Data preprocessing is one of the most critical steps before feeding data to machine learning models. A good data preprocessing can greatly improve the performence of the models. One another hand, if data is not prepared properly then the result of any model could be just “Garbage in Garbage out”. Below are the typical steps to process a dataset:
Load the dataset, in order to get a sense of the data
Taking care of missing data (Optional)
Encoding categorical data (Optional)
Splitting dataset into the Training set and Test set (Validation set)
Feature Scaling
Thanks for all the powerful libararies, today we can implement above steps very easily with Python.
Import the libararies
1 2 3
import numpy as np import matplotlib.pyplot as plt import pandas as pd
numpy is a popular libaray for sintific computing. Here will mainly use it’s N-dimensional array object. It also has very useful linear algebra, Fourier transform, and random number capabilities
matploylib is a Python 2D plotting library which can help us to visulize the dataset
pandas is a easy-to-use data structures and data analysis tools for Python. We use it to load and separat datasets.
sklearn is another libaray we will use later. It is a very powerful tool for data analysis. Due to its comprehensive tools we will introduce them indivdualy once we use them.
Import the dataset
1 2 3 4
# read a csv file by pandas dataset = pd.read_csv('Data.csv') # print out the loaded dataset dataset
Country
Age
Salary
Purchased
0
France
44.0
72000.0
No
1
Spain
27.0
48000.0
Yes
2
Germany
30.0
54000.0
No
3
Spain
38.0
61000.0
No
4
Germany
40.0
NaN
Yes
5
France
35.0
58000.0
Yes
6
Spain
NaN
52000.0
No
7
France
48.0
79000.0
Yes
8
Germany
50.0
83000.0
No
9
France
37.0
67000.0
Yes
1 2 3 4 5
# separate the dataset into X and y # X is independent variables. Here are columns 'Country', 'Age' and 'Salary' X = dataset.iloc[:, :-1].values # y is dependent variables. Here is column 'Purchased' y = dataset.iloc[:, -1].values
If you look closely there are two missing values in the dataset. One is the age of customer 6. Another is the salary of customer 4. Most of the time we need to fullfill the missing values to make the model work. There are three main ways to do it. Using the ‘mean’, ‘median’ or ‘most frequent’. Here I will implement by using the meam of each value.
1 2 3 4 5 6 7 8 9 10
# Import the sklearn libarary from sklearn.preprocessing import Imputer # Instanciate the Imputer class # misstion_value: the place holder for the missting value, here use the default 'NaN' # strategy: 'mean', 'median' and 'most_frequent' # aixs: 0 - impute along columns. 1 - impute along rows. imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) # Fit column Age and Salary imputer = imputer.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3])
Here we used sklearn’s Imputer module to help us to take care of the missing data. From the code you can see it become very easy by using the libaray. And the missing Age and Salary are filled with the mean value of their column.
Encoding Categorical Data
In our dataset, the first column is contry name. The values in this column are text not numbers. But the machine learning model only work with numbers. So we need to encode the country name into numbers.
1 2 3 4 5 6 7 8
# Import LabelEncoder to encode text into numbers from sklearn.preprocessing import LabelEncoder # Encode first column of X labelEncoder_X = LabelEncoder() X[:, 0] = labelEncoder_X.fit_transform(X[:, 0])
After using LabelEncoder, you can see we encode the country names from text into numbers. Here we got ‘France’ -> 0, ‘Germany’ -> 1, ‘Spain’ -> 2. All good, right? NO! Here’s the problem, by encoding ‘France’, ‘Germany’ and ‘Spain’ into 0, 1 and 2, it means ‘Spain’ is greater than ‘Germany’, and ‘Germany’ is greater than ‘France’, just like 2 > 1 >0. This is wrong and all the countries should be considered on the same level. So we need to do some extra work to correct this result.
1 2 3 4 5 6 7 8 9 10 11
# Import OneHotEncoder module from sklearn.preprocessing import OneHotEncoder # Instantiate OneHotEncoder and set the first column oneHotEncoder = OneHotEncoder(categorical_features = [0]) # Encoder the first column of X and return an array object X = oneHotEncoder.fit_transform(X).toarray()
# print value of X float_formatter = lambda x: "%.2f" % x np.set_printoptions(formatter = {'float_kind' : float_formatter}) print(X)
After the categorical encoding, the first column of X is separated into three columns. Each column represent one country. By doing this the model will know which country the customer comes from and make sure the model treat all the country at the same level. Now let’s encode y as well.
1 2 3 4 5 6 7 8 9 10 11
# before encoding print('Before encoding: ') print(y)
# Since y only has two values, 'yes' and 'no', we can just simply encode the value to 1 and 0 labelEncoder_y = LabelEncoder() y = labelEncoder_y.fit_transform(y)
Splitting the data set into the Training set and Test set
One thing that every machine learning process will do is to split dataset into Training set and Test set. Just like human, if the machine keep learning the same dataset, it could “learning it by heart”. Which means that the model could perform very accurate prediction on the same dataset, but when given a new dataset it model just performs poorly. We call this kind of scenario “overfitting”. As a result, to avoid this situation we’d like to separate the dataset into Training set which will be used for training the model. And Test set, which test if the model’s performance. If the model performs poorly on the Test set then we can try to correct the settings in the model and retry it. In some cases, people even break the dataset into three portions, training set, test set and validation set. So after trained and tested, validation set can be used to verify the final performance of the model. And this method to validate the model called cross validation.
# Import 'train_test_split', a very self explaining module # Please notice the libaray we use is 'model_selection' from 'sklearn' from sklearn.model_selection import train_test_split # Split X into X_train and X_test. Split y into y_train and y_test # test_size: usually choose less than 0.5 of the full dataset # random_state: when splitting the dataset we want make the dataset to a random order first X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Feature scaling is another must do step for most of the data preprocessing. What it dose is to scale the values in a dataset into a range of -1 ~ 1. To do this has two benefits, firt the computation time is shorter with small scale numbers. Second, is to avoid a certain feature dominate the result due to a bigger scale. For example, in our dataset there are Age and Salary features. The values of Salary are much bigger than Age, so during the training process Salary feature has a chance to dominate the result and make Age feature become useless. So by scale them into the same level, we can avoid this problem to happen.
There are two main ways to scale the feature: Standardization and Normalization.
Standardization: (sd -> Standard Deviation) $$x’ = \frac{x - mean(x)}{sd(x)}$$
Now we have all the step implemented for data preprocessing. In practice, not all the steps are needed. Please select the required steps based on your dataset. Below is the full version of the code.
# Import the libararies import numpy as np import matplotlib.pyplot as plt import pandas as pd
# Import the dataset dataset = pd.read_csv('Data.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values
# Taking care of missing data from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) imputer = imputer.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3])
# Encoding Categorical Data from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelEncoder_X = LabelEncoder() X[:, 0] = labelEncoder_X.fit_transform(X[:, 0]) onehotencoder = OneHotEncoder(categorical_features = [0]) X = onehotencoder.fit_transform(X).toarray() labelEncoder_y = LabelEncoder() y = labelEncoder_y.fit_transform(y)
# Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)