Table of Contents
Your machine learning model is only as good as the data you feed into it. That makes data preparation for machine learning (or cleaning, wrangling, cleansing, pre-processing, or any other term you use for this stage) incredibly important to get right. It will likely take up a considerable chunk of your time and energy.
Data preparation in data analytics or, more likely, machine learning involves converting data into a form that’s ready for fast, accurate, efficient modeling and analysis. It involves stripping out inaccuracies and other problems that cropped up during data gathering, improving the quality, and reducing the risk of data bias.
If you use Python for data science, you’ll be working with the Pandas library. In this article, we’ll look at some of the key data preparation steps for machine learning you should go through before you start modeling data.
Why this data?
Before you dive in, it’s crucial that you have a clear understanding of why this particular dataset has been selected, as well as precisely what it means. Why is this dataset so significant? What do you want to learn from it and exactly how will you use what it contains? (These decisions are rooted in domain knowledge and careful collaboration with your business colleagues – you can learn more about this here)
Quick cleans
Once you’ve loaded your unstructured data into Pandas, there are a few simple things you can do immediately to clean it up. For example, you could:
- Remove any columns with more than 50% missing values (if your data set is large enough – more on that in the next section)
- Remove lines of extraneous text that prevents the Pandas library from parsing data properly
- Remove any columns of URLs that you can’t access or that aren’t useful
On closer inspection of what each column means and whether it’s relevant to your purposes, you could then eliminate any that:
- Are badly formatted
- Contain irrelevant or redundant information
- Would need much more pre-processing work or additional data to render useful (although you may want to consider easy ways to fill in the gaps using external data)
- Leak future information which could undermine the predictive elements of your ML model
Dealing with missing data
If you are dealing with a very large dataset, removing columns with a high proportion of missing values will speed things up without damaging or altering the overall meaning. This is as easy as using Pandas’ .dropna() function on your data frame. For instance, the following script could do the trick:
df[‘column_1’] = df[‘column_1’].dropna(axis=0)
However, it’s also worth noting the issue so that you can identify potential external data sources to combine with this dataset, in order to fill any gaps and enrich your model later on.
If you are using a smaller dataset, or are otherwise worried that dropping the instance/attribute with the missing values could weaken or distort your model, there are several other data preparation techniques you can use. These include:
- Imputing the mean/median/mode attribute for all missing values (you can use df[‘column’].fillna() and choose .mean(), .median(), or .mode() functions to quickly solve the problem)
- Using linear regression to impute the attribute’s missing values
- If there is enough data that null or zero values won’t impact your data, you can simply use df.fillna(0) to replace NaN values with 0 to allow for computation.
- Clustering your dataset into known classes and calculating missing values using inter-cluster regression
- Combining any of the above with dropping instances or attributes on a case-by-case basis
Think carefully about which of these approaches will work best with the machine learning model you are preparing the data for. Decision trees don’t take too kindly to missing values, for example.
Note that, when doing data preparation using Python, Pandas marks missing numerical data with the floating value point NaN (not a number). You can find this unique value defined under the NumPy library, which you will also need to import. The fact that you have this default marker makes it a lot easier to quickly spot missing values and do an initial visual assessment of how extensive the problem is.
Should you remove outliers?
Before you can make this decision, you need to have a fairly clear idea of why you have outliers. Is this the product of mistakes made during data collection? Or is it a real anomaly, a useful piece of data that can add something to your understanding?
One quick way to check is splitting your dataset into quantiles with a simple script that will return Boolean values of True for outliers and False for normal values:
import pandas as pd df = pd.read_csv("dataset.csv") Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 - Q1 print(IQR) print(df < (Q1 - 1.5*IQR))| (df > (Q3 + 1.5*IQR))
You can also put your data into a box plot to more easily visualize outlier values:
df = pd.read_csv(‘dataset.csv') plt.boxplot(df["column"]) plt.show()
You may also find it useful to apply a log or square root transformation. This will minimize the impact on the model if the outlier is an independent variable while helping your assumptions to work better if it’s a dependent variable.
That said, the most important thing is to consider carefully your reasoning for including or removing the outlier (and for how you handle it if you leave it in). Instead of trying a one-size-fits-all approach and then forgetting about it, this will help you to remain cognizant of potential challenges and issues in the model to discuss with your colleagues and refine your approach.
Transformation
Having fixed the issues above, you can begin to split your dataset into input and output target variables for machine learning and to apply a preprocessing data transform to your input variables.
Precisely what kind of transformations you make will, of course, depend on what you plan to do with the big data in your machine learning model. A few options are:
- Standardize the data
Best for: logistic regression, linear regression, linear discriminate analysis
If any attributes in your input variables have a Gaussian distribution in which the standard deviation or mean varies, you can use these techniques to standardize the mean to 0 and the standard deviation to 1. You can import the sklearn.preprocessing library to use its StandardScaler standardization tool:
from sklearn import preprocessing names = df.columns scaler = preprocessing.StandardScaler() scaled_df = scaler.fit_transform(df) scaled_df = pd.DataFrame(scaled_df, columns = names)
- Rescale the data
Best for gradient descent (and other optimization algorithms), regression, neural networks, machine learning algorithms that use distance measures, e.g. K-Nearest Neighbors
This also involves normalizing data attributes with different scales so that they’re all on the same scale, typically ranging from 0-1. (You can see how the scaling function works in the example below.)
- Normalize the data
Best for: algorithms that weight input values, e.g. neural networks, algorithms that use distance measures, e.g. K-Nearest Neighbors
If your dataset is very sparse and contains a lot of 0s, but the attributes you do have use varying scales, you may need to rescale each row/observation so that it has a unit norm/length of 1. It’s worth noting, however, that to run normalization scripts, you’ll also need the scikit-learn library (sklearn):
from sklearn import preprocessing df = pd.read_csv('dataset.csv') min_max_scaler = preprocessing.MinMaxScaler() df_scaled = min_max_scaler.fit_transform(df) df = pd.DataFrame(df_scaled)
The result is a table that has values normalized so that you can run them without getting extreme results.
- Make the Data Binary
Best for: feature engineering, feature selection transforming probabilities into clear values
This means applying a binary threshold to data so that all values below the threshold become 0 and all those above it become 1. Once again, we can use a scikit-learn tool (Binarizer) to help us quickly solve the problem (here we’ll be using a sample table of potential recruits’ ages and GPAs to exemplify):
from sklearn.preprocessing import Binarizer df = pd.read_csv('testset.csv') #we’re selecting the columns to binarize age = df.iloc[:, 1].values gpa = df.iloc[: ,4].values #now we turn them into values we can work with x = age x = x.reshape (1, -1) y = gpa y =y.reshape (1, -1) #we need to set a threshold to define as 1 or 0 binarizer_1 = Binarizer(35) binarizer_2 = Binarizer(3) #finally we run the Binarizer function binarizer_1.fit_transform(x) binarizer_2.fit_transform(y)
Your output will go from something like this:
Original age data values : [25 21 45 ... 29 30 57] Original gpa data values : [1.9 2.68 3.49 ... 2.91 3.01 2.15]
To this:
Binarized age : [[0 0 1 ... 0 0 1]] Binarized gpa : [[0 0 1 ... 0 1 0]]
… Don’t forget to summarize your data to highlight the changes before you move on.
Final thoughts on the Python data preparation process: what happens next?
As we’ve seen, knowing how to prepare data for machine learning the right way is vital, but this can be a fiddly task. The more types of data you use, the more you might be worried about how long it will take to harmonize this data, applying different cleaning, pre-processing, and transformation tasks so that it all works together seamlessly.
If you plan to go down the (advisable) route of incorporating external data to enrich your machine learning models, bear in mind that you will save a lot of time by going through a data enrichment platform that automates much of this data cleaning for you. At the end of the day, data preparation for machine learning is important enough to take time and care getting right, but that doesn’t mean you should misdirect your energies into easily automated tasks.