Table of Contents
If you’re familiar with Explorium then you know that we believe the core challenge for data science is data. Specifically, we believe data scientists need more of the right data to feed their models and make better predictions. This isn’t always the easiest task.
The process of acquiring external data might look a little something like this: first, you have to understand the issue the business is trying to solve. This is critical in order to proceed with researching the external data sources that may improve your model. Here’s where things start getting tricky. The world of data is expensive. In order to find the right type of data from the right partner or provider, you need to spend a lot of time researching. Once you finally find the right source and go through testing it in your models your work still isn’t done.
It’s time for data preparation.
Data preparation is about cleaning the data and then normalizing and extracting the right features that work best with your model. Because providers are not necessarily collecting the data in a way that can be used for machine learning (ML), this can mean taking the same data and breaking it down to different insights and formats your model can actually work with. We call this finalized external data source an “enrichment” because, in most cases, you will be getting raw, unorganized, and un-normalized data that your model wouldn’t be able to effectively process. However, after all the data preparation and processing the data, just like gold mining, we deliver a dataset that is supposed to enrich your model’s insights.
How exactly can you mine your external dataset for gold? The following steps can get you started with your data preparation.
1. Data exploration and profiling
The first step in your mining process is to observe the dataset to learn about its shape and gaps. The questions below will help you to understand if the external dataset is complete (for example, if a geo dataset contains information for all US zip codes so you’re not missing coverage), if it can answer your questions effectively based on its coverage (for example, if it contains mainly EU postal codes it won’t be a good fit for a US-focused use case), and it helps you to plan how you are going to treat the data.
Ask the following questions:
- How many columns and rows does it have?
- What does each column represent?
- What is the data distribution and coverage (geographic, by industry, or by any other relevant subject)?
- Fill rate: how many null values do I have for each column in the data frame in general?
- Are there any special gaps I should take into consideration?
Let’s see how this works on a practical level. Say we are looking for data about companies in the US. Our internal core dataset contains a list of companies and some basic information about them including their location, the date they applied for the loan, the amount of loan requested, a contact person in the company, and, of course, the label (did they return the loan or not).
At the same time, we have found a source that might be relevant to enrich our core data set. It should provide us with the number of employees in a company, the annual revenue of a company, the company’s industry, and more.
The next step here is to actually understand what each of the columns in the external data set represents and what the data it contains looks like. We would probably want to analyze each column in a different way. For example, we may want to see how many null values it contains or the distribution of the data. However, one of the most important things to test in this process is what coverage the external source provides to our internal core dataset.
Above, you can see that after joining the two datasets, the external dataset has 36% coverage for our internal list of companies. In this case, we might assume the low coverage is happening because there is only a small set of industries covered in the distribution, while our list of companies is probably much wider.
This is important! There are instances where companies buy data that looks amazing on paper but when matching the third party provider with their internal source, it’s actually useless. Take our example again, this red flag is relevant for the overall coverage but also for the column fill rate. If we are leaning on this provider to give the number of employees working at the companies in their dataset, it would be a shame to find out that they are only able to provide this for 10% of the companies they have data for.
This brings us to step number two.
2. Working plan
Because working with data can become overwhelming and you can easily get lost in the amount of information, you should always plan your steps carefully and keep your data preparation process as organized and detailed as you can. Among other things, this can help your teammates in the future understand why you chose or disqualified a specific provider. Each step should be written down along with all of your conclusions and ideas. This includes:
- All statistics from step number one
- A general explanation of the dataset, renaming unclear column names
- Gaps in the data
- Next steps in your plan: what do you want to clean? Where should you normalize the data? What kind of features should you try to extract? Are there any aggregations you should perform?
3. Data cleansing and normalizing — minimize interruptions
Now it’s time to start working! After observing and exploring the data, you should now have a clear understanding of what you need and what is irrelevant. You can drop columns with low fill rate, or just irrelevant content.
Be careful! Low fill rate or coverage is not always unwanted. You can decide to fill these columns with an average value or to keep them null as an indication for something. For example, if your dataset contains information about stores with a license to sell alcohol, the missing coverage would be an indication of the stores in your training set that don’t have a license to sell alcohol.
Another aspect of data cleansing is data normalization. You want to make sure your data is consistent and formatted in a way that best fits your models. For example, if you have a pricing column, you would want to delete unwanted characters like “$” or “K”, normalize all prices into dollar format, and replace any instances of “K” with a numerical value.
Raw | Clean and normalized (all in USD) |
$40K | 40000 |
€30K | 33288 |
4. Squeezing the lemon — feature engineering
Feature engineering is the art and science of transforming raw data into features that better represent a pattern to the learning models. For example, if you had the date “1/10/2019,” you would want to extract the day of the month, the day of the week, the month itself, and the year itself. Each of these represents a feature that the model can learn from and each of them might affect the target differently.
You can also play with multiple attributes to create a whole new feature. For example, in many fraud use cases, the distance between a person’s registered address and their IP location when registering might be a good indicator of fraud. In another example, we can count the number of social profiles associated with the person, which can be a good indication for their web presence or web activity.
What’s important here is that you play with the data with an ML perspective of the features. There are almost endless variations you can play with and each of the new features should be tested against your models. You’ll want to see how these features perform in your model against other features you have (feature interaction), if they’re even better than others in terms of selection and explainability, and of course — if they help you to get an uplift.
5. Wrapping it all up
Finally, after discovering the right features to feed your model, it is important to save all the metadata about your work and about your source of data (see the example below). Among other things, this should include the raw version of the data, the data provider, and the date the data is relevant for. As we said earlier, it is crucial to be super organized with your data and all the insights you gained while working on it.
Mine for gold
The steps described above will likely take you a lot of time. It could be a month or two (getting in touch with a data vendor, getting a sample of data, joining with your own data, preparing data and extracting features, feeding data into your models) before getting an actual objective answer if a data set is relevant or not. The more automation you can introduce into the process the more chances you have of actually finding a game-changing dataset.
If you make sure your data preparation is tight, organized, and creative you’ll enhance your models and end up with a significant uplift in your AUC or R2. Beyond that, it also means your business will get even more valuable insights that will help answer your predictive question. Although you can tune algorithms to try and squeeze extra power out of them, adding good external data sources is really where you make your models smarter by finding that hidden gold mine.