Table of Contents
You’ve built your model, you’ve located your data sources, and you’ve done all the initial processing and ETL to get your data how you want it. Now you’re ready — or almost ready — to deploy your predictive models in the real world.
But wait! Before you go further, you need to appreciate that this isn’t the last hurdle. Rather, it’s the start of a long journey.
Predictive model deployment isn’t something you do once; it’s an ongoing cycle of continuous improvements designed to keep you responsive and adaptable to a changing business context. That means that your deployment strategy needs to take into account how you plan to keep on driving forward, not just make it through the initial push.
Training and testing data for predictive model deployment
Before you can deploy your machine learning models, you need to use the data you’ve collected or acquired to train and test them.
Training is all about teaching your model how to approach the problem of predicting an outcome so that it learns how to get the right results. Testing is where you make sure that the model’s training has worked — by giving it new data and checking that it manages to make accurate predictions.
This is where things get a bit tricky. To test the model, you need to feed it data it hasn’t seen before. But to make sure the test reflects the training, this data also needs to come from the training dataset.
That means splitting up the original dataset into separate sets, some of which you will use for training and some you will save for testing. How you go about splitting this data is really important, too. The most common methods are K-folds and LOOCV, but the one you choose will depend on the size of your dataset and how you want to use the data. We explain the difference in detail in this whitepaper.
You will also need to think carefully about the most appropriate training method to use for your model, in line with the type of problem it needs to solve to make a prediction about the data. For example, you might choose a method like regression, classification, clustering, or dimensionality reduction.
Thorough training and testing are essential before moving on to the predictive model deployment stage, but it can also be a time and resource-intensive process. To speed up time to deployment, it’s well worth looking for tools, techniques, and technologies that help you automate as much of the process as possible.
Keeping your machine learning models fresh and up-to-date
Deploying your predictive model is, alas, far from the end of the line. Your data sources are updating all the time. The business landscape is evolving. Your models will need tweaking and improving. It’s essential that you work these considerations into the way you deploy your model, right from the outset, to avoid performance problems and bottlenecks later on.
Part of this is ensuring that you have set up flexible data pipelines that will allow you to keep feeding high quality, accurate, up-to-the-minute data into your models long into the future. And that you can easily incorporate more and more data sources into your model as they become available and relevant, without having to start over.
To ensure the data you’re basing your model on retains its predictive value, make sure you regularly audit your data sources. Are these still as dependable and relevant as when you first started using them? Has the data collected begun to move in a different direction? Over-reliance on data sources that are subject to change can lead to model drift and deteriorating results.
Check your models, too, analyzing them carefully to ensure you’re still getting relevant, accurate results. Doing this manually will probably work out to be very inefficient, but you can cut down on time and hassle by automating the process.
To stave off performance degradation, it’s wise to intermittently retrain your models with new, relevant data. If your models are absolutely core to the business, it makes sense to do this every few days or weeks, using an offline process and comparing the performance of the offline model to the current production model. If the offline model delivers better results, you know it’s time to manually review and potentially update or replace the model you’re using.
Keep an eye, too, on your extraction methods and feature engineering. These may also become less effective over time, requiring ongoing monitoring and tweaking to keep them fresh. Using an automated data science platform to search for features will help you to explore more use cases and keep your existing predictive models accurate and robust.
Accessing the right external data
We mentioned above how important it is to maintain your data pipelines and ensure you can switch things up and add new data sources at will. This goes for external data sources as well as internal ones.
Getting hold of enough decent historical data to drive accurate predictions is a constant challenge, and the chances are you won’t be able to supply enough data from your internal datasets. Incorporating external data sources provides invaluable additional context, helping you fill out the complete picture. However, you can’t afford to go right back to the start of the process, cleaning and harmonizing data, every time you get close to deployment or when you want to spruce up your production models.
This is why it makes sense to use a data science platform that’s set up for augmented data discovery. That way, you can connect seamlessly to thousands of external data sources, filling in the gaps, and providing rich detail to constantly improve your models. These will already have been vetted for quality and compliance and should be compatible with your existing datasets and models without you needing to do any heavy lifting.
Final thoughts: predictive model deployment for the long haul
Have you set up data pipelines and workflows that make your model reproducible? If something breaks, can you and your colleagues jump back through previous versions of the model to identify which incremental change brought about the problem and restore a version that works? Is it scalable? Is it fully stress tested?
There are certain tasks and considerations you really need to tick off before you even get to the predictive model deployment stage. But the technology you build your models with and manage your data pipelines through also plays a pivotal role. We’ve mentioned automating tasks, but the fact is, a scattered selection of tools can only do so much.
For a truly streamlined predictive model deployment process, you really need to use a comprehensive machine learning platform. One that provides the infrastructure to connect to all the external data sources you might need, adding new data streams as and when you need them, feeding this directly into your models, making training and testing easier, and facilitating ongoing development efforts.