In today’s environment, data scientists are tasked with far more than building a model and deploying it into production. Now they’re chartered with regularly monitoring, fine-tuning, updating, retraining, replacing and jump-starting models — and in some cases, hundreds and thousands of models collectively.
As a result, different levels of model management have emerged. In the following, I try to highlight each, from single model management all the way through building an entire model factory.
You may be wondering how do I use the result of my training procedure to score new incoming data? There are a lot of options, such as scoring within the same system that was used for training and exporting models in standardised formats. Alternatively, you can push models into other systems, like scoring models as SQL statements within a database or containerising models for processing in an entirely different runtime environment. From the model management perspective, you just need to be able to support all required options.
The standard process looks like this:
Load Data > Transform Data > Train Model > Deploy
Note: In reality, very often the model alone is not extremely helpful unless at least part of the data processing (transformation/integration) is a part of the “model” in production. This is where many options show surprising weaknesses in only supporting deployment of the predictive model alone.
Evaluation and monitoring
As part of any form of model management, it is vital to continuously make sure the model keeps performing as it should. Statically collecting data from the past — as many data scientists are forced to do — only provides the knowledge that the model does not suddenly change. More often, you should monitor recently collected data, which allows you to measure whether the model is starting to become outdated because of reality changes. Sometimes it is also advisable to include manually annotated data to test border cases or simply make sure the model is not making gross mistakes.
Ultimately, model evaluation should result in a score measuring some form of model quality, often classification accuracy but sometimes also more application-dependent choices such as expected cost or a risk measure. What you do with that score, however, is another story.
Taking results to update and retrain
At this stage, it gets more interesting and feels much more like actually managing something. Suppose your monitoring stage starts reporting more and more errors. You can trigger automatic model updating, retraining or complete replacement.
Some model management setups simply train a new model and then deploy it. However, since training can take significant resources and time, a more sensible approach is to make this switch dependent on performance to ensure that it is worth replacing the existing model. Run an evaluation procedure to take the previous model (often called the champion) and the newly (re)trained model (the challenger); score them and decide whether the new model should be deployed or the old one kept in place. In some cases, you may only want to go through the hassle of model deployment when the new model significantly outperforms the old one.
Models can still struggle with seasonality if you don’t take precautions elsewhere in your management system, however. For example, if it’s predicting sales quotas of clothing, seasons will affect those predictions most dramatically. If you monitor and retrain on a monthly basis, year-after-year, you can effectively train models to adjust to the current season. You can also manually set up a mix of seasonal models that are weighted differently, depending on the season.
Now a caveat, sometimes models need to guarantee specific behaviour for certain cases. Injecting expert knowledge into model learning is one aspect but having a separate rule model in place that can override the output of the trained model is a more transparent solution.
While some models can be updated, many of the algorithms can be forgetful. Data from a long time ago will play less and less of a role for the parameters. This is sometimes desirable, but it’s hard to properly adjust the rate of forgetting.
An alternative is to retrain a model, building a new model from scratch. This lets you use an appropriate data sampling (and scoring) strategy to make sure the new model is trained on the right mix of past and more recent data.
Now, the management process looks a bit more like this:
Load Data > Transform Data > (Re)Train Model > Evaluate Model(s) > Deploy
Suppose you now want to continuously monitor and update/retrain an entire set of models. You could handle this as the case before, with more than one model, but at this level, issues arise that are connected to interface and actual management. How do you communicate the status of many models to the user and let her interact with them, and who controls the execution of all those processes? There must be a dashboard with capabilities to manage and control individual models at one time.
Most tools allow their internals to be exposed as services, so you can envision a separate program making sure your individual model management process is being called properly. You can either build a separate application or use existing open source software that orchestrates the modelling workflows, supervises, and summarises their outputs.
The model family
Handling bunches of models gets even more interesting when you group them into different model families. You can handle models similarly that are predicting very similar behaviour. This is particularly useful if you regularly need a new model. When models are similar, initialise a new model from existing models in the family rather than starting from scratch or only training the new model on isolated past data. Use either the most similar model (determined by some measure of similarity of the objects) or a mix of models for initialisation.
The setup now looks like this:
If you abstract the interfaces between model families sufficiently, you should be able to mix and match at will. This allows new models to reuse load, transformation, (re)training, evaluation and deployment strategies and combine them in arbitrary ways. For each model, you just need to define which specific process steps are used in each stage of this generic model management pipeline.
Take a look:
There may be only two different ways to deploy a model, but there are a dozen different ways to access data. If you had to split this into different families of model processes, you would end up with over a hundred variations.
The model factory
The final step in model management is to make the jump to creating model factories. This can be done by defining only the individual pieces (“process steps”) from above and combining them in flexible ways defined in a configuration file, for example. This is fantastic because if someone wants to alter the data access or the preferred model deployment later, you would only need to adjust that particular process step instead of having to fix all processes that use it.
At this stage, it makes sense to split the evaluate step into two: the part that computes the score of a model and the part that makes a decision on what to do with that score. The latter can include different strategies to handle champion/challenger scenarios and is independent of how you compute the actual score.
Then, putting a model factory to work is actually straightforward. Configuration setups define which incarnation of each process step is used for each model pipeline. For each model, you can automatically compare past and current performance and trigger retraining/updating.
This is a lot of information, but data scientists can master every level because they must. Today’s massive trove of information will soon seem miniscule, and it is essential that we develop sound, reliable management practices now to handle the increasingly huge volumes of data and accompanying flood of models to ultimately make sense of it at all.
Michael Berthold, Ph.D., founder and CEO, KNIME
Image source: Shutterstock/everything possible