The issues ML model retraining won’t solve

MLOps AI Monitoring

Trusting in artificial intelligence systems is not easy. Given the variety of edge cases on which machine learning models may fail, as well as the lack of visibility into the processes underlying their predictions and the difficulty of correlating their outputs to downstream business results, it’s no wonder that business leaders often look upon AI with some skepticism.

That said, I’ve worked with many forward-thinking AI teams on becoming product-oriented and building trust in their AI systems. I’ve also talked to company leaders about how they can connect their machine learning models to the broader business context in which they’re deployed. At the core of generating trust in AI systems and empowering them to become process-oriented is the concept of machine learning model monitoring. However, when AI monitoring is proposed, data scientists are often confused as to why it is needed, if processes for automatically labeling new data and retraining models are already in place. This kind of thinking is a relic of the traditional, research-oriented approach to data science, and it’s important to understand that the real world is full of scenarios for which ML model retraining is insufficient.

It’s not all about the machine learning model

Machine learning and AI are multidimensional disciplines that cannot be reduced solely to the models they employ. Beyond the ML model, an AI system incorporates data (training, test, and inference), features built upon that data, software platforms, and an encompassing business process. Even if a model is healthy, there is no immediate guarantee that the entire system built around it is. Due to all these interlocking parts, it is easy for a single component of the system to silently break in a way that is difficult to detect without having proper monitoring procedures in place.

fraud detection ml model

As an example, suppose that you’ve trained and deployed a model to perform fraud detection on transactions made by users on your platform. Perhaps one of the features to this model is based on cookies stored in the end user’s browser. Now, suppose that Chrome releases a new, beta version of their browser which changes the policies for reading or writing client-side cookies. Because this is a beta version of the browser, it will only be used by a small fraction of your users, say 0.1%. However, for this small portion of users, your model may fail catastrophically by either erroneously predicting fraudulent activity (result: you lose customers) or letting malicious users slip through the cracks (result: fraud goes undetected and you lose money). Either way, since it is such a small fraction of your user base, you are unlikely to catch these errors until at least one, and likely both, of the following happens:

More users, say 10%, get the new browser, in which case there will be a long lapse between when the issue is created and when it is detected.
Downstream KPIs are drastically impacted, i.e. you lose huge slews of customers, money, etc.

For this reason, it’s critical to implement model monitoring of the system at the most granular levels in order to detect these sorts of issues before they snowball into something much larger.

The key thing to note here, though, is that model retraining will not help in this circumstance, because there is not a problem with the model. The model will still continue to utilize the broken feature, regardless of how many times it’s retrained. It is the identification of this broken component that is needed in order to resolve the issue, not the retraining of the model.

The small, but crucial, segment

As a natural continuation of the above, a lack of granular monitoring of your data and model could cause you to miss out on understanding small, but highly important, segments of users that are vital to your business. Maybe your real estate business could benefit immensely from being able to identify high net-worth clientele along with the homes they are most likely to buy. Or perhaps your retail business would like to identify expectant mothers in order to recommend pregnancy-specific products. Yet, because these segments represent a minority, adding additional, general customer data to your training pipeline could compound the problem, as could retraining your model on the data you already have. Optimizing for the average case will only cause your model to perform worse on the segments that matter. In these sorts of situations, the key is to identify your most important customer segments via granular monitoring of their impact on the most valuable business KPIs and then gather more data from those specific segments. Model retraining without the necessary predicate step of targeted data acquisition will only serve to make your model worse.

Concept drift may be too much for retraining

Concept drift occurs when real-world behavior shifts in such a way that the relationship between data and model fundamentally changes. Concept drift indicates a gradual displacement in the target variable being predicted relative to the training data. Streaming and real-time data are especially vulnerable to the effects of concept drift as these types of data are representative of the continuously varying state of the world. Sometimes the changes in behavior pointed to by concept drift are so extreme that retraining the ML model is not an effective mitigation tactic. Instead, other approaches to solving the problem must be targeted.

Some of the more effective mitigations for concept drift can include improved feature engineering and feature selection. Being able to reshape the relationship between the data being streamed in and the model’s outputs can often be accomplished by choosing newer and more representative features. Other solutions can include modifying the learning algorithm, selecting new hyperparameters, or simply finding newer data to train on. The key is being able to recognize when such mitigation measures are required, and this can only be achieved via comprehensive monitoring.

Even the most accomplished and experienced AI teams are vulnerable to such concept drift-driven issues. Launched back in 2008, the idea behind Google Flu Trends was that of a system which could predict flu activity and outbreaks based on Google search queries. The initial versions of the product were promising and early reports suggested that GFT was capable of predicting regional flu outbreaks up to 10 days ahead of the CDC.

flu trends wired flu trends guardian flu trends time flu trends harvard business review

The reality of GFT was unfortunately not so rosy, as significantly negative press from a groundbreaking article in Science showed. While the early years of the product were promising, after some time predictions began to drift. In fact, during the 2011-2013 flu seasons, GFT overestimated the actual number of flu-based doctor’s visits by 100s of percents.

So how did this happen even though the model was periodically retrained? Simply put, model retraining was insufficient because the ways in which users were searching on Google changed too significantly over the months and years of the model’s life. Such changes might have included people searching for symptoms that may be attributable to other illnesses, changes in language and slang, and a general shift in the types of purposes people use Google for. Regardless of the underlying behavioral cause, the fix would’ve required rethinking and automating the feature selection process, or perhaps using a new learning algorithm altogether.

Better AI monitoring could have led to earlier insights into the underlying concept drift and allowed Google to address them prior to receiving negative press.

Flaws in ML model training mechanisms

Implementing automated model retraining eventually becomes a necessary part of every ML pipeline, but it is not sufficient. In fact, model retraining introduces complexities of its own into the system which can actually cause downstream issues in production AI. As an example, bugs in data collection systems can propagate into models when those ML models are retrained on the flawed data. Alternatively, errors in automatic or human labeling processes can cause models to learn to predict erroneous relationships. Therefore, model retraining cannot be viewed in isolation, but must be understood within the context in which it exists. These sorts of issues can be caught only via monitoring of the retraining mechanism itself, and such monitoring is crucial to ensuring that retraining does not actually cause more problems than it solves.

Furthermore, retraining is not a cost-free process. Continuously retraining a model requires gathering new data, paying to store all this data, and paying for the compute time required to retrain the model. It also introduces all of the aforementioned risks. Smarter monitoring can help you to understand when retraining is actually necessary rather than when it is just a box to be checked. In addition, AI monitoring can help with targeted retraining focused towards specific segments. You might not always want to retrain your model on an entirely new training set, but might instead want to finetune its weights using data gathered from a specific, underrepresented segment. ML monitoring allows these important segments to be identified and provides insight into when retraining is actually required. As an analogy, doing the same full-body workout every time you go to the gym can cause injury and yield diminishing returns. Instead, it’s important to introduce targeted workouts that train specific sets of muscles in order to build a strong and healthy physique. Similarly, model retraining should often be focused towards specific data segments that are informed by comprehensive monitoring processes.

working out

The takeaway

ML models in the real world are not siloed in the same way that they often are in research settings. Constant streams of incoming new data can lead to concept drift and throw off model predictions in a variety of unexpected and unpredictable ways. While it is important to have model retraining procedures in place, recklessly retraining as a silver bullet solution to all problems is not the answer. In fact, doing so introduces additional complexity and can create more problems than it solves. The key to a smart and effective program of model retraining lies in comprehensive monitoring at the most granular levels, allowing you to identify which segments of your data are most important to your ML models and your business. Only then can selective retraining provide the benefits that it promises. Comprehensive monitoring is part of a much larger industry-wide approach to product-oriented data science and builds increased trusts in AI-based solutions. I encourage you to incorporate AI monitoring at each and every stage of your model deployment pipeline. If you would like to discuss how to take a product-oriented approach to your AI, reach out to us with any questions or see how an advanced monitoring tool can support your business needs.