Growing a Machine Learning project - Lessons from the field
August 6, 2020
Machine learning (ML) projects should not be managed as software development projects. Although both use code, ML projects are prone to a higher rate of change over the course of their lifetime. This happens because of their close connection to the environment.
In crude terms, one might think about a software development project as building a finished product such as a plane or a car. On the other hand, the ML project is similar to operating a farm. You need to have all the necessary tools to operate a farm but you also need to respond to the changes in the environment such as weather, nutrients in the soil, possibility of being invaded by parasites or your crop getting sick etc.
A farm needs much more attention since it’s embedded into its surrounding environment much closer than the car into the road. In the same way, a farm is connected to its surrounding, ML models are connected to their data. If models are not managed or taken care of, then they will not yield good predictions. Similarly, a farm will not yield a good crop if its environment is neglected.
How is the ML life cycle of a machine learning project different from a software life cycle?
The difference between software and machine learning workflows is nicely illustrated by Clementes Mewald in his talk, "Managing the full TensorFlow training, tracking, and deployment lifecycle with MLflow (sponsored by Databricks)" The diagrams below are taken from Clemens' talk.
The ML workflow has more steps than the software workflow. At many points in the workflow, we might need to go back to a previous state. During deployment, maintenance, or analysis, we might find out new insights. This new understanding might require us to go back to the last stage. We might need to redo a model or feature selection again. In particular, it’s possible to go from live monitoring to putting data into the right format. This would require going through the entire workflow multiple times during the lifetime of the ML system.
Unlike software development, unit and integration tests are not sufficient for guaranteeing that ML systems are working as expected. Machine learning is coupled tightly with data. It’s the perfect example of the garbage-in-garbage-out principle. ML aims to find patterns in data, so it cannot perform well if the data that it receives is of poor quality. Therefore, once the model is released and then suddenly, the data it gets drops in quality, the whole system will have a reduction in performance. When this happens, you have no other choice but to fix the system.
ML system development is more time consuming than software development. One of the reasons is that machine learning has not been used in production for long. Although it shares certain design principles and practices with software development, the productionizing and design of machine learning systems is a young field. There are still important things that need to be figured out. As is the case for any young and popular technology, there are many different corporate and open source tools and platforms being developed to handle problems in its domain: MlFlow, Amazon Sage, Databricks, Algorithmia, Apache Airflow, Azure ML, H2O ai etc. Also, model selection and hyperparameter tuning are computationally intensive steps that could take multiple days to complete. Additionally, the majority of ML systems usually have a research characteristic that makes the development longer. As a result, it’s hard to estimate how long it would take to build and maintain an ML system.
Although we use software development techniques in building ML systems, they should not be treated as software development projects. Even though it’s clear that ML systems can benefit from the design principles in software development, the fact that machine learning has greater coupling with data makes construction and maintenance more complicated and time-consuming than conventional software development systems.
An example in the context of vulnerability management
Assume that we have the data source that tells us the remediation time of vulnerabilities as a function of their CVSS score. Let’s say the data would look something like that in the figure below, where the vulnerability severity (CVSS) is shown on the x-axis, and the remediation time measured in days is shown on the y-axis.
Now, what if there is a new vulnerability that comes in but which has not been fixed yet. We want to give a reasonable estimate of how much time it’s going to take to fix it. Since the data in the example looks vaguely linear, for the sake of this example, we decide to use linear regression, even though we did not validate the assumptions.
This linear regression model looks good for a while, but then there is a new client that starts using our product and her data is fed into our system.
Our model deals with it perfectly, giving a good estimate of the remediation time. We can be confident about its prediction. So far, so good.
Now let’s say there is another client that starts using Delve:
Now we can see that the data coming from the third client is different, and our model no longer makes sense… If we measure the R-squared metric (it measures how well remediation times are predicted by the model. Higher R2 means the model captures the observations better than the model with the lower R2), for three clients we get the following table:
We can see that the model works better for client 1 than for a client 2. However, it completely misses predicting data for client 3. If the model performance is monitored, the drop in model performance would create an alert to notify a machine learning engineer that there is a problem in production and she needs to investigate it. She would need to decide if new data can be used to train the model and make predictions or if the model would need to be changed altogether.
So what happened? The data coming from the third client has a different distribution than the data coming from the previous two clients. There could be multiple reasons behind this data shift. For example, the client’s devices had a similar type of problem or the client recently hired new security analysts that fixed multiple devices on the same day.
If we didn’t have online monitoring of models, we would have learned about the problem from our clients. I will not go into detail here about what potential consequences this could have for a company.
Although in this simple example plotting data makes it clear that there has been a change in data incoming from a client, in a more complex situation, when there is a more sophisticated model working with big data, a simple approach like this would not work. We would need to have a monitoring infrastructure that is capable of running statistical tests or that does some sort of anomaly detection on large amounts of incoming data to discover, in a reasonable amount of time, that there is a change in the distribution. The system would need to have alarm capabilities to notify machine learning engineers. Also, it would need to be able to stop ML procedures from using data with a different distribution to limit the harm of propagating wrong predictions.
A better approach to plotting is to monitor the model score such as with the root-mean-squared-error (RMSE) or R2. Then we could use a simple threshold. In the case of R2, when the value of the metric is below the threshold we, could trigger a notification that is sent to data scientists. A data scientist could then investigate.
Once the change has been detected, it requires investigation from the data scientist who built the model. She will need to analyze the data and potentially go through the entire ML workflow. In the cybersecurity context, the change can be significant. It can be a signal of a new vulnerability in a system, and could be important enough to trigger the re-prioritization of the tasks in the entire vulnerability management remediation system.
The whole machine learning workflow would need to be executed again and an updated version of the model will need to be deployed and monitored. The new model could require new data or functionalities, so the software architecture might need to change as well.
You can see from this simple example how a data drift could lead to a cascading effect throughout the entire system.
The environment in which ML algorithms operate can change drastically in a short period. As a consequence, this shift forces the change in the performance of an ML system. The dependency on the data to make predictions creates services that are more complex and costly in development and maintenance compared to software services that don’t have a strong coupling with the environment.
Unlike conventional software systems, test harnesses cannot suffice for systems that produce predictions. Instead, monitoring infrastructure has to incorporate test harnesses as part of the online and offline monitoring. Monitoring infrastructure should continuously test the relationship between the environment and model to ensure that its prediction matches reality. Once we detect a data drift, we need to update our model immediately; otherwise, it would produce unacceptable forecasts. In a typical software development project, the product manager decides which features to implement or to delay, but in an ML system, the change in data could force the change in the system.
Forcing machine learning projects into management processes of software development cause unnecessary friction and difficulties instead of facilitating and accelerating progress. Rather, software development processes need to be adapted to machine learning requirements and extended with practices that make sense. As an example, monitoring ML systems should be one of the first things in designing such a system. Just as a farmer needs to know how well her crop is growing, a data scientist needs to know how well her models perform. If there is a change in the remediation time distribution, it affects the performance of the model. A data scientist in a cybersecurity domain needs to be able to see it right away and have a way to quickly diagnose the causes behind it, as well as deploy an updated version with as little friction as possible.
- Delve’s Vulnerability Threat Intelligence Feed
- Re-defining Vulnerability Remediation Prioritization
- Leveraging Collective Intelligence for Contextual Prioritization in Vulnerability Management