Why CVSS (Common Vulnerability Scoring System) is Not Enough
May 7, 2020
The focus of today’s talk is to discuss in a bit more depth the shortcomings of CVSS. And the actual goal of the talk is not to diminish the role of CVSS, but more to empower it as a starting point: a fuller, deep vulnerability prioritization.
Why is vulnerability prioritization important?
So first, let's just discuss the fundamental problem of vulnerability prioritization: more security vulnerabilities were publicly disclosed in the first quarter of this year (2019) than any previous three month period. And, this is quite an understatement because actually the number of vulnerabilities disclosed each year grows near exponentially.
And just between 2016 and 2017, there have been more than double the number of disclosures. This should come to no surprise as the complexity of the technology landscape grows year after year just because of the number of technologies that are released every year. And it's a non-property of complex systems to emerge exponential behavior.
Well, maybe that's a bit technical, but it just means that it should come as no surprise that the number of vulnerabilities and the number of security problems that organizations face increases in a dramatic way year after year. It's just a consequence of a complex phenomenon.
But the real problem does not come from the higher number of disclosures in national vulnerability databases. It's actually the number of vulnerabilities that an organization faces. So just as the number of assets to secure grows exponentially, for the same reason, the number of available security experts does not grow at the same rate. And this is a known fact of the cybersecurity industry. So how can we solve this problem of information overload? How can we tell which vulnerabilities really matter for an organization? This is the problem that CVSS tried to solve and is still trying to solve.
What are the shortcomings of the CVSS score?
As a metric, I just want to emphasize that the goal of this presentation is not to present CVSS in and of itself. So we assume the audience to be slightly familiar with the concept. So I'll just really broadly explain that they did the computation in the background of CVSS. So yes, this is a standard developed by an organization called FIRST (Forum of Incident Response and Security Teams). It's owned by this organization. And the goal is really to assess the criticality or the technical severity of a vulnerability that the whole CVSS score is supposed to be computing in three different stages. So there is a base metric, a temporal metric, an environment metric. Each of these steps is composed of categorical values with predetermined values like low, high and low, medium height. So there are a finite number of values for each of these stages.
But even even if they weren't computed in practice by organizations, they still have a lot of shortcomings that we're going to talk about in the next slide. So the first shortcoming that I want to talk about is the equation opacity. So just the way the CVSS score is computed, it is the goal of CVSS to be easily implementable and interpretable by security professionals. And the problem that we see is that the actual value seems to be made up of really arbitrary numbers, and it's very hard to really interpret these values. Experiments have been done to try to reproduce the process of assigning scores to vulnerabilities. And then what we see is that the most accurate 50 percent of professionals surveyed in this experiment mis-scored the vulnerabilities in the range of two to four points. And this is really problematic because it shows an inherent uncertainty in the scores of other vulnerabilities.
For a little comparison, we developed a little machine learning natural language processing model in-house here at Delve, and we came up with a much narrower margin of error of 0.93 severity, which is a lot closer. But it's funny for a process that was made to be done by humans.
Another problem with this equation opacity is that it works well with textual descriptions of vulnerabilities. That is, it’s supposed to work well with textual description of vulnerabilities. But what do we do when we have zero days or tasks on mass configurations like static resources found on networks? Those are definitely security problems that are vulnerabilities. They don't come with a full fledged textual description. But still, we need to prioritize them on the same scale as any other vulnerability.
Now, the second shortcoming, which is really, really important: the lack of granularity. This should come as no surprise as we have a limited number of values for each categorical sub-component of the score. But even then, in practice, there should be about 100 possible scores for all the vulnerabilities. But vulnerabilities come in millions, right? So how can a typical organization deal with 500,000 vulnerabilities if you only have 100 possible base scores? How can you possibly imagine prioritizing with those numbers? I mean, you don't need to be a mathematician to understand that you're going to end up with many, many, many, many evil vulnerabilities with the same score. But in practice, it's even worse. as only six different numerical scores make up 67 percent of all vulnerabilities. Seventy three different scores happened to be in the MVD out of the 100 possible. But when we look at enterprise networks, we only see about 50 different scorers, which is a bigger implementation of the previously stated problem of granularity. But the real problem does not come from the vulnerabilities presented in the network. It comes from the vulnerabilities that are attackable, are exploitable. And only 10 percent of all vulnerabilities have been seen to be exploitable in practice. So we even narrow down the granularity of the CVSS more. The next trick is linked to the previous one of all the discovered vulnerabilities. Only a small subset have been exploited. As we've seen, it's about 10 percent. But out of these exploitable vulnerabilities, only a small subset are present on assets that organizations really care about. Let's call them important assets or critical assets. So it's really this subset of priorities that an organization should take care of.
Now, how do we define that? How do we determine which are the critical assets? Well, there is absolutely no place in the CVSS score to achieve this level of granularity and of context awareness.
And now for the last shortcoming. And it's actually really a combination of the previous ones. CVSS is absolutely not correlated with exploitation likelihood nor potential impact. So it is not a measure of risk. What we would like is to have a metric that is an actual measure of risk. And in an ideal world, we would get something like an expected loss over each of vulnerabilities on an asset. But in order to do that, we would need something like actuarial tables or cash value for each asset and a real probability of being exploited. This is not I think as far as the state of research goes right now; it's not possible to implement in an automated way. So where do we go from here? Maybe we can’t get this absolutely perfect expected loss of dollars over every asset. But maybe we can supplement certain assets for something more meaningful and more context driven.
What are the components of a reliable vulnerability risk score?
So here are the five components that we have, five characteristics that we would like for a good risk score. A single number for which you can compare every vulnerability. We would like it to be granular enough so that any two given scores are comparable so they're either higher or lower than the other. We would like it to be risk based, so we would like it to encode something close to a probability, and something close to it as an importance to the organization. We would, of course, like it to be automated so that we don't have to rely on pricey security experts, which just don't scale; humans don't scale. And, we would like it to be context driven, really specific to the organizations.
What is vulnerability context and why is it so important?
So how do we represent context? This is a fundamental research question that many statisticians and scientists in machine learning engineering would like to solve. But first, we need to answer some small questions. What sorts of sources of information do we need to represent context? What sorts of sources of information do we have? And then from what we have, what information can we reliably infer from this? And this is really where machine learning comes into play, and what we're going to explore in the rest of the presentation.
So we have some specific requirements for good sources of information. And I mean, they just come from general statistical background knowledge. We wanted to be as diverse as possible. So the more diverse, the merrier. And it's really the wisdom of the crowd argument here. Put another way, you don't want to base your judgment on a single judge, on a single source of information. You want it to be as diverse as possible. And then you want to multiply the sources. So, not just more wide, but more deep. Also, you need that diversity to normalize imperfect information. It’s a large number argument and it's really simple to comprehend even if it's a bit more complex in the details.
So what information would we like to have and how would we represent the vulnerability and its context? First, we have the alert for the base vulnerability. So let's take a buffer overflow. It has a description and a technical severity on which we can infer the base core, but then we can look at some specific information, for example, the method of assessment, the exploitation ease, the detection and reliability. And then we can look at which asset the vulnerability is embedded in. And it's really this that makes the context assessment possible. We don’t have time to go through all of these assessment steps, but what you have to understand is that we go from the specific to the general. And then we ask specific questions that we would like to answer. And for some of them, we can even have if-then-else heuristics. But for some other ones, we will need more sophisticated methods of assessment. But, if we can obtain an efficient representation of the vulnerability, given all these levels of abstraction, then we can start to compute meaningful scores.
So let's say we start with a basic CVSS and then we successively add up these assessments by either pushing up or down the score from the previous assessment. What we're going to end up with is a diminution of the inherent uncertainty of the score. This again, is the wisdom of the crowd argument. By normalizing the assessment steps with respect to all other vulnerabilities in the system and with respect to all the assessment steps done, you converge to a meaningful value. So, again, given that we have the numbers that we wish for, how do I think we can, in a statistically sound way, compute the deep and end score that is meaningful. And that is probably approximately correct. But here is not where the magic happens. The magic happens when we compare all the given vulnerabilities of an organization.
So here we have a typical organization, real data. You can see that on the left. If we only take this CVSS base core, we have a very small number of possible values, and it should come as no surprise given what we discussed before. But then at each step of the assessment where we see that a lot of granularity emerges, and there is a rescaling of the scores with a very few really critical ones and a large group of, let's say, ordinary vulnerabilities that are not critical priority for the specific organization in question. And this is a good representation of the mathematical process that happens. But what we get in the end is distributions, of course, a bit like what's shown here. We have a smaller number of critical goals, but we know why they are critical because on each of these steps shown here, we can output specific assessment steps and can show it to the user so that we know why we prioritized these vulnerabilities as we did. Given we have the assessment steps, this solves the problem of granularity, of opacity. And, it's really interesting to obtain these numbers because remediation teams can focus confidently on the highest priority vulnerabilities, real priorities that are context driven.
What is machine learning and how does it help vulnerability prioritization?
So again, what is the role for machine learning in that? Well, the goal of machine learning is to infer those numbers that the old security guys are skeptical of, and specifically how we arrive at these scores reliably. So, the goal of machine learning is to infer reliable information from indirect and noisy signals. And this is what we're going to call proxy values. This concept is very interesting in that the proxy value is monitoring a value that represents another one that we're interested in. So let's say, for example, we would like to have the cash value of an asset. It's probably not possible to determine that value specifically, but maybe we can use another value as a proxy for cash value that's dependent on the cash value, and that's going to move in the correlated fashion with the cash value, and maybe we can use this as a proxy to prioritize the cash value of the asset.
First, a short primer on machine learning. I bet most people have seen these things many times because it's really fashionable and they appear in a lot of news articles. But the goal of machine learning is to learn facts or inferences about a given population of interest given a sample of the population. But even more, given the data representation of the sample of the population, it's a very powerful concept to be able to infer facts about a broader group just from a representation of the group.
Supervised versus unsupervised machine learning
There are two basic times of machine learning: supervised learning and unsupervised learning. Supervised learning is done when we have supplied labels, so we have a ground truth for the data. And from this ground truth, we learn the distribution of the correlation between the data input and the output. And from that, we can generalize to unseen inputs. So one of the two main families of supervised learning is regression, where you trie to predict a number. So, for example, the cash value of a house, given some characteristics like the number of square feet or the number of windows. The second supervised machine learning family is classification, where you try to output categories. So, the categories could be a critical vulnerability or a non-critical vulnerability, or it could be a cat or a dog in a picture.
My personal favorite part of machine learning is unsupervised learning. It's also known as pattern recognition. The goal here is to extract structure from the data, but without any supplied labels. And it's a very, very, powerful framework. So the two main methodologies that are generally known to the public are called clustering, where we try to learn different subgroups of the data given characteristics that we previously don't know about. This technique is widely used in marketing for, for example, where you want to segment clients given some specific information that we don't previously encode. In another example, one of particular interest here, it's called outlier detection or outstanding asset detection. It’s also a very powerful framework to work with, and we're going see an example of that.
How is machine learning used in vulnerability management?
Now we're going to discuss some examples of machine learning related to vulnerability management. We mentioned proxy values earlier, and specifically to rank assets given their cash value. But, what if instead we could rank assets based on the potential interest for an attacker? Now, this is extremely context specific. Let's say a Linux box in the range Windows workstations. Or a Windows workstation in the range of Linux boxes. It's called an “outstanding” asset in a network. And if we look at specific features for each of these assets and as well as blended together, and run an anomaly detection algorithm, we can detect which assets have “outstanding” values. And, if we do it well, these assets will have more interest to the organization, and we can pump up the score of these vulnerabilities with respect to how far afield they are from the herd. Again, this is just a specific example, but we can easily imagine others from the field of unsupervised learning.
Now, for the next example, we're going to talk about detection reliability. It is a known fact that the false positive is a crippling problem in the vulnerability management landscape. A lot of security products will give you hundreds of thousands of false positives when you do a VM scan. So a solution that we implemented in the product is that we allow users to flag vulnerabilities as false positives or as verified. And with these assessments, we blended with our security expertise as prior knowledge in something that we call a Bayesian inference engine. And it's really a supervised classification problem,because we have the user supplied label, and from these we learn the distribution of false positives, and then we can predict false positives on new vulnerabilities.
For the third example, we're going to talk about the remediation time of a vulnerability as a signal of priority. Of course, we would like to know which assets are priorities for each organization. But in order to do that, without automation,we need organizations to manually enter a rating or priority for every asset. But that's not scalable. What we can do is look at which assets organization’s patch, and how quickly, and evolve from this. And by blending the remediation times of all the organizations, what we end up with is a pretty good signal of what is important, because the hypothesis here is that people are patching the most important stuff first. So the analogy here is that if you were to cheat on an exam and you only take the answer from your neighbor, you have quite a high probability that your neighbor is as bad at solving the problem. But what if you could get the answers of every other student in the class? Then you can blend in all these answers and extract a probably good answer from that. So here, we’re using a supervised regression because we predict the number of days. And from this prediction, the shorter the number of days, the higher the priority. So we're going to just bump up the score or drop it. If the predicted remediation time is too low, of course, in all of these assessment steps the effect of each one is pretty small compared to the combined effect of everything. And this is very important in a statistical sense, because we don't want any of these possibly false assessments to have too much impact on the core. But in combination, it's very improbable that we're wrong on every one of them.
So let's go to our fourth and last example of using machine learning. We developed a natural language processing engine in-house that monitors exploitation trends, black markets, Twitter, any signal that comes from from the external world on which vulnerabilities are exploited or which system or which software is trending right now. And we can map this to the internal representation of the vulnerabilities, the description, the software services, the user-defined tags. And by using this mapping, we can infer something close to exploitation likelihood, because there is a strong correlation between the publication on black markets and publication of exploits on the exploit DB and exploitation inside organizations.
As you can imagine, when you start using proxies for values of interest that are context specific and context important, the only limit is your imagination, your research time, and the ability to iterate quickly. I'm developing machine learning models and of course, monitoring them. So you're assured that they don't produce absurd results. But what we have here is taking the CVSS base core and modifying it in a statistically sound way. We have a single actionable metric. We've seen that it's granular enough to be able to compare any two vulnerabilities. It is risk-based because it includes the priorities of the organization and likelihood of exploitation. It is fully automated, all because it's done by machine learning and, frankly, this is magical. It is context driven because all we add to the CVSS score is information that is specific to the organization, to the network, to the asset, into the vulnerability in question. And how do we do it in a reliable way? Well, we need diversity of source, multiplicity of source. We need a multi-level context representation. So vulnerability, asset network, organization and external. It has to use machine learning. Otherwise, we can only make simple and probably false asset assessments. And it has to be statistically sound and scalable.