Vulnerability Management Blog

Learning context at scale, for the benefit of all

This is part 3 in our series on contextual predictive prioritization. In part 1 we presented the overwhelming problem of vulnerability prioritization and in part 2 we presented our solution, contextual analysis, using hands-on examples of real vulnerabilities. This final article will showcase how machine learning actually enables prioritization at scale and across organizations, allowing all players in the process to benefit from the collective intelligence of all other users.

We have seen that we can order priorities by analyzing a series of interdependent factors, enabling a very fine-grained understanding of one’s organizational security posture, including this understanding in the vulnerability ranking. Statistically, the more diverse the sources of input knowledge are, the more sensible the predictions will be.

What we get in the end is a massive reordering of priorities across the whole organization. While some of these input factors can be pretty simple, some more sophisticated models are necessary to really take the prioritization journey to the next step. Here we present a handful of such factors with a quick explanation of their workings.

In this post, we will first provide a quick recap of our prioritization mechanisms, then we will explain the main tasks of machine learning; namely supervised and unsupervised methods. We will then conclude with 3 examples of assessment factors leveraging these techniques and how they ensure a trustworthy set of predictions.


Ranking of a single vulnerability

The task of ranking vulnerabilities by priority is in itself quite simple: find a complete ordering of vulnerabilities in which any two given vulnerabilities can be compared on the same scale. Based on our expertise, this is best done by analyzing the vulnerabilities in their context, and in an increasingly broad manner. In the following graph, we see a sequential representation of the rankings, with each intermediate assessment step along the way with its partial reordering across all vulnerabilities.

Sequential ranking of a single vulnerability

  1. Base Score: We use CVSS score as a baseline on which to work on as it is an industry agreed upon standard. This score is context-free and does not offer enough granularity.
  2. Detection Reliability: We run a series of assessment using bayesian and aggregated machine learning methods in a recursive manner to rank our own detection methods, according to user assessments.
  3. Dynamic Payload Analysis: We can rank payload effectiveness by inspecting HTTP traffic, parameters, response from web servers and file content on all DAST-based vulnerabilities.
  4. Exploitation predictability: We predict the exploitation likelihood of vulnerabilities by monitoring the constantly evolving exploit information mentioned in publishing communities and other threat intelligence feeds.
  5. Asset Context: We develop an ever-increasing understanding of the asset value and relevance to the organization by using a multitude of Machine Learning (ML) methods, meta-heuristics and statistical techniques.
  6. Network Context: An asset is never isolated. By correlating information from all assets on the network, we can infer probable paths of attacks and exploit chaining. We can also identify plausible targets and other network gold nuggets.
  7. Organization Context: Not all organizations remediate the same way, and by having an efficient representation of the organization and it’s behaviors, it is possible to predict and influence how to efficiently patch systems.
  8. Temporal Context: The speed of remediation is critical for certain classes of vulnerabilities. That is why we keep constant monitoring of all assets and vulnerability history in order to predict potential misdetections and time-sensitive discoveries.

Mass ranking of vulnerabilities


We present the two archetype classes of problems in machine learning: learning to predict from previously seen examples (supervised learning) and finding unknown yet interesting structured knowledge from raw data (unsupervised learning).

For the first category, we can highlight famous examples: identifying dogs from pastries in pictures, detecting cancer using medical data (classification) or predicting the price of a house (regression). By seeing a lot of known examples, data along with target labels, the model learns to predict said label for unseen data. It does so by trying to minimize prediction errors. Artificial neural networks are famously efficient at this, as seen in the spurge of interest of this field right now.

Chihuahua or muffin?

We usually distinguish two broad categories of supervised learning tasks, regression and classification. A regression model predicts a continuous value (house price), while a classification model will output a categorical label (chihuahua/muffin).

The second family of models is just slightly more arcane… pattern recognition, data clustering, anomaly detection are just a handful of tasks that falls unto the umbrella of unsupervised learning. The goal here is to untangle the structure, essence and relationships from the data but without using supplied labels. Known examples of these are customer profiling (data clustering) or network intrusion detection and fraud prediction (anomaly detection).

Below are some examples of problems solved using these techniques in the prioritization process.


One very effective signal of a vulnerability importance is the general rate at which it is patched across all organizations. We can arguably state that a vulnerability that is always patched within two days of discovery is more important than a vulnerability that is patched within 100 days, give or take an intrinsic variance in the measurements.

Predicting remediation time from a variety of input signals is a supervised regression task. What we seek out is to predict the number of days (essentially continuous) before the vulnerability is fixed, from the day it is discovered.

The model is fed a wide range of vulnerability-asset-organization features, and by looking at the remediation histories, we are able to predict the remediation time within a very narrow error margin. We then feed this vulnerability prediction in the ranking algorithm as a signal of its importance, amongst many others.

This remediation time, when focused on efficient organizations, is a prime source of data to rank vulnerabilities for all other organizations to use. That is why it is necessary to finely understand, select, and tune not only the sources of information fed to the predictive model, but also its outputs and its interpretations.

A regression predictor gives us a probability distribution over real numbers, predicted time to remediation in this case


False positives are unavoidable shortcomings of all detection mechanisms. While it is possible to optimize the detection mechanisms in order to avoid FP detection as much as possible, it is not always possible to completely eradicate the problem.

What we can do, on the other hand, is use meta-detection techniques to assess the single vulnerability detection methods using contextual information as additional inputs.

By allowing users to label our vulnerability detection as false positives or as confirmed vulnerabilities, and by using other cues stemming from indirect confirmation of the existence of a vulnerability, we can later reassess our own detection reliability using a much more complete representation of the scanning process.

What we end up with, all done behind the scenes, is a supervised classification of every vulnerability as either being confirmed or false positive. This classification is done using a blend of user labels (supervised learning) and prior expert knowledge (Bayesian inference) in a statistical ensemble machine learning model, just like a democratic process.

A confirmation classifier: Inputs classes on the left, output classes on the right


By working with our team of security experts and partners, we realized something that truly distinguishes the best penetration testers from the average security guy. The answer is simple, yet very difficult to grasp: learned intuition. It is the ability to glance at a very large amount of data and quickly identify the outstanding and interesting individual piece of information – the outlier, the most valuable target.

Outlier or anomaly detection is an archetypal task from the family of unsupervised learning methods. The goal is first to find a general representation of every item, a set of numerical characteristics or categorical features, and find a measure of similarity between them. Using this measurement, we can quickly select which ones are definitely separated from the herd.

The difficult part is to devise this set of representations, both in an effective and computable format, but also in a meaningful, semantic way.

This method is used extensively in the product and it effectively allows us to find interesting targets, such as oddly configured servers, workstations, websites and other network equipment.

Finding outliers according to a multitude of engineered factors simultaneously


Prioritization according to context offers an infinite stream of opportunities for artificial intelligence. Inferring context is a knowingly difficult and imprecise process. The difficulty arises from a fundamental point of view: what is the context? What is important for a specific organization and across organizations? Also from a technical point of view, it is difficult to gain access to a very wide and sensitive range of data, correlating a multitude of sources in an efficient yet statistically significant way.

We believe that Delve Labs is on the right track, and we have an impressive amount of positive results. We also believe that the road to improvement is a never-ending path, but quite an exciting one.

Most Recent Related Stories

What is Risk Based Vulnerability Management?

Read More

Risk Based Vulnerability Management Product Update

Read More

Growing a Machine Learning project - Lessons from the field

Read More