Vulnerability Management Blog

A Predictive Model for the Publication of Exploits using Vulnerability Topics

 

Introduction

Fixing software vulnerabilities in a network is a key aspect of cybersecurity risk management. Security vulnerabilities with publicly available exploits are much more concerning and should be addressed quickly following the publication of an exploit. But passively waiting for the publication of an exploit is a risky prioritization strategy… Unfortunately, we do not know which vulnerability will have an exploit published beforehand. 

But what if we did?

This blog post details the workings of one of Delve’s 40+ prioritization factors, the Exploit Publication Predictor (EPP), which uses supervised machine learning to build a predictive model for the publication of vulnerability exploits. Everyday, the top predictions for the past 30 days are freely available on Delve's Vulnerability Threat Intelligence Feed.

The EPP uses historical data from OSINT sources (exploit and code databases), along with various properties associated with a vulnerability in order to build a detection model using supervised machine learning. More importantly, this approach incorporates a novel set of features by using a topic model built from the descriptions of all disclosed vulnerabilities. In our previous blog post, we built a topic model using Latent Dirichlet Allocation (LDA), which provided us with a list of topics and their importance for every vulnerability, as a fixed-length vector. With this numerical representation of the underlying concepts associated with vulnerabilities, we can improve on previous studies [1,2,3,4,5,6] by enhancing the feature selection when predicting the publication of exploits.

In short, the EPP uses features from vulnerabilities and features built from a topic mapping of descriptions to build an exploit publication detection model, and obtains good performance.

Methodology

 

Dataset

Our training dataset consists of the NIST vulnerability database (NVD), with an aggregation of available OSINT sources of exploits, such as Exploit-DB and Packetstorm, and the dark web. These provide labels for the vulnerabilities.

The model is trained on all vulnerabilities with a CVE number published since 2015. We filter out exploits which have been published before a vulnerability, since we specifically wish to  predict exploits published after its CVE. After filtering the dataset, 2995 vulnerabilities are left with an associated exploit. In contrast, there are tens of thousands of vulnerabilities without an associated exploit. We will see how to deal with such an imbalanced dataset below.

Feature Selection

In our previous blog detailing the Vulnerability Trend Score (VTS), a topic model was built out of the NVD descriptions, using a Natural Language Processing (NLP) approach called LDA. 

This same topic model is used in this detection model, where the output of the topic model, i.e., the numerical representation of the underlying concepts of descriptions, is used as features in the model training phase. Along these, a number of features are extracted from vulnerabilities :

  • The character length of the vulnerability description
  • The number of references available for the vulnerability
  • The number of affected software configurations by this vulnerability
  • The number of vulnerable products associated with this vulnerability
  • If this vulnerability has a Bugtraq ID associated with it, i.e., if it has been discussed on the Bugtraq mailing list

 

Machine Learning

The model then learns the relationship between the various features of the vulnerability and the likely publication of an exploit for it.  The steps are presented in the diagram below.

 

First, the properties of the vulnerabilities are transformed into numerical features. Second, each vulnerability is labeled with 1 if an exploit exists, 0 otherwise. These pairs  are then passed to the machine learning model, which is trained to identify the label given the features. Once the model is trained, it is then used to make new predictions, i.e., it is fed new published vulnerabilities for which no exploit exists yet, and the model outputs a prediction, a label identifying if an exploit will be published in the future or not, along with a probability.

Hence, for the purpose of this article, a model is any algorithm that takes the aforementioned vulnerabilities features as input and outputs a prediction.

Model Selection

A number of algorithms have been tested, in order to compare their performance: Gradient Boosting Tree (GBT), Random Forests, Logistic Regression and Support Vector Machine (SVM).

In order to better validate our trained model, we employ 10-fold cross-validation, where the dataset is split into 10 sets (folds). The model is trained 10 times, where for each iteration, nine folds are used to train the model, while the remaining is used as a test set to calculate how many correct predictions can be made. For each iteration, the fold used for testing is changed. This technique is useful in better assessing a model’s training phase, since all data is eventually used in the training phase.  It also assesses how the predictor can correctly predict unseen data, since all data points are eventually used in the testing phase.

Model Evaluation

To evaluate the trained model, we use accuracy, precision, recall and F1-score as metrics. These metrics are computed using the number of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

In this case, a positive instance is a vulnerability predicted to have an exploit published, while a negative instance is a vulnerability predicted not to have an exploit published.

The accuracy consists in the number of correctly classified instances divided by the total number of instances. While this metric is a good indicator of a classifier’s general performance, we are more interested in minimizing the number of false negatives (vulnerabilities incorrectly classified as having no exploit), even if it results in a slightly higher false positive rate (vulnerabilities incorrectly classified as having an exploit). Take a second, think about it!

So we also compute the precision and recall, which respectively consist in the number of correctly classified positive instances relative to false positives or to false negatives. These two metrics are more indicative of the performance of the classifier at correctly identifying vulnerabilities that will have an exploit published when it is the case. Finally, we also include the F1-score, which is the harmonic mean between precision and recall.

Since our dataset includes much more vulnerabilities without published exploits than with one, we end up  with an unbalanced dataset. With such a dataset, a predictive model will be biased towards the majority class (vulnerabilities without an exploit). 

For example, if we have a dataset with a majority of instances labeled with the positive class and a minority labeled with the negative class, a predictive model can correctly predict negative instances and predict poorly positive instances and still obtain 80% accuracy, as shown in the left diagram. 

 

Fortunately, the recall measure identifies the model’s weakness in identifying vulnerabilities with a published exploit. In this work, we are interested in maximizing the effectiveness of the model at predicting if an exploit will be published for a vulnerability. Following the example in the image, we wish to correctly predict positive instances (green circles), at the risk of increasing the number of negative instances (blue circles) misclassified as positive. The right diagram depicts what we aim to achieve. While both diagrams obtain an 80% accuracy, we can see that the left diagram poorly predicts positive instances, while the right diagram predicts positive instances well, at the cost of additional false positives. We aim at maximizing the recall measure.

In order to limit the bias of an unbalanced dataset, random undersampling is done on vulnerabilities from the dataset that have no exploit, so that we are left with a balanced dataset to train the model. 

Results

The model was trained on the filtered and balanced dataset of vulnerabilities, and obtained the following results:

GBT (Gradient Boosting Tree) obtained the best results and was chosen for the implementation of the EPP, since we wish to maximize the predictive power of the model over its interpretability. 

To show the importance of the chosen metrics, we trained our GBT model on the initial unbalanced dataset, and obtained the following results:

As can be observed, although the model obtains a high accuracy, it does not predict positive instances well, as can be seen by the low recall.

Since we are interested in maximizing the positive predictions - to identify vulnerabilities which will have an exploit published - we generate the Precision-Recall (PR) curve for both our balanced and unbalanced datasets. The PR curve shows the relationship between precision and recall. The larger the area under the curve, the more powerful and accurate a model is.

As can be observed, the model trained on the balanced dataset performs much better than the one trained on an unbalanced dataset.

Once this model has been trained, it is used in a live implementation where it will predict, for all vulnerabilities published in the last 30 days, which ones will have an exploit published with a high probability. These top vulnerabilities are showcased on Delve's Vulnerability Threat Intelligence Feed, along with the vulnerabilities with the top VTS. This feed is updated daily.

Discussion

 

Features of a machine learning model must be chosen carefully

Our initial prediction model included a number of time-sensitive features :

  • The published date of the vulnerability
  • The date of its last modification
  • The difference between the two.

While this version of the model performed better in our training phase, obtaining around 88% accuracy, precision, recall and F1-score, it performed badly in a live implementation by not predicting any instances of the positive class. In short, it predicted no publication of exploits at all. We investigated the model, and extracted the top five features which had an impact on the GBT classifier (on a score from 0 to 1) :

An interesting thing to note here is that 3 out of 5 of the top features are time-sensitive. These three features were actually skewing our model to better predict older vulnerabilities:  indeed, newly published vulnerabilities rarely have a modification date, and the publication date is always recent. In the end, including time-sensitive features was not relevant to our task of predicting the publication of exploits specifically for newly published vulnerabilities.  

When removing only these three features, the model was suddenly able to predict the publication of exploits, albeit with slightly diminished test accuracy.

Limitations

While our model performs well, its predictive power relies upon the data sources used. Unfortunately, these exploit sources are incomplete and not always easily accessible. We aggregated a number of sources, both OSINT and from the dark web, but did not cover all possible ones. While the model can be trained on a subset of sources, it might then identify a vulnerability as likely to have an exploit published, while in reality, one already exists outside of the sources in the dataset. 

Moreover, to better characterize the risk for an organization’s network, this methodology should use actual exploitation data for the labeling phase, such as a list of vulnerabilities exploited through real attacks. Unfortunately, such a dataset is difficult to find as a freely available source online, and so, is out of scope for our purpose.

Conclusion

This blog post presented the EPP, a supervised machine learning model which predicts the probability that recent vulnerabilities will have an exploit published using topics extracted from their descriptions and features associated with their properties. By combining topics representing the underlying concepts associated with a vulnerability and the various properties of a vulnerability, the predictive model was able to perform well, even with the highly unbalanced nature of the training data.

References

[1] C. Sabottke, O. Suciu, and T. Dumitras, “Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits,” p. 17.

[2] M. Bozorgi, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond heuristics: learning to classify vulnerabilities and predict exploits,” in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’10, Washington, DC, USA, 2010, p. 105.

[3] Edkrantz, Michel, and Alan Said. "Predicting Cyber Vulnerability Exploits with Machine Learning." SCAI. 2015.

[4] Bullough, Benjamin L., et al. "Predicting exploitation of disclosed software vulnerabilities using open-source data." Proceedings of the 3rd ACM on International Workshop on Security And Privacy Analytics. ACM, 2017.

[5] Fang, Yong, et al. "FastEmbed: Predicting vulnerability exploitation possibility based on ensemble machine learning algorithm." PLoS One 15.2 (2020): e0228439.

[6] Jacobs, Jay, et al. "Improving vulnerability remediation through better exploit prediction." 2019 Workshop on the Economics of Information Security. 2019.

 

Most Recent Related Stories

Prioritizing Vulnerability Remediation

Read More

Leveraging Collective Intelligence for Contextual Prioritization in Vulnerability Management

Read More

Re-defining Vulnerability Remediation Prioritization

Read More