Automating Threat Intel with Machine Learning: Extracting the Underlying Concepts from Underground Discussions and OSINT
July 2, 2020
In cybersecurity, news regarding new vulnerabilities appears continuously every day through an array of various sources, be it news networks, social networks, blogs, security advisories, etc. These fall under the umbrella of open-source intelligence (OSINT). Yet not all vulnerabilities generate the same level of interest and length of discussion.
As such, one interesting aspect of vulnerability prioritization is to consider what types of vulnerabilities are currently trending in these OSINT sources, in order to establish the types of vulnerabilities an attacker might favor. For example, if discussions in public sources and underground networks currently mention reflected XSS attacks on particular frameworks more often than usual, then other attackers reading these discussions could be tempted to attack other similar XSS vulnerabilities, or find similar inflection points in other frameworks. Attackers, just like cybersecurity experts, are very much influenced by trends in exploitation methods.
This blogpost explains in part how Delve's Vulnerability Threat Intelligence Feed identifies the top trending vulnerabilities on a daily basis.
The goal of the approach presented in this blog is to identify the patterns or trends underlying online discussions. These are what we aim to model in order to generalize which vulnerabilities might receive more attention from attackers, regardless of the vulnerability’s direct mentions (e.g., its CVE number) in online discussions. The goal is the identification of the underlying concepts associated with online cybersecurity discussions, first to assess the importance of existing vulnerabilities, and second to use this representation as input to future predictive models. To extract this information, a natural language approach is employed, which uses machine learning, and more specifically, topic modeling.
In short, it is possible to build a representation of a vulnerability that captures both its intrinsic features as well as its relationship with the external world.
This framework aims to:
- Create a representation that encodes the semantic meaning of vulnerabilities, i.e., topics from the textual descriptions of all disclosed vulnerabilities to date.
- Crawl online infosec discussions and apply them to a topic model, so as to obtain a weighted list of topics, which we call trends.
- Combine the previously-built trends to their respective topics in a vulnerability, in order to indicate its importance to current infosec discussions.
- Use these trends and topics in future machine learning projects as input features
In order to accomplish these goals, we can use topic modeling, which is an unsupervised machine learning technique where a number of topics are automatically extracted from a text corpus, along with each one’s probability to appear in any of the texts. A topic consists of one or more weighted words, where a higher weight indicates a higher importance to the topic. A topic represents an underlying concept extracted from the text through statistical analysis.
For example, there are a great number of disclosed buffer overflow vulnerabilities out there, each with a different but similar description. This technique can capture a generic topic representing the words associated with buffer overflow vulnerabilities, looking somewhat like this:
A prominent topic modeling technique is Latent Dirichlet Allocation (LDA), which is a generative model that computes the probability of a word being associated with a topic. LDA has been proven effective in previous research and applications[2,3], and is one of the most common topic modeling methods.
Creating the Topic Model
First, let’s extract the textual definition of all vulnerabilities in the National Vulnerability Database (NVD) of the NIST, each associated with a CVE number. These texts are then processed, as depicted by Figure 2. In step 1, URLs and common words are removed, the text is split into tokens (single words), and the words are lemmatized (Stanford) to group together different inflections of a word.
Figure 2: Filtering of a description’s words, and splitting into tokens
These lists of tokens can then be transformed into a numerical representation using the bag-of-words model, as depicted in Figure 3. In this new representation, each token is identified by an ID and its number of occurrences, where the ID is chosen based on its first appearance in the definitions. This transformation is necessary since LDA requires numerical inputs.
Figure 3: Transformation of a description’s tokens into a numerical representation
These are then used as input to LDA, as shown in Figure 4, which generates a set of topics identified by a series of words and an importance factor between 0 and 1. This model is implemented using online learning, where new published vulnerabilities are fed to the model daily and the model is updated incrementally.
Figure 4: Association of words in numerical form to their topics through LDA
Building Trends from Online Discussions
Once we have learned a suitable model for vulnerabilities, one that understands the specifics of language in the description of vulnerabilities, we gather online discussions of vulnerabilities and exploitations from various public sources. While a number of previous works have extracted information from online discussions in order to build models[4,5,6,7], none applied them to a topic model in the same way this blog details. Now, let’s process text posts from discussions in the same way as specified in Figure 2 and Figure 3, and apply them to the existing topic model.
This results in a vector of weights for each text post, where each topic’s importance to this discussion is expressed by a number between 0 and 1, named a weight. By averaging these vectors together across all posts, we obtain a final single vector denoting the relative importance of each of our topics in what we observe in online discussions. We call those “trends”. In the end, by extracting the topics with the highest weights, we obtain the set of current trends. Figure 5 describes the creation of trends:
Figure 5: The creation of a trends score from online discussions
Finally, we look back at the vulnerabilities from the NIST database which were used to build the initial topic model, and identify the ones with topics matching the newly established trends. This is done by applying the dot product to each vulnerability topic vector and the trends vector, or in slightly more human language, by computing the product of each of a vulnerability’s topic weights and its corresponding trend weights, and summing them. This gives us a final Vulnerability Trends Score (VTS).
The higher the VTS is, the more closely this vulnerability matches with online discussions.
Figure 6: The dot product between the vulnerability topics and the trends
The framework can be summarized in the figure below. First, the topic model is trained with vulnerability descriptions, and outputs a number of topics:
Second, online discussions are run through the same topic model to establish which of the previously obtained topics are most important for these discussions:
Figure 7: Overview of the approach
Example of a Use Case
Let’s see how this works in practice: 30 topics (why 30? See the Appendix below) were identified on all of NIST’s vulnerabilities, each identifying a concept observed in vulnerability descriptions. Then, data was gathered from online discussions, such as these tweets from Twitter :
- We have created the first write-up on the recent Apache Tomcat RCE (CVE-2020-9484), read here: [url] #infosec #javahacking #exploit #tomcat #deserialization #webapplicationhacking #vulnerability
- Security Bulletin: Open Source Apache Tomcat vulnerabilities affect IBM Tivoli Application Dependency Discovery Manager (CVE-2020-1938) [url] Multiple vulnerabilities in Open Source Apache Tomcat reported by The Apache Software Foundation [url]
- Apache Tomcat AJP Vulnerability (CNVD-2020-10487/CVE-2020-1938) (NIST) .This vulnerability was discovered by a security researcher of Chaitin Tech . You can read any webapps files or include a file to RCE .JUST A POC-GIF with no DETAILS Tomcat has fix this vulnerability ,UPDATE! [url]"
When applying the topic model to these tweets, a weight for every topic of the model is obtained, relative to these specific tweets. This is what we identify as trends. Among others, these three topics, in a human-interpretable form, came out as important:
- Topic A, related to file upload vulnerabilities
- Topic B, related to applications having default settings and making HTTP requests
- Topic D, related to code execution vulnerabilities
Finally, we can use the trends and the vulnerabilities’ topics to obtain a VTS for each vulnerability, using the method described previously.
The top-ranked vulnerability comes out as CVE-2020-1938, which identifies an Apache Tomcat vulnerability, code-named Ghostcat, enabling attackers to upload files through a default connection, and potentially obtain a remote code execution. The following figure illustrates the evolution of this specific vulnerability over the month of June 2020:
Figure 8: The evolution of CVE-2020-1938’s VTS score over the month of June
This vulnerability appearing as the most important is expected, given that many tweets mention Tomcat vulnerabilities, and some directly speak of this vulnerability. More interestingly, other vulnerabilities matching with topics A, B and D will also have a high trending score, while not necessarily being the most mentioned online.
With a model trained in early June 2020, the top trending topics obtained are shown in Figure 9 as word clouds. The model is trained everyday to observe the evolution of trends.
Figure 9: Word clouds detailing four top trending topics
As you can see, topic A is centered around file upload and XSS vulnerabilities, while topic B is more general and is related to vulnerabilities involving default settings and HTTP requests. Topic C is related to vulnerabilities that include improper validation of user input. The last topic, topic D, is about code execution and android vulnerabilities.
We also tracked the weight of our top trending topics over the beginning of June. Since the topic model is run in an online mode (i.e., updated with new vulnerabilities as they appear), it retains its topic distribution, and the topics evolve over time. We observed some gradual changes over the previously mentioned trends, where we can see that trend B and trend C are on the decline, while trend A seems relatively constant and trend D is gaining popularity in online discussions.
Figure 10: The evolution of our top four trends through time.
The top trending vulnerabilities identified through this approach are showcased on Delve's CVE Threat Intelligence Feed and updated daily.
What was obtained from the method presented in this blog is a semantic mapping of vulnerabilities, and a single metric, which we named the Vulnerability Trend Score (VTS). These are used as numerical representations, linking online discussions and the semantic of a description to a specific vulnerability. These new features bridge the gap between the underlying concepts behind a vulnerability and actual numerical features which can be used by a machine learning model.
Notably, the aforementioned features can also be used to predict if a recently published vulnerability is likely to have an exploit published in the coming days, weeks or months. The vulnerabilities with the highest probability of having an exploit published are also shown on Delve's CVE Threat Intelligence Feed. This approach will be detailed in an upcoming blog post.
LDA requires a fixed number of topics as a parameter. In order to validate this parameter choice, multiple values were tested and evaluated through the coherence of topics: a highly coherent topic will have top scoring words that are similar between themselves and different from top words of other topics. Overall, this metric serves as a way to benchmark models in order to compare their effectiveness. The test included n numbers of topics from 5 to 100, and the best results were found with n=30:
Figure 11: Coherence test for identifying the number of topics providing the best clustering
Through this approach, the highest trending vulnerabilities are indirectly linked to online discussions through the use of topics. In order to validate that the vulnerabilities with the highest trending scores are in fact directly related to online discussions, an hypothesis test was conducted to assert that the descriptions of trending vulnerabilities are more similar to online discussions than a random sample of descriptions are to the same online discussions. To compare the similarity of the texts, we used term frequency-inverse document frequency (TF-IDF) , which is a commonly used similarity measure to identify how relevant each word is to a text.
We sampled the 1000 vulnerabilities with the highest VTS, extracted their descriptions, and compared each one with the online discussions using TF-IDF. We then sampled 1000 random descriptions from all vulnerabilities in the NIST database, and compared them as well to the online discussions. With this, we obtained a distribution of similarity measures for the trending vulnerabilities, and another distribution of similarity measures for the random vulnerabilities. We then conducted a permutation test on these two distributions, and confirmed that the similarity scores of the trending vulnerabilities are significantly higher than the similarity scores of the random sample; we reject the null hypothesis with a p-value approaching 0.000.
 Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research 3.Jan (2003): 993-1022.
 Lee, Suchul, et al. "LARGen: automatic signature generation for Malwares using latent Dirichlet allocation." IEEE Transactions on Dependable and Secure Computing 15.5 (2016): 771-783.
 Liu, Linqing, et al. "Detecting" Smart" Spammers On Social Network: A Topic Model Approach." arXiv preprint arXiv:1604.08504 (2016).
 Mittal, Sudip, et al. "Cybertwitter: Using twitter to generate alerts for cybersecurity threats and vulnerabilities." 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 2016.
 Chen, Haipeng, et al. "Using twitter to predict when vulnerabilities will be exploited." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2019.
 Hoffman, Matthew, Francis R. Bach, and David M. Blei. "Online learning for latent dirichlet allocation." advances in neural information processing systems. 2010.
 Shrestha, Prasha, et al. "Multiple social platforms reveal actionable signals for software vulnerability awareness: A study of GitHub, Twitter and Reddit." PLoS One 15.3 (2020): e0230250.
 Röder, Michael, Andreas Both, and Alexander Hinneburg. "Exploring the space of topic coherence measures." Proceedings of the eighth ACM international conference on Web search and data mining. 2015.
 Ramos, Juan. "Using tf-idf to determine word relevance in document queries." Proceedings of the first instructional conference on machine learning. Vol. 242. 2003.