Machine Learning, Penetration Testing, and Your Most Hackable Assets
May 21, 2020
Today's goal is to present some simple ideas of how to use machine learning in cybersecurity. We're going to talk about the basic problems of data overflow in cybersecurity assessments. We’ll also talk about the building blocks of machine learning, simplifying the ideas and understanding the essential. And then we're going to talk about a specific family of machine learning methods called anomaly detection. And we're going to use it in a way that is quite unusual in the world of cybersecurity. And we're going to talk about Batea, which is an internal project that we decided to open source that really exploits the ideas presented at the beginning of the presentation.
Can We Automate the Behavior of a Pen Tester?
Let's delve into the problem. We'll start with a slightly metaphysical introduction to the problem. So at first there was a network. The I.T. network. Is that the population we want to study? So we really want to understand the structure of the network. And in order to obtain information about a network, what we do is that we run and map scans so that the object we're going to use to show the presentation is really the output of the Nmap scan. So what that does is that it scans a network and it shows us the host, the OS on the network, along with information about the reports and their services. The thing with Nmap scans is that even for a single host, you can quickly start to get a lot of information.
The screenshots I'm showing you right now are just examples to show you how for just one single host, you can quickly be overwhelmed with the amount of information. And so this is for just one host. But typically in enterprise networks can easily have hundreds or thousands, even 50,000 hosts in a single given network. So any network analyst using basic tools will quickly be overwhelmed by all this information.
What Assets on a Network Are Most Attractive to Attackers?
So what can we do about it? Well, the goal of any analyst in this position, and especially pen testers, because an Nmap scan is like the Swiss Army knife, is to quickly spot interesting assets or hosts in huge networks. And here internally at Delve, we call these the golden nuggets because the process is really similar to digging for gold and filtering out the gold nuggets from the dust. So why should we do this quickly?
Well, it's pretty obvious because we want to avoid wasting time on the assets of limited interest. Because the time of a security analyst is quite costly and it's not a secret that we have scarcity of expertise.
So what makes an asset a gold nugget? Well, it's hard to define precisely, but it's a combination of the following facts like cash value to the organization, or the potential for further exploitation. So maybe some assets don't have high values, but maybe they host mis-configurations that can lead to the exploitation of a further valuable asset. Maybe it provides an easy access vector into an internal network. Maybe it just has high mis-configuration potential or unusual configurations and unusual configuration has a lot of value to an attacker because it gives insight into how a human decided to configure an asset. So how do we deal with big networks and especially those with big Nmap scan reports? Well,a typical analyst will scroll the Nmap report and get a feel for the network or maybe Grep or find some specific keywords in the network. But in any case, an experience pen tester or an experienced hacker will quickly get a feel for the network and spot juicy assets very quickly. The problem is it's very difficult to teach this expertise to more junior analysts. So our goal here is to understand what is this intuition developed with expertise.
What is the Definition of Expertise in the Context of Machine Learning?
Based on the work of Herbert Simon, a Nobel Prize winner economics and the Turing Award winner in artificial intelligence, quite a great mind of the last century. A major component of expertise is the ability to recognize a very large number of specific relevant cues when they are present in any situation and do this very rapidly. Maybe it's a bit brainy a definition. So let's say that from Google, intuition is the ability to understand something immediately without the need for conscious reasoning. And that's really what we want to achieve using machine learning methods. So intuition in the context of security assessment is the ability to quickly recognize relevant cues in large loads of information. And we really want to stress that this is the goal of the whole process. The problem is that intuition comes with expertise. Expertise is expensive and relevance is very context specific. So in the case of our security assessments, the relevant items are going to be different in different networks. So let's say, for example, finding where a Linux workstation in an IP range filled with Linux servers is very interesting because there is no reason for it to be there. Maybe it was forgotten there, or maybe it's a mis-configuration. And in an inverse situation on an enterprise network, you have ranges of workstations. So it's just for people to work on. And if a developer forgot, and left the Linux server, it probably indicates errors.
How Does Context Influence a Cyber Security Assessment?
So, maybe it can be used for further exploitation. So it's really context specific. And this is also one of the things we want to keep in mind. So in the context of a security assessment, the context of an asset is the actual versus typical use of this asset. The embedding, or that relationship with the other assets on the network, and so the singularity of the asset is very, very relevant. So what can machine learning do? Well, it's a known fact that machine learning is very efficient to automate tasks, especially on large amounts of information. So it becomes good when humans become bad. And so working along with experts. That's what we did internally. We can help improve and transfer this intuition to other analysts. So in crafting intelligent machine learning systems, we are able to empower the work of the less experienced Analysts.
What is Machine Learning and How Does it Work?
For a little introduction to the way we work. There is a saying that if it's done in R, it’s statistics. If it's done in Python, it's machine learning. If it's done in PowerPoint, it's called artificial intelligence because it's mostly for commercial blabla. But when it works, we call it WizardWare. So this is what we do internally; we have a Wizardware engineering department. So let's delve into Wizardware engineering, and we’ll start with the very basic look of what machine learning is. First, a few quick definitions.
The population is what we want to study in the wild. It's the object of interest. But in practice we seldom have access to the whole population. So what we get is a subset of the population, and just simply, the subset introduces biases. And then data is a representation of a sample. It's necessarily imperfect. So it is just a process to collect data from the sample that is just a cross-section of the actual object. And then that can be represented in many different ways. And the numerical representation is what is needed to apply machine learning. So what we want is to obtain the vector of numbers or an array of numbers that represents the data of interest, which is a representation of the sample, which is a subset of the population. And then by achieving a task, optimizing over a task, what we get is a model. And this model gives us information, facts and inferences about the population of interest. So it's really a full circle of concepts.
So the population of interest in our case is the actual network. So the physical computers and the wires and the CPU use and the ports, but the representation that we have that we use is in the Nmap report. So it has a representation that is human consumable. But in practice, it's not, it's not usable in algorithms. So what we need is to refine the representation in order to get the numbers. So a bit more intricate representation is the object which programmers are really used to. So it represents exactly the same information as before. That is in the more structured way. But then if you want to apply machine learning, you want to go in and do more data framing like a row/column representation. And even further, we want a numerical representation, which is just numbers. So a point in a space along with the notion of distance between each point. And I want to stress that this is what machine learning works on, the part that the job of a machine learning engineer or researcher is to transform an object into a vector of numbers and then run the magic stuff on this.
What are Good Tasks for Supervised vs. Unsupervised Machine Learning?
And what is our task? Well, the task is gold-nuggetting. We want to dig for gold. And no, this is not a Unix admin. It is a gold prospector. So in traditional software engineering, where the algorithms take data as input, and you get regular output. But what we want is a model. So what the model has is data as input and it outputs inferences and predictions. And so what wizardware does is that we take data as input, we take algorithms as input. And what we get as output is models from which we can then do insights and inferences. So let's just break down the buzzwords of machine learning. When we have access to labeled data, so a ground truth signal, we call it supervised learning. And this is what triggered the modern machine learning revolution. So if you have access to a whole lot of data with ground truth, like pictures of cats with a label. Is there a cat in the picture or not? You can do supervised learning.
So there are two main forms of supervised learning are regression, where you output a number. So let's say the price of a house, given some features about the house like the square footage or the number of windows. Then if you want to output a category, it's called classification. And then the more arcane part of machine learning, which personally, I find more interesting, is unsupervised learning. So the two archetypal tasks of unsupervised learning are data clustering, where you want to group together items of similar nature. And then, what we're going to see and talk about, is anomaly detection, which tries to separate points which are different from the herd. So we want to establish a baseline and then find the points that differ the most.
Just a quick example of a process of supervised learning. Let's say you have a population of Muffins of Baphomet, which is one representative of the Satanic church, and Chihuahuas, the population of these real things which exist in the wild. Then you have data representation. So the data representation in this case is a picture which is a cross section of the real thing. If you have labels telling what is in the picture, then you can train that machine learning model in order to recognize what is on the picture. And so the underlying principle of machine learning is that if the model is able to accomplish this task, then it has learned something about the actual population. But in practice, obtaining labeled data is notoriously very difficult, sometimes just plain impossible, because you just don't have the ground truth.
So supervised learning is not really efficient in our case. It's really efficient when you are a gigantic data collector like Amazon, Facebook and Google. But for most real world problems, it's not really practical to label data. So you have to be clever. You have to use proxy variables or use unsupervised learning or semi supervised. So we're going to delve into unsupervised learning. So the goal of unsupervised learning is to learn about the population’s internal structure. So really the relationship between the data points. So either it's like pattern recognition, clustering, or anomaly detection. What you do is that you make predictions on the individual data points given the overall structure of the data. So you don't use labels. You don't use truth signals. You just use the relationship between the point, the space between the points.
What is Anomaly Detection in Machine Learning or AI?
So typical use cases of anomaly detection, which is the technique we want to use is, let's say, engineering defect detection. In the mass production of a lot of manufacturers, it's not really possible for humans to inspect every individual item. So you can use machine learning to try to find defects. So there would be objects which differ from the baseline. An example, which is used in cybersecurity is intrusion detection. So in networks where you analyze network packets, you can find that you can establish a baseline at. A baseline of network traffic, and then use this to flag abnormal behavior and fraud detection.
Also, if you ever got your credit card canceled because of a weird transaction or you were traveling, it's most probably an anomaly detection method that was used. But you can translate “anomaly” with words which are a bit more positive because you don't only want defects, but you can try to reframe it to find outstanding elements or maybe desirable elements which differ from the undesirable crowd. And this is what we're trying to frame here.
So how does it work? Well, there are many different methods or algorithms to detect anomalies. The first method that we're going to see is geometry based, and then we're going to talk about a reconstruction base, which is more an informational method. And then we're going to see the isolation based method. So my goal here is just to present you some different methodologies or ideas on how to find the anomalies.
First, geometry based methods. The goal is to represent each point as a point in a space and define an ocean between every point. Then what you do is that you take the elements which are farthest from all of the others. These would be the outliers, and you can draw a circle around each point, and count the number of intersections, and see the point with the lowest number of intersections is going to be at the outlier. But the problem is that in the high dimension, when you check many, many different features at the same time, the notion of distance is really not well behaved, so it's really a mathematical property, there's nothing to do against that. If you want to analyze many different things at the same time on the data points, there is no way you can use geometry based methods efficiently. It's good for simple problems or problems in low dimensions, but that's not what we're going to use because we potentially want to analyze high dimensional data.
The next next method that I want to explain is reconstruction based, which uses deep neural networks. And this method is really what triggered the deep learning revolution in the last 10 years.
The diagram we have below is what we call an auto encoder.
The goal of an auto recorder is to reconstruct the input. While traversing the neural network, in the middle of the neural network, we put a layer which is a lot smarter than the input and the output. So the goal of this layer is to be an information bottleneck. So by traversing this network, the network learns to compress the input and then reconstruct it. And if the network is not able to reconstruct the input, it means that the network did not successfully learn the internal structure of the data. So then items with larger construction errors would be outliers. This method is very efficient on high dimensional data like pictures or natural language processing data. The problem is you need lots and lots and lots of data points in order to train, and it's very costly. So it's not something that we're going to use. And in practice, it's very efficient, but you need millions of data points.
What is the Best AI Method for Anomaly Detection?
The method that I prefer is called isolation forest. So first we're going to see what is an isolation tree. Then we're going to see the concept of isolation forest. So again, we have our data represented as points. So you can imagine all of these points to be servers or a host in the network. And then X and Y could be any kind of feature that we want to observe. Just keep in mind that in practice we could have tens or hundreds of different dimensions to analyze at the same time, and it's not possible to represent it in just two. So the X and Y axis here are inverted, which is fine because it's a freedom of the researcher to show the axis that I want. So the method here is to split the dimensions randomly. So we do splits and then we construct a tree on the side which represents the cut.
So here y1 would be a random cut. Then we iteratively split until we isolate the points. So we can see on the right that we isolated the green and red points. And then we do this over and over again until all the points are isolated. So the intuition here is that if you take the path length from the root to a point, the shortest path length should be points which are more isolated. So in practice, since the cuts are random, there is absolutely no guarantee that the shortest paths are going to be outliers and long paths are going to be normal points. But if you do this over and over again and you take the average path length, you do these isation trees many, many, many times, you take the average path length for each point. Then using the law of large numbers, you can be somewhat certain that this average path length is a good representative of how much of an outlier a point is.
This is very efficient both in dimensionality and in the number of training examples necessary. The reason is there are mathematical shortcuts to take. And in practice, we have a much more efficient running time and results than using the previous methods. And one of the great things about this method is that it gives an order score for each point. And so you not only have outliers as a binary outcome, but you have the level of which it's an outlier.
How Do You Use the Isolation Forest Method to Identify Hackable Assets on a Network?
And this is a very desirable property because what you want is to rank assets in a network. So what do we do with this? Well, since we have potentially high dimensional data, geometry based is bad. Since we have a potentially low number of data points and by low I mean even fifty thousand which is low with respect to deep learning networks. So isolation forest is good. So this is what we are going to use to rank our network assets. So let's go to how we use this newly found knowledge in order to rank our network assets.
So let's come back to the problem. We want to rank enterprise I.T. assets in order of interest for a potential attacker or for an analyst to protect against a potential attacker. We have a few problems that we defined earlier. Our networks are different. “Interesting” is a very contextual concept. And typical networks are quite large.
And then what we've seen is that unsupervised learning is very good to accomplish contextual tasks because it takes the whole dataset. As long as we define “interesting” as an outlier, we can use that anomaly detection and all of wizardware is big data friendly because it can treat large amounts of information very quickly as opposed to humans. So if we take the network representation as a vector and use isolation forest, we should be able to get a gold nugget model. And this is what we did internally. So in the Delve software, we have implemented it in the vulnerability prioritization pipeline. And we have also developed an open source tool in order to share with the community. And we also have a live version that you can try on our Web site, which is trained on the data that we have internally.
So Batea is a software that automates this process. We take an Nmap report, and we parse it into a Python object, and then to an internal representation. Then we do feature engineering, which we're going to talk about next, into a matrix form, and then we run the isolation forest's algorithms on it. And the output is gold nuggets.
How do You Build a Mathematical Representation of Assets on an Enterprise Network?
So we have ordered lists of assets in order of interest. So what do we want for numerical representation? Because the most critical part of a machine learning pipeline is the numerical representations, we want to translate the Nmap scan into a vector of numbers. So we need every feature to have high variance. So it means there is a large range of values that it can take. We wanted to have ordinality, so we wanted to have an order. So, for example, when we have a categorical feature like the OS, it's not clear if Linux is higher than Windows for example. But if we translate it to binary features, then you have Linux as a binary feature. It’s 0 or 1. One if it’s Linux. If it's not Linux, then you can have many of these features and it makes sense to compare them together. And of course we want features to have relevance. So we want these features to indicate something about the asset, and especially in this case, we want these features to represent a point of interest for an attacker.
So what makes an asset? Gold nuggets, as we said earlier, have value to the organization. So we want each feature to represent value and the potential for further exploitation. We want the feature to represent how much access this gives to your network. For misconfigurations and just unusual configurations, we're going to see some of these features. So for the basic features we want to show the IP scattering, so the distribution of IP addresses, the port count, specific parts, timing information, host names. For example, if I'm going to show you the training data that we have in the model live and production, we use it on the diversity of networks, and we have more than a thousand assets.
And what we see here is a distribution of port count with respect to the last IP octet. So if the IP address is 192.168.0.30, it’s the 30 here that’s represented on the x axis. What we see from this graph is that most of the assets are close to zero. So this is an indication of usual behavior. So the unusual assets are closer to two hundred and fifty five. And of course, most assets have a low number of open ports. So if I were to only observe these two features, I would definitely go into the upper right corner and see what is there, because these are the most unusual assets. As we said earlier, we have many, many different features that we want to observe at the same time. So we can go and see a bit more intricate features.
So maybe you want to look for specific ports because some ports have more value to the organization. So like Windows controls, Linux admins databases is a strong indicator of a value issue. Some specific kinds of servers like Citrix, for example, and SSL certificate information. Some remote admin like RTP is very popular right now named service. The complexity of hostname. So if you have Dev1, Dev2, Dev3, and then XYZ, then XYZ should stand out.
So just for example, we have the distributions of services both for Linux and Windows, and we see that some of them are very common. Maybe they're not so interesting, but some of them are not. So if in your network you have some domain or kerberus or telnet and FTP, you should look for them in particular and you should stand out as features. And it's those that are taken into account in Batea. And so maybe you want some more context based features, so let's say the entropy of a port or the entropy of a hostname. So entropy is a measure of how unusual a piece of information is. So, for example, the entropy here would indicate that the order of ports is unusual. So if you have your http port, and the http service on port 8080, whether the port 8080 is going to be more unusual. So the entropy will go higher in the port 8080.
And let's say that the unusual port number and banner or just unusual Mac addresses, which could indicate shadow IT or misconfigured IT assets. So, for example, here we have the number each point represents an asset. And then if you have the number of named services on the X axis and then the port entropy on the Y axis, we can say we can see that for assets which have the same number of services. Those having a higher port entropy indicate a higher mis-configuration potential because for the same number of ports they have a more unusual distribution of ports. And this is very interesting to an attacker because it indicates an unusual way of setting up stuff. Then, if we go into more advanced feature engineering, we can use Batea in conjunction with external information like NSE scripts, vulnerability scans or just the MLT banner parsing.
Parsing banners using natural language processing methods, we can get into really sophisticated insights. So, for example, what we do here at Delve is that we scan for vulnerabilities and our goal is to prioritize these vulnerabilities. And so what we did is that we ran vulnerability scans on the same assets that we use to train Batea, and the information on the x axis is the CVSS of the worst or the most critical vulnerabilities on the asset. So there's a lot of insight into this slide because what we see. is that there is really not a lot of granularity in the CVSS score. There is only a very small number of possible scores that we get, but yet we want to prioritize this score.
What Assets Should You Prioritize for Vulnerability Management?
So, using the Batea anomaly score for any given score, you should focus your attention on the unusual asset. So here is the danger zone.
So these are assets with critical vulnerabilities and the high mis-configuration potential. So if you only have a limited time to spend fixing vulnerabilities, these are the points you should be fixing. So this is a prime example of how Batea is used inside the product. So inside the Delve product, it's just one of the many factors taken into account. But you can use Batea in your own environment and correlate it with your own given data. And so the code of Batea is completely open source. But if you used the code on GitHub, you have to train your own model. You can retrain them and use them across various assessments. And you can also contribute to the project. And if you want to try it live, we have the trained model that you can use on the website.