Working with data - Lessons from the field
December 11, 2019
Part 1 - Problem finders are the best data scientists
This is the first article in a series of blog posts that highlights some lessons learned in the field by doing “real and dirty” data science here at Delve, things nobody teaches you in school and even less in online classes. In a nutshell, this article is an overview of the right mindset to have when doing proper data science. Subsequent posts will dig deeper into certain methods, attitudes and tricks from the field required to be a successful data scientist.
Key takeaway: The ability to frame and reframe problems, and explore data with a constant state of suspicion about uncovered relationships is what makes it possible to create real value and insights.
Bad problem framing is the root of all evil. Some might be outraged by this restatement of the famous quote from Donald Knuth, but I stand my ground. Often, the first problem you are trying to solve is ill-defined, so it’s the data scientist’s responsibility to refine it. People who direct you to solve a problem are most likely to have a very biased way of framing it. They have been living - or worse yet, heard about - some kind of pain, maybe for a long time, and are expressing it in their own jargon. Yet they’ve heard about the magical doings of data science and machine learning... Wizardware to the rescue!
You, the data scientist, probably don’t understand their problem well enough, and they probably don’t know how to translate it into a statistically defined, data-driven way. This is your job, and you shall do it with diligence. Dig deeper. Question assumptions, and never, ever feel ashamed to ask domain-specific questions.
Framing the problem correctly is the most fundamental step of any data science project. It should be done iteratively, continuously and with multiple conversations between you and the requester. By asking different but related questions about the situation you are trying to solve, you might stumble onto something very interesting. Once you know exactly what problem to solve, and that it is formally well-defined, the question of how to query the data will come more naturally. Beware that you might not be able to find explicit answers to the questions; proxy variables are most often the best you can do, as we will explore in a subsequent article.
A recent example here at Delve led to the discovery of an interesting insight. Clients frequently think that it is possible to find “valuable” network devices using machine learning. While this problem seems quite intuitive for a human analyst, determining what “valuable” really means is a prime example of an ill-defined problem. Valuable to whom, and with respect to what? What measurable quantity is value, and which actual data can I use to predict it? By speaking with many different stakeholders, we came to realize that the most valuable assets are most often the unusual ones. Most importantly, unusual with respect to some specific features of the asset, like the hosted website’s content and complexity, the type of services running on them, their network context, etc.
In short, it led us to craft a set of numerical variables for each device, encoding it’s interesting properties for the stakeholders, and then running unsupervised anomaly detection methods to uncover the most outstanding ones. It is a prime example of insights derived from good, in-depth problem formulation. This iterative back-and-forth with stakeholders led to the concept of Gold Nuggeting, of which you can learn more about in this blog post or white paper.
Never take the data generating process for granted. The goal of a data scientist is both straightforward technical and quite abstract. The objective is always to uncover insights, relationships, and predictions about phenomena, given only the data representation of samples from the populations of interest.
There are multiple levels of potential bias to take into account here. If you want to make reliable and meaningful predictions that generalizes well on new data, great care must be taken about the quality of data that you have in hand.
In the case of cybersecurity data, sampling biases happen mainly because data is collected by specific persons and devices, during specific events or under specific circumstances, like attack scenarios on test servers, or even in certain “demilitarized” subnetworks. These selection biases can be problematic down the line because the real, critical assets could look very different from the sampled ones and the model could be unable to capture critical signals in due time.
Questions are more important than data here. In this field, many problems arise when one starts with a dataset before framing the problem correctly, trying to find relationships without questioning data’s legitimacy. Here are some questions for which you should have clear answers:
- Does your data represent interesting parts of your population, or is it filled with meaningless, noisy variables? What about missing values?
- What are your outliers? Are they legitimate or should you delete them? Outliers are full of insights.
- What does your data actually look like? Have you plotted it? Plot it.
- How naturally does your data group together, and with respect to which variables? Plot these groups.
- What else could you query to supplement your initial dataset? How hard would it be to obtain it?
- Does the sampling process introduce bias? If so, why and how can you detect it?
- What mechanisms are in place to ensure the diversity of data is statistically significant over time? Will it be a problem?
- Which variables are causally linked to what you are trying to predict, and which ones are potential confounders?
- Can you confidently parametrize your data using known distributions, and on what grounds?
- If you make predictions today based on the data at hand, will it still be true tomorrow?
This is just a small set of questions you should be able to discuss about, but it is by no means complete. Experience and mistakes will make you supplement it with your own. Sampling biases are somewhat inevitable, but can be dealt with if you are working with care and a right amount of suspicion. We will take a look at some interesting and pernicious ones in a following post.
Finally, act with the highest possible sense of rigor. There is a natural tendency to frame problems given only the data at hand. This is known as “availability bias” and it is a very sneaky thing to deal with, as we humans have this innate tendency to avoid extra work. Yet it is our duty to act with impeccable scientific integrity, so be suspicious. Fail at this step, and you might not only end up with the wrong answers to your questions, but altogether asking the wrong questions to your data. Suspicion, curiosity, rigor and integrity are more important than tools and techniques.