Knowledge extraction on anonymized data
September 5, 2019
How security researchers can gain (probable) insights on (almost) anonymous data.
At Delve Labs, our main concern is to provide our customers with better tools, helping them understand their information security posture. Therefore, we require insight into many intricate and sensitive details of their IT environment. We need the ability to query highly private data in a provably secure fashion. This short article introduces the concepts and terminology of privacy-preserving data mining and showcases the most important methods of this increasingly critical discipline. This post will be the first of a series that will illustrate certain techniques, with relevant examples from the field of information security.
The basic goal of data mining is to extract patterns from vast amounts of data, discovering meaningful subsets of lower dimensional structure in the space of available information. Doing this reduces the the granularity of representations to extract predictive, causal models on the one hand, or to find correlations for descriptive models on the other. Data mining experienced a significant rise in exposure in recent years, driven by the growing amounts of data collected by ubiquitous and pervasive monitoring technology and spectacular results in data analytics and machine learning.
The rest of this article is divided as follows: we first define the fundamental problem of privacy-preserving data-mining, namely the privacy vs. utility trade-off. Then we define the data lifecycle model, and we conclude with a description of the three major formalization of anonymity-preserving techniques: k-anonymity, differential privacy, and distributed privacy.
The problem of analytics on sensitive data
While the utility (quality and significance) of the available data is of prime importance, detailed queries can raise privacy concerns from the record owners. Unfortunately, the vast majority of techniques applied to anonymize data either modify, remove or randomize some of the original information, hence reducing its utility. This is known as the fundamental trade-off of utility vs. privacy. The challenge is to define a formal model of privacy from which we can derive a minimum threshold of anonymity while maximizing the utility to allow efficient data mining. To understand data privacy, we must first look at the three general stages data goes through.
The data lifecycle model
Stage |
Definition |
Problems |
Data collection | The generation, filtering, transformation and transmission of information from the original source. | The collecting entity is not fully trusted. Randomization on individual data is performed at the source to ensure no critical identifiable attributes are stored. |
Data publishing | The dissemination of a subset or view of the complete data set to a third party. In this case, we assume the data set is held by a trusted party. | Sensitive attributes can be linked to a unique individual using linkage attacks [ref] or detailed comparison with other records. |
Result publication | The publication of results or parameters of the knowledge extraction process, keeping the sensitive inputs secret. | A malicious user may be able to query the algorithm in a clever way, allowing him to infer sensitive information about its private input. |
I will give a brief summary of a privacy technique used at each of these stages in the following section.
Toward a formal definition of privacy
While privacy is a concept that has no universally accepted boundaries, sensitive information has been collected, queried, breached and published for ages. Many techniques have been used, like incorporating additive or multiplicative noise with a known statistical distribution, all lacking convincing proofs of anonymity. It is only quite recently that formal models of privacy have emerged.
k-Anonymity
The first of the three such models I want to present is k-anonymity [ref], which gives an answer to the following problem:
“Given a uniquely-identifying field-structured data, produce a useful release of the data with a formal guarantee that the individuals who are the subjects of the data cannot be re-identified from the anonymized version.”
k-anonymity is an intrinsic property of a dataset, which is said to be k-anonymous if every combination of identity-revealing characteristics occurs in at least k different records.This way any identity-revealing attribute is indistinguishable from that of at least k-1 other records. k is a parameter of privacy chosen in advance. Many techniques can be used to replace or randomize attributes while respecting this definition. This model will be discussed in depth in its own article.
While k-anonymity is widely used, it is far from perfect. It protects record owners’ identities, and precludes the association of any sensitive attribute with any particular individual. It does not, however, defend against detection of the presence or absence of the record’s owner from the data set, potentially leading to information leak.
Differential Privacy
That is specifically the problem that our next privacy model aims to solve. Differential privacy, now considered to be the strongest form of privacy [ref], formalizes the following task:
“Design a database modification scheme in which, for every two subsets of this modified database differing by a single element, there exists no single knowledge extraction algorithm that produces an output from which we can distinguish the two subsets with a significant probability (let’s call this probability epsilon).”
This parameter epsilon is then chosen as a threshold of privacy, and can be made as small as necessary based on privacy requirements. Much work has been done to analyze and complement this definition, and a plethora of applications and algorithms exist that instantiate it. Its robustness has been widely accepted. For example, Uber [ref] published an open source implementation of an algorithm that rewrites SQL queries to be differential-privacy compliant. Google has also published a data collection algorithm using differential privacy called RAPPOR [ref], used in the chrome browser to collect users’ data, and Apple has publicly claimed to adopt differential privacy in its products which collect behavioral information [ref]. I will cover this model of privacy in an separate blog post with enlightening examples.
Distributed Privacy
The final model we will mention is distributed privacy. In this model, mutually-untrusted parties each hold a subset of a dataset, and wish to compute a knowledge model from the sum of their data while revealing as little as possible about their individual components to one another. For example, competing institutions on a market might want to ensure that all of them are compliant with some agreed requirements, but none of them wants to share their private financial information. Here we can highlight the intriguing techniques of secure multi-party computation, oblivious transfer and homomorphic encryption. Many of the techniques that we will illustrate in this privacy model are still only theoretically useful owing to computational limitations, but nonetheless carry important applications for the technologies of tomorrow, and so it deserves it’s own article.
Conclusion
One of the main advantages of the technology developed at Delve Labs is the ability to leverage the vast amount of data provided by the collective gathering of all of our customers’ information. Collecting, storing, managing and analyzing this increasing amount of highly sensitive data brings its share of technical challenges. We are always looking at improving the security and fidelity of our data. Knowledge of formal privacy methods is one of the many ways by which we hope to strengthen our machine learning infrastructure, and we hope to share what we learn in the series of posts we have started here.