Article 29 Data Protection
The European Commission’s Article 29 Data Protection Working Party provides a useful set of criteria for evaluating anonymization methods in its “Opinion on Anonymization Techniques” (2014):
- Is it still possible to single out an individual?
- Is it still possible to link records relating to an individual?
- Can information be inferred concerning an individual?
The first criterion means that it should not be possible to discover information about a specific individual or small group of individuals. For example, if only three individuals in an anonymized hospital dataset share a diagnosis, the dataset fails the test of singling out. The second means that it should not be possible to link different records pertaining to an individual or group. For example, a dataset that includes individuals’ occupations as well as demographic information could potentially be linked to publicly available profiles on LinkedIn, social media, or registers of professionals or government employees. Third, it should not be possible to infer potentially identifying attributes based on other attributes in a dataset. For example, location data collected through smartphones, which has sometimes been released as part of open datasets, usually makes it possible to infer the location of an individual’s home and office.
To evaluate re-identification risk, the Article 29 Working Party also suggests understanding identity as multidimensional, with each clear attribute as a coordinate. Whenever it is possible to analyze a region of this multi-dimensional space that contains only a few points, there is a risk of individuals being re-identified. In other words, any combination of properties that is unique to a particular individual or a very small group of individuals poses a risk of re-identification. Anonymity is protected when it is only possible to analyze sizeable “clusters” of individuals who cannot be distinguished from one another based on their attributes.
Here’s an example of the application of anonymization techniques to prevent the singling out of individuals or small subgroups:
A hospital database is being anonymized so that it can be shared with a medical research institute. Patient names and health card numbers have been deleted from the dataset, and dates of birth and death have been generalized to years of birth and death only. Dates of diagnosis and treatment have been generalized to monthly intervals. Data fields that remain unchanged are diagnosis and treatment procedures. If, say, only three individuals born in 1982 received a particular diagnosis in March 2014, the risk of re-identification is too high. One option is to delete these records. The other is to apply additional anonymization, perhaps by generalizing years of birth to ten-year intervals (e.g., 1980-1989, or alternatively age 30-39).
The key to anonymization lies not in deleting particular types of data, but in preventing the occurrence of subsets of one or a few individuals with a specific set of characteristics. The concept of dimensions of identity provides a starting point towards this goal by helping to break down a dataset and suggest possibilities for anonymization. Dimensions not relevant to a particular purpose can be eliminated from the dataset. Within each of the remaining dimensions, the most specific fields can be deleted, randomized, or generalized. Finally, any very small subsets remaining can be identified and deleted. When this is accomplished, the risk of re-identification approaches zero, as any unique or distinct attributes of individuals have been concealed.
Article 29 Data Protection Working Party, Opinion 05/2014 on Anonymisation Techniques.
Privacy and data protection laws stipulate protections for “personal information,” which is usually defined as information that could readily identify an individual. This has given rise to various attempts to define “identifying information,” when in fact there is no clear division between types of information that can and cannot identify individuals. Identity has frequently been conceptualized as a spectrum ranging from unique identifiers such as names and government identification numbers, to anonymous attributes that provide no connection to identity. However, individuals can frequently be identified by unique combinations of attributes that by themselves provide little information about identity. Understanding identity not as a spectrum, but as a set of dimensions, helps to anonymize data more effectively by taking into account all of the combinations of attributes that could identify an individual.
Identifying information is often understood as a spectrum ranging from “verinyms,” which provide certainty of an individual’s identity, to anonymous information which provides absolutely no connection to identity. Most data fall somewhere in the middle of the spectrum, identifying subgroups to which an individual belongs. Individuals can be identified by a single attribute, such as a full postal address, or a combination of attributes: for example, the ages of a person’s children and the school the children attend, combined with an ethnic identifier, may point to one or two individuals.
The goal of anonymization is to reduce the amount of personal information in a dataset to such a point that it no longer poses a significant risk of re-identifying individuals, and can therefore be shared more freely. The key to effective anonymization is the ability to evaluate the risk of re-identification in a dataset and apply changes that reduce this risk to an acceptable threshold. Conceptualizing identity as a set of dimensions helps to evaluate this risk and guide the use of anonymization techniques.
Dimensions of Identity
Data sets containing personal information usually include a variety of personal attributes, such as gender, age, address, transactions, and so on. Some of these attributes are more closely related to each other than others. Grouping data into dimensions provides significant information about re-identification risk. For example, the location dimension includes a number of identifying attributes: mailing address, postal code, electoral district, city, etc. As more attributes belonging to a given dimension are included in a data set, and as those attributes become more specific, the likelihood of identification within that dimension rises.
Some key dimensions included in many data sets are:
- Personal demographics (e.g., gender, age, ethnicity, number of children)
- System transactions (e.g., appointments, billing, service and program participation)
- Medical information (e.g., test results, diagnoses, prescriptions)
- Financial information (e.g., credit card information, bank transactions)
The risk of releasing data increases with:
- the level of granularity in each of the dimensions, and
- the uniqueness of a certain property.
Sufficient granularity in any dimension can be enough to identify an individual: for example, a full postal address may belong to a single person. Medium to high levels of granularity in several dimensions can also identify an individual: for example, there may be only one individual in a hospital’s database with a particular year of birth, partial postal code, and gender. Along with granularity it is essential to consider the uniqueness of any property or combination of properties included in a dataset. Some postal codes contain twenty households, and others only one. Some medical diagnoses are very common in a particular age bracket, and unusual in others.
(C) All Rights Reserved
About Waël Hassan:
Dr. Waël Hassan is the founder of KI Design – his full bio is available at About