Evaluating Anonymization Methods
Article 29 Data Protection
The European Commission’s Article 29 Data Protection Working Party provides a useful set of criteria for evaluating anonymization methods in its “Opinion on Anonymization Techniques” (2014):
- Is it still possible to single out an individual?
- Is it still possible to link records relating to an individual?
- Can information be inferred concerning an individual?
The first criterion means that it should not be possible to discover information about a specific individual or small group of individuals. For example, if only three individuals in an anonymized hospital dataset share a diagnosis, the dataset fails the test of singling out. The second means that it should not be possible to link different records pertaining to an individual or group. For example, a dataset that includes individuals’ occupations as well as demographic information could potentially be linked to publicly available profiles on LinkedIn, social media, or registers of professionals or government employees. Third, it should not be possible to infer potentially identifying attributes based on other attributes in a dataset. For example, location data collected through smartphones, which has sometimes been released as part of open datasets, usually makes it possible to infer the location of an individual’s home and office.
To evaluate re-identification risk, the Article 29 Working Party also suggests understanding identity as multidimensional, with each clear attribute as a coordinate. Whenever it is possible to analyze a region of this multi-dimensional space that contains only a few points, there is a risk of individuals being re-identified. In other words, any combination of properties that is unique to a particular individual or a very small group of individuals poses a risk of re-identification. Anonymity is protected when it is only possible to analyze sizeable “clusters” of individuals who cannot be distinguished from one another based on their attributes.
Here’s an example of the application of anonymization techniques to prevent the singling out of individuals or small subgroups:
A hospital database is being anonymized so that it can be shared with a medical research institute. Patient names and health card numbers have been deleted from the dataset, and dates of birth and death have been generalized to years of birth and death only. Dates of diagnosis and treatment have been generalized to monthly intervals. Data fields that remain unchanged are diagnosis and treatment procedures. If, say, only three individuals born in 1982 received a particular diagnosis in March 2014, the risk of re-identification is too high. One option is to delete these records. The other is to apply additional anonymization, perhaps by generalizing years of birth to ten-year intervals (e.g., 1980-1989, or alternatively age 30-39).
The key to anonymization lies not in deleting particular types of data, but in preventing the occurrence of subsets of one or a few individuals with a specific set of characteristics. The concept of dimensions of identity provides a starting point towards this goal by helping to break down a dataset and suggest possibilities for anonymization. Dimensions not relevant to a particular purpose can be eliminated from the dataset. Within each of the remaining dimensions, the most specific fields can be deleted, randomized, or generalized. Finally, any very small subsets remaining can be identified and deleted. When this is accomplished, the risk of re-identification approaches zero, as any unique or distinct attributes of individuals have been concealed.