Privacy and data protection laws stipulate protections for “personal information,” which is usually defined as information that could readily identify an individual. This has given rise to various attempts to define “identifying information,” when in fact there is no clear division between types of information that can and cannot identify individuals. Identity has frequently been conceptualized as a spectrum ranging from unique identifiers such as names and government identification numbers, to anonymous attributes that provide no connection to identity. However, individuals can frequently be identified by unique combinations of attributes that by themselves provide little information about identity. Understanding identity not as a spectrum, but as a set of dimensions, helps to anonymize data more effectively by taking into account all of the combinations of attributes that could identify an individual.
Identifying information is often understood as a spectrum ranging from “verinyms,” which provide certainty of an individual’s identity, to anonymous information which provides absolutely no connection to identity. Most data fall somewhere in the middle of the spectrum, identifying subgroups to which an individual belongs. Individuals can be identified by a single attribute, such as a full postal address, or a combination of attributes: for example, the ages of a person’s children and the school the children attend, combined with an ethnic identifier, may point to one or two individuals.
The goal of anonymization is to reduce the amount of personal information in a dataset to such a point that it no longer poses a significant risk of re-identifying individuals, and can therefore be shared more freely. The key to effective anonymization is the ability to evaluate the risk of re-identification in a dataset and apply changes that reduce this risk to an acceptable threshold. Conceptualizing identity as a set of dimensions helps to evaluate this risk and guide the use of anonymization techniques.
Dimensions of Identity
Data sets containing personal information usually include a variety of personal attributes, such as gender, age, address, transactions, and so on. Some of these attributes are more closely related to each other than others. Grouping data into dimensions provides significant information about re-identification risk. For example, the location dimension includes a number of identifying attributes: mailing address, postal code, electoral district, city, etc. As more attributes belonging to a given dimension are included in a data set, and as those attributes become more specific, the likelihood of identification within that dimension rises.
Some key dimensions included in many data sets are:
- Location
- Personal demographics (e.g., gender, age, ethnicity, number of children)
- System transactions (e.g., appointments, billing, service and program participation)
- Medical information (e.g., test results, diagnoses, prescriptions)
- Financial information (e.g., credit card information, bank transactions), etc.
The risk of releasing data increases with:
- the level of granularity in each of the dimensions, and
- the uniqueness of a certain property.
Sufficient granularity in any dimension can be enough to identify an individual: for example, a full postal address may belong to a single person. Medium to high levels of granularity in several dimensions can also identify an individual: for example, there may be only one individual in a hospital’s database with a particular year of birth, partial postal code, and gender. Along with granularity it is essential to consider the uniqueness of any property or combination of properties included in a dataset. Some postal codes contain twenty households, and others only one. Some medical diagnoses are very common in a particular age bracket, and unusual in others.