Big Data’s Big Privacy Leak – Metadata and Data Lakes, Part 3
Bringing Privacy Regulation into an AI World, Part 3: Big Data’s Big Privacy Leak: Metadata and Data Lakes
This seven-part series explores, from a Canadian perspective, options for effective privacy regulation in an AI context.
For a long time, access control has been the principal means of protecting personal information. Fifty years ago, this meant locked file cabinets. Today, much of our personal data is protected by passwords. This system, refined over the past fifty years, has been highly effective in securing data and minimizing data breaches. But the advent of big data and AI has moved the goalposts. Access control cannot protect all of the personal data we reveal as we navigate the internet. Further, most internet users are now more concerned about how the companies to which they entrust their personal information are using it, than about the risk of data theft. To adapt to this rapidly evolving digital environment, it will be necessary to rethink access control and develop stronger practices for controlling the use of personal data.
Many of our daily online activities are regulated by passwords. They safeguard our online lives, giving us access to everything from our smartphones and bank accounts to the many websites where we shop and entertain ourselves. Passwords are keys securing our personal information and property. The security they give us is known as access control.
Yet there is one type of personal data that passwords cannot protect: the traces we leave every time we use the Internet and phone networks. The details of our activity as network users are known as metadata. We can keep our personal information under cyber-lock and key, but not our metadata. We can erase browser cookies, but the search engine’s log of our browsing patterns and search keywords remains.
The ground rules of personal data protection have not changed, despite general confusion about how they apply in rapidly-changing contexts. Fair information principles, the bedrock of Canadian privacy legislation, state that organizations should only collect, use, share, and retain personal information for specific purposes to which individuals have consented. Any information, or combination of information, that is detailed enough to potentially identify a person is considered personal information, and these rules apply.
Yet as larger and larger volumes of data are collected and aggregated by big data initiatives, it becomes more and more difficult to define precisely what is considered personal information. “Data lakes” – massive repositories of relatively unstructured data collected from one or several sources, often without a specific purpose in mind – are a highly valuable asset for companies, providing a wide variety of data for potential future analysis, or for sale to other companies.
Data lakes often contain a mix of metadata and personal content. In combination, these can frequently identify specific individuals. For example, publicly available and searchable databases of Twitter activity show tweets by geographic location – positions so specific as to reveal street addresses. In the commercial realm, big box retailers use customers’ debit and credit card numbers to link their various purchases, and have developed customer sales algorithms so refined that they can identify the purchase patterns of pregnant women and send them coupons for baby products. Personally-identifiable data is of far more value to marketers than aggregate data, and powerful AI technologies can be harnessed to re-identify anonymous data.
Legally, personal data can only be collected and used for specific purposes to which individuals have given consent. AI systems, however, blur the line between anonymous data and personal information by making it possible to identify individuals and infer more detailed personal information by combining data from multiple sources. Controlling access to data does not address the most significant privacy risks of AI initiatives. To protect privacy in a big data world, it will be necessary to develop more sophisticated strategies to govern the use and sharing of personal data, as I will explore in my next posts.