It is common knowledge these days that you exist all over the internet. Each site you view, app you use, and company you deal with tracks you in different ways, building databases of information which help them develop more effective (and profitable) services. While this data is often protected by privacy policies, these policies generally allow data to be shared with anyone, given certain steps are taken to anonymize the data. However, as we mentioned in a previous post, Harvard Researcher Latanya Sweeney has recently shown that data can never truly be anonymized, but can be pieced together using other publicly available information to “fill in the blanks”.
This is unfortunate since mining big data can be incredibly useful, not just for maximizing profits, but for measuring larger social trends and analyzing regional health concerns. So how can we analyze data without sacrificing privacy, when traditional anonymization does not cut it? One solution is through harnessing differential privacy. With differential privacy, whenever data is transferred between parties, it is randomly altered in ways which do not change how the database behaves statistically, but provides a mathematical limit on the probability of identifying any one entry. That limit is the database’s privacy score.
Currently, differential privacy is facing a number of mathematical hurdles including developing more efficient algorithms which require less computing time and ensuring that the random alterations to the database cannot be sniffed out. Given the fractured state of American privacy law, even once these technical hurdles are surmounted, it will be difficult to have differential privacy become the norm. Were it to succeed, this tool would be invaluable to wide-scale social research, with very promising implications in the fields of medicine, sociology, economics, etc.