My latest Guardian column is “Data protection in the EU: the certainty of uncertainty,” a look at the absurdity of having privacy rules that describes some data-sets as “anonymous” and others as “pseudonymous,” while computer scientists in the real world are happily re-identifying “anonymous” data-sets with techniques that grow more sophisticated every day. The EU is being lobbied as never before on its new data protection rules, mostly by US IT giants, and the new rules have huge loopholes for “anonymous” and “pseudonymous” data that are violently disconnected from the best modern computer science theories. Either the people proposing these categories don’t really care about privacy, or they don’t know enough about it to be making up the rules — either way, it’s a bad scene.
Since the mid-noughties, de-anonymising has become a kind of full-contact sport for computer scientists, who keep blowing anonymisation schemes out of the water with clever re-identifying tricks. A recent paper in Nature Scientific Reports showed how the “anonymised” data from a European phone company (likely one in Belgium) could be re-identified with 95% accuracy, given only four points of data about each person (with only two data-points, more than half the users in the set could be re-identified).
Some will say this doesn’t matter. They’ll say that privacy is dead, or irrelevant, or unimportant. If you agree, remember this: the reason anonymisation and pseudonymisation are being contemplated in the General Data Protection Regulation is because its authors say that privacy is important, and worth preserving. They are talking about anonymising data-sets because they believe that anonymisation will protect privacy – and that means that they’re saying, implicitly, privacy is worth preserving. If that’s policy’s goal, then the policy should pursue it in ways that conform to reality as we understand it.
Indeed, the whole premise of “Big Data” is at odds with the idea that data can be anonymised. After all, Big Data promises that with very large data-sets, subtle relationships can be teased out. In the world of re-identifying, they talk about “sparse data” approaches to de-anonymisation. Though most of your personal traits are shared with many others, there are some things about you that are less commonly represented in the set – maybe the confluence of your reading habits and your address; maybe your city of birth in combination with your choice of cars.