This weekend we awoke to hear of plans by Her Majesty’s Revenue & Customs (the UK tax authority, akin to the IRS in the USA, but with more power) to start selling anonymised tax data where doing so “would generate clear public benefits, and where there are robust safeguards in place.”
Although there is no formal announcement on the HMRC news section, you can see some of the press coverage on The BBC, The Guardian or The Telegraph.
You’ll see that one of the Government’s own MPs has described the plan as “borderline insane,” a tactic no doubt employed to garner some headlines and ensure that his opposition is well known; especially given the likely public reaction and HMRCs not-all-to-great record on data protection. But is it that insane?
Setting aside the plans to sell the data, and the slightly more nuanced debate that the sale of public data brings (and of course the OpenData / Data.gov movement) I’d like to concentrate on the anonymisation of the data which HMRC might be proposing to use, and just how flawed that can be in the age of Big Data and Cloud Computing.
It is likely the proponents of the HMRC plan will assure the general public that their data won’t be identifiable and the principle of tax-payer confidentiality will be upheld… Well, it turns out that’s really hard to do!
Re-identification is the process of taking a dataset which is believed to have been anonymised of any personally identifiable information and by means of processing or data-matching re-establishing the personally identifiable information (PII) with some level of confidence.
In practice this generally means combining other publicly available information with the ‘anonymised’ information in a data-matching / ‘jigsawing’ exercise. Historically this was hard, processor intensive work which could take days or weeks and thus was usually cost or time prohibitive – even with just one data set to combine.
However, the advances in ‘Big Data’ over recent years, combined with the scalable power of cloud computing, mean that multiple data sets could be combined in a matter of moments – making the re-identification of data not only possible but also practicable.
An often-quoted example of this process is when Netflix first released some anonymised usage data as part of the Netflix Prize was combined with IMDB reviews (and thus IMDB user names). It was possible to identify the user who had watched the Netflix movie, then link that to their IMDB review based on the time – a seemingly innocuous data point in the Netflix set. By then reversing this process it was possible to take the IMDB reviews and user names and come up with a complete listing of films watched by each user. More information on that here.
This was with two data sources – IMDB and Netflix Anonymised Data. Imagine if the researchers here had then added in social media data, perhaps by looking for similar user names, or perhaps looking for posts containing the films name around the correct time – something not that complicated to do with Big Data and Cloud tech. It would have been comparatively easy to go from anonymised film usage data to a picture, name and social media details of the person watching it, along with their recent film history.
Just think of the consequences if the same happened with your tax data!
What Do We Do?
Of course, we all want open data, don’t we? But if we get scared by the possibilities like those above, we’d never release any data. A similar recent debate in the UK formed around government plans to allow research based on NHS medical records – Care.data. Fundamentally few people would disagree with using existing medical knowledge to try and improve care for the future, but medicine is complicated and you need a lot of data about an individual person to do that reliably. So, anonymised data would help, and surely we all want better health for our future generations (and maybe even us!).
Obviously we have to be careful HOW we anonymise data. The devil is in the detail. As data professionals we can take obvious steps to anonymise data effectively against the threats we know about at the time we anonymise it. We also look to anonymise data down to the lowest level needed to provide meaningful data for research & development, social good etc – perhaps by aggregating data into groups (for example postcode area SW1A rather than SW1A 2, or even SW1A 2AA – Downing Street).
The problem comes, as with most information security, that there will always be someone with more knowledge, more skills or a stronger, often nefarious, desire to break the defences put in place to protect that information. This is the “motivated intruder” attack. It is our job to protect against this as best we can when we anonymise data – it’s a higher standard than “can a reasonable person link data.”
Motivated Intruder Test
So, when anonymising our tax data, HMRC must think of the motivated intruder. In fact, The Information Comissioner’s Office details this exceptionally well in the Code Of Conduct for Anonymisation. HMRC will have to think about some, all, and hopefully more than the following:
- What other information is out there?
- What other information could be “jigsawed” with the tax information?
- What information they release:
- Can they aggregate without losing utility of the data?
- What data points are in it which may help to identify a person?
- What could the data be used for?
- How difficult (and therefore likely) is it to use this data?
Some of these will be very hard to answer, or even unknown to HMRC. They are the realm of specialists who devote their whole professional life to this sort of question. It’s just like any other form of Information Security – you don’t know what you don’t yet… so best ask someone who does nothing else. Actually, ask two people – or better still 20.
When we launch a new website, or service, or even maintain an existing one, the prudent amongst us employ the services of at least one (sometimes many) security consultancies to “penetration test” them. They use all the techniques they know how to try and break in / break the service. Anonymised data should be no different – HMRC must test their data sets with as many 3rd parties as possible and they should make those results public to instill confidence.
The publification of anonymised tax records could be very useful for so many aspects of life, some commercial, some social – but the potential harm of doing it incorrectly is huge and the risk of doing so is high. HMRC would be wise to tread very carefully and walk very slowly into this one.