Skip to main content
HM Revenue & Customs Logo

HMRC Anonymous Data? Be Careful…

This weekend we awoke to hear of plans by Her Majesty’s Revenue & Customs (the UK tax authority, akin to the IRS in the USA, but with more power) to start selling anonymised tax data where doing so “would generate clear public benefits, and where there are robust safeguards in place.”

Although there is no formal announcement on the HMRC news section, you can see some of the press coverage on The BBCThe Guardian or The Telegraph.

You’ll see that one of the Government’s own MPs has described the plan as “borderline insane,” a tactic no doubt employed to garner some headlines and ensure that his opposition is well known; especially given the likely public reaction and HMRCs not-all-to-great record on data protection. But is it that insane?

Setting aside the plans to sell the data, and the slightly more nuanced debate that the sale of public data brings (and of course the OpenData / Data.gov movement) I’d like to concentrate on the anonymisation of the data which HMRC might be proposing to use, and just how flawed that can be in the age of Big Data and Cloud Computing.

It is likely the proponents of the HMRC plan will assure the general public that their data won’t be identifiable and the principle of tax-payer confidentiality will be upheld… Well, it turns out that’s really hard to do!

Re-Identification

Anonymous MaleRe-identification is the process of taking a dataset which is believed to have been anonymised of any personally identifiable information and by means of processing or data-matching re-establishing the personally identifiable information (PII) with some level of confidence.

In practice this generally means combining other publicly available information with the ‘anonymised’ information in a data-matching / ‘jigsawing’ exercise.  Historically this was hard, processor intensive work which could take days or weeks and thus was usually cost or time prohibitive – even with just one data set to combine.

However, the advances in ‘Big Data’ over recent years, combined with the scalable power of cloud computing, mean that multiple data sets could be combined in a matter of moments – making the re-identification of data not only possible but also practicable.

An often-quoted example of this process is when Netflix first released some anonymised usage data as part of the Netflix Prize was combined with IMDB reviews (and thus IMDB user names). It was possible to identify the user who had watched the Netflix movie, then link that to their IMDB review based on the time – a seemingly innocuous data point in the Netflix set. By then reversing this process it was possible to take the IMDB reviews and user names and come up with a complete listing of films watched by each user. More information on that here.

This was with two data sources – IMDB and Netflix Anonymised Data. Imagine if the researchers here had then added in social media data, perhaps by looking for similar user names, or perhaps looking for posts containing the films name around the correct time – something not that complicated to do with Big Data and Cloud tech. It would have been comparatively easy to go from anonymised film usage data to a picture, name and social media details of the person watching it, along with their recent film history.

Just think of the consequences if the same happened with your tax data!

What Do We Do?

Of course, we all want open data, don’t we? But if we get scared by the possibilities like those above, we’d never release any data. A similar recent debate in the UK formed around government plans to allow research based on NHS medical records – Care.data. Fundamentally few people would disagree with using existing medical knowledge to try and improve care for the future, but medicine is complicated and you need a lot of data about an individual person to do that reliably. So, anonymised data would help, and surely we all want better health for our future generations (and maybe even us!).

Obviously we have to be careful HOW we anonymise data. The devil is in the detail. As data professionals we can take obvious steps to anonymise data effectively against the threats we know about at the time we anonymise it. We also look to anonymise data down to the lowest level needed to provide meaningful data for research & development, social good etc – perhaps by aggregating data into groups (for example postcode area SW1A rather than SW1A 2, or even SW1A 2AA – Downing Street).

The problem comes, as with most information security, that there will always be someone with more knowledge, more skills or a stronger, often nefarious, desire to break the defences put in place to protect that information. This is the “motivated intruder” attack. It is our job to protect against this as best we can when we anonymise data – it’s a higher standard than “can a reasonable person link data.”

Motivated Intruder Test

So, when anonymising our tax data, HMRC must think of the motivated intruder. In fact, The Information Comissioner’s Office details this exceptionally well in the Code Of Conduct for Anonymisation. HMRC will have to think about some, all, and hopefully more than the following:

  • What other information is out there?
  • What other information could be “jigsawed” with the tax information?
  • What information they release:
    • Can they aggregate without losing utility of the data?
    • What data points are in it which may help to identify a person?
    • What could the data be used for?
  • How difficult (and therefore likely) is it to use this data?

Some of these will be very hard to answer, or even unknown to HMRC. They are the realm of specialists who devote their whole professional life to this sort of question. It’s just like any other form of Information Security – you don’t know what you don’t yet… so best ask someone who does nothing else.  Actually, ask two people – or better still 20.

When we launch a new website, or service, or even maintain an existing one, the prudent amongst us employ the services of at least one (sometimes many) security consultancies to “penetration test” them. They use all the techniques they know how to try and break in / break the service. Anonymised data should be no different – HMRC must test their data sets with as many 3rd parties as possible and they should make those results public to instill confidence.

The publification of anonymised tax records could be very useful for so many aspects of life, some commercial, some social – but the potential harm of doing it incorrectly is huge and the risk of doing so is high. HMRC would be wise to tread very carefully and walk very slowly into this one.

Not Every Cloud Has a Silver Lining.

Nor a suitable disaster recovery policy it seems… ‘cloud’ based computing has been something of a buzz word in computing for a few years now, and large corporates are increasingly adopting it.  Essentially cloud computing means delivering something over the internet (something being a service, application etc); usually implemented as a web application.

For corporates this has a number of advantages, mainly around cost.  Cloud services are usually sold on a “pay as you use basis,” so you only pay for what you use; you don’t own the hardware, so have no capital costs; and the provider is responsible for maintaining the system, flexing with demand and importantly contingency.

I won’t comment here about the merits of Cloud computing, and whether I think the benefits are real or not – because you can be sure its here to stay with big offerings from companies like Amazon, Google & Microsoft; I want to concentrate on the contingency.

Finding Out The Hard Way.

The provocation for this post is a recent failure by the UK based cloud accounting service Clear Books; which I use for my accounting software and I previously recommended to others.  Over the weekend their database server died, and consequently so did their service.  This post isn’t meant to be a bashing of them, they’ve reacted well – but it shouldn’t have happened.  The full updates which were posted during the outage, which lasted all of Monday, are on their GetSatisfaction site.

Clearbooks Error
Over Friendly Error Screen!

When you put your data in the “cloud” it’s easy to believe its safe – clouds are big things after all!  Unless you are very foolish you aren’t going to place business critical data in the hands of a company who don’t look professional.  If their web site is terrible, or slow, or they seem a bit 1990’s; then you just wouldn’t trust them.  Trust is the operative word.  Clear Books ticked those boxes for me…

… but on Monday I was reminded that in doing so I chose to put all my eggs in one basket, and if that basket broke I was doomed!  There is only one person to blame for this – me; although I had checked what their policies were I had no idea what their infrastructure actually was.  Of course, I also have no idea what infrastructure my Google Apps run on, or what Amazon use for their S3 offering – but I trust them to manage it.

Fit For Purpose.

Had I have known that all my data was residing on a single database server (albeit with RAID disks) I would have been alarmed.  Thats just incredibly bad form when providing a commercial offering which holds business critical data – be it on the cloud or otherwise.   Real time replication is intrinsic in modern database servers (and unlike of old it adds next to no overhead), it should have been replicated.  But you can’t do that on 1 server!  It should have been at the least clustered, so in the event of a hardware failure another physical machine took over.    By the sounds of it the disks are internal to the server too – more bad form.  Customer data should be stored on a proper SAN from a reputable manufacturer… multiple disks, multiple paths to the disks from multiple servers, etc.

Enough of how I think it should have been set up – my point is that while the kit there may have been fit for purpose the disaster recovery plan was not… and this was a simple hardware failure.  A similar scale system I look after can fail a database node over with approximately 90 seconds down time – not 24+ hours!

So, if you’re thinking of going to the cloud with your data… make sure the provider has good architecture, and an even better disaster recovery plan.

The Future.

In fairness to Clearbooks, who I really only want to use as a example (as well as venting a bit I suppose!), and to their credit, they have been open and honest throughout.  This is really important – they could have been less transparent, or blamed their partner Fubra more; but they’ve taken it on the chin and admitted their failings.  Then they have promised to fix them and said they will remain open about this process.  You can’t ask for more than that, and you can be damn sure Microsoft didn’t do the same when they had their cloud glitch in 2009.

http://getsatisfaction.com/clearbooks/topics/unexpected_downtime-19c9hf