Health and Transport along Data’s Cockup Boulevard

One of the things about data releases is that there are cockups. Even if we accept your argument that you’d never screw it up, what about the people who follow you, and the people who follow them? Or your predecessor?

In medConfidential’s usual health arena, those cockups tend to be cognitively uncomfortable, or include difficult tradeoffs, as do many decisions to do with people’s health. However, down the road at the Department for Transport, they have examples that have similar potential effects, but that are easier to talk about at parties.

Everyone knows what a train is and, while trains do crash, we have some idea of just how rare that actually is, and get on them daily anyway. For that reason, the examples in this blog post will look at transport, rather than health.

Finding your way to cockup boulevard

Our friends at the UK Anonymisation Network recently published a presentation on the process of anonymisation – mostly looking at the process that organisations should go through. (While the presentation was published in the context of open data, the rules apply for any data.) Full details are in the presentation and its accompanying documents – for the purposes of this post, the description and process in Section 2 is pretty good, within some constraints:

  • Describe your data situation
  • Know your data
  • Understand the use case
  • Understand the legal issues
  • Understand the issue of consent and your ethical obligations
  • Identify the processes you will need to assess disclosure risk
  • Identify the disclosure control processes that are relevant to your situation
  • Identify who your stakeholders are and plan how you will communicate
  • Plan what happens next after you have shared of released data
  • Plan what you will do if things go wrong

The last point is the kicker; this is hard. What happens when you cock it up? Or, if not you, your successor’s successor, who has less of an understanding of what the words actually mean than you do?

The whole process relies on those following the process having an understanding of not only what they’re doing, but the wider data environment in which they are operating. For many organisations, there is a fundamental denial of anything that’s even just outside their narrow silo, let alone the wider “environment”, and that’s going to get messy.

It doesn’t matter how good your SDC process is if you don’t care about the world as it is, rather than just how it would be convenient for it to be. Data, once released, cannot be un-released. Future releases may be stopped (with resultant damage to confidence in the data environment), however, the existing releases will still have been released. Under an Open Data License – which is necessary for arbitrary reuse – it is particularly difficult to get them back.

cyclingSome of these will be pure accidents.

Take as an example Transport for London, who run the “Boris bike” hire scheme, and who publish details of cycle hires – from where to where, and when. Data that produces many of the pretty cycle hire maps you see.

The data published should be “a row identifier, the length of hire, the start time/date, a Bike ID, the Start Location, and the End Location”, thus:

Rental Id, Duration, Bike Id, End Date, EndStation Id, EndStation Name, Start Date, StartStation Id, StartStation Name
18884041,271,4313,02/01/2013 13:32,251,”Brushfield Street, Liverpool Street”,02/01/2013 13:28,509,”Fore Street, Guildhall”

A significant amount of public benefit can come from such data being available; many different analyses have been done.

Sometimes the choice to release is deliberate. (The release of New York taxi trip data was a deliberate, if ill-considered, act.) But at some point last year, someone at Transport for London just made a mistake.

For a couple of months, TfL accidentally included the “hire key” ID, which is the identifier of the person who hired the bike. As such, it was possible to derive sensitive details using other data known about the various trips of individuals.

Avoiding cockup boulevard altogether

Whether deliberate or accidental, such issues come from fundamental category errors. We see this a lot – such as people perceiving linked achievement data as a dataset about schools and teachers, without appreciating the crucial significance of it containing the life experiences of children. Some projects see doctors and nurses – people who, when they were aged about 13, decided to spend their life helping people – and consider that an exploitable resource for acquiring nice things.

It will become increasingly common to wrap such things in the banner of “data”, and claim the magic pixie dust will solve all. How likely is it that such category errors will be nowhere within your organisation, and never occur? Especially in a political bureaucracy where you have powerful individuals “masterminding” a programme without regard to the details?

It’s a good thing that the UKAN assessment process has cockup sections one and two.

What is Open Data?

Open data is data published for all to use, with no limit on purpose – which is why personal data cannot ever be open data, except for matters of public record (i.e. some legally-mandated details about people who have power or influence over others’ lives). When aggregated and properly treated, fully anonymised results about people – statistics – can and should be open data. However, any failure to follow a full and complete statistically valid process means you are actually publishing personal data.

In ethical practice, the only entity who can publish rich, detailed personal data on an individual is that individual themselves. It can only ever be something someone does themselves, and not something people do to them.

And broad, open-ended ‘consent’ just won’t cut it. Even if you get someone’s permission for a bunch of the good stuff you imagine doing with their data, it’ll be the bad stuff you haven’t thought of that someone else does that’ll screw you. And the people whose personal data you published. Depending on circumstances, this could be downright abusive or worse.

I may choose to post photos of my meals to instagram; someone I don’t know choosing to post all my meals to instagram is just creepy.


P.S. Good luck to Mike Bracken and Tom Steinberg in their future endeavours.

1 thought on “Health and Transport along Data’s Cockup Boulevard

  1. Pingback: The Second Chief Data Officer |

Comments are closed.