medConfidential response to NHS England response to Sky News NHS security story and research by the Oxford Internet Institute

NHS England is still trying to justify in 2015 what it tried to sneak through in 2013. Has it learnt nothing?

Original Sky News piece and article;
NHS England briefing;
PharmaTimes repeats NHS England briefing.

Disclosure: Sam Smith of medConfidential sits on the Privacy Advisory Group for the Office of National Statistics’ (census replacement) Beyond 2011 & Big Data programmes, of which the expert academic at the Oxford Internet Institute interviewed by Sky News is also a member.

Does the database exist?

NHS England: “firstly, there is no database of information for the care.data programme yet”
NHS England: “confirmed that pilot schemes are starting again”
NHS England: “To access the data collected as part of care.data, applicants will need to…”

NHS England itself acknowledges, on a page named “our plans”: “for example, the hospital episode statistics (HES) service has been collating administrative information since the 1980s about every hospital admission funded by the NHS.”

So there are existing databases which are vulnerable to these problems and a new database is being built, it’s just not been built yet. (The ‘new’ specification in 2015 appears to be the same care.data specification from 2013 – with various ‘mistakes’ covering HIV, HPV, and AIDS codes corrected.)

Aspects of the existing data services are as concerning, if not more so, than the care.data proposals.

“A statement and briefing were provided to Sky by NHS England ahead of broadcast”

On Thursday evening, NHS England contacted medConfidential, having seen our tweet, to say they had commented to Sky News. But, as of Monday, the Sky News piece still contained no attributed quote or statement from NHS England. It has a quote from the programme director at HSCIC, not NHS England.

We don’t know the ins and outs of exactly who said what to who when but, yet again, it seems that NHS England is hiding behind another government body – the Health and Social Care Information Centre – to provide justifications that do not speak to the full consequences of its own future proposals.

HSCIC is a “creature of statute”, a body which in law may only do things as Directed, including by NHS England. NHS England is the puppeteer cowering behind the curtain, insisting the puppet’s the one at fault.

“this would be a criminal offence”

While ‘hacking’ into a database of medical information would indeed be a criminal offence, it is rather beside the point. It’s the the ‘Hollywood scenario’ of a remote attacker defeating NHS England’s defences with cunning from their back bedroom, or North Korean data terrorists launching an attack.

What is far more relevant is that copies of the data (HES, etc.) have been sold [1] to a whole range of organisations and companies, many of which continue to receive data. And there are no criminal sanctions for misuse of the data by the recipients or data breaches, which – despite previous denials [2] – we now know there have been [3].

NHS England is quite clear that confidential data is already being sent to places: “confidential data is always encrypted whilst in transmission and the secure networks used to transfer data are regularly tested and monitored for any vulnerabilities”. (Unless David Cameron succeeds in outlawing it, as he proposed last week.)

In the case of the Sky News piece, the researcher acted entirely ethically and correctly in using the information provided by the journalist – who had given full and informed consent, and was clearly aware of the risks. Those who would rather continue the status quo and placate, rather than inform, the public are less likely to explain all of the risks and mitigations to a journalist. And highly selective ‘explanations’ do not give the full picture.

Given the continuing distribution of 25 years of hospital records – over 1 billion dated events – this research identifies both the grave risk to the medical privacy of the country, and the continued wilful ignorance of NHS England.

1) On a “cost recovery” basis.
2) On BBC Radio 4’s Today programme, 4 February 2014, Tim Kelsey claimed “in 25 years there has never been a single episode in which the rules… have ever compromised a patient’s privacy.”
3) HSCIC’s FOI response on 7 April 2014 lists a data breach in every year from 2009 to 2012; HSCIC holds no records from before it was formed in 2005.

Where does the data go?

NHS England: “To access the data collected as part of care.data, applicants will need to go through an approvals process and then, during the pathfinder stage, can only see it in a secure data facility (SDF). During pathfinder stage, access applications will only be accepted from select organisations and there is a robust security procedure in place when the applicant visits the SDF.” [our emphasis]

The crucial point being, what about after the pathfinder stage? Where will applicants be able to “see” the data then?

Will NHS England revert to current practice, as for HES and other data, and permit copies of the data to be sent out? There’s little point constructing a “secure data facility” if it is not then used for all future access to the data.

If all NHS England will promise is to keep patients’ data in the SDF “during the pathfinder stage” then it is just a temporary safeguard, which can be removed for the full national roll-out.

So why won’t NHS England promise that patients’ data will always be kept in the secure data facility? It clearly wants to keep its options open – but if the intention is for data to be accessed in other ways in future, why aren’t patients and GPs being told? Given NHS England’s track record of miscommunication, trumpeting what actually amounts to a tightly time-limited conditional safeguard does very little to inspire confidence.

“NHS to carry on selling patient records to insurers” – Telegraph, 27 November 2014

NHS England: “credit rating agencies or health insurers would not be granted access to the NHS’ secure data facility where the information will be held.”

This may sound pretty definite, but can NHS England cite the precise part of legislation which provides the same level of certainty as that statement? We doubt it, because it has never previously been able to do so. NHS England argues the claim on the Telegraph front page was false, but has never provided any evidence to support its assertions. And we’ve asked, repeatedly.

In fact, the law remains mute on the types of companies that may have access to the data – it concentrates on uses – and the undefined phrase “for the promotion of health” leaves open loopholes for data access that even McDonalds or Big Tobacco might use. (Regulations that might begin to address this, for the Care Act passed in May, are still unpublished.)

Misunderstanding the ‘birthday attack’

PharmaTimes: “NHS England said the suggestion by Sky is incorrect, saying the likelihood of being able to identify an individual “is negligible”

NHS England is again misleading the public.

As an analogy, if you consider a classroom and pick two children at random it is highly unlikely – 1 in 133,225 (i.e. 365 x 365) – that they will both have a specific birthday. But if you walk into that same classroom of 23 children or more and ask “Do two of you share a birthday?” then the chances are better than 50-50 that the answer is yes.

Example 1: Know someone who had a heart attack?

Presume someone you know has had a heart attack.

NHS England has 181 A&E departments [4] handling England’s 386 heart attacks per day [5], so each A&E receives, on average, 2 heart attack victims per day. Which, even without any other information, gives a 50% probability of spontaneous identification of a victim whose hospital and date of event is known (neither should be sensitive on their own). As the OII research into the Sky News journalist argued, that is information that gets tweeted, as it is ‘not sensitive’.

Because the data is linked over time – ‘longitudinal’, to use the proper statistical term – discovery of a single medical event would mean you can use that pseudonym to link back to all of that person’s other medical events, because “the pseudonym is allocated to the record instead” (NHS England).

It doesn’t matter what the pseudonym is or what form it takes, what matters is that it links the records. The information associated with the date of the event is what gives you the link to a victim, not the NHS number or pseudonym.

NHS England is therefore being disingenuous when it says “once a patient’s record has been matched, the information that could identify a patient is removed and the pseudonym is allocated to the record instead” and that pseudonyms can be converted back to the original identifier “only by using the specific encryption key that created the pseudonym” and this is “only ever disclosed in very exceptional circumstances”.

Of course NHS England does not disclose the original identifier (NHS number). The key point that the researcher made, and that NHS England missed or continues to wilfully ignore, is that this is completely irrelevant.

And it shows that NHS England has learnt nothing from the concerns of the last year.

In February 2014, David Davis MP argued that knowing the dates he had his nose broken (due to media attention) would mean his entire medical record could be identified. NHS England has never refuted this argument with substance.

4) DH count. See https://www.whatdotheyknow.com/request/131933/response/325271/attach/3/Annex%20A%20Final.pdf
5) 141,000 per year in England: https://www.bhf.org.uk/publications/statistics/cardiovascular-disease-statistics-2014

Example 2: Women with children

NHS England seems to believe that your children’s birthdays are secret.

For example, by the HSCIC’s own rules, in HES the date and code for “Birth date – baby” is deemed identifiable, but the date and code for “maternity: where the baby was delivered” is not [6]. These are the same event, stored twice, but treated as if they are entirely different. Removing only one of them does not magically turn HES into non-personal data, and HES contains dozens – if not hundreds – of such fields.

Similarly, a family is identifiable by knowing the birthdays of the children. For a family of 2 children, there is a 90% likelihood that the birthdays of the two children are unique. For a family with 3 children, the children’s birth dates are almost certainly a unique identifier for that family in the country, tracked via the mother’s medical history.

On average, one set of twins are born in each maternity hospital in the UK per day. There are just 208 triplets born in the UK per year, i.e. fewer than one per day. If you know the birthdate of a triplet you could therefore read off the entire medical history of the mother via that single event.

6) For a single illustrative example, see HSCIC HES inpatient data dictionary, page 11, field: admimeth (and many, many others). This is only one method of delivery, others are equivalent.

Example 3: Who gets chemotherapy?

NHS England repeatedly argues that its care.data programme is necessary because “the NHS isn’t capable, currently, of telling you how many patients are undergoing chemotherapy, for example”.

In fact, the vast majority of chemotherapy is delivered in secondary, not primary care. Extracting data from GPs’ systems would provide no more information than is (or should already be) gathered from the actual providers. If you want to know who is receiving treatment, the most sensible choice is to go to the source of the treatment.

And to count the number of people, it is simply not necessary to know who they are – a count of unique identifiers is enough. NHS England is mandating the use of NHS numbers by care providers, and that mandate is in the process of being passed into law.

To count people, you need to know only that you’re counting non-duplicate entities. It does not matter whether you use names, physical people or their pseudonyms (e.g. telephone number, NHS number, or an arbitrary pseudonym).

Worked example 4: Don’t get into an accident

Relatively minor medical events of those in the public domain are often reported – how many women of a particular age reported to a particular hospital with an elbow injury, for example, the day that Nick Clegg’s wife broke her elbow in 2010, just before the general election? [7] – and even the most private of individuals can find themselves in the newspaper due to an accident.

Standard journalistic practice means that accidents reported in the local press will include the date of the event, a person’s name and age, along with the area of town – in some cases even the road – where the victim lives. Such reports usually provide enough information for an informed guess at likely diagnoses, which can then be matched with a particular incident. (With regard to example 2, the same would be true of someone announcing the birth of their triplets on Twitter or Facebook.)

An experiment by Professor Latanya Sweeney of the Harvard Data Laboratory starkly demonstrates the risks of matching within ‘de-identified’ data, i.e. data where some identifiers have been removed, rather than being replaced by pseudonyms.

Taking the US equivalent of HES – de-identified public hospital records for a state – and using articles in local news reports giving an indication of types of injury, her team was able to confirm that merely by being involved in an incident where you were taken to hospital, it was routinely possible to match to the victim’s entire hospital history, and discover details that even the patient had not told the hospital directly, but which had been discovered from their medical profile.

When contacted by the project, patients were horrified to find they could be identified and have their medical history exposed from the data made available.

7) https://www.google.com/search?q=nick+clegg+wife+election+elbow+broken

Pseudonyms

Identification isn’t just about finding someone’s name; it’s about linking an individual’s data records together so that you can learn things about them. If I know your home address, gender, date of birth, hair colour, eye colour, weight and telephone number, it doesn’t matter how many characters are in your database’s pseudonym – what matters is that I, and my data, can be (re)identified.

NHS England’s argument is bureaucratic obfuscation. It’s like saying that having a phone number doesn’t tell you who someone is and then blaming the patient for answering the phone with their name.

Or in another analogy, it’s the sort of approach that insists you have to know the name of the bug that bit you in order for it to matter. We don’t have many small poisonous bugs in England, but other places do. Small creatures have many names; they have their Latin classification, they have names in English, and in local areas they have names in local languages, etc. In short, they have many pseudonyms – but it’s all the same bug.

If you’re bitten by a poisonous bug, the sensible medical approach doesn’t care about its actual name but rather, by asking questions about its attributes – what colour was it? was it spotty or stripy? how many legs? any wings? – the care provider can work out the appropriate treatment. The name really doesn’t matter; what you care about is the antidote, a name you will care about far, far more! At best, whatever the bug is called may be a link between looking it up and how you cure the bite – but you really don’t need the name.

Attempting to make this all about pseudonyms seriously misses the point. The real problem is the linked individual-level data that the NHS has treated so egregiously badly in the past, which with this argument NHS England appears to continue to want to do.

In 1989 this was all new, and difficult. In 2015, there are no excuses.

In summary

NHS England’s scenario: “In the extremely unlikely event an individual was able to ‘hack’ the system, they would need the encryption key to convert back the coding” is a diversion.

The point is not that one can infer an individual’s identity from the linking pseudonym – taking the “100 character” pseudonym to “convert back the coding” – it’s that there is so much other data in the file that you don’t have to.

As detailed above, in the ‘Hollywood Scenario’ the chances of someone arbitrarily picking a row in a dataset and knowing who it is are slim. But, as PharmaTimes suggests, that’s the imaginary plotline for a movie, not real world protection of patients.

Can NHS England tell the difference? We suggest they listen to the experts who can.

For the rich, dated linked data about which NHS England has given no assurances regarding dissemination beyond the ‘pathfinder’ stage of care.data and using widely-available other information, as the researcher at OII and our by no means exhaustive examples show, there are many ways to identify people’s medical records in individual-level data – regardless of whether it has been pseudonymised (or de-identified).

That NHS England continues to try to mislead the public on this fundamental point in 2015 suggests the “pause” it took to “listen and understand” public concerns throughout 2014 was not enough. Continuing to hold onto and propagate the fantasy that pseudonymisation makes the possibility of re-identification “negligible” is either naïve or incompetent.

We’re not quite sure what’s worse.