Implementing Data Usage Reports

We introduced the concept of Data Usage Reports a year ago. Posting prototypes to officials unannounced led to a DH commitment for HSCIC to look at a roadmap for implementation.

3 weeks later, NHS England announced that they had done no work on implementing the care.data consent codes, and so transparency took a back seat to consent for most of the year. Not forgotten, not less important, just less urgent. Given that HSCIC only had 2 full time people working on either issue, this priority was clearly correct (although the hordes of staff digging care.data in deeper suggested a political allocation of resources).

As HSCIC moves towards an announcement on consent implementation in the new year (we have sent them some questions), it’s time to look at what we’ve learnt in a year of discussions about Data Usage Reports. Most of it is relatively dense detail, but the final section is the one missing piece.

It is necessary to close the Data Trust Deficit. The last year of work on Data Usage Reports, looking at all the details, shows this is entirely achievable, where there is political will.

Restating the Principle

You should have a complete knowledge of how individual level data about you has been used or disseminated. Any individual should be able to freely read the outcomes of those projects, the new research, the new knowledge, that they contributed to creating.

It’s that simple.

The best people to notice and report data being used when it shouldn’t be is the individual whose data was used inappropriately themselves. Additionally, Government has a cyber-security problem, in that is required to defend many pools of bulk personal datasets, but didn’t keep track of where they were; and this will only exacerbate the data trust deficit. These are the same problem. In discussions where providers suggest that tracking data use will come with disproportionate cost, this is a helpful proxy for systems (or suppliers) where the data risk is disproportionately high.

In particular instances, there may be reasons reporting can not be done today for particular systems – there is a great legacy of IT systems which don’t keep track – but there is no doubt that it should be done. There is no doubt that it can be done now for the largest areas of concern, moving towards completeness over time. As a simple matter of data hygiene, let alone cyber-security, those systems that can’t audit how they access or release data should be noted with an associated schedule for replacement or upgrade. Organisations should be publicly accountable for driving those numbers towards zero, with a public commitment to full accountability.

Implementation.

Within the NHS, the NHS number is now legally mandated for use by all services, which makes data linkage relatively simple. NHS organisations must use it, which makes reporting usage to HSCIC rather easy. Similarly, the NHS has the login point being the GP surgery, with an existing, functional, dissemination mechanism for usernames and passwords. Nothing we’ve seen this year raises any questions about that.

For the rest of Government, things are somewhat interesting.

If the political principle is stated that citizens should be able to know how individual level data about them has been used, existing laws allow for Data Usage Reports to be implemented on a consensual basis, using existing legislation (the Statistics and Registration Services Act – the flow to ONS is a (very) custom statistic). That is not to say we wouldn’t welcome a legal mandate that every citizen should be able to know how individual level data about them has been used or disseminated, and a requirement on all public sector data controllers to tell them. A legal mandate also makes the rules preventing abuse of the Reports themselves easier to create, but that also exists in the Data Protection Act already (the request is technically a novel custom Subject Access Request, fulfilled digitally, voluntarily, and for free).

When Data Usage Reporting moves from the NHS to the rest of Government, that will be two separate reports (one NHS, one all of verify). Given the interest from the private sector in this concept, we would expect this to begin to stimulate a market for Data Usage Reporting by reputable businesses and charities (should this happen, the consideration of such reports as results of Subject Access Requests should receive extra considerations), and reputable monitoring tools for citizens. Personal Data Stores are likely to be a necessary component.

The links between the various operational databases and the reporting shall only be via per-system unique random identifiers. In reporting for each data flow, only the date and those identifiers accessed for that flow on that date are reported back, not anything like a name etc. The per-system uniqueness for the same individual means that cross-system links are impossible. When creating reports, if felt needed (as some have suggested), the generation process could ask for (and then discard prior to creation) some other identifiers so that even the systems holding the archive of accesses can’t know which UUIDs relate to a single individual. This is less of a problem if there is one tightly protected silo of accesses, rather than pools of data lying around departments and APIs.

{this would really benefit from the diagram}

Don’t build a dossier on citizens

When engaging with the concept of implementing data usage reports, it is important to remember two fundamental principles:

Build a report for citizens, not a dossier on citizens.
Don’t facilitate new links between databases.

This will require strong legal defenses around the report generation organisation, the kind of defences that already exist for the HSCIC and the Office of National Statistics.

Remembering the Policy intent

Most of the time, no one gives any thought to where their data has gone, or why. There should be no requirement for people to read the usage report, unless they choose to.

The most likely time for people to choose to read the report is following a data loss or similar incident. It is therefore necessary to design the model in which data usage reports are created bearing this in mind. Usage will likely be very, very, bursty, and those who are most interested will be the most concerned, not necessarily the most engaged.

In the NHS, that is simply facilitated by the HSCIC, which exists to do precisely that sort of work. For the rest of Government, that function should be provided by the ONS, using existing legal powers, or potentially a new power should Data Usage Reports be put on a statutory requirement.

More widely, there is also a concern that there must be no way for any department to use this Reporting information for operational purposes. The best way for that to be enforced is for it not to be held by any operational department.

The National Statistician is also a natural counterpart to the Information Commissioner as a balanced Governance framework. The ICO is restricted to considering what uses are legal, the National Statistician may decide that, just because a use is legal, doesn’t mean it is broadly beneficial. Meeting the policy intent of one silo often causes broad damage to other more important agenda. As we saw with care.data, this perspective is often entirely lacking within a bureaucracy.

Such Government reports should be considered as critical infrastructure for the “cyber-” considerations of civil contingencies planning. Those reports will be most needed when the system is most stressed, and as such, this should be a major consideration in the technical architecture. It may be easier politically to have departments store information in their own silos, where they can change it retrospectively at will; but when a departmental system (or department) imitates Talk Talk and ends up on pastebin, the primary technical response to that (unplugging it) mustn’t affect data usage reporting.

(In a Gov.UK/Verify context, it is easy to see how once a day, standardised reports would be generated and encrypted using a user’s public key, so that when a user wishes to read their report, the display mechanism in Verify decrypts using the private key on the way to the browser (or, given performance implications, in browser). Storing a complete and encrypted NHS data usage report for every patient would only be about 6Tb in total – comparatively not that much space.)

Into the long, long term

Many of the systems that this Government designs and builds will be in place for the next several decades, they may be replaced, but the agile, iterative approach is little different in these respects from the existing decades long monolithic contracts.

When the Estonians designed their system in the early 90s, their national security concerns were radically different to what they have evolved into. The Chancellor of the Duchy of Middle Earth may enjoy the Shire today, but the orcs may appear and use these systems for alternate purposes, whether you fear Prime Minister Corbyn or President Trump. Data Usage Reporting may be a minimal defence, but transparency on data linkage is going to be increasingly vital as systems become more digital and more agile.

Individual-ish data (non-NHS)

Data Usage reporting applies to all individual level data, where, even through pseudonymisation or other mechanisms, the chain of ownership from the original data to the end user can be restored. This is true of all administrative data sources. While a data user may not know who the individuals are in the data set, the data custodian does, and usage and outputs can be passed back along the chain of custody.

There are a small number of individual level datasets where that does not apply – the most obvious example being ONS Surveys where they ask random people in the street (or phone random landlines). At that point, while there is individual level data involved, there is no way to link those to a Verify identity, and nor should there be, as no individual identifiers are collected.

Currently, while ONS does a great deal of good work and delivers a lot of anonymised data for secondary uses, the businesses and individuals who volunteer their time to complete those surveys have no mechanism to find out how the world became better as a result of them doing so. It is this feedback cycle that is an important side-effect of Data Usage Reporting.

For these cases, there should simply be an opt-in list of those datasets, and individuals at survey time told that, if they want to know how that dataset gets used, (or they’re just interested in a particular survey, or want to opt in to the complete list), they can be told they may opt-in to having those data uses included in their data usage report. That offer being open to everyone. This would include datasets such as the census anonymised individual level datasets, which should never be linked to a Verify identity.

The one big area of discussion:

Most of the above issues have not been that major; it’s mostly a question of perspective or (the lack of) political will. What has really prompted a passionate response is whether it should be a PDF.

Here’s is the pretty dashboard for a Government Data Usage Report, but a (properly designed) PDF is also vital for those who are most concerned – which is the people who will make most use of this. Most people will just use the dashboard unless there’s a problem, but this is not just for most people, most of the time.

To steal a phrase from Baroness Lane-Fox, Data Usage Reporting should reach to the furthest first. Not just in terms of digital exclusion (assisted digital is vital), but in terms of Departmental inclusion, and in terms of providing information for those whose trust in Government handling of their individual level data has been most damaged by the recent approaches. Good design is important, but having something that people can print out, keep, and is as trustworthy as a bank statement is also entirely necessary.

Egregious cockups around public data will continue until there’s leadership on a new approach

Data Usage Reporting came out of the wreckage of care.data. Given the Government’s plans for data usage, it may be that Reporting will come out of that wreckage, designed by those who saw care.data as a playbook, not a warning. It is data projects in secret that cause the most problems. Transparency drives data quality up (as citizens can see that errors mean that their data wasn’t accessed when it should have been, or accessed when it shouldn’t). It provides a feedback mechanism that allows for projects to correct in small increments, rather than exploding.

Most data handling in Government, and in industry, is no worse than that in the HSCIC in 2013. It just so happens that public expectations of the NHS meant that the issue got addressed there first. All data is terrible, the main difference is the NHS has been more honest about it than most. The number of British passports issued to people born in the great country of Yorkshire might astound you (they misread the form).

It is deeply ironic, although not amongst particularly strong competition, that the Minister most accountable for how UK citizens’ data is used is Theresa May, with the requirement that she must know how MI5 uses bulk personal datasets. Our next newsletter will cover this topic in more detail.

Without a “Partridge Report” style process, a full accounting of the status quo, Government will not know what it is currently doing with data. Without knowing what is currently happening, it can not get better.

Data Usage Reporting will need either high level political leadership, or another data catastrophe. The one outstanding question, is whether leadership will come from the unfortunate Minister who found a care.data project in their portfolio, or whether there is a Minister willing to lead from the front.

2016 will be interesting.

The 2016 update is now available: Your Records in Use – Where and When… — Political will (or wont) for telling you how your data has been used.

medConfidential

Keep Our Confidence