Data De-identification: Possibilities, Progress, and Perils

October 25, 2019

The process of de-identifying clinical data for use in research and quality improvement involves safeguarding protected health information (PHI) by removing or anonymizing patient identifiers. In this context, an “identifier” is anything that by itself or in a group is uniquely associated with a particular patient. An identifier might be a medical record number, a social security number (SSN), a home address, or a cell phone number. While removing obvious identifiers is relatively straightforward as specified by law, anonymizing patient data requires additional steps.

For example, it might involve transforming identifiers into random numbers that cannot be “mapped” back to individual patients. Dates can also be partial identifiers, since someone may know the date I visited the clinic; therefore, de-identification may include tactics such as shifting dates by as much as a year while maintaining the relative order of events. "Re-identification" is simply the process of recovering those IDs—in other words, figuring out who someone is in the data. 

So in de-identification, the goal is not only to remove an obvious identifier like an SSN, but to remove any fields that would make it possible to put a name to a record. Over the years, we’ve gotten better at removing identifiers, but we've also gotten better at figuring out who someone is anyway. In fact, there is now ample evidence that re-identification can be accomplished without additional steps and only a modest level of expertise. Below, we’ll take a look at some examples of this dynamic.

Latanya Sweeney and the Governor of Massachusetts’s Medical Records

In the mid-1990s, the State of Massachusetts released "de-identified" medical records, which in this case meant that patient IDs and names had been removed. However, in 1997 an MIT computer science graduate student named Latanya Sweeney showed that she could re-identify many people based only on information such as their sex, zip code, and birth date. She demonstrated this in dramatic fashion by faxing the governor of Massachusetts his medical record after re-identifying him. Sweeney would go on to faculty positions at MIT and Harvard, and her work is part of the reason that the category of PHI now includes more than just name, SSN, address, and phone number.

A lot of database research went into k-anonymity, which basically says you're safe from such a "re-identification" if the fields left in the data for you are exactly the same for at least k-1 other people. But this is still imperfect. What if all k of you also have some undesirable value (say, a bad credit score, or a stigmatized diagnosis that’s part of your health data)?  In such a situation someone can still learn you have that undesirable value. The shortcomings of k-anonymity led to the development of l-locality and other successor ideas, but all still suffered from similar imperfections.

Netflix and AOL: Some Cautionary Tales

In 2006, Netflix embarked on a project designed to both to advance their machine-learning algorithms and help their bottom line. The company released data on how their users ranked movies and offered a prize of 1 million dollars to anyone who could improve their accuracy at predicting user movie ratings (1-5 stars) by at least 10%. To protect customer’s privacy, names and other substantial personal data were removed. The Netflix Prize was hailed as a milestone in applied ML and was a huge marketing success, until two researchers from the University of Texas – Austin, Arvind Narayanan and Vitaly Shmatikov, showed that they could re-identify individual users anyway. Plans for the Netflix Prize 2 were immediately dropped.

Only a few months prior to the Netflix debacle, American Online (AOL) had committed a comparable blunder when it released a trove of sensitive user search data. As happened with the Netflix Prize, individual users were re-identified. The company received tons of bad press and had to apologize. All of these "re-identifications" carry a high risk of lawsuits because an individual can claim the company didn't honor their commitment to her/his privacy, or released personal data without individual approval.[1] Ever more surprising re-identifications continue to occur with regularity, with some of the most notable being identification of individuals from the DNA of their relatives or from MRIs of their brains.

The Ongoing Privacy Arms Race

There's a serious and consequential race going on between privacy protection and re-identification methods. Using encryption, it's possible for me to learn a model—say, to predict someone's stable warfarin dose from their genetics, diet, smoking, weight, etc.—without actually seeing anyone's unencrypted data. Nevertheless, there's still some risk that when I publish the learned model itself, it fits the patients in my training data better than some other population of patients, and so allows people to make inferences about those patients (such as their genotypes, if we know their actual warfarin dose). This is called a model inversion attack. There's not a lot of real worry over someone's CYP2C9 or VKORCI genotype right now, but there might be in the future if we also associate them with, say, some negative personality trait, or an elevated risk that might affect health insurance status.

Besides encryption, another approach to safeguarding PHI is known as differential privacy. In this approach, we add noise to the data, so that no matter what other information we have, the answers to any query we ask of the data provably change very little whether I decide to be included in the database or not. Therefore, if I allow my data to be used, there's a provable limit on what anyone can learn about me—even just from a model arising from the data.

But even with all these protections, there's still some risk. If we learn something about patients in a particular state that indicates something negative about the health of the entire state, this could negatively affect every individual in the state. That's true even if we can't determine who's who in the data. 

In other words, re-identification of individual patients isn't the only risk of sharing data, but it's the one we’re all focused on. Re-identification rightly gets a lot of attention, because if I don't agree to a data release but someone re-identifies me in de-identified data, I've still lost my privacy (in the case of a medical record, I've given away my entire recorded clinical history). I might be willing to let a small group of researchers know my clinical history for research, but I might not want the whole world knowing it. Once we publicly release data, we can't ever grab it back, whereas in a limited release we still have some legal recourse.

This blog post is the first of two parts. In the next installment, we’ll look at some of the implications of data de-identification and re-identification when trying to leverage clinical data to integrate research and healthcare decision-making.

[1] Ohm P. Broken promises of privacy: responding to the surprising failure of anonymization. UCLA Law Review. 2010;57:1701-1777.