This post is a part of our Bioethics in the News series
By Tom Tomlinson, PhD
It was good news to learn last month that the “Golden State Killer” had at last been identified and apprehended. A very evil man gets what he deserves, and his victims and their families get some justice.
The story of how he was found, however, raised concerns in some quarters. The police had a good DNA sample from the crime scenes, which with other evidence supported the conclusion that the crimes were committed by the same person. But whose DNA was that? Answering that question took some clever detective work. Police uploaded the DNA files to a public genealogy website, GEDmatch, which soon reported other users of GEDmatch who were probably related to the killer. More ordinary police work did the rest.
Most of the concern was over the fact that the police submitted the DNA under a pseudonym, in order to make investigative use of a database whose members had signed up and provided their DNA only for genealogical purposes.

My interest in this story, however, is the way it both feeds and undermines a common narrative about our DNA—that it is uniquely identifying, and that therefore any uses of our DNA pose special threats to our privacy. As The New York Times expressed this idea, “it is beginning to dawn on consumers that even their most intimate digital data—their genetic profiles—may be passed around in ways they never intended.”
It’s true that a sample of DNA belongs uniquely to a particular individual. But the same is true of a fingerprint, a Social Security number, or an iris. More importantly, by themselves none of these pieces of information reveals who that unique individual is.
As the Golden State Killer story illustrates, it’s only when put in the context of other information that any of these admittedly unique markers becomes identifying. If the GEDmatch database contained nothing but genetic profiles, you could determine which genomes the killer was related to. But you’d have no idea who those genomes belonged to, and you’d be no closer to finding the killer.
Although an individual genome can’t by itself be identifying, it can provide a link that ties together different information sources which include that genome. It can then be that collection that points to an individual, or narrows the list of possibilities to increase the odds of identification, and the threats to privacy. Imagine the state police maintains a database of forensic DNA linked to records of criminal convictions, and provides that database to criminologists, stripped of any names or other direct identifiers. Imagine as well that one of the hospitals provides researchers with DNA from their patients along with their de-identified medical records (which can include patients’ age, race, first 3 ZIP numbers, and other demographic information).
If we put those together we can do some interesting research: use the DNA link to identify those who both committed various crimes and had a psychiatric history, so we can compare them to convicted felons without a psychiatric history.
But now it may take very little additional information to identify someone in that combined database and invade their privacy. If I’m a researcher (or hacker) who knows that my 56-year-old neighbor was convicted of assault, I can now also find out whether he has a record of psychiatric illness—and a lot more besides. What he had thought private, is no longer so.
The point of this somewhat fanciful example is that as more information is collected about us, from more sources, the threats to our privacy will increase, even if what’s contained in individual sources offers little or no chance of identification.
For this reason, the prospect of merging various data sources for “big data” health research will challenge the current research regulatory framework. Under both the current and the new rules (which haven’t yet gone into effect), the distinction between identifiable and non-identifiable research subjects is critical. Research using information that can be linked to an individual’s identity requires that person’s consent. To avoid this requirement, research data must be “de-identified”. De-identification is the regulatory backbone on which much of the current “big data” research relies, allowing the appropriation of patient medical records and specimens for use in research without consent; and it provides the regulatory basis for uploading the data collected in NIH-supported research into a large NIH-sponsored database, the database of Genotypes and Phenotypes (dbGaP), which most NIH-supported genomic studies are required to do. Data from dbGaP can then be used by other researchers to address other research questions.
The possibilities of merging such “de-identified” databases together for research purposes will only increase, including facial recognition databases being collected online and on the street. As the mergers increase, it will become more difficult to claim that the people represented in those databases remain non-identifiable. As Lynch and Meyer point out in the Hastings Center Report, at this point there will be two choices. We can require that all such research will need at least broad consent, which will have to be reaffirmed every time a person’s data is used in new contexts that make identification possible. Or we will have to fundamentally reassess whether privacy can play any role at all in our research ethics, as the very idea of “privacy” evaporates in the panopticon of everyday surveillance.
Tom Tomlinson, PhD, is Director and Professor in the Center for Ethics and Humanities in the Life Sciences in the College of Human Medicine, and Professor in the Department of Philosophy at Michigan State University.
Join the discussion! Your comments and responses to this commentary are welcomed. The author will respond to all comments made by Thursday, July 12, 2018. With your participation, we hope to create discussions rich with insights from diverse perspectives.
You must provide your name and email address to leave a comment. Your email address will not be made public.
References
- Coded Private Information or Specimens Use in Research, Guidance (2008). U.S. Department of Health & Human Services. https://www.hhs.gov/ohrp/regulations-and-policy/guidance/research-involving-coded-private-information/index.html. Published December 19, 2016. Accessed June 21, 2018.
- National Center for Biotechnology Information, U.S. National Library of Medicine. https://www.ncbi.nlm.nih.gov/gap. Accessed June 21, 2018.
- How To Upload your DNA test results to Gedmatch for FREE. Your DNA Guide. https://www.yourdnaguide.com/upload-to-gedmatch/. Accessed June 21, 2018.
- Iris recognition. Wikipedia. https://en.wikipedia.org/wiki/Iris_recognition. Published June 20, 2018. Accessed June 21, 2018.
- Johnson G. Amazon urged not to sell facial recognition tool to police. Associated Press. https://apnews.com/5bd7883d7a1f4cf78ed52faf6641cbd0. Published May 23, 2018. Accessed June 21, 2018.
- Kolata G, Murphy H. The Golden State Killer Is Tracked Through a Thicket of DNA, and Experts Shudder. New York Times. https://nyti.ms/2vSD2nF. Published April 27, 2018. Accessed June 21, 2018.
- Lynch HF, Meyer MN. Regulating Research with Biospecimens under the Revised Common Rule. Hastings Center Report. May 2017;47(3):3-4. https://www.ncbi.nlm.nih.gov/pubmed/28543413.
- McMullan T. What does the panopticon mean in the age of digital surveillance? The Guardian. https://www.theguardian.com/technology/2015/jul/23/panopticon-digital-surveillance-jeremy-bentham. Published July 23, 2015. Accessed June 21, 2018.
- Revised Common Rule. U.S. Department of Health & Human Services. https://www.hhs.gov/ohrp/regulations-and-policy/regulations/finalized-revisions-common-rule/index.html. Published January 19, 2017. Accessed June 21, 2018.
- Vaas L. Facebook can’t wiggle out of facial recognition lawsuit, judge says. Naked Security. https://nakedsecurity.sophos.com/2018/05/16/facebook-cant-wiggle-out-of-facial-recognition-lawsuit-judge-says/. Published May 16, 2018. Accessed June 21, 2018.
“Imagine as well that one of the hospitals provides researchers with DNA from their patients along with their de-identified medical records (which can include patients’ age, race, first 3 ZIP numbers, and other demographic information).”
Latayna Sweeney found, among other things, that 87% (216 million of 248 million) of the
population in the United States had reported characteristics that likely made them unique based
only on {5-digit ZIP, gender, date of birth}.
Click to access paper1.pdf
Thanks for this. I hadn’t seen it before, but it’s a very nice example of how the odds of identification grow as different datasets are combined, making what had been de-identified data anything but.