Online Genealogy Databases: What Could Go Wrong?
I’m lucky that my mother’s cousin spent years tracking down our family’s ancestry including photographs, news clippings, and really detailed lineages. But what if he hadn’t and I got the bug to research my family’s past? Would I turn to an online site? What about companies that would analyze my DNA and send me back information about my ethnicity, where my ancestors lived, and maybe help me connect with new relatives? What are the implications of creating a massive online database full of people’s genetic information? What might happen if someone used that information along with anonymous results from a research study where pieces of DNA information were revealed? Could the person in the research study be identified?
A group of researchers at MIT and Harvard were interested in this very question. They were concerned that although the requirements for keeping people’s identities were met both in research studies as well as on genealogy websites, that the power of data housed in these massive databases and the uniqueness of DNA data could be used to reveal the identity of the participants in scientific studies.
The MIT/Harvard study focused on the data that is contained in short tandem repeats on the Y chromosome of male DNA (Y-STR). This is the kind of information that is often used in forensics, paternity and genealogical DNA testing. These regions of DNA contain multiple copies of short repeating sequences of bases that are repeated a variable number of times and are specific to an individual. When researchers compare these regions of DNA, the more similar they are the more likely two people are to be related. Amazingly, when enough of these regions are looked at, it can indicate a match with a specific geographic origin of a person’s ancestors. Typically genealogy websites are able to give results including a list of surnames (aka family name or last name, typically inherited from the father) associated with the pattern of DNA and information about the patrilineal (descent through the male) line, including geographical locations, potential spelling variants of the last name and pedigrees.
The first thing the researchers did was test these databases. Using Y-STR data from 911 individuals with known surnames, researchers tested if the websites would match the 911 DNA data samples with the same or similar surnames from the website’s databases. Using just this DNA sequence they were able to identify the correct surname 12% of the time. The researchers then wanted to see if it was possible to improve these results by using additional information about the individuals. They chose two pieces of data that are not protected by the United States Health Insurance Portability and Accountability Act (HIPAA).
This act, signed into law in 1996, prohibits the release of Protected Health Information that could result in a person being identified from their medical records. To protect patients, things like your name, any geographical identifier smaller than a state, your birthday and month (but not year), phone numbers, social security number, and a number of other types of information that would allow you to be specifically identified must be kept private on all health related documents. Data is de-identified or anonymized in order to protect participants in research projects that collect health data (including DNA).
The two pieces of information the researches chose in addition to the Y-STR DNA data were the person’s year of birth and their state of residency, neither protected by HIPAA. They ran a simulation where they used various online public record research engines and U.S. Census data and found that searching for year of birth and state together would return at least 60,000 results (potential people as matches to the year of birth and state). When a surname was also added to the search, the number dropped to only 12 males. This is few enough people to look up each one individually if you were trying to track down just one individual.
To tie all of this together with the DNA data, they conducted just two more experiments. First, they had one of their male colleagues from the lab submit a DNA sample to a genealogy service which was then added to the genealogy website database. This is the standard procedure of these online database companies. At the same time, the researchers sequenced his DNA in their lab and used their results to search the genealogy online database. Their search returned the colleague’s entry as a top record, indicating the data that was generated in a research lab like theirs would match the data found through the companies DNA analysis. The second experiment started with the researchers accessing information from the National Center for Biotechnology Information (NCBI) archives. This is a public website that has a small number of genomes from identified individuals.
They tested the DNA data of three of these individuals against a genealogy website. Two, who had the common last names of Snyder and West, were not successful in returning the same surname when their DNA data was searched for on the genealogy website’s database. But when they searched for the third, whose last name was Venter, they returned only 33 results, eight of which also had the last name Venter. To take this further, the researchers then wanted to see whether they would be able to identify their specific Venter using an online public record using just the last name Venter, his year of birth and state of residence. This search came up with two matches, one of which was the male in the archive. This shows that by using DNA data along with just two pieces of information that are not regulated by HIPAA, it is possible to identify an individual using only online resources including genealogy websites and public records. In fact, it only took 3 – 7 hours to re-identify an individual!
So what do the results of this study mean?
It isn’t realistic to try and stem the flow of people’s information onto the Internet. There are thousands of genetic records added every month by enthusiastic people hoping to find out more about their family’s origins. At the same time the science behind the sequencing of DNA and the software used to compare them is also advancing, allowing for better and closer matches. This means that as it becomes easier to sequence larger portions of peoples’ DNA for cheaper it will only take very small differences in DNA to allow these computer programs to find relatives and identify potential surnames.
The authors of this paper don’t believe it is possible to stop people from putting more information out there, or to recall all the information that is currently available. They feel, and I agree, that in addition to preventing people from researching their personal histories, that this would also hamper scientific progress. They suggest that a more reasonable and feasible solution is to establish clear policies for data sharing. This would include educating the public and participants in studies about the benefits and risks of genetic information. There should also be more legislation regarding the proper use of genetic information.
If you have heard about the term “Big Data” before and wondered if and how it might apply to you, this is one example that seems to run in everyone’s family.
The image is from Wikipmedia Commons, and is a faithful photographic reproduction of a two-dimensional, public domain work of art.
This blog does not necessarily reflect the views of AAAS, its Council, Board of Directors, officers, or members. AAAS is not responsible for the accuracy of this material. AAAS has made this material available as a public service, but this does not constitute endorsement by the association.