Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons

01/24/2020 ∙ by Kerem Ayoz, et al. ∙ Bilkent University Case Western Reserve University, 0

Sharing genome data in a privacy-preserving way stands as a major bottleneck in front of the scientific progress promised by the big data era in genomics. A community-driven protocol named genomic data-sharing beacon protocol has been widely adopted for sharing genomic data. The system aims to provide a secure, easy to implement, and standardized interface for data sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. However, beacon protocol was recently shown to be vulnerable against membership inference attacks. In this paper, we show that privacy threats against genomic data sharing beacons are not limited to membership inference. We identify and analyze a novel vulnerability of genomic data-sharing beacons: genome reconstruction. We show that it is possible to successfully reconstruct a substantial part of the genome of a victim when the attacker knows the victim has been added to the beacon in a recent update. We also show that even if multiple individuals are added to the beacon during the same update, it is possible to identify the victim's genome with high confidence using traits that are easily accessible by the attacker (e.g., eye and hair color). Moreover, we show how the reconstructed genome using a beacon that is not associated with a sensitive phenotype can be used for membership inference attacks to beacons with sensitive phenotypes (i.e., HIV+). The outcome of this work will guide beacon operators on when and how to update the content of the beacon. Thus, this work will be an important attempt at helping beacon operators and participants make informed decisions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With plummeting sequencing costs, we look forward reaching a capacity of sequencing one billion individuals over the next 15-20 years, resulting in availability of very large genomic datasets (Schatz, 2015; Collins and Varmus, 2015; Ledford, 2016). Although such large datasets are promising a revolution in medicine, it has been shown in numerous studies that it is not straightforward to ensure anonymity of the participants in such datasets (Homer et al., 2008a; Sankararaman et al., 2009; Jacobs et al., 2009; Visscher and Hill, 2009; Clayton, 2010).

Human genome is the utmost personal identifier and sharing genomic data for research while preserving the privacy of the individuals have been challenging many different fields (e.g., medicine, bioinformatics, computer science, law, and ethics) for long, due to possibly dire ethical, monetary, and legal consequences. To address this challenge and create frameworks and standards to enable the responsible, voluntary, and secure sharing of genomic data, the Global Alliance for Genomics and Health (GA4GH) was formed by the community (ga4, 2020). The current genomic data sharing standard of the GA4GH is called the genomic data-sharing beacons. Beacons are the gateways that let users (researchers) and data owners exchange information without -in theory- disclosing any personal information. A user who wants to apply for access to a dataset can learn whether individuals with specific alleles (nucleotides) of interest are present in the beacon through an online interface. That is, a user can submit a query, asking whether a genome exists in the beacon with a certain nucleotide at a certain position, and the beacon answers as “yes” or “no”. If the dataset does not contain the desired genome, genomic data is not shared and distributed unnecessarily. In addition, researchers does not have to go through the paperwork to obtain a dataset which will not be helpful for their research. The GA4GH provides a shared beacon interface (bea, 2020) that as of September 2019 provides access to 80 beacons and acts as a hub where researchers and data owners meet.

Beacons are typically associated with a particular sensitive phenotype (e.g., the SFARI beacon that host individuals with autism). Therefore, presence of an individual in a particular beacon is considered as privacy-sensitive information and the main aim of the beacons is to protect this information. An attacker, using the responses of a beacon and genomic data of a victim, may try to infer the membership of the victim in a particular beacon by running a membership inference attack. Beacon framework sets a barrier against membership inference attacks by allowing only presence/absence queries for variants and not tying any response to any specific individual. In that sense, beacons are considered to have stronger privacy measures compared to other statistical genomic databases. Despite these barriers, several works have proven that beacons are not bulletproof and they are vulnerable to membership inference attacks (Shringarpure and Bustamante, 2015; Raisaro et al., 2016; von Thenen et al., 2018).

However, threats against genomic data-sharing beacons are not limited to membership inference attacks. In this paper, for the first time, we identify and analyze the vulnerability of genomic data-sharing beacons for the “genome reconstruction” attack. We consider a scenario, in which the attacker knows the membership of a victim to a beacon that may not be associated with a sensitive phenotype. Then, we show how the attacker can accurately infer the genome of the victim by using the beacon responses. Such an attack may result in serious consequences if the attacker uses the reconstructed genome to infer sensitive information (e.g., disease diagnosis) about the victim or to infer the victim’s membership to another statistical genomic database of interest (e.g., another beacon that is associated with a sensitive phenotype). In particular, we show how the attacker can use the inherent correlations in the genome to run such an attack in an efficient and accurate way compared to a baseline approach. We also show how clustering techniques can be used to further improve the accuracy of such an attack.

Previous works in the literature assume beacons are static and do not change over time. However, beacons are dynamic datasets (donors join and leave) and this results in an increased risk for the genome reconstruction attack. Thus, for the first time, we consider the beacons as dynamic databases and formulate the genome reconstruction attack accordingly. In such a genome reconstruction attack, the attacker reconstructs all or a subset of the genomes in the beacon. Among the reconstructed genomes, it is not trivial to infer which one belongs to the victim. Therefore, we also show how the attacker can identify the victim’s genome among the set of reconstructed genomes using a set of visible phenotypes (physical characteristics) of the victim, which is public information. Finally, to show one of the consequences of the identified genome reconstruction attack, we show how the attacker can utilize the outcome of this attack to initiate a membership inference attack against the same victim in another beacon, which can be associated with a sensitive phenotype. To do this, we combine the identified genome reconstruction attack with the membership inference attacks against beacons from the literature.

We implement and evaluate the identified vulnerability using real genome data obtained from OpenSNP (Greshake et al., 2014) and HapMap (Consortium et al., 2003) datasets. We particularly evaluate the success of the attacker to reconstruct a victim’s point mutations that include at least one rare nucleotide (i.e., minor allele) since minor alleles (i) reveal sensitive attributes of individuals (e.g., predispositions to privacy-sensitive diseases); and (ii) provide rich information to the attacker for membership inference attacks (Raisaro et al., 2016; von Thenen et al., 2018)

. We show that precision and recall of reconstruction reaches up to 0.9 (each) when the size of the beacon is increased by

and the victim is one of the newcomers. Even when the size of the beacon is increased significantly (by

), precision drops down to 0.8 and recall drops down to 0.5. Furthermore, our results show that when more than one individual is added to the beacon, the attacker can accurately pinpoint the victim’s reconstructed genome by matching the victim’s phenotypical characteristics to the reconstructed genomes using machine learning algorithms. We also show via experiments that the outcome of the genome reconstruction attack can be accurately used for the membership inference attack on another beacon and it helps an attacker infer the membership of a victim only with a few queries.

This study clearly shows that privacy risks for genomic data-sharing beacons are much severe than perceived. This is particularly important since the number of beacon participants, and hence the privacy risk of individuals increase rapidly. Thus, we believe that implications of this work will be substantial. The rest of the paper is organized as follows. In the next section, we summarize the related work in genomic privacy. In Section 3, we provide background information about genomics and membership inference attacks against beacons. In Sections 4 and 5, we introduce the system and threat models. In Section 6, we provide the details of the identified vulnerability. In Section 7, we evaluate the identified vulnerability using real genomic datasets. In Section 8, we discuss our main findings and potential mitigation techniques. Finally, we conclude the paper in Section 9.

2. Related Work

Genomic privacy has recently been explored by many studies (Erlich and Narayanan, 2014; Naveed et al., 2015; Ayday et al., 2013a). In the following subsections, we will summarize existing work on privacy in statistical genomic databases, inference attacks, and privacy of genomic data-sharing beacons.

2.1. Privacy in Statistical Genomic Databases and Inference Attacks on Genomic Privacy

Several works have shown that anonymization does not effectively protect the privacy of genomic data (Gitschier, 2009; Gymrek et al., 2013; Hayden, 2013; Malin and Sweeney, 2004; Sweeney et al., 2013; Lin et al., 2004; Kale et al., 2017). It has been shown that the identity of a participant of a genomic study can be revealed by using a second sample (e.g., part of the DNA information from the individual) and the results of the clinical study (Homer et al., 2008b; Wang et al., 2009; Im et al., 2012; Clayton, 2010; Zhou et al., 2011). Differential privacy (DP) (Dwork, 2006) concept has been frequently used to mitigate membership inference attacks when releasing summary statistics from genomic databases. Fienberg et al. used the DP concept for sharing statistics, such as minor allele frequencies and chi-square values (Fienberg et al., 2011). Yu et al. extended this work and presented a scalable algorithm for any arbitrary number of point mutations (single nucleotide polymorphisms - SNPs) (Yu et al., 2014). Johnson and Shmatikov proposed using the exponential mechanism for the computation and release of statistics about a genomic database (Johnson and Shmatikov, 2013). Tramer et al. also studied the tradeoff between privacy and utility provided by DP (Tramer et al., 2015). Compared to statistical databases, genomic data-sharing beacons have stronger privacy measures since they only allow presence/absence (or yes/no) queries for variants.

Humbert et al. proposed an inference attack on kin genomic privacy using the family ties between individuals, pairwise correlations between the SNPs, and publicly available statistics about DNA  (Humbert et al., 2013). Then, Deznabi et al. demonstrated that stronger inference techniques can be generated by combining high-order correlations and family ties (Deznabi et al., 2018). Furthermore, several studies have examined phenotype prediction from genomic data, as a means of tracing identity (Humbert et al., 2015a; Lippert et al., 2017; Kayser and de Knijff, 2011; Zubakov et al., 2010; Ou et al., 2012; Allen et al., 2010; Manning et al., 2012; Walsh et al., 2011; Claes et al., 2014; Liu et al., 2012)

. To mitigate such attribute inference attacks, besides DP-based solutions (to release genomic data), cryptographic solutions has been also proposed to perform some operations on genomic data in a privacy-preserving way. Existing cryptographic solutions mainly focus on (i) private pattern-matching and the comparison of genomic sequences 

(Troncoso-Pastoriza et al., 2007; De Cristofaro et al., 2013; Jha et al., 2008; Blanton et al., 2012; Naveed et al., 2014), and (ii) privacy-preserving personalized medicine (Baldi et al., 2011; Ayday et al., 2013b). In this work, we identify and analyze a different type of attribute inference attacks particularly against genomic data-sharing beacons.

2.2. Privacy in Genomic Data Sharing Beacons

Researchers showed that presence (membership) of an individual in a genome sharing beacon can be inferred by repeatedly querying the beacon. Shringarpure and Bustamante introduced a likelihood-ratio test (LRT) that can predict whether an individual is in the beacon by querying the beacon for multiple SNPs of a victim (Shringarpure and Bustamante, 2015). Note that inferring the membership of an individual in a beacon that is associated with a sensitive phenotype is equivalent to uncovering the sensitive phenotype about the victim. Then, Raisaro et al. showed that if the attacker first queries the SNPs with low minor allele frequency (MAF) values, it needs fewer queries for a successful attack (Raisaro et al., 2016). Later, von Thenen et al. showed that even if the attacker does not have victim’s low-MAF SNPs, it is still possible to infer membership by exploiting the correlations in the genome (von Thenen et al., 2018). Furthermore, they showed that beacon responses can also be inferred using such correlations (via a query inference, or QI-attack). In an orthogonal work, Hagestedt et al. have hypothesized that while current beacons systems are limited to genomic data, in the near future the community is going to need a similar system for other biomedical data types. They proposed a beacon system for sharing DNA methylation data (an epigenetic mechanism to regulate transcriptional activity) and then showed that it is possible to successfully launch a membership inference attack against this system. They proposed a DP-based solution in their proposed MBeacon system. The approach retains utility by adjusting the noise level for high risk methylation regions that might leak phenotypic information (i.e., regions which are related to disease).

Contribution of this paper. In this paper, we identify and analyze a genome reconstruction attack against genomic data-sharing beacons by particularly exploiting the information leaked due to beacon updates and the correlations between the point mutations. So far all works in the literature have focused on membership inference attacks against genomic data-sharing beacons. To the best of our knowledge, this is the first work that identifies, thoroughly analyzes, and shows the consequences of the genome reconstruction attack against the beacons. Furthermore, as opposed to existing work (that only consider a snapshot of the beacon), we show the the privacy risk in dynamic beacons, in which new donors may join or existing donors may leave.

3. Background

In this section, we provide background information on genomics and also on membership inference attack against beacons (that we use in Section 6.5).

3.1. Genomics Background

Single nucleotide polymorphism (SNP) is the most common source of variation in the human genome. SNP is a point mutation (e.g., substitution of a single nucleotide in the genome - A,T,C, or G) and there are around 50 million known SNPs in the human genome (sit, 2020a)

. The alternative nucleotides for each locus (SNP position) are called alleles. The major allele is the most frequently observed nucleotide for a SNP position and the minor allele is the rare nucleotide (i.e., the second most common). The frequency (or probability) of observing the minor allele at a SNP position is called the minor allele frequency (MAF) of that SNP. Human genome has two copies for each locus (one per chromosome) and a SNP can be represented in terms of the number of its minor alleles (i.e.,

for homozygous major, for heterozygous, or for homozygous minor). SNPs in human population are inherently correlated and this correlation model may change for different populations. Linkage disequilibrium (LD) is the non-random association of alleles at two or more loci. If two SNPs are in LD, they are correlated and co-occur more frequently than expected. Some SNPs are pathogenic and cause genetic diseases (sit, 2020b) and hence, they may carry sensitive information regarding individuals’ health condition. As discussed in Section 2, most existing works in genomic privacy literature focus on the protection of the SNPs to prevent the risk of genetic discrimination.

3.2. Membership Inference Attack Against Genomic Data-Sharing Beacons

In (Raisaro et al., 2016), Raisaro et al., introduces the Optimal attack, in which the attacker constructs a set of candidate SNPs to be queried and submits queries starting from the lowest MAF

. Let the null hypothesis (

) refer to the case in which the queried genome is not in the beacon and alternative hypothesis () be the case in which the queried genome is a member of the beacon. In (Raisaro et al., 2016), the log-likelihood () under the null and alternate hypothesis are shown as follows.

(1)
(2)

where is the response set, is the answer of the beacon to the query at position (1 for “yes”, 0 for “no”), and represents a small probability where the attacker’s copy of the victim’s genome does not match the beacon’s copy for a locus (e.g., due to difference in variant calling pipeline). is the number of posed queries. is the probability that none of the individuals in the beacon has the queried allele at position and represents the probability of no individual except for the queried person having the queried allele at position . The computations of and depend on the queried position and they change at each query as follows: and , where represents the MAF of the SNP at position . The likelihood-ratio test (LRT) statistic, , is then determined as

(3)

In Section 6.5, we use the Optimal attack when we show how the proposed genome reconstruction attack can be combined with the membership inference attack.

4. System Model

As shown in Figure 1, we consider a system between the beacon participants (e.g., donors), the beacon, and the beacon users. The donor shares their genome with the beacon. It is possible that the donor may share their genome with more than one beacon that may or may not be associated with sensitive traits. Genome donor is not active during the protocol after they share their data with the beacon. Also, beacon never publicly shares its dataset. Some beacons may only share metadata about their donors such as their gender, age, or ethnicity. In general, we consider the beacon as a dynamic dataset in which new donors may join and existing donors may leave over time. Beacon users issue queries to the beacon. A beacon user is a potential attacker as shown in Figure 1. As discussed, the beacon user can only ask the presence of a genome with a particular allele (nucleotide) at a particular position of a given chromosome and the beacon only responds as “yes” or “no”. In this work, we assume beacon honestly reports the result of each query to the user (e.g., without introducing intentional noise to the query results) and we do not consider a query limit for the users as it is usually trivial to overcome such limits (e.g., by registering several times with different accounts).

Figure 1. Proposed system model.

5. Threat Model

Depending on the attacker’s objective, two attacks that can be launched against genomic data-sharing beacons are: (i) membership inference attack and (ii) genome reconstruction attack. Here, for the first time, we identify and study the latter. We assume that the attacker, with the knowledge about the membership of an individual to a beacon, tries to reconstruct a victim’s genome by issuing queries to the corresponding beacon. This is a realistic assumption, especially for beacons that are not associated with a sensitive trait (i.e., Kaviar (Glusman et al., 2011)). For such beacons, membership of an individual may not be privacy-sensitive information. However, using this information, the attacker may infer the genome of the victim and use this for other attacks against the victim.

This vulnerability exists both for static and dynamic beacons. In static beacons, knowing that the victim is a member of the beacon, only the “no” responses would provide certain information about the victim’s genome to the attacker. “Yes” responses may be due to any other participant of the beacon and as the size of the beacon increases, “yes” responses do not provide much information to the attacker. However, in dynamic beacons, when the beacon is updated, using the change in the responses of the beacon, the attacker can learn more about the genomes of new participants. Thus, in this paper, we analyze this vulnerability for dynamic beacons and we assume that the victim is added between times and along with other newly added donors to the beacon.

We assume that along with the fact that the victim is among the newly joined participants to the beacon, the attacker also knows (i) the number of other newly joined individuals that are added to the beacon along with the victim; (ii) a (partial) snapshot of the beacon before the victim is added (at time ). That is, responses to some queries before the victim joins to the beacon; (iii) a set of victim’s visible characteristics (phenotye); and (iv) publicly available information about genomics, such as minor allele frequencies (MAF values) of SNPs and correlation between the SNPs in the population of interest. Finally, we assume that the attacker is a regular beacon user and it does not collude with the beacon.

6. Genome Reconstruction Attack on Genomic Data-Sharing Beacons

As discussed, we define the genome reconstruction attack as inferring genomic data of a genome donor (i.e., victim) given their membership information to the beacon. To show the effect of genome reconstruction attack more clearly, we consider dynamic beacons and we assume the victim is among the newly joined donors to the beacon. For clarity of the discussion, we present the identified attack only considering newly joined donors. Considering the donors that leave the beacon is symmetrical and trivial. We discuss this case in Section 8.1.

In genome reconstruction attack, due to the nature of beacon responses, the attacker can infer if a victim has at least one minor allele at every SNP position. This is because the response of the beacon only tells if there is an individual in the beacon with at least one minor allele at a given SNP position. Thus, for each SNP of victim (), the goal of the attacker is to infer and (i.e., or ). For simplicity, we define the event . Thus, if , and , otherwise. Note that inferring this information for a victim results in a serious privacy concern. As we will discuss and show later, using this information, an attacker can associate the genotype of the victim to related phenotypes (e.g., diseases) and initiate a membership inference attack for the victim by targeting another beacon that is associated with a sensitive phenotype (e.g., cancer or HIV+).

We consider a scenario in which the attacker has no information about the victim’s genome, but it knows that the victim is added to the beacon between times and . Let and represent the number of individuals in the beacon at times and , respectively. We also assume that the attacker knows and , which can easily be obtained by monitoring the changes in beacon size (or from the metadata of the beacon). By possessing this information, the attacker can probabilistically infer the genome of the victim by utilizing the changes in beacon’s responses (at times and ) as follows: (i) if the previous response (at time ) was “no” and the current response (at time ) is “yes”, the probability that the victim having a minor allele at the corresponding query position increases depending on how many new individuals are added to the beacon in this time interval; (ii) if the previous response was “yes” and the current response is also “yes”, attacker cannot infer much about the victim’s genome, especially if the total size of the beacon is large; and (iii) if both the previous and the current responses are “no”, the attacker understands that the victim does not have a minor allele at the corresponding query position.

Here, the most important (or the most sensitive) information for the attacker can be considered as the “no” responses at time that turn to “yes” at time . Because, such responses let the attacker infer the positions that the victim has at least one minor allele with a high probability (depending on how many new individuals are added to the beacon in this time interval). Since minor alleles of individuals are typically the indicators for privacy-sensitive information about them (e.g., predisposition to diseases or membership to other datasets), in this work, we focus on the success of the attacker based on its success in inferring the minor alleles of a victim using the beacon responses that turn to “yes”. Exhaustively generating all potential solutions of this problem would result in a total of genomes, where is the total number of responses that turn to “yes” at time (which can be on the order of tens of thousands), and hence it is intractable. In the following, we first describe a baseline method that provides a tractable solution to this problem. Next, we present a greedy approach to run such an attack more accurately, and then we will detail a more sophisticated, clustering-based approach for the genome reconstruction attack.

6.1. Baseline Approach for Genome Reconstruction

First, we describe a baseline approach, in which the attacker, using the responses of the beacon, reconstructs the genomes (of the newly joined donors) by assigning them to bins according to MAF values of the SNPs ( can be different than and the selection of effects the precision and recall of the attacker). Genome reconstruction attack using the baseline algorithm for a particular victim at time can be described as follows. The input of the attacker is (i) snapshot of the beacon with donors at time (i.e., responses to of all queries at time ); (ii) the fact that new donors are added to the beacon between times and ; (iii) the fact that the victim is among the newly added donors; and (iv) publicly available MAF values of the SNPs.

First the attacker identifies the set of SNPs for which the response of the beacon was “no” at time and it becomes “yes” at time . Thus, the attacker constructs a set , consisting of these SNPs. Then, the attacker creates empty bins representing SNP sets of newcomer donors. For each SNP in set , the attacker retrieves its MAF value, . Next, the attacker assigns the value of SNP for each individual (in each bin) consistent with the SNP’s MAF value as follows: (i) with probability and (ii) with probability . Since the beacon’s response for SNPs in has turned from “no” to “yes”, for all SNPs in , there should be at least one bin (among bins) with at least one mutation (i.e., homozygous minor or heterozygous SNP). Thus, once the values of the SNPs in for all bins are determined, the attacker checks if there is any SNP in the set that is not assigned to any the bin. If there is such a SNP, the attacker randomly picks a bin and assigns the value of the corresponding SNP as for the corresponding bin. The details of this baseline approach are also shown in Algorithm 1.

Input: : beacon; : Number of added people to ; : percentage of SNPs in captured by the attacker; Population that represent the composition in
Output: reconstructed genomes
// Step 1: Query Beacon
snapshot1 // Including victim, donors join Beacon between time and
snapshot2 // Step 2: Obtain No-Yes SNPs
1 NoYesResponses for  to  do
2       if snapshot1[i] == ”No” and snapshot2[i] == ”Yes” then
3             NoYesResponses.append(i)
4       end if
5      
6 end for
// Step 3: Reconstruct genomes
7 S [] for  to NoYesResponses.length do
8       for  to  do
9             if randnum ¡ getMAF(P,NoYesResponses[i]) then
10                  
11             end if
12            
13       end for
14      
15 end for
// Step 4:If a SNP is unassigned, randomly assign it to a reconstruction
16 if !assigned then
17      
18 end if
return S
Algorithm 1 Baseline Algorithm for Genome Reconstruction Attack

6.2. Greedy Algorithm for Genome Reconstruction

The above-mentioned baseline algorithm assumes every SNP is independent and the correlations among them are disregarded. However, SNPs are inherently correlated and considering such correlations in the genome reconstruction attack may result in significantly more accurate results. In the greedy algorithm discussed here, the attacker forms the bins considering the correlations between the SNPs in set . Using an iterative approach, the attacker assigns each SNP (minor allele) to an individual such that the probability of assignment is proportional to the average correlation of the new SNP with the already assigned SNPs of the individual (i.e., bin ). If no assignment is made this way, a random individual is selected to make sure there is at least one person with the corresponding new SNP.

Genome reconstruction attack using the greedy algorithm for a particular victim at time can be described as follows. The input of the attacker is (i) responses of the beacon to of all possible queries at time ; (ii) the fact that new donors are added to the beacon between times and and the victim is among the newly added donors; (iii) publicly available MAF values of the SNPs; and (iv) a correlation model between the SNPs that is consistent with the population structure of the beacon (that can be computed using publicly available genomic datasets).

For the correlation model, we assume the attacker uses a Markov chain model, as described in 

(Samani et al., 2015). The attacker calculates the likelihood of the victim having at least one minor allele at a SNP position as

(4)

where is the order of the Markov chain. In order to build a Markov chain model for the genome, we use public sources such as HapMap (Gibbs et al., 2003). Consistent with the previous work in (Samani et al., 2015), we define the -order model as follows: (i) if and (ii) if , where is the frequency of occurrence of the sequence that contains to . The SNPs are ordered according to their physical positions on the genome. In this work, we use and we do not limit the correlations only for the neighboring SNPs which is different from (Samani et al., 2015). Instead, we create our correlation model by considering the pairwise correlations between all the SNPs in the beacon. Here, we use Sokal-Michener distance to measure correlations between SNPs.

In the greedy approach, first, the attacker constructs set . Then, it creates empty bins ( does not have to be equal to ) representing the number of rare SNPs in . We assume that the SNPs with an MAF value below a threshold are categorized as rare SNPs. Assuming rare SNPs do not have correlations among each other, assigning the rare SNPs in to different bins as seeds is expected to result in an accurate separation of individuals. Next, for each remaining SNP in , the attacker computes the correlation of with all the previously assigned SNPs in each bin using the aforementioned correlation model. The attacker assigns in bin which has the highest average correlation value and and . Eventually, the attacker constructs potential genomes (in bins) belonging to newcomer donors.

6.3. Clustering-Based Algorithm for Genome Reconstruction

Input: : beacon; : Number of added people to ; : percentage of SNPs in captured by the attacker; Population that represent the composition in
Output: reconstructed genomes
// Step 1: Query Beacon
snapshot1 // Including victim, donors join Beacon between time and
snapshot2 // Step 2: Obtain No-Yes SNPs
1 NoYesResponses for  to  do
2       if snapshot1[i] == ”No” and snapshot2[i] == ”Yes” then
3             NoYesResponses.append(i)
4       end if
5      
6 end for
// Step 3: Cluster No-Yes SNPs
7 for  to  do
8       for  to NoYesResponses.length do
9            
10             .addEdge(NoYesResponses[i],NoYesResponses[j],)
11       end for
12      
13 end for
G // Step 4: Reconstruct genomes
14 S [] for  to  do
15       foreach s in  do
16            
17       end foreach
18      
19 end for
return S
Algorithm 2 Clustering-Based Algorithm for Genome Reconstruction Attack

The above-mentioned greedy algorithm reconstructs genomes by following a particular order (determined based on the MAFs of the SNPs). Different orders may provide different (and possibly more accurate) solutions. Thus, to consider all query responses together in a collective way, we propose clustering-based approaches for the genome reconstruction attack that cluster the identified minor alleles (for which ) for the newly joined donors to the beacon. The proposed clustering techniques essentially use the correlations between the SNPs (that are computed using the aforementioned correlation model) to separate the SNPs into different bins. We use two types of clustering techniques: (i) one that creates non-overlapping bins (hard clustering); and (ii) one that may assign a SNP into multiple bins (soft or fuzzy clustering).

For (i), we employ spectral clustering, in which a standard clustering method (such as k-means clustering) is applied on certain eigenvectors of the Laplacian matrix of a graph 

(Ng et al., 2002). Spectral clustering is our method of choice as it has been shown to provide favorable results in many high dimensional feature spaces like ours (Rodriguez et al., 2019). And, for (ii) we employ the fuzzy c-means clustering (FCM) algorithm (Bezdek et al., 1984) which is a common choice for these types of tasks. The algorithm is similar to k-means clustering, but it also allows probabilistic assignments of samples to multiple clusters. The description of both clustering methods are similar except for the clustering steps. Thus, in the following, we describe both methods together.

The input of both clustering-based algorithms is the same as the input of the greedy algorithm. First, the attacker identifies the set of SNP positions for which the response of the beacon was “no” at time and it becomes “yes” at time and constructs set . Then, the attacker builds a graph of SNPs using the correlation model, in which the vertices are the SNPs in and undirected edges are weighted by the correlation values between these SNPs. This graph represents a pairwise similarity model for the SNPs and is used for a quantitative assessment of the correlation of each SNP pair in .

Next, the attacker applies either the spectral or fuzzy clustering algorithms on the constructed graph. The outcome of spectral clustering is a set of disjoint clusters. Fuzzy clustering results in groups of SNPs that maximizes the similarity in a group while allowing a SNP to be shared by multiple individuals. Thus, in fuzzy clustering, each SNP is assigned to clusters for which the algorithm returns a relatively high probability of association.

After clustering, the attacker obtains different clusters which corresponds to reconstructed genomes. The details of this algorithm are also shown in Algorithm 2.

6.4. Identifying the Victim Using Genotype-Phenotype Associations

In previous sections, for genome reconstruction, we assumed that the attacker can correctly identify the victim’s genome among several reconstructed bins (e.g., = 1). Assuming the attacker has information about some phenotypic characteristics of the victim and relying upon the fact that SNPs are intrinsically linked to phenotypic traits (such as eye color, blood type etc.), we also study and show how accurately the attacker can identify the victim’s genome among other candidates. This provides a complete methodology for the genome reconstruction attack against beacons in real-life.

Assume victim is among the new additions to the beacon (it is trivial to extend the methodology if there are more than one victims). The attacker is assumed to have access to two distinct sets: (i) a set of reconstructed genotypes as a result of the genome reconstruction attack, where

is a vector containing the SNP values of genotype

(or bin ); and (ii) a set containing the values of phenotypic traits of victim . Such phenotype information can be obtained from publicly available resources or using the physical traits of the victim. The goal of the attacker is to correctly match the victim’s phenotype to the correct reconstructed genome (that is the most similar to the victim’s) among all candidate reconstructed genome sequences. In (Humbert et al., 2015b), Humbert et al. focused on the deanoymization risk and modelled genotype-phenotype association as an assignment problem. They showed this risk by using the Hungarian algorithm (Kuhn, 1955). Different from (Humbert et al., 2015b), here, we rely on machine learning-based tools for maximizing the matching likelihood and genotype-phenotype associations. We observe that such a formulation provides more accurate results. Also, rather than using SNP values (0, 1 or 2), due to the nature of the proposed attack, we represent the state of each SNP of individual as , which can be either or , as discussed before.

To train the model, we first apply a feature selection with mutual information to reduce the number dimensions (i.e., SNPs) of the data from 2,338,175 to 500k. Then, using selected features, we train

classifiers for each considered phenotype. The OpenSNP dataset has the following number of samples for each phenotype: (i) eye color (4 labels), 755 samples, (ii) color blindness (2 labels), 360 samples, (iii) hair type (2 labels), 358 samples, (iv) hair color (4 labels), 456 samples, (v) lactose intolerance (2 labels), 340 samples, (vi) blood type (4 labels), 250 samples, (vii) earwax (2 labels), 240 samples, (viii) tongue rolling (2 labels), 427 samples, and (ix) intolerance to soy (2 labels), 131 samples. For each phenotype, we split the data as training () and test () samples) sets. On the training set, we used 5-fold cross validation and grid search to tune the parameters of the classifier. We report the test accuracy and discuss the findings in Section 7.3.

Using the above-mentioned trained classifiers, we predict the phenotypes of each reconstructed genome (i.e., bin). Each classifier outputs a probability for each possible label for that phenotype (e.g., eye color being blue). We assume the phenotypes are independent and use a weighted ensemble of the individual classifiers to detect the most likely reconstruction that can lead to for the victim . Each individual classifier’s weight is proportional to its test accuracy.

The performance of identification of victim’s reconstructed genome under different settings is also discussed in Section 7.3.

6.5. Genome Reconstruction in Membership Inference

To show one consequence of the proposed genome reconstruction attack, we also model and analyze how the proposed attack can be utilized for membership inference (introduced in Section 3.2). We consider a scenario in which the attacker, knowing the membership of an individual to a beacon with no phenotype (or with a non-sensitive phenotype), first utilizes the responses of this beacon to infer specific parts of a victim’s genome (i.e., SNPs). Then, the attacker uses these inferred SNPs to infer the membership of the victim to a beacon with a sensitive phenotype. This attack is important and realistic, because knowing the membership of an individual to a beacon with a non-sensitive phenotype may not seem to create a privacy issue. However, using the proposed genome reconstruction attack and the membership information of the victim to the beacon with non-sensitive phenotype, the attacker can first infer the SNPs of the victim and then, infer the membership of the victim to another beacon with a sensitive phenotype.

To show this, first, we run the proposed genome reconstruction attack in Section 6.3 and infer the SNPs of the victim with at least one minor allele. Using these inferred SNPs, we then run the membership inference attack to infer the membership of the victim in another beacon. For membership inference attack, we use the Optimal attack in (Raisaro et al., 2016) (described in Section 3.2), which is shown to be an effective attack for membership inference (for our scenario, Optimal attack in  (Raisaro et al., 2016) and the QI-attack in (von Thenen et al., 2018) perform similar, we choose to implement the Optimal attack due to its simplicity). However, different than the original Optimal attack, since we query the alleles of the victim that the attacker infers as a result of the genome reconstruction attack, in the null and alternate hypothesis equations in (1) and (2), there is an additional error due to the inference error of the genome reconstruction attack. Thus, we first experimentally compute the error rate of the genome reconstruction attack for a particular scenario (e.g., for particular and values). We then include this additional error on the parameter in (2), which represents the probability that the attacker’s copy of the victim’s genome does not match the beacon’s copy for a SNPs. Furthermore, as opposed to original Optimal-attack, here the attacker may not have access to the SNPs of the victim with the lowest MAF values; instead the attacker only knows the SNPs that are inferred as a result of the genome reconstruction attack. We evaluate the success of this attack in terms of the power of the attacker in Section 7.4.

7. Evaluation

To evaluate the identified vulnerabilities, we evaluated our methods using real-life genomic datasets. Here, we first describe the datasets we used and then present the evaluation results.

7.1. Datasets and Evaluation Metrics

We used two different genome datasets for evaluation: (i) genome dataset of CEU population from the HapMap dataset (Gibbs et al., 2003) and (ii) OpenSNP genome dataset (ope, 2020). Using the HapMap dataset, we created a beacon that consists of donors (unless otherwise specified) from the CEU population, including around million SNPs for each donor. We created the correlation model (i.e., SNP network or similarity model) for this beacon using individuals from the same HapMap dataset that are not in the constructed beacon. Using the OpenSNP dataset, we created a beacon that consists of donors (unless otherwise specified), including around million SNPs for each donor and created the correlation model using the rest of the OpenSNP dataset. For the OpenSNP dataset, we also collected the reported phenotypes of individuals. We focused on the following phenotypes, each reported by at least individuals: eye color, color blindness, hair type, hair color, lactose intolerance, blood type, earwax type, tongue roller, and intolerance to soy. We used genomes which are associated with at least 5 of the above-mentioned phenotypes.

We evaluated the precision, recall, and accuracy for the reconstruction of a victim’s SNPs based on changes in beacon responses. For precision and recall, we defined the success over correctly inferring the SNPs of the victim with at least one minor allele. For accuracy, we defined the success over correctly inferring all SNPs of the victim in set . Thus, for precision and recall, we defined (i) true positive as correctly inferring a SNP of victim with (with at least one minor allele); (ii) false positive as incorrectly assigning for who is homozygous major at that locus; (iii) true negative as correctly inferring a SNP of victim with (with no minor allele, homozygous major); and (iv) false negative as incorrectly assigning for who has at least one minor allele at that locus (i.e., heterozygous or homozygous major). We also defined accuracy as the fraction of correctly inferred SNPs among the ones in set (including SNP positions for which the response of the beacon was “no” at time and it becomes “yes” at time ). Furthermore, we quantified the success of identifying the victim’s genome among the reconstructed genomes in terms of the accuracy of the developed genotype-phenotype inference mechanism. Finally, we used the power of membership inference (as a result of the log-likelihood test) to show how the outcome of the genome reconstruction attack can be used for membership inference.

7.2. Evaluation of Genome Reconstruction

First, using both the OpenSNP and HapMap beacons and only focusing on genome reconstruction, we evaluated and compared the baseline method (in Section 6.1) and the proposed clustering-based approach (in Section 6.3). Here, we show the performance of the genome reconstruction algorithm considering the reconstructed genome (among all reconstructed genomes) that is the most similar to the victim’s genome. Thus, we assume that the attacker can identify the victim’s reconstructed genome among the other candidates. Later, we will also show that attacker can indeed identify this genome with high accuracy using public (i.e., not sensitive) phenotype information about the victim.

Overall, results we obtained from both beacons are similar to each other showing that the identified vulnerability is not dataset specific. Thus, we show the results of the HapMap beacon in Appendix A due to space restrictions. Figures 2 and 9 (in Appendix A) show the success (precision, recall, and accuracy) of the reconstruction for various number of newly added donors () in OpenSNP and HapMap beacons, respectively. The results show that on average, the identified attack using spectral clustering can reconstruct the victim’s genome with a precision close to when the size of the beacon is increased by in an update. We also obtained more than precision even when the size of the beacon is increased by (i.e., when for OpenSNP beacon). The reason of high accuracy for all techniques is because accuracy is defined over all SNPs of the victim in set , including the ones that do not have any minor alleles (and the number of such SNPs dominate the number of SNPs with at least one minor allele). This indicates a substantial privacy risk, especially if the reconstructed SNPs are tied to sensitive phenotypes. Also, the baseline algorithm performs substantially worse than the proposed clustering-based approach. The results also show that spectral clustering-based genome reconstruction outperforms the fuzzy clustering-based approach. We observed that allowing a SNP (that includes at least one minor allele) to be in multiple bins results in high false positives. Therefore, in the remaining of this section, we use spectral clustering-based genome reconstruction for the evaluations.

In Figures 3 and 10 (in Appendix A), we also show the effect of varying number of bins () in the genome reconstruction attack when the number of newly added donors () is for OpenSNP and HapMap beacons, respectively. We observed that for both beacons, precision increases and recall decreases with increasing . Also, as expected, precision and recall becomes balanced when .

Next, in Figures 4 and 11 (in Appendix A), we show the effect of the beacon size () at time when new donors are added between times and for OpenSNP and HapMap beacons, respectively. Here, we assume that the number of bins () is equal to the number of newly added donors (). We observed that as the size of the beacon increases, all precision, recall, and accuracy of the reconstruction attack slightly increases (for a fixed number of newly added donors). This shows that the success of the identified attack mainly relies of the fraction of the newly added donors to the beacon, and it is independent of the size of the beacon at time .

In these experiments, we assumed that the attacker knows the whole snapshot of the beacon at time (i.e., ). We observed that the total number of inferred SNPs decrease linearly with decreasing value. However, precision, recall, and accuracy of the attack remain similar to the results in Figures 9 and 2. This effect of will be more clear for the membership inference attack (we discuss and evaluate this in Section 7.4).

(a) Precision.
(b) Recall.
(c) Accuracy.
Figure 2. Precision, recall, and accuracy for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of newly added donors.
(a) Precision.
(b) Recall.
(c) Accuracy.
Figure 3. Precision, recall, and accuracy for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of bins/clusters () in the genome reconstruction attack. Number of newly added donors () is .
(a) Precision.
(b) Recall.
(c) Accuracy.
Figure 4. Precision, recall, and accuracy for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of beacon size (). Number of newly added donors is and for all plots.

7.3. Identifying the Victim’s Genome Using Phenotype Inference

Here, we evaluate the success of the attacker in identifying the reconstructed genome of the victim among all reconstructed genomes using the algorithm in Section 6.4

. Since HapMap dataset does not include phenotype information about the genome donors, we only use the OpenSNP beacon for this evaluation. We employed and compared several classifiers for genotype-phenotype associations, including logistic regression, SVM, multilayer perceptron, random forest, and XGBoost. Among these, we obtained the highest classifier accuracy with XGBoost 

(Chen and Guestrin, 2016), and hence for the rest of this part, we report the results we obtained using the XGBoost classifier.

In Figure 5, we show the classifier accuracy for each phenotype we consider along with the accuracy of random guess. All existing and newly added donors in the OpenSNP beacon have reported these phenotypes. We combined all these considered phenotypes to build a single ensemble classifier. In Figure 6, we show the ensemble classifier accuracy for varying number of newly added donors to the beacon (here, we assumed and we observed similar patterns when as well). Note that we use the original genomes of individuals in the training dataset when building the model, however, for test, we use reconstructed genomes of the victims (that may have noise due to reconstruction error). We observed that the proposed algorithm provides more than accuracy when the size of the beacon is increased by , and the accuracy slightly decreases with increasing number of newly added donors. These results show that the attacker can identify the reconstructed genome of the victim among all reconstructed genomes with a high accuracy.

Figure 5. Classifier accuracy of ensemble classifier (XGBoost) and accuracy of random guess for each considered phenotype in the OpenSNP dataset.
Figure 6. Classification accuracy of genotype inference from phenotype for varying number of newly added donors to the beacon ().

7.4. Using Genome Reconstruction in Membership Inference

To show one severe consequence of the proposed genome reconstruction attack, we also show how the outcome of it can be utilized in a membership inference attack. For this, we constructed two beacons from the OpenSNP dataset: (i) including 100 individuals and (ii) including 100 individuals. We assume that is associated with a privacy-sensitive phenotype and the goal of the attacker is to infer the membership of the victim to . We also assume that new individuals are added to at time and the victim is among these newly joined donors. The attacker only knows that the victim is among the individuals that are added to at time along with a snapshot of at time (i.e., of all query responses at time ).

First, we applied the spectral clustering-based genome reconstruction (that provides the best performance in Section 7.2) to reconstruct the genomes of newly joined donors to . Then, we identified the reconstructed genome of the victim using phenotype information about the victim (as in Section 6.4). Finally, using the reconstructed genome of the victim, we conducted the membership inference attack on using the Optimal attack (as described in Section 3.2).

Similar to Raisaro et al. and von Thenen et al., we ploted the power curve of the membership inference attack at false positive rate. We empirically built the null hypothesis ( in Section 3.2). For every query, we determined the distribution of under the null hypothesis using individuals that are not in . When is less than a threshold , the null hypothesis is rejected and we find from the null hypothesis with (corresponding to false positive rate). Then, we computed the power as proportion of the individuals in the alternate hypothesis (including different victims in ) having a value that is less than . As discussed, for the alternate hypothesis ( in Section 3.2), we included the real genome of the victim to the beacon but we assumed that the attacker has a noisy version of victim’s genome (due to the inference error of the genome reconstruction attack).

In Figures 7 and 8, we show the power curve of this attack with varying number of newly added donors () to beacon and for different values (fraction of the whole responses the attacker knows at time ), respectively. As expected, with decreasing values of , the power increases faster since the accuracy of genome reconstruction increases (and hence the error rate of the membership inference attack decreases). For instance, when the victim is among newly added donors to beacon , the attacker can reconstruct its genome and then infer the victim’s membership to beacon with a very high confidence in just slightly more than queries. Furthermore, as shown in Figure 8, the power of the membership inference attack increases with increasing value. This is because with increasing value, the attacker infers higher number of (and potentially low-MAF) SNPs of the victim as a result of the genome reconstruction attack.

Figure 7. Power of membership inference attack on beacon with varying number of newly added donors () to beacon .
Figure 8. Power of membership inference attack on beacon for different values (fraction of the whole responses the attacker knows at time ).

8. Discussion

Overall, we believe that the implications of this work will be significant, especially over genomic data-sharing beacons and other statistical genome datasets. Beacons have been widely accepted by the community as the best standard for ease of set up and encouraging collaboration without compromising security. However, the privacy pitfalls have cast doubts on their usability. Currently, setting up a beacon is a risk for data owners (i.e., potential beacon operators) because they might face legal consequences if their dataset is breached. Thus, it might be a wiser choice to keep the dataset offline and let people go through the rigorous paperwork. For the donors, it is usually a shot in the dark since they do not understand the technical and statistical risks, but they fear from the consequence which scare them away from joining the study. This work, along with existing membership inference attacks on genomic data-sharing beacons, will help both the beacon operators and the donors to understand the privacy risk and take precautions (or informed decisions) against the identified risks. In the following, we discuss some alternative scenarios for the proposed attack, practical usecase of the identified vulnerability, and potential mitigation techniques.

8.1. Donors Leaving the Beacon

In Sections 6 and 7, we presented and evaluated the identified vulnerability by only considering the newly joined donors to the beacons. It is also possible that existing donors may also leave the beacon. However, such a scenario can be easily addressed by using the identified attack mechanism. Considering the donors that leave the beacon brings up two different scenarios: (i) victim is among the newly joined donors (while there are also donors leaving the beacon between times and ); and (ii) victim is among the donors that leave the beacon (while there may be other donors leaving or joining the beacon between times and ).

Scenario in (i) is no different than what we discussed in Section 6. The number of “no” responses at time that turn to “yes” at time does not change due to the donors leaving the beacon. On the other hand, some “yes” responses at time may turn to “no” at time due to the donors leaving the beacon. However, such responses do not provide information about the minor alleles of the victim, and hence we do not consider such responses in this work. In scenario (ii), “yes” responses at time that turn to “no” at time will provide information about the minor alleles of the victim (and other donors that leave the beacon during that time interval). Using such responses, one will need to run the algorithms proposed in Section 6 to reconstruct the genome of the victim. Thus, it is trivial to consider both newly joining and leaving donors in the proposed attack mechanism.

8.2. Risk Quantification for the Genome Reconstruction Attack

The identified vulnerability and the proposed attack algorithm can be used as a privacy risk quantification tool by the beacon operator. For this, we foresee a simulation-based technique to quantify the risk and show it to the beacon operator. This will be a customized technique for each donor in the beacon and the following discussion is for one particular donor. Assume that a total of new donors are gathered by the beacon between times and . To quantify the genome reconstruction risk, one may run the attack we introduced in Section 6, pretending the donor is added to the beacon along with the other newcomer donors and compute the fraction of the SNPs that can be reconstructed. Then, using public sources (such as HapMap), one can gather a small number (e.g., ) of genomes belonging to individuals from same population as the donor. Then, the same attack can be run for the selected people (i.e., adding each random individual along with the other newcomer donors), their reconstruction rates can be set as the baseline, and eventually, a privacy risk percentile can be provided for the donor. Moreover, for all correctly inferred SNPs, one can perform a pathogenic scan on ClinVar (Landrum et al., 2017) to inform the donor about what traits they might be linked should their genome is put onto the beacon. Using this information and based on the privacy risk of the donor, either the donor or the beacon operator will decide whether or not to add the donor to the beacon at time . This process can be repeated for all the newcomer donors.

We foresee that using such a quantification algorithm, a potential beacon participant can provide informed consent about how (and what portion of) their data can be used by the beacons (e.g., when the beacon can start using their data in its responses or when the beacon should stop using their data). Similarly, such a tool can guide a beacon operator on the number of participants to include in a batch to update the beacon.

8.3. Mitigation Techniques

To mitigate membership inference attacks against beacons, several countermeasures have been proposed. Shringarpure and Bustamante considered: (i) increasing the beacon size, (ii) sharing only small genomic regions, (iii) using single population beacons, (iv) not publishing the metadata of a beacon, and (v) adding control samples to the beacon dataset (Shringarpure and Bustamante, 2015). Raisaro et al. proposed assigning a query budget for each individual’s genome as a countermeasure. However, later von Thenen et al. showed that such query budgets are not effective considering the auxiliary information of the attacker about the victim and correlations between the SNPs (von Thenen et al., 2018). Lately, Al Aziz et al. proposed two algorithms that are based on randomizing the response set of the beacons with the goal of protecting beacon members’ privacy while maintaining the efficacy of the beacon servers (Al Aziz et al., 2017). However, most of such techniques directly reduce the utility of the beacon without carefully analyzing a balance between privacy (of beacon participants) and utility (of beacon responses). Thus, we believe that existing countermeasures proposed for membership inference are not directly applicable to mitigate genome reconstruction attack.

To mitigate genome reconstruction, here we suggest three simple methods: (i) updating the beacon content when ¿ 1; (ii) adding (or removing) donors after quantifying their risks against genome reconstruction (as discussed in Section 8.2); and (iii) adjusting diversity of the beacon to have beacons with mixed ethnicity genome donors. We observed that for beacons with mixed ethnicity donors, it is hard to construct the correlation model (unless the beacon discloses the ethnicities of the donors as metadata), and hence it is hard to conduct the proposed correlation-based genome reconstruction attacks. We will further work on more sophisticated countermeasures in future work.

9. Conclusion and Future work

In this paper, we have identified and analyzed a serious privacy concern against genomic data-sharing beacons. Thus far, the only privacy vulnerability that has been identified for beacons was membership inference. We have identified and, via extensive analysis, showed the impact of another serious privacy concern for beacons: genome reconstruction. We have shown the practicality of the identified privacy concern in real-life by showing the whole attack strategy including genotype-phenotype inference. Furthermore, we have shown how genome reconstruction attack can be used together with the membership inference to identify privacy-sensitive phenotypes of individuals. In future work, we will develop privacy-risk quantification tools for beacon operators (and donors) using the identified vulnerability and also considering the risk of membership inference. Furthermore, we will work on mitigation techniques for the identified vulnerability while preserving the utility of beacon content and beacon responses.

References

  • (1)
  • ga4 (2020) 2020. https://www.ga4gh.org/about-us/. [Online; accessed 10-January-2020].
  • bea (2020) 2020. http://beacon-network.org. [Online; accessed 10-January-2020].
  • sit (2020a) 2020a. https://ghr.nlm.nih.gov/primer/genomicresearch/snp. [Online; accessed 10-January-2020].
  • ope (2020) 2020. http://opensnp.org. [Online; accessed 10-January-2020].
  • sit (2020b) 2020b. Disease Risk. http://www.eupedia.com/genetics/medical_dna_test.shtml [Online; accessed 10-January-2020].
  • Al Aziz et al. (2017) Md Momin Al Aziz, Reza Ghasemi, Md Waliullah, and Noman Mohammed. 2017. Aftermath of Bustamante attack on genomic beacon service. BMC Medical Genomics 10, 2 (2017), 43.
  • Allen et al. (2010) Hana Lango Allen, Karol Estrada, Guillaume Lettre, Sonja I Berndt, Michael N Weedon, Fernando Rivadeneira, Cristen J Willer, Anne U Jackson, Sailaja Vedantam, Soumya Raychaudhuri, et al. 2010. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 7317 (2010), 832–838.
  • Ayday et al. (2013a) Erman Ayday, Emiliano De Cristofaro, Jean-Pierre Hubaux, and Gene Tsudik. 2013a. The chills and thrills of whole genome sequencing. (2013).
  • Ayday et al. (2013b) Erman Ayday, Jean Louis Raisaro, Jean-Pierre Hubaux, and Jacques Rougemont. 2013b. Protecting and evaluating genomic privacy in medical tests and personalized medicine. In Proceedings of the 12th ACM Workshop on Privacy in the Electronic Society. 95–106.
  • Baldi et al. (2011) Pierre Baldi, Roberta Baronio, Emiliano De Cristofaro, Paolo Gasti, and Gene Tsudik. 2011. Countering GATTACA: efficient and secure testing of fully-sequenced human genomes. In Proceedings of the 18th ACM conference on Computer and communications security. 691–702.
  • Bezdek et al. (1984) James C Bezdek, Robert Ehrlich, and William Full. 1984. FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences 10, 2-3 (1984), 191–203.
  • Blanton et al. (2012) Marina Blanton, Mikhail J Atallah, Keith B Frikken, and Qutaibah Malluhi. 2012. Secure and efficient outsourcing of sequence comparisons. In Proceedings of European Symposium on Research in Computer Security. 505–522.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
  • Claes et al. (2014) Peter Claes, Denise K Liberton, Katleen Daniels, Kerri Matthes Rosana, Ellen E Quillen, Laurel N Pearson, Brian McEvoy, Marc Bauchet, Arslan A Zaidi, Wei Yao, et al. 2014. Modeling 3D facial shape from DNA. PLoS Genetics 10, 3 (2014).
  • Clayton (2010) David Clayton. 2010. On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics (2010).
  • Collins and Varmus (2015) Francis S Collins and Harold Varmus. 2015. A new initiative on precision medicine. New England Journal of Medicine 372, 9 (2015), 793–795.
  • Consortium et al. (2003) International HapMap Consortium et al. 2003. The international HapMap project. Nature 426, 6968 (2003), 789.
  • De Cristofaro et al. (2013) Emiliano De Cristofaro, Sky Faber, and Gene Tsudik. 2013. Secure Genomic Testing with Size- and Position-hiding Private Substring Matching. In Proceedings of the 12th ACM Workshop on Privacy in the Electronic Society.
  • Deznabi et al. (2018) Iman Deznabi, Mohammad Mobayen, Nazanin Jafari, Oznur Tastan, and Erman Ayday. 2018. An inference attack on genomic data using kinship, complex correlations, and phenotype information. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 15, 4 (2018), 1333–1343.
  • Dwork (2006) Cynthia Dwork. 2006. Differential Privacy. Proceedings of the 33rd International Conference on Automata, Languages and Programming (2006).
  • Erlich and Narayanan (2014) Yaniv Erlich and Arvind Narayanan. 2014. Routes for breaching and protecting genetic privacy. Nature Reviews Genetics 15, 6 (2014), 409–421.
  • Fienberg et al. (2011) Stephen E Fienberg, Aleksandra Slavkovic, and Caroline Uhler. 2011. Privacy preserving GWAS data sharing. In IEEE 11th International Conference on Data Mining Workshops (ICDMW). 628–635.
  • Gibbs et al. (2003) Richard A Gibbs, John W Belmont, Paul Hardenbol, Thomas D Willis, Fuli Yu, Huanming Yang, Lan-Yang Ch’ang, Wei Huang, Bin Liu, Yan Shen, et al. 2003. The international HapMap project. Nature 426, 6968 (2003), 789–796.
  • Gitschier (2009) Jane Gitschier. 2009. Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project. American Journal of Human Genetics 84, 2 (2009), 251–258.
  • Glusman et al. (2011) Gustavo Glusman, Juan Caballero, Denise E Mauldin, Leroy Hood, and Jared C Roach. 2011. Kaviar: an accessible system for testing SNV novelty. Bioinformatics 27, 22 (2011), 3216–3217.
  • Greshake et al. (2014) Bastian Greshake, Philipp E Bayer, Helge Rausch, and Julia Reda. 2014. OpenSNP–a crowdsourced web resource for personal genomics. PLoS One 9, 3 (2014), e89204.
  • Gymrek et al. (2013) Melissa Gymrek, Amy L McGuire, David Golan, Eran Halperin, and Yaniv Erlich. 2013. Identifying personal genomes by surname inference. Science 339, 6117 (2013), 321–324.
  • Hayden (2013) Erika Check Hayden. 2013. Privacy protections: The genome hacker. Nature 497 (2013), 172–174.
  • Homer et al. (2008a) Nils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waibhav Tembe, Jill Muehling, John V Pearson, Dietrich A Stephan, Stanley F Nelson, and David W Craig. 2008a. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics 4, 8 (2008).
  • Homer et al. (2008b) Nils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waibhav Tembe, Jill Muehling, John V Pearson, Dietrich A Stephan, Stanley F Nelson, and David W Craig. 2008b. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics 4, 8 (2008).
  • Humbert et al. (2013) Mathias Humbert, Erman Ayday, Jean-Pierre Hubaux, and Amalio Telenti. 2013. Addressing the concerns of the Lacks family: quantification of kin genomic privacy. In Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1141–1152.
  • Humbert et al. (2015a) Mathias Humbert, Kévin Huguenin, Joachim Hugonot, Erman Ayday, and Jean-Pierre Hubaux. 2015a. De-anonymizing Genomic Databases Using Phenotypic Traits. Proceedings on Privacy Enhancing Technologies 2015 (2015), 99–114.
  • Humbert et al. (2015b) Mathias Humbert, Kévin Huguenin, Joachim Hugonot, Erman Ayday, and Jean-Pierre Hubaux. 2015b. De-anonymizing genomic databases using phenotypic traits. Proceedings on Privacy Enhancing Technologies 2015, 2 (2015), 99–114.
  • Im et al. (2012) Hae Kyung Im, Eric R Gamazon, Dan L Nicolae, and Nancy J Cox. 2012. On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. American Journal of Human Genetics 90, 4 (2012), 591–598.
  • Jacobs et al. (2009) Kevin B Jacobs, Meredith Yeager, Sholom Wacholder, David Craig, Peter Kraft, David J Hunter, Justin Paschal, Teri A Manolio, Margaret Tucker, Robert N Hoover, et al. 2009. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature genetics 41, 11 (2009), 1253–1257.
  • Jha et al. (2008) Somesh Jha, Louis Kruger, and Vitaly Shmatikov. 2008. Towards practical privacy for genomic computation. In Proceedings of IEEE Symposium on Security and Privacy. 216–230.
  • Johnson and Shmatikov (2013) Aaron Johnson and Vitaly Shmatikov. 2013. Privacy-preserving data exploration in genome-wide association studies. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1079–1087.
  • Kale et al. (2017) Gulce Kale, Erman Ayday, and Öznur Tastan. 2017. A utility maximizing and privacy preserving approach for protecting kinship in genomic databases. Bioinformatics (2017).
  • Kayser and de Knijff (2011) Manfred Kayser and Peter de Knijff. 2011. Improving human forensics through advances in genetics, genomics and molecular biology. Nature Reviews Genetics 12, 3 (2011), 179–192.
  • Kuhn (1955) Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1-2 (1955), 83–97.
  • Landrum et al. (2017) Melissa J Landrum, Jennifer M Lee, Mark Benson, Garth R Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, Jennifer Hart, Douglas Hoffman, Wonhee Jang, et al. 2017. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic acids research 46, D1 (2017), D1062–D1067.
  • Ledford (2016) H Ledford. 2016. AstraZeneca launches project to sequence 2 million genomes. Nature 532, 7600 (2016), 427.
  • Lin et al. (2004) Z. Lin, A. B. Owen, and R. B. Altman. 2004. Genomic research and human subject privacy. Science 305, 5681 (Jul 2004), 183.
  • Lippert et al. (2017) Christoph Lippert, Riccardo Sabatini, M. Cyrus Maher, Eun Yong Kang, Seunghak Lee, Okan Arikan, Alena Harley, Axel Bernal, Peter Garst, Victor Lavrenko, Ken Yocum, Theodore Wong, Mingfu Zhu, Wen-Yun Yang, Chris Chang, Tim Lu, Charlie W. H. Lee, Barry Hicks, Smriti Ramakrishnan, Haibao Tang, Chao Xie, Jason Piper, Suzanne Brewerton, Yaron Turpaz, Amalio Telenti, Rhonda K. Roby, Franz J. Och, and J. Craig Venter. 2017. Identification of individuals by trait prediction using whole-genome sequencing data. Proceedings of the National Academy of Sciences (2017). https://doi.org/10.1073/pnas.1711125114
  • Liu et al. (2012) Fan Liu, Fedde van der Lijn, Claudia Schurmann, Gu Zhu, M Mallar Chakravarty, Pirro G Hysi, Andreas Wollstein, Oscar Lao, Marleen de Bruijne, M Arfan Ikram, et al. 2012. A genome-wide association study identifies five loci influencing facial morphology in Europeans. PLoS Genetics 8, 9 (2012).
  • Malin and Sweeney (2004) Bradley A. Malin and Latanya Sweeney. 2004. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. Journal of Biomedical Informatics 37, 3 (2004), 179–192.
  • Manning et al. (2012) Alisa K Manning, Marie-France Hivert, Robert A Scott, Jonna L Grimsby, Nabila Bouatia-Naji, Han Chen, Denis Rybin, Ching-Ti Liu, Lawrence F Bielak, Inga Prokopenko, et al. 2012. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nature Genetics 44, 6 (2012), 659–669.
  • Naveed et al. (2014) Muhammad Naveed, Shashank Agrawal, Manoj Prabhakaran, XiaoFeng Wang, Erman Ayday, Jean-Pierre Hubaux, and Carl Gunter. 2014. Controlled Functional Encryption. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security.
  • Naveed et al. (2015) Muhammad Naveed, Erman Ayday, Ellen W Clayton, Jacques Fellay, Carl A Gunter, Jean-Pierre Hubaux, Bradley A Malin, and XiaoFeng Wang. 2015. Privacy in the genomic era. ACM Computing Surveys (CSUR) 48, 1 (2015), 6.
  • Ng et al. (2002) Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002.

    On spectral clustering: Analysis and an algorithm. In

    Advances in neural information processing systems. 849–856.
  • Ou et al. (2012) Xue-ling Ou, Jun Gao, Huan Wang, Hong-sheng Wang, Hui-ling Lu, and Hong-yu Sun. 2012. Predicting human age with bloodstains by sjTREC quantification. PloS ONE 7, 8 (2012).
  • Raisaro et al. (2016) Jean L Raisaro, Florian Tramer, Ji Zhanglong, Diyue Bu, Yongan Zhao, Knox Carey, David Lloyd, Heidi Sofia, Dixie Baker, Paul Flicek, Suyash S Shringarpure, Carlos D Bustamante, Suang Wang, Xiaoqian Jiang, Lucila Ohno-Machado, Haixu Tang, XiaoFeng Wang, and Jean-Pierre Hubaux. 2016. Addressing Beacon Re-Identification Attacks: Quantification and Mitigation of Privacy Risks. The Journal of the American Medical Informatics Association 24, 4 (2016), 799–805.
  • Rodriguez et al. (2019) Mayra Z Rodriguez, Cesar H Comin, Dalcimar Casanova, Odemir M Bruno, Diego R Amancio, Luciano da F Costa, and Francisco A Rodrigues. 2019. Clustering algorithms: A comparative approach. PloS one 14, 1 (2019), e0210236.
  • Samani et al. (2015) Sahel Shariati Samani, Zhicong Huang, Erman Ayday, Mark Elliot, Jacques Fellay, Jean-Pierre Hubaux, and Zoltán Kutalik. 2015. Quantifying genomic privacy via inference attack with high-order SNV correlations. In Security and Privacy Workshops (SPW), 2015 IEEE. 32–40.
  • Sankararaman et al. (2009) Sriram Sankararaman, Guillaume Obozinski, Michael I Jordan, and Eran Halperin. 2009. Genomic privacy and limits of individual detection in a pool. Nature Genetics 41, 9 (2009), 965–967.
  • Schatz (2015) Michael C Schatz. 2015.

    Biological data sciences in genome research.

    Genome Research 25, 10 (2015), 1417–1422.
  • Shringarpure and Bustamante (2015) Suyash S Shringarpure and Carlos D Bustamante. 2015. Privacy risks from genomic data-sharing beacons. The American Journal of Human Genetics 97, 5 (2015), 631–646.
  • Sweeney et al. (2013) Latanya Sweeney, Akua Abu, and Julia Winn. 2013. Identifying participants in the personal genome project by name. arXiv preprint arXiv:1304.7605 (2013).
  • Tramer et al. (2015) Florian Tramer, Zhicong Huang, Jean-Pierre Hubaux, and Erman Ayday. 2015. Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies. In Proceedings of ACM Conference on Computer and Communications Security (CCS). 1286–1297.
  • Troncoso-Pastoriza et al. (2007) Juan Ramón Troncoso-Pastoriza, Stefan Katzenbeisser, and Mehmet Celik. 2007. Privacy preserving error resilient DNA searching through oblivious automata. Proceedings of ACM CCS ’07 (2007).
  • Visscher and Hill (2009) Peter M Visscher and William G Hill. 2009. The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS Genet 5, 10 (2009).
  • von Thenen et al. (2018) Nora von Thenen, Erman Ayday, and A Ercument Cicek. 2018. Re-identification of individuals in genomic data-sharing beacons via allele inference. Bioinformatics 35, 3 (2018), 365–371.
  • Walsh et al. (2011) Susan Walsh, Fan Liu, Kaye N Ballantyne, Mannis van Oven, Oscar Lao, and Manfred Kayser. 2011. IrisPlex: a sensitive DNA tool for accurate prediction of blue and brown eye colour in the absence of ancestry information. Forensic Science International: Genetics 5, 3 (2011), 170–180.
  • Wang et al. (2009) Rui Wang, Yong Fuga Li, XiaoFeng Wang, Haixu Tang, and Xiaoyong Zhou. 2009. Learning your identity and disease from research papers: information leaks in genome wide association study. In Proceedings of the 16th ACM Conference on Computer and Communications Security. 534–544.
  • Yu et al. (2014) Fei Yu, Stephen E Fienberg, Aleksandra B Slavković, and Caroline Uhler. 2014. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of Biomedical Informatics 50 (2014), 133–141.
  • Zhou et al. (2011) Xiaoyong Zhou, Bo Peng, Yong Fuga Li, Yangyi Chen, Haixu Tang, and XiaoFeng Wang. 2011. To release or not to release: Evaluating information leaks in aggregate human-genome data. ESORICS’11: Proc. of the 16th European Conf. on Research in Computer Security (2011), 607–627.
  • Zubakov et al. (2010) Dmitry Zubakov, Fan Liu, MC Van Zelm, J Vermeulen, BA Oostra, CM Van Duijn, GJ Driessen, JJM Van Dongen, Manfred Kayser, and AW Langerak. 2010. Estimating human age from T-cell DNA rearrangements. Current Biology 20, 22 (2010), R970–R971.

Appendix A Evaluation of Genome Reconstruction on the HapMap Beacon

In Figure 9 we show the success (precision, recall, and accuracy) of the reconstruction for various number of newly added donors () in HapMap beacon. Next, in Figure 10, we show the effect of varying number of bins () in the genome reconstruction attack when the number of newly added donors () is for HapMap beacon. Finally, in Figure 11, we show the effect of the beacon size () at time when new donors are added between times and for HapMap beacon. Note that due to the number of individuals in the HapMap dataset, we could increase the size of the HapMap beacon to donors at most.

(a) Precision.
(b) Recall.
(c) Accuracy.
Figure 9. Precision, recall, and accuracy for the genome reconstruction of a newly added donor to HapMap beacon with varying number of newly added donors.
(a) Precision.
(b) Recall.
(c) Accuracy.
Figure 10. Precision, recall, and accuracy for the genome reconstruction of a newly added donor to HapMap beacon with varying number of bins/clusters () in the genome reconstruction attack. Number of newly added donors () is .
(a) Precision.
(b) Recall.
(c) Accuracy.
Figure 11. Precision, recall, and accuracy for the genome reconstruction of a newly added donor to HapMap beacon with varying number of beacon size (). Number of newly added donors is and for all plots.