GenShare: Sharing Accurate Differentially-Private Statistics for Genomic Datasets with Dependent Tuples

Motivation: Cutting the cost of DNA sequencing technology led to a quantum leap in the availability of genomic data. While sharing genomic data across researchers is an essential driver of advances in health and biomedical research, the sharing process is often infeasible due to data privacy concerns. Differential privacy is one of the rigorous mechanisms utilized to facilitate the sharing of aggregate statistics from genomic datasets without disclosing any private individual-level data. However, differential privacy can still divulge sensitive information about the dataset participants due to the correlation between dataset tuples. Results: Here, we propose GenShare model built upon Laplace-perturbation-mechanism-based DP to introduce a privacy-preserving query-answering sharing model for statistical genomic datasets that include dependency due to the inherent correlations between genomes of individuals (i.e., family ties). We demonstrate our privacy improvement over the state-of-the-art approaches for a range of practical queries including cohort discovery, minor allele frequency, and chi^2 association tests. With a fine-grained analysis of sensitivity in the Laplace perturbation mechanism and considering joint distributions, GenShare results near-achieve the formal privacy guarantees permitted by the theory of differential privacy as the queries that computed over independent tuples (only up to 6 theoretically guaranteed by differential privacy. For empowering the advances in different scientific and medical research areas, GenShare presents a path toward an interactive genomic data sharing system when the datasets include participants with familial relationships.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/09/2021

Near-Optimal Privacy-Utility Tradeoff in Genomic Studies Using Selective SNP Hiding

Motivation: Researchers need a rich trove of genomic datasets that they ...
06/18/2021

Sharing in a Trustless World: Privacy-Preserving Data Analytics with Potentially Cheating Participants

Lack of trust between organisations and privacy concerns about their dat...
02/15/2021

Genomic Data Sharing under Dependent Local Differential Privacy

Privacy-preserving genomic data sharing is prominent to increase the pac...
02/28/2020

Asymptotic Theory for Differentially Private Generalized β-models with Parameters Increasing

Modelling edge weights play a crucial role in the analysis of network da...
01/21/2021

Privacy-Preserving and Efficient Verification of the Outcome in Genome-Wide Association Studies

Providing provenance in scientific workflows is essential for reproducib...
12/28/2018

Answering Range Queries Under Local Differential Privacy

Counting the fraction of a population having an input within a specified...
09/29/2020

DPCrowd: Privacy-preserving and Communication-efficient Decentralized Statistical Estimation for Real-time Crowd-sourced Data

In Internet of Things (IoT) driven smart-world systems, real-time crowd-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The fast-paced high throughput sequencing technologies result in generating a tsunami of large-scale datasets and biobanks. The number of sequenced human genomes has been increasing at an exponential rate, and now we are at about 2.5 million sequenced genomes around the world. This is projected to reach 105 million genomes in 2025 stephens2015big , especially after the COVID-19 pandemic, where many countries have decided to study genomic data at a population scale. These rich troves of data are becoming the keystone for empowering medical science advances. Researchers need large amounts of genomic datasets that they can leverage to gain a better understanding of 1) the genetic basis of the human genome and identify associations between phenotypes and specific parts of DNA, and 2) disease diagnosis and treatment (e.g., personalized medicine farnaes2018rapid ). However, since the human genome is the utmost personal identifier, it is normally discouraged to share genomic data due to the privacy concerns and the possible legal, ethical, and financial consequences, as well as the data protection guidelines in many countries. Hence, sharing genomic data while preserving the privacy of the individuals has been challenging for many different fields (e.g., medicine, science, bioinformatics) bonomi2020privacy . The challenge worsens when sharing large datasets or their statistics as they are usually vulnerable to privacy leaks due to the inherent correlations between genomes of participating family members almadhoun2020differential ; almadhoun2020inference . For the hope of sharing genomic datasets and aiming at gaining more accurate and refined biomedical insights, researchers have proposed applying differential privacy (DP) concept dwork2008differential as a protective measure against several inference attacks over genomic dataset (e.g., Homer attack homer2008resolving ). Informally, a (randomized) algorithm is differentially private if its output distribution is approximately the same when executed on two inputs (e.g., datasets and ) that differ by the presence of a single individual’s data (i.e., neighboring datasets

). This condition prevents an adversary with access to the algorithm output from learning anything substantial about any one individual since the probability of observing a certain outcome for the neighboring datasets does not differ by more than a multiplicative factor of

. is referred to as the privacy budget, where smaller values of give stronger guarantees of privacy. DP methods are widely-used for privately sharing the summary statistics after adding an adequate noise. One of the DP common approaches is to add Laplace noise (i.e., Laplace perturbation mechanism (LPM)) nissim2007smooth based on the global sensitivity (GS) of the statistics query (i.e., the maximum difference between the query results and is at most ). uhlerop2013privacy ; yu2014scalable ; johnson2013privacy developed differentially-private algorithms that release different queries in a privacy-preserving way from statistical genomic studies, such as genome wide association studies (GWAS). These queries include but are not limited to 1) count or cohort discovery: to query how many participants in the dataset satisfy given criteria, 2) association tests: compute statistics for a point mutation (single nucleotide polymorphism SNP), or 3) minor allele frequency (MAF): to compute the frequency of which the rare nucleotide occurs at a particular SNP.

Despite the rigorous mathematical foundation of DP cho2020privacy ; he2018achieving ; raisaro2018m and the fact that only aggregate-level information is shared, DP mechanisms can still leak sensitive information about the participated individuals if the dataset includes dependent tuples (i.e., family members). It is a common situation for genomic datasets to have dependency between their tuples (or records) due to the inherent correlations between genomes of individuals that have family ties. In our previous work almadhoun2020differential ; almadhoun2020inference , we demonstrate the feasibility of attribute and membership inference attacks on differentially private query results by exploiting the dependence between tuples. Our evaluation over real-world statistical genomic datasets shows how kinship relations between individuals participating in a genomic dataset cause a significant reduction in the privacy guarantees of traditional DP-based mechanisms. Current studies have attempted to propose general mechanisms to tackle this problem, such as Pufferfish kifer2011no , and its extensions he2014blowfish ; chen2014correlated ; liu2016dependence ; zhao2017dependent . However, these efforts fail to capture the statistical relationships between dependent tuples in genomic datasets, and hence resulting in sub-optimal solutions limiting their effectiveness in practice. They either lack the privacy (degrade rigorous guarantees of privacy) or the utility (introduce an excessive amount of noise), as we show in our evaluation in Section 4. Therefore, there is a critical need for a fine-grained analysis of LPM sensitivity considering different queries over genomic datasets to fill an unmet need for privacy-preserving genomic data sharing when the dataset includes dependent tuples. This will encourage both the healthcare stakeholders and data donors (including families) to widely share and use such valuable data resources.

Our goal in this paper is to enable privacy-preserving sharing of summary statistics from genomic data with dependent tuples by achieving the privacy and utility (encompassing accuracy) guarantees of the standard DP assuming all the participants of the dataset are independent (i.e., independent tuples). To achieve this goal, we aim at preserving the privacy of the genomic data donors by analyzing and perturbing the query results using a controlled noise in order to minimize the probability of leaking undesired information. We propose GenShare model that provides rigorous theoretical guarantees of DP formulation in terms of privacy and utility. The key idea of GenShare is to 1) theoretically analyze statistical relationships between the tuples in the genomic datasets to infer both pairwise correlations and complex joint correlations between multiple participants, 2) compute the dependence sensitivity () sensitivity, how much each query can reveal out of such statistical relationships, and 3) take effective DP protective measures based on each query sensitivity. Focusing on three types of real-world queries: (1) count or cohort discovery, (2) MAF, and (3) tests, we empirically demonstrate the privacy and utility improvements of our proposed DP-based mechanism for each query type. We provide a use case on how our GenShare could be used to enable data sharing with privacy. Our key theoretical advances show that an LPM based approach, combined with a fine-grained computation for the sensitivity performed by the data owner (i.e., entity which collects/generates the genomic dataset), provably achieves the expected data utility of the shared query results, while maintaining the privacy guarantees of DP that can be obtained when the query is computed over independent tuples. This paper makes the following contributions:

  • Introducing a query-answering sharing model “GenShare" for genomic datasets with formal privacy guarantees, while ensuring that the query results are as accurate as theoretically guaranteed by DP.

  • Providing an effective LPM-based analysis based on the dependent and independent tuples included in the query computations, which is more accurate and robust than most similar existing approaches.

  • Following the real-world workflows in recent studies for different queries, we show the robustness of GenShare using a range of queries such as cohort discovery, MAF, and over real-world statistical genomic datasets.

  • Achieving almost the same privacy guarantees (in terms of estimation error, which is commonly used to quantify genomic privacy) as the query that is computed over independent tuples.

To our knowledge, GenShare is the first model that dynamically and effectively tailors the DP protective measures based on each query sensitivity to protect the privacy of individuals who have simple/complex correlations participating in the genomic dataset, while simultaneously maximizing the benefits of data sharing for science. The rest of this paper is organized as follows: Section 2 presents related prior work on DP mechanisms under dependent tuples. Section 3 explains our proposed privacy model “GenShare", followed by Section 4 where we evaluate our proposed GenShare model and compare it to the state-of-art mechanisms. Section 5 presents the conclusion and highlights future research directions that are pointed by this paper.

2 Related Work

Several studies have questioned whether DP is valid for correlated data. kifer2011no was the first to raise the issue of privacy degradation when DP is applied over a dataset with correlated tuples. To this end, existing solutions that try to handle the correlation between tuples in the datasets can be categorized into two types, by considering: 1) the dependency between different tuples (i.e., individual-individual correlations), and ii) the dependency among single individual’s data at different time-series (streams) entries (i.e., temporal correlations).

First, to handle the individual-individual correlations (or vertical correlations) between tuples, Group DP dwork2014algorithmic is one of the first studies, which proposes adding noise proportional to the size of the largest correlated tuples in the dataset. Their method adds a tremendous amount of noise (i.e., noise to a dataset with dependent tuples), thus destroying the data utility. As a generalization of DP, kifer2012rigorous proposes another general and customizable method called Pufferfish to handle the dependent tuples by adjusting the Laplace scale, however, the main challenge of Pufferfish is the lack of suitable mechanisms to achieve the expected privacy guarantees. Following this general approach of Pufferfish, the baseline approach proposed by chen2014correlated tries to handle the correlation by multiplying the original sensitivity of the query with the number of correlated records (i.e., query sensitivity = query original sensitivity). Bayesian DP yang2015bayesian uses a modification of Pufferfish, but it only focuses on modeling the tuples correlation by the Gaussian Markov Random Fields. All the following studies such as liu2016dependence ; zhao2017dependent ; almadhoun2020differential are trying to adjust the sensitivity by introducing dependence coefficients according to the number of correlated data, considering the pairwise correlation between dataset tuples as in liu2016dependence

or using heuristic analysis (empirically-computed query sensitivity) as in 

almadhoun2020differential .

Following the second setting to handle the temporal correlations, song2017pufferfish ; chen2017pegasus propose sharing statistics and counts of a data stream considering horizontal correlations. In song2017pufferfish

, they propose two algorithms for the Wasserstein mechanism and the Markov Quilt mechanism when the correlations can be modeled by Bayesian Network.

cao2017quantifying

also considers the temporal correlation which can be modeled by a Markov Chain.

In Section 4.4, we compare our model (in terms of privacy) with the existing similar approaches from the two aforementioned categories liu2016dependence ; almadhoun2020differential ; song2017pufferfish ; dwork2014algorithmic . Since Hidden Markov would not work to model statistical genomic dataset, we are not comparing our model with the mechanisms proposing hidden Markov-based models yang2015bayesian ; song2017pufferfish ; chen2017pegasus .

Figure 1: Our proposed GenShare model

3 Proposed Method

As discussed in Section 2, some researchers have proposed general mechanisms to tackle the degradation in the privacy guarantees of DP that happens on account of the dependency between database tuples kifer2012rigorous . However, this privacy risk has not yet been studied for statistical genomic datasets (which potentially include many dependent tuples due to dependency/correlations between genomes of individuals that have family ties) and existing mitigation chen2014correlated ; zhao2017dependent ; chen2017pegasus ; liu2016dependence ; almadhoun2020differential fail to theoretically capture the statistical relationships between dependent tuples in genomic datasets, and hence resulting in sub-optimal solutions considering privacy and utility.

As a first step towards mitigation of this risk, following a similar analysis as in liu2016dependence (but modeling the correlations differently, i.e., joint correlations considered), we propose GenShare as a formalization of -DP notion for genomic datasets with dependent tuples. Among all family trees in a dataset , we denote the one with the strongest relationships (i.e., the one with the largest aggregate kinship coefficient between any individual and the other family members) as the strongest dependent tuple set and represent it as (). We let and be neighboring datasets with dependent tuples (i.e., among dependent tuples, and differ in one record) if the change of one tuple value in causes change of at most tuple values in . Thus, we define GenShare for genomic datasets with dependent tuples using this notion of neighboring datasets, and to achieve the guarantees of -DP, we re-formulate LPM by introducing a new fine-grained “sensitivity” definition for genomic datasets that include dependent tuples, as follows:

Theorem 3.1.

For a dataset D with b genomic dependent tuples, a randomized algorithm provides -differential privacy for a query Q with global sensitivity , if .

Lemma 3.1.

Let represent the global sensitivity of a query . The dependent sensitivity for sharing the results of query over a genomic dataset with dependent tuples .

Proof:

To prove Theorem 3.1 and compute , we consider a simple query function to publish a sanitized version of a dataset with dependent tuples. Among these dependent tuples, we have the participant and participants in set , where may contain more than one tuple. To satisfy -DP under this scenario we have:

(1)

where is a randomized algorithm, represents the sanitized version of a data point (SNP) , represents the SNP value of individual , and represents the set of SNP values of the individuals in set , where = . Also, and values are selected to obtain the maximum difference in the value (i.e., = 2 if = 0 and if = 0, = 2). This is to consider the effect of maximum change in the SNP value of participant on the values of dependent individuals in .

To achieve -DP, we add Laplace noise proportional to the query’s global sensitivity, by using a proper Laplace scale for the Laplace distribution, where = . Our goal is to find a proper scaling factor when sharing statistics from a dataset with dependent tuples by changing the original global sensitivity to . By transforming the left-hand side of Equation 1

using the law of total probabilities, we have:

(2)

Here,

is a vector representing the values of the SNPs in

. includes the set of vectors for potential values of (considering Mendel’s law and the relationships of the dependent tuples in the dataset). Also, is a function that computes the sum of SNP values in . To compute the potential values in , we develop probabilistic models representing the evolution of an SNP value over multiple generations. For this, based on Mendel’s law, we find the family relationships between individuals and compute the probabilities of moving from one SNP value to another, from one generation to the next.The right-hand side of Equation 2 contains two terms: the first left term considers the change in the SNP i of individual j from the value h to h’, and the second right term that considers the change in the SNP i of individuals in (due to the dependency between j and individuals in ) given the change in from the value h to h’. For the first left term of the right-hand side of Equation 2, we have:

(3)

where represents which is the maximum change in from the value h to h’. If we ignore the second right term of the right-hand side of Equation 2, and combine the remaining of Equation 1 and Equation 2 , then we have:

(4)

The scale for the Laplace distribution is: which is compatible with the Laplace scale in the standard DP mechanism. To study the effect of the the maximum change in an individual ’s data on b-1 dependent tuples (in ), we focus on the second right term of the right-hand side of Equation 2 to define as follows:

(5)

Combining Equation 1-5, we have:

(6)

Therefore, we represent the dependent sensitivity for sharing the results of query over a genomic dataset with dependent tuples as = + = .

We derive the dependent sensitivity as:

(7)

In practice, depending on over which individuals a query is computed, first the strongest dependent tuple set among such individuals is determined, and then, the corresponding dependent sensitivity is computed. Furthermore, we observe that the inference power of an adversary may be affected by the number of dependent tuples (i.e., family members) and independent tuples (i.e., unrelated members) included in the query results. Hence, in our sensitivity analysis, (i.e., LPM scale) value can be neatly chosen to find the adequate value of . We show our heuristic analysis on how to choose in Section 4.4.

3.1 Use Case

To clarify our previous computations, here we consider a simple query function to publish a sanitized version of a dataset with dependent tuples. Among these dependent tuples we have the participants j and k, and o, where k and o . To satisfy -DP for genomic datasets with dependent tuples we have:

(8)

By transforming the left-hand side of Equation 8 using the law of total probabilities, we have:

(9)

Therefore, we derive the dependent sensitivity as:

Figure 2: The effect of including only the target and his relatives from MC family in the count query results, on the adversary’s estimation error of inferring the target’s SNPs values. Using a range of values, we compare our model “GenShare" with 5 existing mechanisms. We provide the data points of the estimation error when we use = 0.1 and 1 for a query with 2 family members.

3.2 GenShare Model

Let dataset D includes individuals and m SNPs. We assume a statistical query (e.g. MAF) is computed over q participants in , including a target and other dataset participants (q = 1+p). Set () includes individuals from the same family (i.e., target and his/her family members), and set () includes the other unrelated members (non-relatives) in the dataset. We show the overview of our proposed GenShare model in Figure 1. The entity which collects/generates the genomic dataset is the “data owner” and the data owner can share statistics about its dataset with a client (i.e., researcher or physician). This is a common way to share research findings. Following the attack scenario proposed by almadhoun2020differential , to limit the number of dataset members included in the query result, the client (or adversary) sends its query specified by some demographic properties (e.g., age, address). As an example, we consider here the MAF query by the client (or adversary). First, the data owner computes the result of the query on the dataset, and meanwhile, he determines the number of family members and unrelated members included in the query results. Based on that, the data owner computes and then applies LPM to the query results, then he sends them to the client. Data owner reports (i) the query result (MAF of all SNP values for the dataset participants that are considered in the query computation) and (ii) the number of dataset participants that are used to compute the query results (q).

4 Settings and Evaluation

To evaluate the privacy performance of our proposed model GenShare, we use the correctness metric over a real-world statistical genomic dataset to show the robustness of GenShare. We next discuss our evaluation in detail.

4.1 Dataset Discription

We combine three statistical genomic datasets that include genomic data of 1) family members and, 2) unrelated members (non-relatives). Our final genomic datasets contain the partial DNA sequences from:

  • CEPH/Utah Pedigree 1463 drmanac2010human : to obtain the genotypes of 10 family members (originally 17 members) from variant call format (VCF) files.

  • Manuel Corpas (MC) Family Pedigree corpas2013crowdsourcing : to obtain the genotypes of a scientist named Manuel Corpas (the target in our experiments) and his 4 family members.

  • 1000Genome phase 3 data 10002015global : to obtain data for the unrelated individuals from the same or different population of the target and his family members. We extracted the genotypes from chromosomes 1 and 22 for 2504 participants from 23 populations using the Beagle genetic analysis package browning2018one (to extract the number of minor alleles for each SNP).

4.2 Differentially private data release

In a statistical genomic dataset (e.g., GWAS) with individuals and SNPs, uhlerop2013privacy computes the sensitivity for privacy-preserving release of cell counts as 2 (i.e., Laplace noise with scale 2/), while the MAF sensitivity can be computed as and statistics as . johnson2013privacy claim that adding Laplace noise with scale to the cell count of genomic dataset results in accurate statistics or -values. In GenShare, we use these algorithms to calculate the global sensitivity of the queries .

Figure 3: The effect of including the target, his father and mother from MC family, and (a) 5 unrelated members (FMT5u) or (b) 10 unrelated members (FMT10u) in the count query results, on the adversary’s estimation error of inferring the target’s SNPs values.
Figure 4: The effect of including the target , and only (a) 5 unrelated members (5u), (b) 10 unrelated members (10u), (c) 20 unrelated members (20u) in the MAF query results, on the adversary’s estimation error of inferring the target’s SNPs values.
Figure 5: calculations for computing in the sensitivity when the query results contain (a)family members only, (b)family members and other unrelated members, (c)unrelated members only

4.3 Evaluation Metrics

For evaluating GenShare, we use correctness metric to quantify the privacy-preserving guarantees of GenShare. Estimation error is used to quantify the correctness by measuring the distance Dist between the true value of the SNP and the inferred value by the client (e.g., adversary). For a statistical genomic dataset with SNPs, we measure the expected estimation error as follows:

(10)

Here, is the true value of SNP i for the target individual j, while is the estimated value. We can compute the probabilities for using the Mendelian inheritance probabilities for a SNP given all the potential SNP values (i.e., 0, 1, or 2) for (represented as ). As discussed in Section 4.1, we use a dataset to evaluate GenShare and compare it with the state-of-the-art mechanisms. D includes n individuals (n= 2520) and m SNPs for each individual (m = 1000). To infer the values of these m SNPs, we repeat our experiments 10 considering 100 SNPs (i.e., 100 queries are performed) each time.

4.4 Experimental Results

In our evaluation, we assume that the query can include the target (e.g., individual ) with 1) a direct family member, 2) multiple family members, or 3) multiple family members, and other unrelated individuals. We compare our model (in terms of privacy) with the existing similar work (discussed in Section 2) such as liu2016dependence ; almadhoun2020differential ; song2017pufferfish ; dwork2014algorithmic . Since Hidden Markov would not work to model kinship relations in a genomic dataset, we are not comparing our model with the mechanisms proposing Hidden Markov-based models. In the following, we compare our proposed model (referred to as “GenShare" in the figures) with: (i) independent assumptions (referred to as “Independent Assumption" in the figures) to show that GenShare can be proven by preventing any client from utilizing the dependencies among the dataset tuples to infer more sensitive attributes about dataset participants (in other words, we are aiming at achieving the privacy guarantees of the standard DP assuming all the participants of the dataset are independent), (ii) the proposed mitigation algorithm in almadhoun2020differential (referred to as “Almadhoun et. al.” in the figures), (iii) dependent sensitivity mechanism proposed in liu2016dependence (referred to as “Liu et. al." in the figures), (iv) Wasserstein algorithm proposed in song2017pufferfish (referred to as “Wasserstein" in the figures), and (v) Group DP proposed in  dwork2014algorithmic (referred to as “Group DP" in the figures).

In Figure 2, we evaluate the effect of different values of the privacy budget, , on the adversary’s correctness in inferring the targeted SNPs considering a different number of family members included in the query results. We evaluate the estimation error using 18 different values (i.e., is not continuous, ) divided into 4 intervals as shown in the legend of Figure 2.

Here, the count query (used in cohort discovery) results include the statistics from the family members only. First, we start including 1 first-degree family member (e.g., mother or father) from MC family with the target . Then, we include both mother and father with the target to the query results. Third, we include father, mother, and sister in the query results. Last, we consider a second-degree family member (aunt of the target ) in the query results along with the father, mother, and sister of the target as shown in the (x-axis) of Figure 2. We make the following key observations: (i) GenShare achieves the best privacy overall, it provides almost the same privacy guarantees (in terms of estimation error), as the query that is computed over independent tuples (i.e., independent assumption). Hence, our model succeeds in near-achieving the standard differential privacy guarantees without any degradation in terms of privacy or utility across several values. (ii) Existing techniques generally cannot optimize their schemes to achieve the required privacy and utility guarantees. They either add too much noise (e.g., f= 2 members in the figure) or degrade rigorous guarantees of privacy (e.g., as when f 3 members). (iii) As expected by DP, decreasing the privacy budget values (starting from = 4 descending until = 0.1) leads to increasing the privacy guarantees while decreasing the utility guarantees.

Next, in Figure 3, we include family members (father and mother) and other unrelated members (= 5 in Figure 3(a) and = 10 in Figure 3(b)) with the target to evaluate the effect of different values of the privacy budget, , on the adversary’s correctness in inferring the targeted SNPs. Considering a count query, we observe that GenShare achieves better privacy for various privacy budgets, compared to the existing techniques even when the query results include unrelated members, as illustrated in Figure 3.

Figure 4 shows that GenShare is equivalent to DP mechanism when the query results only include unrelated members, unlike the existing techniques liu2016dependence ; song2017pufferfish ; dwork2014algorithmic , which compute the dependent sensitivity based on the number of dependent tuples in the dataset, ignoring whether these dependent tuples are included in the query or not.

In our sensitivity analysis in Section 3 we observe that the inference power of an attacker decreases with an increasing number of independent tuples in the query computation. Hence, (i.e., LPM scale = ) value can be neatly chosen to find the adequate value of considering the number of dependent and independent tuples in the query computation. Since the (i.e., the query sensitivity) is computed considering the query type (illustrated in Section 4.2), the data owner in our model can compute the value in based on the number of family members or unrelated members included in the query result, as shown in Figure 5. As expected, adding more unrelated members to the query results leads to more precise sensitivity computations until reaching the sensitivity of the standard DP mechanism (i.e., = ).

Next, we compare the performance of GenShare when first-degree or second-degree family members (from MC and UTAH families) are included in the query computations with the target. Our results show the robustness of GenShare regardless of the degree of familial relationship between the dataset tuples. The differences in privacy guarantees between GenShare and the “Independent Assumption" do not exceed 5% across a range of privacy parameters , with respect to estimation error (Figure 6).

Figure 6: The differences from “Independent Assumption" privacy guarantees (in terms of estimation error), considering range of values and different cases of including first-degree and second-degree relatives in the query computations.

Finally, we compare the performance of GenShare for different query types, e.g., count, MAF, and tests. As expected, we observe that the differentially private statistics calculated based on GenShare provide accurate and near-optimal matching to the privacy guarantees of DP with “Independent Assumption", with a difference up to 6% in terms of estimation error across a range of privacy parameters (Figure 7). Overall, our results illustrate the theoretical boundaries of leveraging LPM-DP for mitigating the “tuples dependency" privacy risk in genomic query-answering systems. GenShare is vital for genomic data sharing and in a broader sense, it will also have implications for medical data sharing. Considering i) the importance of sharing statistical genomic and medical datasets (which is the aim that many institutes are seeking to achieve) for high-impact medical research (e.g., NIH recently awarded $73 million to collect and archive the information of genes and genomic variants for precision medicine nihaward ) and, ii) the sensitivity of the (personal) information in these datasets (especially there is a high probability to have families in these genomic datasets), data owners should be very careful when sharing data related to such datasets. Moreover, GenShare can be utilized to provide strong insights to several clients from different parties about each other’s datasets (e.g., before they exchange datasets for joint research). Such privacy-preserving sharing mechanism may be helpful to accelerate the data sharing process across researchers, especially with the worldwide strict regulations of data protection for sharing and exchanging data.

Figure 7: Comparison between applying GenShare for count, MAF, and queries. GenShare reduces the differences from “Independent Assumption" privacy guarantees (in terms of estimation error), considering different values and the 3 query types

5 Conclusion

Differential privacy provides a theoretical notion of privacy that provides formal guarantees that the distribution of query results changes slightly with the addition or removal of a single tuple in the dataset. However, privacy guarantees of DP-based solutions are based on the assumption that all tuples in the dataset are independent. In reality, genomic data from different individuals may be dependent according to the genomic interactions due to the familial ties between them. In this paper, we propose GenShare to provide countermeasures against privacy risks due to dependent tuples in the statistical genomic datasets. To achieve the privacy and utility guarantees theoretically provided by DP, GenShare captures the joint statistical relationships between dependent tuples in the genomic datasets. Our results show that GenShare provides a significant improvement in the privacy and utility guarantees over existing mechanisms across a range of privacy parameters

. All of these contributions will benefit the medical and genomics research community, in the long run, and realize the promise of privacy-preserving access to the genomic datasets that are relied upon in future health information exchange systems. There are several directions that merit further research. It may be possible for us to consider: 1) more concepts in differential privacy, such as local sensitivity, 2) complex tasks and applications such as federated machine learning, 3) different settings e.g., larger number of queries or composing multiple queries.

References

  • (1) Stephens, Z., Lee, S., Faghri, F., Campbell, R., Zhai, C., Efron, M., Iyer, R., Schatz, M., Sinha, S. & Robinson, G. Big data: astronomical or genomical?. PLoS Biology. 13, e1002195 (2015)
  • (2) Farnaes, L., Hildreth, A., Sweeney, N., Clark, M., Chowdhury, S., Nahas, S., Cakici, J., Benson, W., Kaplan, R., Kronick, R. & Others Rapid whole-genome sequencing decreases infant morbidity and cost of hospitalization. NPJ Genomic Medicine. 3, 1-8 (2018)
  • (3) Bonomi, L., Huang, Y. & Ohno-Machado, L. Privacy challenges and research opportunities for genomic data sharing. Nature Genetics. 52, 646-654 (2020)
  • (4) Almadhoun, N., Ayday, E. & Ulusoy, Ö. Differential privacy under dependent tuples—the case of genomic privacy. Bioinformatics. 36, 1696-1703 (2020)
  • (5) Almadhoun, N., Ayday, E. & Ulusoy, Ö. Inference attacks against differentially private query results from genomic datasets including dependent tuples. Bioinformatics. 36, i136-i145 (2020)
  • (6) Dwork, C. Differential privacy: A survey of results. International Conference On Theory And Applications Of Models Of Computation. pp. 1-19 (2008)
  • (7) Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J., Stephan, D., Nelson, S. & Craig, D. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics. 4, e1000167 (2008)
  • (8) Nissim, K., Raskhodnikova, S. & Smith, A. Smooth sensitivity and sampling in private data analysis.

    Proceedings Of The Thirty-ninth Annual ACM Symposium On Theory Of Computing

    . pp. 75-84 (2007)
  • (9) Uhlerop, C., Slavković, A. & Fienberg, S. Privacy-preserving data sharing for genome-wide association studies. The Journal Of Privacy And Confidentiality. 5, 137 (2013)
  • (10) Yu, F., Fienberg, S., Slavković, A. & Uhler, C. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal Of Biomedical Informatics. 50 pp. 133-141 (2014)
  • (11) Johnson, A. & Shmatikov, V. Privacy-preserving data exploration in genome-wide association studies. Proceedings Of The 19th ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. pp. 1079-1087 (2013)
  • (12) Cho, H., Simmons, S., Kim, R. & Berger, B. Privacy-preserving biomedical database queries with optimal privacy-utility trade-offs. Cell Systems. 10, 408-416 (2020)
  • (13) He, Z., Li, Y., Li, J., Li, K., Cai, Q. & Liang, Y. Achieving differential privacy of genomic data releasing via belief propagation. Tsinghua Science And Technology. 23, 389-395 (2018)
  • (14) Raisaro, J., Troncoso-Pastoriza, J., Misbach, M., Sousa, J., Pradervand, S., Missiaglia, E., Michielin, O., Ford, B. & Hubaux, J. M ed C o: Enabling Secure and Privacy-Preserving Exploration of Distributed Clinical and Genomic Data. IEEE/ACM Transactions On Computational Biology And Bioinformatics. 16, 1328-1341 (2018)
  • (15) Kifer, D. & Machanavajjhala, A. No free lunch in data privacy. Proceedings Of The 2011 ACM SIGMOD International Conference On Management Of Data. pp. 193-204 (2011)
  • (16) He, X., Machanavajjhala, A. & Ding, B. Blowfish privacy: Tuning privacy-utility trade-offs using policies. Proceedings Of The 2014 ACM SIGMOD International Conference On Management Of Data. pp. 1447-1458 (2014)
  • (17) Chen, R., Fung, B., Yu, P. & Desai, B. Correlated network data publication via differential privacy. The VLDB Journal-The International Journal On Very Large Data Bases. 23, 653-676 (2014)
  • (18) Liu, C., Chakraborty, S. & Mittal, P. Dependence Makes You Vulnberable: Differential Privacy Under Dependent Tuples.. NDSS. 16 pp. 21-24 (2016)
  • (19) Zhao, J., Zhang, J. & Poor, H. Dependent differential privacy for correlated data. 2017 IEEE Globecom Workshops (GC Wkshps). pp. 1-7 (2017)
  • (20) Dwork, C., Roth, A. & Others The algorithmic foundations of differential privacy.. Foundations And Trends In Theoretical Computer Science. 9, 211-407 (2014)
  • (21) Kifer, D. & Machanavajjhala, A. A rigorous and customizable framework for privacy. Proceedings Of The 31st ACM SIGMOD-SIGACT-SIGAI Symposium On Principles Of Database Systems. pp. 77-88 (2012)
  • (22) Yang, B., Sato, I. & Nakagawa, H. Bayesian differential privacy on correlated data. Proceedings Of The 2015 ACM SIGMOD International Conference On Management Of Data. pp. 747-762 (2015)
  • (23) Song, S., Wang, Y. & Chaudhuri, K. Pufferfish privacy mechanisms for correlated data. Proceedings Of The 2017 ACM International Conference On Management Of Data. pp. 1291-1306 (2017)
  • (24) Cao, Y., Yoshikawa, M., Xiao, Y. & Xiong, L. Quantifying differential privacy under temporal correlations. 2017 IEEE 33rd International Conference On Data Engineering (ICDE). pp. 821-832 (2017)
  • (25) Drmanac, R., Sparks, A., Callow, M., Halpern, A., Burns, N., Kermani, B., Carnevali, P., Nazarenko, I., Nilsen, G., Yeung, G. & Others Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 327, 78-81 (2010)
  • (26) Corpas, M. Crowdsourcing the corpasome. Source Code For Biology And Medicine. 8, 13 (2013)
  • (27) Consortium, 1. & Others A global reference for human genetic variation. Nature. 526, 68 (2015)
  • (28)

    Browning, B., Zhou, Y. & Browning, S. A one-penny imputed genome from next-generation reference panels.

    The American Journal Of Human Genetics. 103, 338-348 (2018)
  • (29) Chen, Y., Machanavajjhala, A., Hay, M. & Miklau, G. Pegasus: Data-adaptive differentially private stream processing. Proceedings Of The 2017 ACM SIGSAC Conference On Computer And Communications Security. pp. 1375-1388 (2017)
  • (30) The National Human Genome Research Institute: NIH awards $73m to continue building resource of genes and genomic variants for precision medicine. NIH.(2021,9), https://www.genome.gov/news/news-release/NHGRI-awards-73million-to-continue-building-Clinical-Genome-Resource-ClinGen