1. Introduction
Recent advances in genome sequencing technologies have enabled individuals to access their genome sequences easily, resulting in massive amounts of genomic data. On one hand, sharing this massive amount of data is important for the progress of genomics research. Genomic data collected by research laboratories or published in public repositories leads to significant breakthroughs in medicine, including discovery of associations between mutations and diseases. On the other hand, genomic data contains sensitive information about individuals, such as predisposition to diseases and family relationships. Due to privacy concerns, individuals are generally hesitant to share their genomic data. Therefore, how to facilitate genomic data sharing in a privacypreserving way is a crucial problem.
One way to preserve privacy in genomic data sharing and analysis is to utilize cryptographic techniques. However, these techniques mostly bring interoperability and scalability problems. Encrypted data can only be used for a limited number of operations and high computation costs decrease the applicability of these techniques for large scale datasets. Local differential privacy (LDP) is a stateoftheart definition to preserve the privacy of the individuals in data sharing with an untrusted data collector, and hence it is a promising technology for privacypreserving sharing of genomic data. Perturbing data before sharing provides plausible deniability for the individuals. However, the original LDP definition does not consider the data correlations. Hence, applying existing LDPbased data sharing mechanisms directly on genomic data makes perturbed data vulnerable to attacks utilizing correlations in the data.
In this work, our goal is to provide privacy guarantees for the shared genomic sequence of a data owner against inference attacks (that utilize correlations in the data) while providing high data utility for the data collector. For that, we develop a new genomic data sharing mechanism by defining a variant of LDP under correlations, named dependent LDP. We use randomized response (RR) mechanism (a commonly used LDPbased data sharing mechanism) as a baseline since the total number of states for each genomic data point is 3 and the RR provides the best utility for such a small number of states (wang2017locally)
. Moreover, RR uses the same set of inputs and outputs without an encoding, which allows the data collector to use perturbed data directly. We show how directly applying RR mechanism is vulnerable to inference attacks and focus on improving its privacy and utility while providing formal privacy guarantees. Thus, we first show how correlations in genomic data can be used by an attacker to infer the original values of perturbed data points. We describe a correlation attack and show how estimation error of the attacker (a commonly used metric to quantify genomic privacy) decreases due to the direct use of RR.
In the correlation attack, to increase its inference power, the attacker detects (and eliminates) the data values that are not consistent with the other shared values based on correlations. Thus, in the proposed data sharing scheme, we consider such an attack bydesign and do not share the values of the shared data points which are inconsistent with the previously shared data points. During sharing of each data point (single nucleotide polymorphism  SNP) with the data collector, the proposed algorithm eliminates a particular value of a shared SNP if the corresponding value of the SNP occurs with negligible probability considering its correlations with the other shared SNPs (to prevent an attacker utilize such statistically unlikely values to infer the actual values of the SNPs). Then, the algorithm adjusts the sharing probabilities for the noneliminated values of the SNP by normalizing them and making sure that the attacker’s distinguishability between each possible values of the SNP is bounded by , which achieves dependent LDP.
To improve utility, we introduce new probability distributions (for the shared states of each SNP), such that, for each shared SNP, the probability of deviating significantly from its “useful values” is small. Useful values of a SNP depend on how the data collector intends to use the collected SNPs. For this, we focus on genomic data sharing beacons (a system constructed with the aim of providing a secure and systematic way of sharing genomic data) and show how to determine probability distributions for different states of each shared SNP with the aim of maximizing the utility of the collected data (this can easily be extended for other uses of genomic data, such as in statistical databases). In the proposed mechanism, SNPs of a genome donor are processed sequentially. Although the proposed
dependent LDP definition is satisfied in any order, the number of eliminated states for each SNP can be different based on the order of processing. Hence, the utility of the shared data changes when the processing order of SNPs changes. We also show how to determine an optimal order of processing (which provides the highest utility) via Markov decision process (MDP) and provide a value iterationbased algorithm to achieve this goal. Furthermore, due to complexity of the optimal algorithm, we propose an efficient greedy algorithm to determine the processing order of the SNPs in the proposed data sharing mechanism.
Since SNPs of a child are inherited from her parents, genomic data sharing should also consider the privacy of the family members. An attacker can gain information about the SNPs of a donor’s family member even though the donor shares her SNPs after perturbing their values using the proposed mechanism. Thus, we also explain how the information gain of an attacker about a family member can be computed in terms of the privacy budget (i.e., parameter) of the genome donor. We also propose a basic algorithm to identify the maximum privacy budget (i.e., largest value) which can be used by a genome donor to make sure that the privacy budgets of her family members are not violated due to her sharing.
We conduct experiments with a reallife genomic dataset to show the utility and privacy provided by the proposed scheme. Our experimental results show that the proposed scheme provides better privacy and utility than the original randomized response mechanism. We also show that using the proposed greedy algorithm for the order of processing, we improve the utility compared to randomly selecting the order of processed SNPs.
The rest of the paper is organized as follows. We review the related work in Section 2 and provide the technical preliminaries in Section 3. We present the proposed framework in Section 4. We propose an algorithm for optimal data processing order in Section 5. We evaluate the proposed scheme via experiments in Section 6. In Section 7, we discuss about considering kinship relationships during data sharing and our assumptions about the attacker. Finally, we conclude the paper in Section 8.
2. Related Work
In this section, we discuss relevant existing works on genomic privacy along with local differential privacy.
2.1. Genomic Privacy
Genomic privacy topic has been recently explored by many researchers (survey:genomicera). Several works have studied various inference attacks against genomic data including membership inference (related:homer; related:wang; related:shringarpureandbastumante) and attribute inference (genomic:lacks; genomic:highorder; Khodam). To mitigate these threats, some researchers proposed using cryptographic techniques for privacypreserving processing of genomic data (related:baldi; related:ermanclinic; related:wangPrivateEditDistance; deuber2019my). The differential privacy (DP) concept (privacy:differentialprivacy) has also been used to release summary statistics about genomic data in a privacypreserving way (to mitigate membership inference attacks) (differential:gwas; differential:gwas_yu; differential:gwas_johnson). Unlike the existing DPbased approaches, our goal is to share the genomic sequence of an individual, not summary statistics. To share genomic sequences in a privacypreserving way, Humbert et al. proposed an optimization based technique that selectively hides portions of shared genomic data to optimize utility by considering privacy constraints (genomic:reconciling). However, they do not provide formal privacy guarantees. For the first time, we study the applicability of LDP for genomic data sharing and develop a variant of LDP for correlated data.
2.2. Local Differential Privacy
Differential privacy (DP) (privacy:differentialprivacy) is a concept to preserve the privacy of records in statistical databases while publishing statistical information about the database. Although DP provides strong guarantees for individual privacy, there may be privacy risks for individuals when data is correlated. Several approaches have been proposed (yang2015bayesian; cao2017quantifying; liu2016dependence; song2017pufferfish) in order to protect the privacy of individuals under correlations. Since these works focus on privacy of aggregate data release (e.g., summary statistics about data), they are not suitable for individual data sharing. Local differential privacy (LDP) is a stateoftheart definition to preserve the privacy of the individuals in data sharing with an untrusted data collector. However, a very limited number of tasks, such as frequency estimation (wang2017locally), heavy hitters (bassily2017practical), frequent itemset mining (wang2018locally), marginal release (cormode2018marginal), and range queries (cormode2019answering) have been demonstrated under LDP and the accuracy of these tasks are much lower than performing the same task under the central model of differential privacy. Collecting perturbed data from more individuals decreases the accuracy loss due to randomization. Hence, practical usage of LDPbased techniques needs a high number of individuals (data owners), which limits the practicality of LDPbased techniques. To overcome the accuracy loss due to LDP, a shuffling technique has recently been proposed (erlingsson2019amplification; cheu2019distributed). The main idea of shuffling is to utilize a trusted shuffler which receives the perturbed data from individuals and permutes them before sending to data collector. However, requiring a trusted shuffler also restricts the practical usage of this method.
Another approach to improve utility of LDP is providing different privacy protection for different inputs. In the original definition of LDP, all inputs are considered as sensitive and the indistinguishability needs to be provided for all inputs. Murakami et al. divided the inputs into two groups as sensitive and nonsensitive ones (murakami2019utility). They introduced the notion of utilityoptimized LDP, which provides privacy guarantees for only sensitive inputs. Gu et al. (gu2019providing) proposed inputdiscriminative LDP, which provides distinct protection for each input. However, grouping inputs based on their sensitivity is not realistic in practice due to the subjectivity of sensitivity. In this work, we discriminate the inputs based on their likelihood instead of their sensitivity. We focus on how correlations can be used by an attacker to degrade privacy and how we mitigate such degradation. Hence, we provide indistinguishability between possible states by eliminating the states that are rarely seen in the population using correlations. By doing so, we aim to decrease the information gain of an attacker that uses correlations for inference attacks. Furthermore, both of these works (murakami2019utility; gu2019providing) aim to improve utility by providing less indistinguishability for nonsensitive data and providing more accurate estimations. In our work, the accuracy does not rely on estimations. Instead, we provide high accuracy by eliminating rare values from both input and output sets. Moreover, we improve the utility by increasing the probability of “useful values” considering the intended use of the shared data.
3. Technical Preliminaries
In this section, we give brief backgrounds about genomics and LDP.
3.1. Genomics Background
The human genome contains approximately 3 billion pairs of nucleotides (A, T, C, or G). Approximately of these pairs are identical in all people. When more than 0.5% of the population does not carry the same nucleotide at a specific position in the genome, this variation is considered as singlenucleotide polymorphism (SNP). More than 100 million SNPs have been identified in humans. For a SNP, the nucleotide which is observed in the majority of the population is called the major allele and the nucleotide which is observed in the minority of the population is called the minor allele. Each person has two alleles for each SNP position, and each of these alleles are inherited from a parent of the individual. Hence, each SNP can be represented by the number of its minor alleles, which can be 0, 1, or 2. In this work, we study the problem of sharing the values of SNPs in a privacypreserving way. It is shown that SNPs may have pairwise correlations between each other (e.g., linkage disequilibrium (slatkin2008linkage)). Hence, an attacker can use such correlations to infer the original values of the shared SNPs. Furthermore, since alleles are inherited from parents, sharing SNPs by a genome donor also results in revealing some information about the family members. Thus, privacy of the family members should also be considered while sharing genomic data.
3.2. Local Differential Privacy
Local differential privacy (LDP) is a variant of differential privacy that allows to share data with an untrusted party. In LDP settings, there is a data collector who wants to compute statistical information about a population. Each individual shares her data with the data collector after perturbation (randomization). Then, the data collector uses all collected perturbed data to estimate statistics about the population. During data perturbation, the privacy of the individuals are protected by achieving indistinguishability.
In this work, we adopt the general definition of local differential privacy (duchi2013local; kairouz2014extremal), which is expressed as follows:
Definition 1 ().
A randomized mechanism satisfies local differential privacy if
where and are two possible values of an element , is the output value, is the collection of all possible output values of , and denotes an appropriate filed on .
Definition 1 captures a type of plausibledeniability, i.e., no matter what input value of is released, it is nearly equally as likely to have come from any of its possible values.
The parameter is the privacy budget, which controls the level of privacy. Randomized response (RR) (warner1965randomized) is a mechanism for collecting sensitive information from individuals by providing plausible deniability. Although RR is originally defined for two possible inputs (e.g., yes/no), this mechanism can also be generalized to protect privacy when there are more than two possible states. In generalized randomized response (wang2018locally), the correct value is shared with probability m and each incorrect value is shared with probability m to achieve LDP, where is the number of states.
4. Proposed Framework
In this section, we first introduce the problem and explain genomic data sharing with an untrusted data collector by directly applying RR mechanism. We then present a correlation attack that utilizes correlations between SNPs and show the significant decrease in the estimation error of the attacker (a commonly used privacy metric for genomic data) after the attack. Then, we show how to simultaneously improve privacy against the correlation attacks and improve utility for genomic analysis. Finally, we present our proposed genomic data sharing mechanism.
4.1. Problem Statement
System Model. Figure 1 shows the overview of the system model and the steps of the proposed framework. We focus on a problem, where genome donors share their genomic data in a privacypreserving way with a data collector who will use collected data to answer queries about the population. In genomic data sharing scenario, there are individuals () as genome donors. A genome donor has a sequence of SNPs denoted by . Since each SNP is represented by the number of minor alleles it carries, each has a value from the set . Today, individuals can obtain their genomic sequences via various online service providers, such as 23andMe, and they also share their sequences with other service providers or online repositories (e.g., for research purposes). Hence, the proposed system model has realworld applications, where individuals want to preserve privacy of their genomic data when they share their genomic sequences with other service providers and online repositories.
Threat Model. The data collector is considered as untrusted (i.e., potential attacker). It can share the data directly with another party or use it to answer queries. Hence, we assume the attacker has data shared by all genome donors with the data collector, however, it does not know the original values of any SNPs. In addition, we assume that the attacker knows the pairwise correlations between SNPs (which can be computed using public datasets), the perturbation method, and the privacy budget . Thus, the attacker can infer whether the shared value of a SNP is equal to its original value using correlations.
Data Utility. The data collector uses data collected from genome donors to answer queries. Therefore, we define the utility as the accuracy of data collector to answer such queries. For genomic data, typically, the utility of each value of a SNP is different and the utility of a SNP may change depending on the purpose of data collection (e.g., statistical genomic databases, genomic data sharing beacons, or haploinsufficiency studies). Thus, one of our aims is to improve the utility of LDPbased data collection mechanism by considering data utility as a part of the data sharing mechanism.
Genomic Data Sharing Under Local Differential Privacy. In (wang2017locally), several approaches have been explained for estimating frequency of inputs under LDP such as direct encoding, histogram encoding, and unary encoding. As shown in (wang2017locally), when the size of input set is less than , direct encoding is the best among these approaches. Since the size of input set for genomic data is , we also use direct encoding approach for genomic data sharing. In direct encoding approach, no specific encoding technique is applied to inputs before perturbation and randomized response (RR) mechanism (introduced in Section 3.2) is used for perturbing inputs. To apply RR mechanism and achieve LDP for genomic data, the value of a SNP is shared correctly with probability and each incorrect value is shared with probability . After receiving perturbed values from individuals, the data collector estimates the frequency of each input in the population as , where is the number of individuals who shared .
4.2. Correlation Attack Against LDPBased Genomic Data Sharing
When multiple data points are shared with the RR mechanism, LDP is still guaranteed if the data points are independent. However, it is known that SNPs have pairwise correlations between each other (e.g., linkage disequilibrium (slatkin2008linkage)). An attacker can use the correlations between SNPs to infer incorrectly or correctly shared SNPs as a result of the RR mechanism.
To show this privacy risk, we consider a correlation attack that can be performed by an attacker in the following. We represent a SNP as and we represent the value of for individual as . We assume that all pairwise correlations between SNPs are publicly known. Hence, is known by the attacker for any and . Let be the perturbed data that is shared by with the data collector (potential attacker whose goal is to infer the actual SNP values of the individual). Without using the correlations, the attacker’s only knowledge about any is the probability distribution of randomized response mechanism. However, using the correlations, the attacker can enhance its knowledge about the probability distribution of by eliminating the values that are not likely to be observed (i.e., that have low correlation with the other received data points).
To achieve this, for each of , using all other received data points (except for ), the attacker counts the number of inconsistent instances in terms of correlations between different values of and all other received data points (i.e., having correlation less than a threshold). Let be the correlation threshold of the attacker. The attacker keeps a count for the number of instances for each () having , , and as , , and , respectively. If any of these values is greater than or equal to (where is an attack parameter for the number of inconsistent data points), the attacker eliminates that value in the probability distribution of and considers the remaining values for its inference about the correct value of .
To show the effect of this correlation attack on privacy, we implemented the RR mechanism for genomic data and computed the attacker’s estimation error, a metric used in genomic privacy to quantify the distance of the attacker’s inferred values from the original data, before and after the attack. Our results (in Figure 5, Section 6.1) clearly show the vulnerability of directly using RR in genomic data sharing. For instance, when , the attacker’s estimation error decreases from 0.8 to 0.4 after the correlation attack. In general, we observed that the attacker’s estimation error (i.e., the privacy of genome donor) decreases approximately 50% by using this attack strategy.
4.3. dependent Local Differential Privacy
To handle data dependency in privacypreserving data sharing, some works, such as (chanyaswad2018mvg; liu2016dependence) extend the definition of traditional differential privacy by considering the correlation between elements in the dataset. However, there is a lack of such variants for local differential privacy models, which hinders the application of LDPbased solutions for privacypreserving genomic data sharing. In this paper, inspired by (liu2016dependence), which handles data dependency by considering the number of elements that can potentially be affected by a single element, we propose the following definition.
Definition 2 ().
An element in a dataset is said to be dependent under a correlation model (denoted as ) if its released value depends on at most other elements in .
Furthermore, let be the set of elements on which a dependent element is dependent (through model ) (), be the set of released values of , and represent the possible value(s) of that can be released due to the releasing of and model . Note that it is possible for some elements to have only one possible value to be shared under a specific correlation model. If the only possible value happens to be the true value of that element, we call these elements ineliminable elements, whose privacy will be inevitably compromised for the sake of the utility improvement of the entire shared elements (we formally investigate this issue in Section 5). As a result, we propose the following definition.
Definition 3 ().
A randomized mechanism is said to be dependent local deferentially private for an element that is not ineliminable if
Definition 3 can be considered as a specialization of the general LDP in Definition 1 by having and . Essentially, Definition 3 means that any output of a dependent element is nearly equally as likely to have come from any of its possible input values given other already shared elements (i.e., and a correlation model ). We summarize the main notations used throughout this paper in Table 1.
Notions  Descriptions 

for individual  
possible values of after SNPs in are  
shared as given the correlation model  
released value of SNP ,  
correlation threshold for value elimination  
inconsistency threshold  
a tuple describing individual ’s MDP interaction  
with the environment, i.e., { set of all MDP states  
of , initial MDP state of , action set,  
transition probability between two MDP states,  
set of rewards, the horizon of the MDP }  
individual ’s decision policy at time step  
individual ’s optimal decision policy at time step  
statevalue function of state under policy  
a SNP sharing order of individual  
the optimal SNP sharing order of individual 
4.4. Achieving dependent LDP in Genomic Data Sharing
Our experimental results show the vulnerability of directly applying RR mechanism for genomic data sharing. Thus, here, our goal is to come up with a genomic data sharing approach achieving dependent LDP that is robust against the correlation attack. The definition of LDP states that given any output, the distinguishability between any two possible inputs needs to be bounded by . In Section 4.2, all values in set are considered as possible inputs for all SNPs during data sharing. However, we know that the attacker can eliminate some input states using correlations. Hence, for the rest of the paper, we consider the possible input states as the ones that are not eliminated by using correlations. In other words, instead of providing indistinguishability between any two values in the set , we provide indistinguishability between the values that are statistically possible.
In the correlation attack described in Section 4.2, the attacker uses two threshold values. The correlation values less than are considered as low correlation. In addition, if the fraction of SNPs having low correlation with a state of a particular SNP is more than , such state of the SNP is eliminated by the attacker. In the data sharing scheme, we also use these two parameters to eliminate states. However, the parameters used by the algorithm may not be same with the ones used by the attacker. Hence, to distinguish the parameters used by the algorithm and the attacker, we represent the parameters used in the algorithm as and (which are the design parameters of the proposed data sharing algorithm). We describe this algorithm for a donor as follows.
In each step of the proposed algorithm, one SNP is processed. The algorithm first determines the states to be eliminated by considering previously shared SNPs. Then, the algorithm selects the value to be shared () by limiting the distinguishability of noneliminated states by . Hence, the order of processing may change the number of eliminated states for a SNP, which may also change the utility of the shared data. For instance, when a SNP is processed as the first SNP, all its three states are possible (for sharing) since there is no previously shared SNP. However, processing the same SNP as the last SNP may end up eliminating one or more of its states (due to their correlations with previously shared SNPs). We propose an algorithm to select the optimal sharing order (considering utility of shared data) in Section 5. In the following, we assume that a sharing order is provided by the algorithm in Section 5 and SNPs are processed one by one following this order.
For , the algorithm considers the previously shared data points and identifies the states which will be eliminated. As explained in the correlation attack, the algorithm counts the number of previously shared SNPs which have low correlation with states 0, 1, and 2 of . Thus, the algorithm keeps counts for the previously shared SNPs () having , , as , , and , respectively. If any of these values is greater than or equal to , the algorithm eliminates such value from the possible outputs of . Let and , and the value of be . Then, the algorithm assigns the probabilities of noneliminated states as follows:

If there are three possible outputs (i.e., no eliminated state), the algorithm uses the same probability distribution with the RR mechanism as . Thus, and .

If there are two possible outputs (i.e., one eliminated state) and (state ) is not eliminated, the algorithm uses an adjusted probability distribution as (or , depending on which state is eliminated).

If there are two possible outputs (i.e., one eliminated state) and is eliminated, the algorithm uses an adjusted probability distribution as .

If there is one possible output (i.e., two eliminated states), the corresponding state is selected as the output.

If there is no possible output (i.e., three eliminated states), the algorithm uses the same probability distribution as the RR mechanism.
For other values of , the algorithm also works in a similar way. The probability distributions for sharing a data point are also shown in Figure 2. Based on these probabilities, the algorithm selects the value of . If the attacker knows and used in the algorithm, it can compute the possible values for each SNP using perturbed data , , and the correlations between the SNPs. Since ratio is preserved in each case, the attacker can only distinguish the possible inputs with difference.
4.5. Improving Utility by Adjusting Probability Distributions
In Section 4.4, we proposed a data sharing mechanism to improve the privacy of randomized response mechanism against the correlation attack. The mechanism guarantees that the perturbed data belonging to does not include any value that have low correlation with other SNPs. However, consistent with existing LDPbased mechanisms, the algorithm assigns equal sharing probabilities for each incorrect value of a SNP . That is, probability of sharing any noneliminated value (such that ) is the same. This also implies that utility of each incorrect value of a SNP are the same (for the data collector). However, this may cause significant utility loss since the accuracy of genomic analysis may significantly decrease as the values of shared SNPs deviate more from their original values (e.g., in genomic data sharing beacons or when studying haploinsufficiency). For genomic data, typically, the utility of each value of a SNP is different and the utility of a SNP may change depending on the purpose of data collection. Here, our goal is to improve the utility of shared data by modifying the probability distributions without violating dependent LDP.
To improve utility, we introduce new probability distributions, such that, for each shared SNP, the probability of deviating high from its “useful values” is small. Useful values of a SNP depend on how the data collector intends to use the collected SNPs. For instance, for genomic data sharing beacons, changing the value of a shared SNP with value to does not decrease the utility, but sharing it as may cause a significant utility loss. Similarly, while studying haploinsufficiency, obfuscating a SNP with value results in a significant utility loss while changing a to (or to ) does not cause a high utility loss. Here, to show how the proposed data sharing mechanism improves the utility, we focus on genomic data sharing beacons (similar analysis can be done for other uses of genomic data as well).
Genomic data sharing beacons allow users (researchers) to learn whether individuals with specific alleles (nucleotides) of interest are present in their dataset. A user can submit a query, asking whether a genome exists in the beacon with a certain nucleotide at a certain position, and the beacon answers as “yes” or “no”. Since having at least one minor allele is enough for a “yes” answer, having one minor allele (a SNP value of ) or two minor alleles (a SNP value of ) at a certain position is equivalent in terms of the utility of beacon’s response. Therefore, if the correct value of a SNP is or , sharing the incorrect value as or will have higher utility than sharing it as . Considering this, we change the probability distributions of the data sharing mechanism (given in Section 4.4) as follows to improve the utility.

If there are three, one, or no possible outputs, the probability distributions described in Section 4.4 are used.

If there are two possible outputs and is not eliminated, the algorithm shares as with probability and the incorrect value with probability .

If there are two possible outputs, is eliminated, and , the algorithm uses an adjusted probability distribution as .

If there are two possible outputs, is eliminated, and , the algorithm uses an adjusted probability distribution as .

If there are two possible outputs, is eliminated, and , the algorithm uses an adjusted probability distribution as .
As in Section 4.4, and . These probability distributions (which are also shown in Figure 3) still preserve the ratio between states. Note that for eliminating the states, the same process is used as described in Section 4.4. To determine the processing order of the shared SNPs, the algorithm in Section 5 is used.
4.6. Proposed Data Sharing Algorithm
In Section 4.4, we described how to improve privacy by eliminating statistically unlikely values for each SNP. In Section 4.5, we explained how to modify probability distributions to improve utility of shared data for genomic data sharing beacons. Using these two ideas, we describe our proposed genomic data sharing algorithm in the following and provide the details for an individual in Algorithm 1. The algorithm processes all SNPs one by one and in each iteration, it computes a value to share for the SNP being processed (eventually, all SNPs are processed and they are shared at the same time with the data collector). The algorithm first eliminates the states having low correlations with the previously processed SNPs, as described in Section 4.4. Two thresholds and are used to determine the eliminated states. We evaluate the effect of these threshold values on utility and privacy in Section 6.2. Then, the algorithm decides the shared value of the SNP using the probability distribution in Figure 3. This process is repeated for all SNPs and the SNP sequence to be shared (i.e., output) is determined. Since we consider all pairwise correlations, changing the order may change the utility of the proposed scheme by eliminating different states. We discuss the optimal selection of this order (in terms of utility) in Section 5 and Algorithm 2 outputs the optimal order for each individual (i.e., ). Due to the computational complexity of Algorithm 2, we also propose a greedy algorithm in Section 5.2. Thus, either the output of the optimal or the greedy algorithm is used as the input for the proposed data sharing algorithm. In addition, we also visualize the selection of next SNP to be processed via a tree structured flowchart in Figure 9 (in Appendix A). Furthermore, in genomic data sharing, there is an interdependent privacy issue between genome donors and their family members. Thus, we also discuss the effect of the proposed data sharing mechanism on the privacy of donors’ family members in Section 7.1 and propose an algorithm to determine the maximum privacy budget (in terms of the parameter) that can be used by a genome donor while considering the privacy budgets of her family members.
Lemma 4.1 ().
Given a processing order, Algorithm 1 achieves dependent local differential privacy for each genomic data point that is not ineliminable.
Proof.
The proof directly follows from the reallocation of probability mass used in the RR mechanism. Since is the pairwise correlation between SNPs, we have . Besides, the ratio is preserved in the modified RR mechanism, and hence the condition in Definition 3 can always hold for ineliminable SNPs. ∎
5. Optimal Data Processing Order for the Proposed Genomic Data Sharing Mechanism
Algorithm 1 considers/processes one SNP at a time and as discussed, different processing orders may cause elimination of different states of a SNP, which, in turn, may change the utility of the shared data. Assuming there are totally SNPs in of an individual , Algorithm 1 can process them in different orders. As a result, determining an optimal order of processing to maximize the utility of the shared sequence of SNPs is a critical and challenging problem. In this section, we formulate the problem of determining the optimal order of processing as a Markov Decision Processes (MDP) (sutton2018reinforcement), which can be solved by value iteration using dynamic programming. Note that the algorithm locally processes all SNPs, and then perturbed data is shared all at once. Hence, the data collector does not see or observe the order of processing.
Since we consider genomic data sharing beacons to study the utility of shared data (as in Section 4.5) and the proposed sharing scheme is nondeterministic, we aim at achieving the maximum expected utility for the beacon responses using the shared SNPs. Note that similar analysis can be done for other uses of genomic data as well. Beacon utility is typically measured over a population of individuals, however, in this work, we consider an optimal processing order, which maximizes the expected beacon utility for each individual. The reason is twofold: (i) an individual does not have access to other individuals’ SNPs and (ii) a population’s maximum expected beacon utility can be achieved if all individuals’ maximum expected beacon utility are obtained due to the following Lemma.
Lemma 5.1 ().
Maximizing the expectation of individuals’ beacon utility is a sufficient condition for maximizing the expectation of a population’s beacon utility.
Proof.
Proof is given in Appendix B ∎
The sufficient condition in Lemma 5.1 can easily be extended to other genomic data sharing scenarios as long as the individuals share their SNPs independently from each other.
5.1. Determining the Optimal Processing Order via Markov Decision Processes (MDP)
Here, we proceed with obtaining the optimal order of processing which maximizes individuals’ expected beacon utility. First, we model the SNP state elimination and processing order as an agentenvironment interaction framework, where the agent is a specific individual (donor), the environment is the proposed SNP sharing scheme considering correlations (in Algorithm 1), and the interaction follows a MDP.
For instance, consider the individual (donor) in the population. Then, her MDP interaction with the environment is characterized as a tuple , where is the set of all MDP states of individual , is her initial MDP state, is her action set, is the transition probability between two MDP states of , is her set of rewards, and is the horizon of the MDP (i.e., number of rounds in discrete time). In our case, (number of SNPs to be processed), and . At each time step (i.e., when individual processes her th SNP), the agent chooses an action from her action pool (i.e., selects a specific SNP from her remaining unprocessed SNPs), where is the set of remaining unprocessed SNPs and is the set of all SNPs of individual . Then, the environment provides the agent with a MDP state and a reward . In particular, (i.e., the list recording all observations of previously processed SNPs of individual ) and is the utility of the beacon response on , and hence, we have . After observing and receiving , the agent takes the next action , which causes to transit to via the transition probability . Here, the equality holds due to the Markov property (sutton2018reinforcement) and is determined by the leaf nodes in Figure 9 (in Appendix A). An illustration of the MDP interaction between the agent (individual ) and the environment (Algorithm 1) at time step (processing the th SNP) is shown in Figure 4.
Since the optimal order can be predetermined and should be invariant in time, we model the agent’s (individual ) decision policy at time step as a deterministic mapping as , i.e., , . Let be the sequence of decision policies of the agent. Due to the nondeterministic behavior of Algorithm 1, we characterize the environment’s behavior on individual as a probabilistic mapping as (i.e., ). Furthermore, we define the future cumulative return for individual starting from MDP state as and the statevalue function of MDP state under policy as ( indicates that utility is considered in an expected manner with respect to the environment’s probabilistic mapping ). Then, to maximize an individual’s expected beacon utility at time step , the agent takes the optimal decision and suggests that may not be unique.
Thus, we have formulated the optimal order of processing problem as a finitehorizon MDP problem, whose state, action, and reward sets are all finite and dynamics are characterized by a finite set of probabilities (i.e., ). The finitehorizon MDP problem is Pcomplete, as it can be reduced from the circuit value problem, which is a wellknown Pcomplete problem (papadimitriou1987complexity)
. In the literature, exact optimal solution of finitehorizon MDP problem can be obtained by quite a few methods, for example value iteration, policy iteration, or linear programming
(bertsekas1995dynamic). In Algorithm 2, we provide a value iteration (sutton2018reinforcement) based approach to determine the optimal order of processing for an individual.Algorithm 2 is implemented using dynamic programming starting from the last time step, and it has a computational complexity of for individual (sutton2018reinforcement)
. For finitehorizon MDP, the number of MDP states grows exponentially with the number of variables, which is known as the curse of dimensionality. For example, in our case, at time step
, Algorithm 2 needs to calculate the statevalue function for states. In the literature, many approaches have been proposed to address this issue, such as state reduction (dean1997model) and logical representations (boutilier2000stochastic), which, however are outside the scope of this paper. Therefore, Algorithm 2may be computationally expensive to process large amount of data, and hence in the following section, we propose a heuristic approach to process long sequence of SNPs.
5.2. A Heuristic Approach
In this work, we consider sharing thousands of SNPs of individuals in a population. As a consequence, it is computationally prohibitive to obtain the exact optimal order of processing for each individual. We propose the following heuristic approach for an individual to process her SNPs in a local greedy manner. Specifically, at each time step , the algorithm selects the SNP with the maximum expected beacon utility, i.e., , where is the set of remaining SNPs of individual , and denotes the expected immediate utility if individual selects SNP and it is determined by a certain leaf node in Figure 9 (in Appendix A). After evaluating the condition in the root node using a SNP, only one leaf node can be activated. For example, without loss of generality, assume that at time step , SNPs and are left in , after the elimination check, can activate the leaf node characterized by and , and can activate the leaf node characterized by and . Then, the heuristic algorithm selects SNP to process at time step . If there is a tie between two SNPs, we randomly choose one. We compare this heuristic approach with the optimal algorithm (in Algorithm 2) in Section 6.3.
6. Evaluation
We implemented the proposed data sharing scheme in Section 4.6 and used a real genomic dataset containing the genomes of the Utah residents with Northern and Western European ancestry (CEU) population of the HapMap project (international2003international) for evaluation. We used 1000 SNPs of 156 individuals from this dataset for our evaluations. Using this dataset, we computed all pairwise correlations between SNPs. For each 1 million () SNP pairs, we computed 9 () conditional probabilities. Hence, we totally computed 9 million conditional probabilities (for all pairwise correlations between all SNPs). Note that, to quantify the privacy of the proposed scheme against the strongest attacks, we used the same dataset to compute the attacker’s background knowledge. However, in practice, the attacker may use different datasets to compute such correlations and its attacks may become less successful when less accurate statistics are used. We also assumed that each donor has the same privacy budget (). To quantify privacy, we used the attacker’s estimation error. Estimation error is a commonly used metric to quantify genomic privacy (wagner2017evaluating), which quantifies the average distance of the attacker’s inferred SNP values from the original data () as
where is the attacker’s inference probability for being . We assume the attacker’s only knowledge is and initially, which are computed based on . Then, using the correlations, the attacker improves its knowledge by eliminating the statistically less likely values. For the eliminated states, attacker sets the corresponding probability to . Since can be at most 2 for genomic data, is always in the range , where higher indicates better privacy. Thus, when the attacker’s estimation error decreases, the inference power of the attacker (e.g., to infer the predisposition of a target individual to a disease) increases accordingly. To quantify the utility, we used the accuracy of beacon responses. For each SNP, we first run the beacon queries using the original values and then run the same queries with the perturbed values. Let the number of beacon responses (SNPs) for which we obtain the same answer for both original data and perturbed data be . We computed the accuracy as ( is the total number of beacon queries), which is always in the range .
In the following, we first compare the proposed algorithm with the original RR mechanism in terms of privacy and utility. Then, we evaluate the effect of the design parameters on privacy and utility. Finally, we show the effect of the order of processing on utility.
6.1. Comparison with the Original Randomized Response Mechanism
As we discussed in Section 4.2, the original randomized response (RR) mechanism is vulnerable to correlation attacks because when a given state of a SNP is loosely correlated with at least other SNPs, the attacker can eliminate that state, and hence improve its inference power for the correct value of the SNP. In Figure 5, we show this vulnerability in terms of attacker’s estimation error (blue and red curves in the figure). We observed that attacker’s estimation error is the smallest (i.e., its inference power is the strongest) when the correlation threshold of the attacker () is and inconsistency threshold of the attacker () is , and hence we used these parameters for the attack.
Under the same settings, we also computed the estimation error provided by the proposed algorithm when and . Therefore, during data sharing, we eliminated states of the SNPs having correlation less than (the correlation threshold of the algorithm) with at least of the previously shared SNPs (in Section 6.2, we also evaluate the effect of these parameters on privacy and utility). We also let the attacker conduct the same attack in Section 4.2 with the same attack parameters as before. Figure 5 shows the comparison of the proposed scheme with original RR mechanism (green curve in the figure is the privacy provided by the proposed scheme). The results clearly show that the proposed method improves the privacy provided by RR after correlation attack. For instance, for , the proposed scheme provides approximately improvement in privacy compared to the RR mechanism. Note that the privacy of RR before the attack (blue curve in the figure) is computed by assuming the attacker does not use correlations. Hence, when the attacker uses correlations, it is not possible to reach that level of privacy with any data sharing mechanism and the privacy inevitably decreases. With the proposed scheme, we reduce this decrease in the privacy. To observe the limits of the proposed approach, we performed the correlation attack by assuming the attacker has 0 value for all SNPs (which is the mostly observed value in genomic data) and we observed the attacker’s estimation error as 0.66 (under the same experimental settings) after the correlation attack. Hence, with any mechanism it is not possible to exceed 0.66 after correlation attack and the privacy provided by the proposed scheme is remarkable.
Focusing on genomic data sharing beacons, we also compared the utility of shared data using the proposed scheme with the original RR mechanism in terms of accuracy of beacon answers (using the accuracy metric introduced before). We randomly selected 60 people from the population and used their 1000 SNPs to respond to the beacon queries. For 257 SNPs there was no minor allele, and hence the original response of the beacon query was “no”. There was at least one minor allele in 60 people for the remaining 743 SNPs (and hence, the original response of the beacon query was “yes”).
For the original RR mechanism, we shared 1000 SNPs of 60 individuals after perturbation. In the RR mechanism, the data collector eliminates the noise by estimating the frequency of each value using the sharing probabilities as described in Section 4.1. Hence, if or more individuals report for the value of a SNP (after perturbation), we considered the answer of beacon as “no”. For the proposed data sharing scheme, we did not apply such an estimation since in the proposed scheme, the sharing probabilities of the states are different for each SNP. Figure 6 shows the accuracy of the beacon for 1000 queries. We observed that our proposed scheme provides approximately 95% accuracy even for small values of , while the accuracy of the RR mechanism is less than 70% for small values and it only reaches to 85% when increases. We provide the accuracy evaluation for the “yes” and “no” responses separately in Appendix C. Note that we do not quantify the utility over the probability of correctly reporting a point. We quantify the utility over the accuracy of beacon answers. When the answer of the beacon query is “yes”, the original response of the beacon is mostly preserved after perturbation in both the original RR and the proposed mechanism, as shown in Appendix C (while the proposed mechanism still outperforms the RR mechanism, especially for smaller values). On the other hand, when the original answer of a beacon query is “no”, all individuals must report 0 for that SNP (to preserve the accuracy of the response). In this case, applying the original RR cannot provide high accuracy when is small, because with high probability, at least one individual reports its SNP value as 1 or 2 (i.e., incorrectly). As we also show in Appendix C, our proposed approach significantly outperforms the RR mechanism in terms of the accuracy of the “no” responses. Therefore, we conclude that the proposed scheme provides significantly better utility than the original RR mechanism.
Although here we evaluated utility for genomic data sharing beacons, similar utility analyses can be done for other applications as well. Since the proposed scheme eliminates statistically unlikely values, the proposed scheme will still outperform the original RR mechanism under similar settings. Since the proposed data sharing mechanism considers the correlations with the previously shared data points (as in Algorithm 1) its computational complexity is , where is the number of shared SNPs of a donor.
One alternative approach to improve privacy in the original RR mechanism can be adding a postprocessing step that includes identifying the SNPs having low correlations with the other SNPs and replacing them with the values that have high correlations. Such an approach can be useful to prevent correlation attacks due to eliminating less likely values. However, this approach provides much lower utility compared to the proposed mechanism since the proposed mechanism improves utility by adjusting probability distributions and optimizing the order of processing. We also implemented this alternative postprocessing approach and compared with the proposed mechanism. We observed similar estimation error with the proposed mechanism, which shows that this approach can also prevent correlation attacks. However, as shown in Table 7 in Appendix D, postprocessing approach provides even lower utility than the original RR mechanism without postprocessing, because it becomes harder to do efficient estimation after the postprocessing. Hence, the proposed mechanism outperforms the original RR mechanism even if postprocessing is applied.
6.2. The Effect of Parameters on Utility and Privacy
0.01  0.02  0.03  0.04  0.05  
Estimation error ()  0.491  0.380  0.348  0.368  0.415 
In Section 6.1, we used the correlation threshold of the attacker () as 0.02 and inconsistency threshold of the attacker () as 0.03 in its correlation attack. In our experiments, these parameters provided the strongest attack against the original RR mechanism. In Table 2, we show how the estimation error of the attacker changes for different values of when and . When in the original RR mechanism, we computed the estimation error before the attack as 0.78. Increasing results in eliminating less states by the attacker. For instance, if attacker selects , it cannot eliminate any states and the estimation is still . As decreases, more states are eliminated and the estimation error keeps decreasing up to a point (up to in our experiments, which provides the smallest estimation error). As we further decreased beyond this point, we observed higher estimation error values, since as approaches to , all states are eliminated for more SNPs. Also, when , we computed the estimation error as 0.78 as well. We also observed similar results for different values of . Similarly, when , we obtained the smallest estimation error for the attacker (and hence the strongest attack) when .
0.02  0.04  0.06  0.08  0.1  

Estimation Error ()  0.483  0.486  0.492  0.499  0.503 
Accuracy ()  0.950  0.942  0.918  0.892  0.865 
0.01  0.02  0.03  0.04  0.05  

Estimation Error ()  0.490  0.487  0.483  0.479  0.476 
Accuracy ()  0.932  0.940  0.950  0.954  0.959 
Since the attack against the original RR mechanism is the strongest when and , we set the correlation parameters of the proposed data sharing algorithm the same as the attack parameters (i.e., and ) in Section 6.1. Here, we study the effect of changing these parameters on the performance of the proposed mechanism. We assume that the attacker does not know the parameters ( and ) used in the algorithm and uses the parameters providing the strongest attack ( and ) against the original RR mechanism. First, we evaluated the effect of correlation threshold on privacy and utility (all correlations that are smaller than are considered as low by the algorithm). Our results are shown in Table 3. We observed that increasing increases the attacker’s estimation error since we assume the attacker does not know and uses in its attack. However, using provided the best utility for the proposed algorithm. Since there is no correlation (conditional probability) that is less than in our dataset, the minimum possible value that we can use for in the algorithm is . We also show the privacy and utility of the proposed scheme for different values of in Table 4. We observed that increasing slightly increases utility, however, the privacy also decreases at the same time.
In the previous experiments (Table 3 and Table 4), we assumed that the attacker does not know the parameters used in the experiments and uses and . However, the attacker can perform stronger attacks if it knows the design parameters ( and ) of the algorithm. Thus, we also computed the attacker’s estimation error by assuming it knows the parameters used in the algorithm ( and ). Estimation error of the attacker for different values of and are shown in Tables 5 and 6, respectively. When we increased up to 0.1, we observed a slight decrease in the estimation error. For instance, when and , we observed the estimation error of the attacker as . Similarly, the attacker can decrease the estimation error to by knowing the value of and selecting . We also observed that for values greater than and values less than 0.01, the decrease in attacker’s estimation error converged. Overall, we conclude that the attacker can slightly reduce its estimation error by knowing the design parameters of the proposed mechanism, however, the gain of the attacker (in terms of reduced estimation error) is negligible (at most 0.07). Furthermore, the proposed scheme still preserves its advantage over the original RR mechanism in all considered scenarios. These results show that varying design parameters only slightly affect the performance of the proposed scheme.
In our experiments, we assume that the attacker has the same background knowledge (i.e., correlations between SNPs) as the data owner. If the attacker’s knowledge is weaker than this assumption (e.g., if the computed correlations on the attacker’s side are not accurate), then its attack will be less successful and its estimation error will be higher than the one we computed in our experiments. On the other hand, if the attacker’s knowledge about the correlations in the data is stronger than the data owner, it can perform more successful attacks. To validate this, we added noise to the correlations computed by the data owner and observed that the attacker obtains a lower estimation error than the one in our experiments. In the worst case scenario, when the data owner does not know (or use) the correlations in the data, the estimation error of the attacker becomes equal to its estimation error when it performs the attack to the original RR mechanism (i.e., solid line marked with triangles in Figure 5).
0.02  0.04  0.06  0.08  0.10  
Estimation Error ()  0.483  0.478  0.462  0.446  0.420 
0.01  0.02  0.03  0.04  0.05  
Estimation Error ()  0.434  0.468  0.483  0.497  0.508 
6.3. The Effect of the Processing Order on Utility
In this section, we show the effect of different order of processing on the utility of the beacon responses. For all experiments, we set the parameters the same as in Section 6.1 (i.e., and ) and we also quantified the accuracy in terms of the fraction of correct beacon responses for a population. We reported the results averaged over 100 trials.
To demonstrate that the greedy order of processing (in Section 5.2) outperforms the random order and provides an accuracy that is close to the optimal order (in Algorithm 2), we first compared them using a small dataset of 10 SNPs of 10 individuals (obtained from the same HapMap dataset (international2003international) introduced before). When processing the SNPs of an individual using the random order, we randomly permuted the order of her SNP sequence and then fed it into Algorithm 1. Assuming each donor has the same privacy budget () and varying the privacy budget from to , we show the results in Figure 7. We observed that for all the privacy budgets, the accuracy obtained by the greedy order is close to that obtained by the optimal order (when , the accuracy provided by both orders differ only by less than ). Whereas, the accuracy achieved by the random order is the lowest for all the privacy budgets because the random order does not try to maximize individuals’ expected beacon utility. These results show that greedy order of processing (in Section 5.2) performs comparable to the optimal algorithm (in Algorithm 2), and hence we use the greedy algorithm for our evaluations with larger datasets.
Next, we compared the accuracy achieved by the greedy and random orders on the original dataset (i.e., 1000 SNPs of 156 individuals). The experiment results are shown in Figure 8. We observed that compared to the small dataset, the accuracy is improved significantly. For example, even under very limited privacy budgets (e.g., ), both orders can achieve an accuracy over since large dataset contains stronger (and more) correlations among SNPs. Correlations in the data is critical for the utility of the proposed data sharing mechanism, since when data is correlated, the proposed algorithm eliminates statistically unlikely states and adjusts the probability distributions of the remaining states in such a way that deviating highly from the “useful values” of the shared SNPs is small (as discussed in Section 4.5). From Figure 8, we also observed that the accuracy achieved by the greedy order consistently outperforms that obtained by the random order. This suggests that the utility varies under different processing orders and we can improve the utility of shared data points (SNPs) in a strategic way (e.g., by selecting them in a greedy manner). This outcome can also be generalized when sharing other types of correlated data. Another advantage of determining the processing order using the greedy algorithm is its computational complexity (, where is the number of shared SNPs of a donor), whereas the computational complexity of the optimal algorithm (in Algorithm 2) is .
7. Discussion
In this section, we discuss how to consider kinship in data sharing and how attacker’s background knowledge affects the privacy guarantees.
7.1. Selection of Privacy Budget by Considering Kinship
Due to rules of inheritance (i.e., Mendel’s law), sharing the value of a SNP also (indirectly) reveals information about genomes of donor’s family members and this may help an attacker to improve its inference about some SNPs of the family members. Let a genome donor share her () as and be one of her family members. Here, we discuss the information gain of an attacker about of the family member () by receiving in terms of the privacy budget of . We assume that all family members have their own privacy budgets (e.g., being the privacy budget for ) that they do not want to violate. An attacker can gain information about genomes of individuals by using the shared SNPs of their family members. Hence, we first compute the indirect privacy budget of a family member when shares her . Then, we propose an algorithm to compute the maximum privacy budget of a genome donor to preserve the privacy of her family members by considering their privacy budgets and the previously shared SNPs.
Computing the Attacker’s Information Gain
Since the proposed data sharing mechanism assigns different probability of sharing for a given SNP under different scenarios (as discussed in Section 4.5), we discuss the privacy of family members by assuming all three states of the SNP are possible (i.e., no state is eliminated due to correlations) for the sake of generality. The computations in this part can be done similarly for other scenarios (e.g., the ones having less possible states) as well using the probability distributions in Figure 3. In the scenario with three possible states, when the donor shares as the value of , the original value of the SNP is with probability (). Also, both or can also be the original value of that SNP with probability (). As mentioned before, to achieve dependent LDP, and are selected as and , respectively. Therefore, using the Mendel’s law, the attacker can compute for . For instance, can be computed by the attacker as . This sum is also equal to: . Similarly, the attacker can also compute for other and values. Hence, based on the conditional probabilities (computed using Mendel’s law), the attacker can gain a certain amount of information about each family member of the donor (the amount of information depends of the kinship relationship between the donor and the corresponding family member). Moreover, the attacker can gain more information about a victim if more than one of victim’s family members share their values for the same SNP.
As discussed in Section 3.1, each SNP of a child inherits one allele from the mother and one allele from the father. This means, the attacker can gain the most information from the first degree family members. For instance, if a child has as the value of , the attacker can (using Mendel’s law) infer that neither of her parents can have as the value of . In Appendix E.1, we analyze the privacy loss of a victim when one of her/his first degree relatives share her using the proposed method. We also extend the analysis and consider a case, in which two children of a victim (parent) share their SNPs under dependent LDP in Appendix E.2.
Determining the Maximum Privacy Budget
value of each family member who wants to share her SNPs can be computed similarly by considering the privacy budgets of the family members and the family members who previously shared their SNPs under dependent LDP. Let all members in a family denoted as , …, and assume of them (, …,) shared their SNPs previously. A basic algorithm is given in Algorithm 3 to compute the maximum privacy budget that can be used by a family member who wants to share her . For each family member who did not share , the algorithm computes maximum privacy budget that can be used by to preserve privacy of (by computing