Functionalities within the human body is coded in the DNA. The way cells evolve and form different tissues and limbs are highly correlated to the information stored in the genome. Human genome is a sequence of nucleotides chosen from the four member set . The sequence in human genomes are very similar–more than 98 percent alike. What is mostly responsible for variations among human genomes are Single Nucleotide Polymorphisms (SNPs). In fact, an individual’s genome can be uniquely characterized by its SNPs–that is called genotyping.
Having access to the genome sequence can benefit individuals for health care purposes both in diagnostic and therapeutic decision-making procedures , , . As a result, the usage of genetic testing services have risen massively in the past decade , , as well as genetic testing providers. As genomic data is becoming a leading part of health care procedures, concerns involving the privacy and confidentiality of this data have grown similarly , , . The disclosure of this data can be maliciously used for example by insurance companies to increase the rates for particular diseases and drugs. Moreover, The disclosure of this information puts the information on the relatives in danger as well, due to the inherited similarities between family members , . Thus, accessing genomic data in one hand is useful in curing diseases and on the other hand its disclosure is a violation to the privacy of individuals [11, 12, 13, 14]. There are a lot of papers addressing the issue of privacy in data exploration for genomic data. Some have used the concept of k-anonymity for providing data privacy, some have used differential privacy and others provided solutions by cryptographic methods [15, 16, 17, 18, 19, 20, 21, 22, 23]. The objective in all those papers was to make sure no one’data is revealed in a published data set due to the process of sharing data for research purposes. In this paper, we have looked into the issue of privacy in a different way. The privacy is violated at the beginning of sequencing process, due to the access of the sequencing company to the sequence. Therefore, before we even disclose our data, the company knows our sequence.
The most popular method in sequencing the whole genome is shotgun DNA sequencing , , . In this method, the genome is broken into multiple fragments with various lengths. After that, a sequencing machine reads these fragments are assembles the reads to build the whole sequence. Assembling algorithms available let the sequencing procedure to be both cost and time effective. It takes just a couple of days with a cost of less than 1000 dollars to sequence the genome, thanks to the existing sequencing machines. Also, to further reduce the costs and time, pooled sequencing can be used , , . In this methodology, rather than sequencing one individual, the genomes of a set of individuals are pooled together and sent to the sequencer. This will reduce the cost in comparison to the case in which these individuals sequenced the genome separately. Also, as wii be seen later on, the usage of pooled sequencing will benefit us in providing the privacy constraint.
Taking a deep look at the sequencing procedure, we realized that the sequencing process is itself a source of leakage for the sequence information. In this paper we introduce a scheme in which sequencing is possible while this kind of leakage is prevented and we will guarantee this privacy mathematically. In fact, we are going to sequence the genome of a set of individuals, using a sequencing machine, while limiting the knowledge received by the sequencer as desired. We first mention that the sequencing process consists of two phases. First is the reading phase in which the sequencer reads the received fragments; i.e. determines the sequence of nucleotides in each fragment. Second is the processing phase where a machine called data collector
, using the received reads, assembles the sequence of each individual. We aim at separating the two phases to provide privacy. In fact, we will introduce a methodology in which the sequencer is unable to do the processing phase while the data collector has the ability. In other words, the reading phase which needs high tech machines is outsourced, and the processing phase which is computational is done on a trusted local machine. To separate the two phases, we should make sure the data collector has more information in comparison to the sequencer. One of the ideas used in that direction is the usage of a set of individuals which their genome sequence is known a–priori to the data collector and unknown to the sequencer. the other idea is to use the finite field addition. Briefly, if we have two binary random variables and one of them has a uniform distribution, their summation in binary field reveals no new information of the non-uniform random variable; i.e. having the output of this summation, does not change the distribution of the random variable in comparison to the prior distribution. With these two ideas, we are going to limit the information leakage at the sequencer as desired, while letting the data collector to reconstruct the sequences.
This problem is conceptually connected to the Shamir sharing scheme . In this scheme, a secret is partitioned to multiple parts, and each part is stored in a data base. This partition is done in such a way that with a subset of the data bases, the secret is reconstructed. In fact there is a threshold for the number of data bases where any subset with the number of data bases equal or more than that, can reconstruct the secret, and any subset with the number of data bases less than that threshold, receives no information about the secret . Based on this solution, there are many works providing solutions , .
The rest of paper is organized as follows. The problem setting is provided in Section II. In Section III, an achievable scheme is introduced with the corresponding results. In Section IV, a generalized version of the scheme is introduced with the resulting theorems and Section V concludes the paper and introduces some future steps.
Ii Problem Setting
We propose an architecture in which there is a trusted data collector and a sequencing machine (i.e. sequencer). also, there is a set of individuals that want their genome to be sequenced privately, without leaking the sequence data to the sequencer. There are individuals in this set and they are labeled from to . The data collector has the duty to gather the genomes of the individuals in the set and pool their fragments (the genome is sheared to fragments with various sizes) together and send this pool to the sequencer. Then, the sequencer will read these fragments (reading phase) and reports the resulting reads to the data collector. At last, the data collector, using the set of reads, assembles the sequences for all individuals (processing phase) and reports the results to them. To provide privacy, unlike conventional methods, we aimed at separating the reading phase with the processing phase. In fact, the sequencer has the duty to do the reading phase and the data collector is used for the processing phase. Our objective for privacy is to guarantee that the processing phase can not be done in the sequencer.
To separate the two phases, we should create an information gap between the sequencer and the data collector. To do this, we use another set of individuals which their sequences are known before hand to the data collector but unknown to the sequencer. The genomes of this set of individuals are also collected by the data collector and their fragments are added to the pool. This set is of size and the individuals are labeled from to and are called known individuals. The previous set, which the aim is to sequence their genomes, are called unknown individuals.
We referred to SNPs earlier as the main source of difference between human genomes. Although there are four types of nucleotides, two of them can occur in every SNP position for all individuals, and this binary set in every position is known a–priori for the population. Also, for each SNP position, the allele occurring with more frequency in the population is called the major allele and the one occurring with less frequency is called the minor allele. Considering this, the sequence of every individual can be characterized by a vector inwhere is the total number of SNPs and and represent the minor and major alleles, respectively. Moreover, we define the matrix which contains the random variable in its row and column that indicates the allele for unknown individual in SNP position . Similarly, the matrix is defined for the known individuals. Keep in mind that the entries in are unknown both at the sequencer and the data collector, but the entries in are unknown to the sequencer and known to the data collector, leading to an information gap between these two.
Let and denote the set of fragments containing SNP position for the unknown individual and known individual respectively. The data collector sends the set of fragments to the sequencer (see Fig. 1). Let us define the random variables and as the coverage depth for SNP position for the unknown individual and known individual
, respectively. Note that in the sequencing process, from each individual, there are a number of genomes provided for the data collector, so for most regions in the genome for one individual, there are multiple fragments containing the region. The sequencer reads each SNP with a probability of error. As will be seen later, to lower the effect of reading error caused by the sequencer, we should increase the coverage depth. The set of reads sent to the data collector by the sequencer is denoted by.
Sequencers have errors in reading bases. The probability of error in reading a SNP in a fragment is assumed to be constant across all sequences and for all SNPs and is denoted by . More precisely, in the sequencer, for a fragment of an individual, and in a SNP, the probability that a is read or vice versa is , independent of the individual, the fragment, and the SNP.
Having as a side-information, the data collector maps to the matrix using a function , i.e.
refers to an estimate of the matrix of SNPs for unknown individuals.
The proposed scheme should be such that the following two conditions are satisfied:
Reconstruction Condition: Let and denote the column of the matrix and respectively. The reconstruction condition requires that the inequality below hold for any given :
is referred to as the accuracy level and is a design parameter.
Privacy Condition: For privacy to be held, we want the distribution of remains almost the same before and after reading the fragments. To be precise, the privacy condition requires that the following inequality hold for any given :
is referred to as the privacy level and is a design parameter.
In the following section we will introduce a proposed scheme that satisfies the two conditions simultaneously.
Iii Structured Achievable Scheme with Constant Coverage Depth
Assumption 1: Every fragment is short enough to contain no more than one SNP.
Assumption 2: Every fragment is long enough that can be correctly mapped to the reference genome, i.e. we can identify exactly from what region of the genome sequence they came from.
These two assumptions are realistic. We should keep in mind that there are approximately 3.3 million SNPs in the human genome. Comparing to the 3 billion length of the whole genome, it is concluded that the average distance between two SNPs is roughly 1000 base pairs . Moreover, using short read alignment algorithms like Bowtie , it is possible to assemble reads of length in the order of a couple of hundreds. Thus using such algorithms, and choosing the fragments lengths to be about few hundreds, both assumptions are valid simultaneously.
In the proposed achievable scheme, we focus on the case where . In cases where is greater than , we partition the set of individuals into groups of size and use this scheme for each group separately. In this paper, we propose a specific assignment scheme for the coverage depth parameters. In the proposed solution, named structured scheme, for , , and we have
where . Also, entries in
have prior probabilities following the major allele frequencies and entries inhave uniform prior probabilities.
Keeping the coverage depth variables exactly as introduced in the above equations is practically impossible. They are actually random variables. Analyzing the random case is rather complicated. To have a better understanding of the problem and make the analysis tractable, in this section, we consider the constant case and later in Section Blah, we generalize the results to the case of random coverage depths.
First, we introduce the main results. Then we derive the mathematical models in the data collector and the sequencers in Subsections III-A and III-B, respectively. We rely on these models to prove the main results in Subsections III-C and III-D. At last, we discuss the results in Subsection III-E.
The following theorem provides a sufficient condition for the reconstruction condition to hold.
In the structured scheme with constant coverage depth and reading probability of error of , the reconstruction condition (1) is satisfied if
The following theorem provides a sufficient condition for the privacy condition to hold.
In the structured scheme with constant coverage depth, the privacy condition (2) is satisfied if
The main message of these results is that we can choose the parameters of the proposed scheme such that both conditions are satisfied, simultaneously. In other words, these theorems confirm that the separation of the reading phase and the processing phase together with adding known individuals and by adjusting coverage depths, offers enough flexibility to satisfy both conditions at the same time; based on (6), is chosen, and using (5), is set.
Iii-a Mathematical Model in Data Collector in the Structured Scheme
For any SNP position , the data collector should be able to estimate the vector .
In this subsection, we seek for the model that the data collector observes in SNP position . We will show that the data collector receives as
in which where
To obtain this model, we should keep in mind that the fragments have no tags and the data collector and sequencer both do not know the corresponding individual which every fragment belongs to. Therefore, when the data collector receives the read fragments from sequencer, the only information it gets is the number of major (or minor) alleles in every position . Consequently, the data collector receives the following summation
in which and are noisy versions of and respectively, due to the reading error caused by the sequencer. Also, recall that the data collector knows the sequence of known individuals a priori, i.e. it knows the value for all . Let us assume these values are . Therefore we have
Note that subtracting in the above equation is fine, because of the full knowledge of matrix is available at the data collector.
To follow, we derive the parameters of the random variable on the condition of knowing . Based on (10) we have
in which the last inequality is valid for both possible values of ; i.e. and . Using the MMSE estimate and orthogonality principle, we can write
where is a random variable with and . Also and are uncorrelated. Consequently
Based on central limit theorem. Thus
converges in distribution to a normal distribution with zero mean and variance . Thus, the last term in the right-hand side of (16) converges to a normal distribution with zero mean and variance . Similarly Using (11), we reach a similar equation.
Iii-B Mathematical Model in Sequencer in Structured Scheme
Similar to the previous subsection, the sequencer receives the following summation in (9). The difference here with the previous subsection is that all individuals are unknown form the sequencer’s view point. Therefore,
Yet, follows (10).
Scaling the summation in (9), the sequencer receives defined as
Taking similar steps as in the previous subsection, is written as
where in which is defined in (8).
Iii-C Proof of Theorem 1
Note that the value of the summation uniquely matches to a (in binary representation of it, each entry corresponds to a for different values of ). Therefore, our objective is to find the summation above. The probability of error in estimating the summation, based on (7), is simply upper bounded by
Obviously, here due to the fact that s are chosen from the set . Thus
in which is defined in (8).
Iii-D Proof of Theorem 2
The fact is that for the sequencers, is equivalent to , because fragments contain just one SNP and are grouped based on their containing SNP position and in the group containing SNP position , the information is stored in . Thus we have
Recall that denotes the column of . Due to independence of entries in , we have
Based on the last two equalities
Thus, for privacy condition (2) to be satisfied, it is sufficient for every to have
To begin, we define as
It is concluded that the following Markov chain holds,
Thus we have
In what follows, we seek for . We have
We expand in binary formation
Consequently, the following equations hold
in which in equation , is the carry over of the left-hand summation in binary field. Equivalently we have
We expand the right hand side of the above equality as
Based on (35) and the fact that entries of have uniform prior probabilities, has uniform distribution, so . For we have
which also results in . Note that is resulted from the fact that is sufficient statistic for . Also is resulted from (36). Similarly, all the terms in (III-D) result to except the last term. Therefore,
Using the last two equalities and (III-D), we have
The proof is complete.
As it is seen from theorem 1, the minimum needed to preserve the reconstruction condition, behaves exponential with . is a noise-resistance parameter and as it becomes larger, the ratio of the fragments containing false reads concentrate to the probability of error in the reading phase (); that is why increasing helps to eliminate the noise term in (7).
Taking a deeper look at the procedure in the proof of Theorem 2, we realize that we have created the binary field addition in our scheme, as was desired. The bits that derive form (35) to (37), are the result of binary field addition. The addition is for two random variables where one of them has uniform distribution, , and the other, ,follows the distribution of SNP position . If the value of is given alone, the results reveals no new information about . Thus these bits alone, are not leaking any information. So we have created this kind of addition, thanks to adjusting the coverage depth values. From (7) it is concluded that the only bit leaking information in position is which means the binary field addition scheme is not working perfectly, but we should remember that the problem addressed in this paper has its limitations that we should adapt to. Interestingly, the maximum entropy of this bit is and this upper bound on the information leakage is independent of . This aspect is very interesting and useful and results the average information leakage per bit to be at most . Therefore by increasing , this average decreases, so we can adjust so that we reach the desired . Note that based on our simulations, is an increasing function of (see Figure 3) and tends to an ultimate value. So by increasing , the information leakage per bit decreases with the rate of , not more.
Iv Structured Achievable Scheme with Random Coverage Depth
In the previous section, we analyzed the problem for constant coverage depth; however, it is not a practical case because we do not have exact control on the number of fragments. In this section, we consider a more general case in which the coverage depth parameters are random variables. We assume them to be binomial variables and approximate them with normal distribution. Therefore, for , we have
Similarly for , we have
Due to the fact that coverage depths mostly have large values, we assumed that .
As the previous section, we introduce the results hereunder. After that, the mathematical model and the estimation rule are introduced in Subsections IV-A and IV-B. Then, the proof of Theorem 3 is provided in Subsection IV-C. Following them, we discuss the results in Subsection IV-D.
The following theorem provides a sufficient condition to satisfy the reconstruction condition.
In the all-but-one scheme, the reconstruction condition (1) is satisfied if:
For the privacy condition, Theorem 2 is valid here as well. This will be discussed later in Subsection LABEL:discuss2.
Iv-a Mathematical Model in Data Collector in the Structured Scheme
In this subsection, we will show that the information the data collector receives is the value in which is written as
where and are normal random variables with zero mean and variance and respectively, where
In the pooled sequencing scenario, the sequencer will receive , which is defined as
Consider the random variable conditioned on . We have
It is trivial that the random variables and are independent conditioned on . Also, the distribution of resembles that of both conditioned on .
Thus we have and
Similar to the steps taken in Subsection III-A and as a result of the central limit theorem and orthogonality principle
For the second term in (IV-A) we have
Using the law of total variance we have
where . Using the same steps, for the data collector we have
Iv-B Estimation Rule
For any SNP position , the objective for the data collector is to estimate the vector . We define the extended vector , where the last entries are known to the data collector. Therefore, for the data collector, estimating is equivalent to estimating .
In this section, our objective is to find the rule that should be used by the data collector to estimate . Using the ML rule, the estimate is obtained by
Based on (IV-A),
Iv-C Proof of Theorem 3
Based on the mathematical model and estimation rule presented in the previous subsections, we are ready to provide the proof of theorem 3.
Similar to the proof presented in subsection III-C and based on the estimation rule in (70), our estimation resembles an AWGN channel. In other words, if we estimate , then is resulted accordingly. Thus, for the probability of error we have
Putting the right-hand side less than results
Rewriting the left-hand side by substituting results