DNA sequencing is a modern technology that allows the researchers to access the genomic information of a vast number of individuals. The genome is a high dimensional mathematical object which encodes most of the biological functions of an individual. Finding the relations between the elements of the genome and the individuals’ characteristics is a challenge which requires huge amount of data. Fortunately, DNA sequencing technologies are growing so rapidly that the cost of generating massive genomic datasets is affordable for many research institutes.
In the upstream analysis of DNA sequencing datasets, the target is to recover an individual’s genome from a set of reads produced by a sequencer. The information theoretic aspects of this problem are addressed in , where the fundamental limits of reconstructing the genome are characterized. For further studying the problem, please see in [3, 4, 5].
In the downstream analysis, after reconstructing the genome, the main challenge is to learn the connections between the genomes and the real world observable characteristics or phenotypes. This problem is known as Genome-Wide Association Studies (GWAS), where given the genomes sequences of a population of individuals and their observed phenotypes, the objective is to learn the sites in the genome that cause or relate to the observed characteristics. In other word, in a genome of length , only a subsequence of length called causal subsequence is correlated with the observed phenotype.
Being a fundamental problem in biology and medicine, GWAS has been studied extensively and novel biological results have been discovered (see for instance [6, 7, 8, 9, 10, 11, 12]). We note that an important application of GWAS is when the observed phenotype is related to a disease. For instance, there is a number of papers studying the causal factors of diabetes (type I  and type II ) as well as various types of cancers , for example the breast cancer  and the prostate cancer .
Many algorithms have been proposed to solve the GWAS problem (see for instance ). Despite all of these progresses, addressing the problem from an abstract level where reliability in inference is the main objective is missing. We aim at filling this gap by providing the first fundamental limits on this problem. In particular, we study the GWAS problem from an information theoretic perspective.
In this paper an abstract probabilistic model is proposed where it is assumed that a dataset of individuals’ genomes and their corresponding phenotypes is given111In this paper, we assume that the dataset is drawn from one population, i.e., the population stratification is assumed to be done before. This means that the dataset is homogeneous. . See Figure 1 which shows the proposed model of this paper for studying the problem of learning the causal subsequence. The main properties of this model are as follows.
The genomic information of each individual is represented by a sequence of length . This sequence is denoted by in the figure.
The observed phenotype takes two states, which are denoted by 0 and 1.
The observed phenotype is related to a specific subsequence of length of the genome sequence, which is denoted by . Here is the (latent) causal subsequence.
The sequence denotes the genomic locations which are correlated to the observed phenotype. The (deterministic) function maps the genomic information to the label set . The phenotype is also related to the other effects, such as environmental effects, which are modeled by an additive noise.
To model more closely real-world phenomenon, we note that the phenotypes can come from two different sources: genome and environment. For the modeling of the environmental effects, we assume that the mapping of the genomes to the phenotypes is stochastic, i.e., a susceptible individual based on her genome may show the phenotype with some probability related to her environment. Here we treat these features as an additive noise. Although this assumption may be not realistic in cases that the environmental and genomic effects are dependent, this simplification allows us to analyze the problem in an efficient way and to derive non-trivial results.
A natural way to model the GWAS problem is to consider (latent) patterns from a specific (latent) subsequence of the genome (i.e., the causal subsequence) that cause the observed phenotype. In this paper, the proposed model uses a deterministic function
to take the (latent) patterns into consideration. This assumption allows us to use the results in the theory of machine learning for the analysis of the problem. In particular, we use some bounds based on the VC dimension222 Vapnik-Chervonenkis dimension. of the class of functions to prove the results.
This paper revolves around answering the following question. What is the required sample complexity that implies the reliable learning of the causal subsequence? Our main contribution is that in a biological plausible model, we characterize the fundamental limits of learning the causal subsequence exactly. We define the rate of GWAS as , where is the binary entropy function. This definition, as we will see later, is the natural scaling law of the parameters of the problem that allows to define the capacity. Note that we are interested in maximizing the rate for the GWAS problem. A rate is said to be achievable if it is possible to learn the causal subsequence reliably enough, as and . The capacity is defined as the supremum of all achievable rates.
We define two notions of the zero-error-rate and the
error-rate estimation of the causal subsequence. In theerror-rate estimation, we are interested in learning the causal subsequence with an error-rate of at most , while in the zero-error-rate estimation, no positive error-rate is acceptable. In the zero-error-rate estimation, we fully characterize the capacity region. The capacity is proven to be a finite positive number, which shows that the parameters scaling is correct. In the error-rate regime, we show that for small enough , the capacity is the same as the zero-error-rate case. This shows that, eventually, the problems of estimation of the causal subsequence, with the zero-error-rate and with the error-rate are equivalent in the asymptotic regimes.
2 Problem Formulation
In this paper, random variables are represented by capital letters, such as, and their realizations are represented by lower case letters, such as . But as an exception, we use the capital letters , and
to denote the problem’s parameters, which are non-random. For a (discrete) random variable, denotes its probability mass function (pmf). Random pmfs are also denoted by capital letters, like . We denote the sequences by bold letters, like , and the capital bold letters represent random sequences, like . For any positive integer , let . The
norm of a vectorfor each is defined as
For a sequence of length , like and a sequence of length with entries in , like , we define as a subsequence of . Also we denote the length of a sequence by . In this paper, all the logarithms are in the base two and are denoted by . For any , the binary entropy function is defined as
The mutual information of two discrete random variables and is denoted by The distance of two sequences and , is defined as
where means the symmetric difference of sets. The set of all strictly increasing sequences of length with entries in is denoted by . Given a finite set , we define
2.2 System Model
We are interested in associating a specific phenotype with the genotype of a population of individuals. In our problem, each individual’s genotype can be represented by a sequence of length with elements from a finite set . Moreover, the phenotype of each individual can take only two states, which can be represented by . There is a stochastic map , which associates the genotype of the individual, denoted by , to its phenotype . Figure 1 represents the set of biologically plausible functions that we are interested in. In particular, for generating the phenotype , first a subsequence of length of , represented by is chosen. Then, passes through a deterministic binary-valued function and the result is . Finally, meets an additive noise and makes the label . More precisely, , where is a Bernoulli random variable with parameter . We assume that the additive noises are independent for different individuals and they are also independent from the sequences of the individuals. The existence of this additive noise in the problem setup is due to the effect of the other factors in the observed phenotype, such as environment. Note that this model is robust. In one extreme, the observed phenotype is very correlated with the genome (). In the other extreme, the labels are approximately independent from the genome (). Note that is given in the problem.
In our model, the sequence and the deterministic function are unknown but they are same for all individuals. It is assumed that the parameter is given and also . We also assume that the deterministic function is chosen uniformly and randomly from , for a given positive integer . Note that the parameter denotes the number of causal factors of the target phenotype. In the machine learning literature, the parameter is also related to the Vapnik-Chervonenkis dimension (VC dimension) of the class of functions in our model. The sequence is selected randomly and uniformly from the set of all strictly increasing sequences of length with the entries belong to , which is denoted by 333 We note that because there is no information about and and they are latent in the model, we assume that the prior distribution of them is uniform. It is also worth mentioning that the uniform sampling of is only needed for the converse proofs. The achievability proofs hold for any prior distribution on the set . . The entries of the sequence represent the sites in the genome that affect the observed labels. The dataset of individuals sequences is assumed to be sampled uniformly and independently from the set .
We define the parameter in the proposed data generation model. It is assumed that is known444 Note that is a constant that only depends on and . . Note that . In this paper, our objective is to estimate , given sampled sequences and their corresponding observed labels . In the following, we formally define the GWAS algorithms.
Algorithm is a mapping from the set of all possible input datasets, , to the set . When there is no ambiguity, we denote an algorithm by or sometimes by . For a dataset , denotes the output of the algorithm.
Next we formally define the error event and also the probability of error for the algorithms.
For a positive and an algorithm , the error event is defined as , where is the output of the algorithm. Also the worst-case probability of error of an algorithm is defined as
The average probability of error is also defined as .
The parameter is a threshold for the normalized distance between and its estimation . Notice that for any algorithm , . Also by the definition of the error event, it is obvious that if then and thus and
In this paper, we want to characterize the fundamental limits of the GWAS problem, i.e., the region for the parameters of the problem that the reliable estimation of is possible. For this purpose, we derive the fundamental limits in two scenarios. First we study the problem in the case that we want to estimate with zero-error-rate, which will be defined formally later. Second, we study the problem of approximating , where a positive error-rate of is acceptable. In the following definitions, first we define the achievable rates and then we define the capacity region of the problem in two cases.
The rate of an algorithm is defined as , where is the binary entropy function.
For any positive , a positive real is said to be achievable, iff there is a sequence of algorithms 555 Assume that and are strictly increasing functions of . with rate , where 666 By , we mean as . , such that as
A positive real is said to be achievable, iff for any positive is achievable.
Now we are ready to define the capacity region of the problem.
The zero-error-rate capacity is defined as the supremum of all achievable rates and is denoted by Also for any positive , the capacity is defined as the supremum of all achievable rates and is denoted by .
Note that the capacity is the inverse of the minimum number of required sampled data, normalized by , such that the reliable estimation of is asymptotically possible. We notice that in the definition of the zero-error-rate capacity, any positive error-rate must be removed asymptotically, while in the capacity, an error-rate of at most is acceptable. Also the assumption means that the size of the given dataset is very greater than the number of causal factors. The reliable learning is impossible, when the size of the dataset is in the same order with the number of causal factors.
Due to the definitions of the capacities, we have
for any real positive numbers , such that .
3 Main Results
In this section, we state the main results of this paper. In the following theorem, we characterize the capacity
The zero-error-rate capacity is
where is the binary entropy function.
Theorem 1 characterizes the capacity of the zero-error-rate estimation of the causal subsequence. This shows that we have a threshold effect at , in the asymptotic regimes. Note that the capacity is strictly positive because .
For the achievability, we examine all of the subsequences of length of genome and choose the one for which two binary vectors and are jointly typical for some . Note that unlike the channel coding, there is no codebook in this setup and the sequences are produced by nature. This changes drastically the proof techniques. However, we prove that the probability of error in the proposed scheme tends to zero, using some approximation methods ignoring the dependency among some events and bounding the effect of this assumption.
For the converse, we cannot directly use Fano’s inequality due to the definition of the error. Instead, we develop some inequalities similar to it. The need for this new bound is due to the fact that in our case, if we have an approximation of the causal subsequence with at an error rate of at most , then we can not determine it exactly.
We are also interested in characterizing the minimum number of required samples to find an approximation for the causal subsequence, rather than zero-error-rate learning of it. In the following theorem, we state the result for the error-rate capacity.
There is a positive , such that for any we have
It may be surprising that the capacity is the same as the zero-error-rate capacity. This shows that there is no difference between the approximation of the causal subsequence and the zero-error-rate learning of it in the asymptotic regimes, from the perspective of the required sample complexity.
The proofs of the main results of this paper can be found in the next two sections.
4 Proof of Theorem 1
Let be a positive real number. We aim to prove that is achievable. In particular, for any positive , we want to show that is achievable. For this purpose, we introduce an algorithm achieving this rate. The algorithm is a jointly typical decoder. First let us have some definitions.
For any positive , let denotes the set of all jointly typical binary sequences of length , with respect to the pmf , i.e.,
In other words, is the set containing all of the pairs of binary sequences of length with empirical entropies close to the true entropies with respect to the pmf .
The proposed algorithm is as follows.
For a given dataset , the algorithm chooses with the following property: there is a function , such that the binary vectors and are jointly typical, i.e., . If there are more than one such with this property, the algorithm chooses one of them at random. If there is no such , the algorithm chooses an element of randomly as the output. We denote the proposed algorithm by
4.1.2 Analysis of the algorithm
For any fixed positive , we aim to prove that any is achievable using the proposed algorithm. In particular, we are interested to show that as . In other words, for any , we want to prove that the probability tends to zero in the asymptotic regimes.
Fix an arbitrary . Consider two events and as follows. is the event that the causal subsequence has not the acceptance properties of the algorithm, i.e.,
Also is the event that there is a such that has the acceptance properties of the algorithm and . To be more precise, let us define
Note that for the proof of the achievability of , it just suffices to prove that . Using the union of events bound, and thus it suffices to show that and vanish in the asymptotic regimes. We note that
vanishes in the asymptotic regimes, using the law of large numbers. Hence for completing the proof, it suffices to show thatvanishes asymptotically.
First let us state a lemma.
[15, Theorem 3.5] (Sauer’s lemma) For any positive integers and any ,
where is the VC dimension of the class of functions .
For any positive integers and any , we have
Corollary 1 has an important role in our proofs. It establishes a connection between the parameter (which is related to the VC dimension of the class of functions considered), and the number of possible observable patterns as the output of
instances of a function. This bound is very essential for the asymptotic analysis of the algorithmby the union bound. Note that the condition of Corollary 1 is satisfied asymptotically, since we have assumed that .
For any positive and any , the following proposition holds with probability .
For any and any function , such that , the probability that has the acceptance conditions in the proposed algorithm using the function (the probability of the event ) is upper bounded by .
See appendix A. ∎
Using Corollary 1, for the analysis of the algorithm , it just suffices to check at most functions. Let us denote the event in the statement of Theorem 3, which holds with probability , by . Now we write
We note that since . Hence vanishes asymptotically, if
This shows that by choosing small enough , any is achievable. This holds for any positive and thus completes the proof.
In this section, we prove the converse part of Theorem 1. First we need a lemma.
For any positive , let be an achievable rate. Then, we have
where is a function that is defined as
See appendix B. ∎
It can be shown that for any This follows from the following lemma.
For any we have
where is the binary entropy function.
See appendix C. ∎
Using Lemma 3, we conclude that
which shows that .
Now consider an achievable rate . This implies that for any positive , is achievable. Using Lemma 2, we conclude that for any positive Hence,
and this completes the proof.
5 Proof of Theorem 2
Let that be an achievable rate. We prove that First we note that using Lemma 2, . The function is defined as and it is proved that Now consider a sequence of algorithms , with vanishing probability of error, i.e., . We define a complementary procedure as follows. Assume that the output of the algorithm to a given dataset is . Let be a ball with radius in around , with respect to the distance measure . More precisely, we define
Apply the proposed algorithm in the achievability proof of Theorem 1 to find some as the estimation of . More precisely, find the one such that
for some . If there is not such , choose one of the elements of randomly. If there is more than one such element, choose one of them randomly.
We prove that the output of this complementary algorithm is a zero-error-rate estimation of , with high probability. In particular, for any , we show that the output of the complementary procedure has at most error-rate, with high probability in the asymptotic regimes. More precisely, we prove that for any positive .
Fix a positive . Define two events and similar to the achievability proof of Theorem 1. More precisely, we define
In addition, we define
Note that two events and are defined with respect to the parameter . Also by the assumption, , with high probability. In other words, . By the law of large numbers, . Using the union of events bound, we conclude that vanishes asymptotically.
Now observe that
where (a) follows from the definition of the function . Now since , we conclude that vanishes asymptotically 888 The multiplicative factor does not make any problem, noting that asymptotically. This is due to the fact that . , if
for some positive . Let us assume . Using Lemma 2, we have
for small enough Hence we conclude that if is achievable, then it is achievable for any . This means that is achievable. Hence, using Theorem 1, we conclude that and this completes the proof. Note that based on the fact that , there is a positive such that for any , we have . We are done.
Numerical calculation shows that works for the converse of Theorem 2. However, we do not claim that this is the optimum threshold.
In this paper a biological data analysis problem known as Genome-Wide Association Study (GWAS) has been studied from an information theoretic view. In the proposed probabilistic model, we fully characterized the fundamental limits of learning the causal subsequence. It is shown that two problems of the error-rate and the zero-error-rate estimation of the causal subsequence are equivalent. For the future of this work, some extensions of the problem can be considered. The first one is to assume that the environmental effects are not like additive noises in labeling and to explore the capacity in a more general framework. Another extension is to consider the GWAS problem in the case that the labels or phenotypes are not binary-valued. Also modeling the genome sequence by a more realistic probabilistic model is of the future of this work.
-  B. Tahmasebi, M. A. Maddah-Ali, and A. S. Motahari, “Genome-wide association studies: Information theoretic limits of reliable learning,” Information Theory Proceedings (ISIT), 2018 IEEE International Symposium on, pp. 2231–22235, Jun. 2018.
-  A. S. Motahari, G. Bresler, and D. N. C. Tse, “Information theory of DNA shotgun sequencing,” IEEE Transactions on Information Theory, vol. 59, pp. 6273 – 6289, Oct. 2013.
-  A. Motahari, K. Ramchandran, D. Tse, and N. Ma, “Optimal DNA shotgun sequencing: Noisy reads are as good as noiseless reads,” Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on, pp. 1640–1644, 2013.
-  S. Mohajer, A. Motahari, and D. Tse, “Reference-based DNA shotgun sequencing: Information theoretic limits,” Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on, pp. 1635–1639, 2013.
-  I. Shomorony, T. A. Courtade, and D. Tse, “Fundamental limits of genome assembly under an adversarial erasure model,” IEEE Transactions on Molecular, Biological and Multi-Scale Communications, vol. 2, pp. 199–208, Dec. 2016.
-  J. N. Hirschhorn and M. J. Daly, “Genome-wide association studies for common diseases and complex traits,” Nature Reviews Genetics, vol. 6, no. 2, pp. 95–108, 2005.
-  N. Risch and K. Merikangas, “The future of genetic studies of complex human diseases,” Science, vol. 273, no. 5281, pp. 1516–1517, 1996.
-  J. C. Barrett, D. G. Clayton, P. C. et al. , “Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes,” Nature genetics, vol. 41, no. 6, pp. 703–707, 2009.
-  K. Hara, H. Fujita, T. A. Johnson et al., “Genome-wide association study identifies three novel loci for type 2 diabetes,” Human Molecular Genetics, vol. 23, no. 1, pp. 239–246, 2014.
-  D. F. Easton and R. A. Eeles, “Genome-wide association studies in cancer,” Human Molecular Genetics, vol. 17, pp. R109–R115, 2009.
-  D. F. Easton, K. A. Pooley, A. M. D. et al., “Genome-wide association study identifies novel breast cancer susceptibility loci,” Nature, vol. 447, no. 7148, pp. 1087–1093, 2007.
-  G. Thomas, K. B. Jacobs, M. Y. et al., “Multiple loci identified in a genome-wide association study of prostate cancer,” Nature Genetics, vol. 40, no. 3, pp. 310–315, 2008.
-  A. Najafi, S. Janghorbani, S. A. Motahari, and E. Fatemizadeh, “Statistical association mapping of population-structured genetic data,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, Dec. 2017.
-  T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. New York: John Wiley and Sons, 2006.
-  M. Mohri, A. Rostamizadeh, and A. Talwalker, Foundations of Machine Learning, The MIT Press, 2012.
-  F. Liese and I. Vajda, “On divergences and informations in statistics and information theory,” IEEE Transactions on Information Theory, vol. 52, pp. 4394–4412, Oct. 2006.
Appendix A Proof of Theorem 3
To prove Theorem 3, we need some preliminaries which are available in the following two subsections.
In this subsection, we first review some definitions about the divergence measures on probability distributions, as well as their main properties.
Let be a convex function, such that and is strictly convex at . Then, the divergence of any (finite) probability measures and on a finite set is defined as
We notice that the divergences satisfy the data processing inequality.
 (Data processing inequality for divergences). For any (finite) probability measures and any channel , the following inequality holds.
Note that the function satisfies the required conditions in Definition 7. It can be shown that in this case, the divergence reduces to the total variation distance of two probability measures.
In the following definition, we define the information between two arbitrary (discrete) random variables.
For any (discrete) random variables and , we define
Specifically, for the case of , we can write
Consider three random variables , such that forms a Markov chain.
Then, for each
forms a Markov chain. Then, for eachdivergence we have
First we define the following channel
where is the identity channel, i.e., with probability one. Note that
which completes the proof. Note that (a) follows from Theorem 4 and (b) follows from the fact that , which is due to the Markov property. ∎
Consider random variables , such that forms a Markov chain. Then, we have
Consider and use Lemma 4 twice. ∎
We need a result on the equivalency of and norms which is stated in the following lemma.