There is a conceptual problem in describing the relationship between the genome and the phenotype. Consider some quantifiable trait of an organism, such as height, weight, antibiotic resistance, spherical equivalent of the eye, intelligence, or some other form of “fitness.” It is natural to seek to find the set of genes responsible for that trait. Genome-wide association studies (GWAS) can detect places in the chromosome with a strong statistical association with some trait. Alleles which influence the trait in question are located at such places, although the precision of GWAS may be insufficient to pinpoint the exact location responsible for the trait variability.
However, even if the allele that correlates with a trait could be pinpointed, one cannot quantify the full effect of each gene on the trait in question, because the effect of one gene may be modified by the presence of other genes at other loci. A naïve approach would attempt to find a spectrum of loci which modify the trait, attach a number measuring the effect of swapping alleles at each of these loci (the “allele effect size” Park18026
) and then predicting the resultant trait value by simply adding together the individual effects over the spectrum of loci. This approach predicts that the values of a trait over a population with independently distributed alleles will converge to a Gaussian distribution (pageof NaturalInheritance_Galton1889 ). This theory underlies biometric analysis of “heritability,” which assumes that trait values are the convolution of genetic and environmental influences. Indeed, the typical distribution of biological traits appear to be bell-shaped curves, although fat tails are not ruled out.
But it is also quite clear that the naïve analysis above is incorrect. Even in the simplest cases of two homologous loci on paired chromosomes, examples of dominant and recessive gene interactions show that the effects of switching alleles at separate loci are not strictly additive. A trait is not a linear combination of individual gene effects — it is a function of the entire genome. To conclusively understand the effect of the genes, one would need a table assigning the value of the trait to every possible combination of the relevant genes at many locations. But such a table would be enormous. If traits are functions of the entire genome we face what has been called the “scale problem” QuantAnalEmpFitnessLandscapes_Szendro2013 . The number of possible genomes grows exponentially with the number of gene loci. For example, if we find loci (each with alternate alleles) associated with a certain trait, then there are genomes. To pick a relatively small case, if there will be approximately potentially different genomes. It would be flatly impossible to measure the trait for each such genome and hence, the function from genome to trait cannot be measured in any practical sense.
This paper explores a third possibility. There may be many traits which are neither strictly additive nor as chaotic as the theoretical nightmare above where each genome has its own arbitrary trait value. It is possible to find a way to interpolate between these two possibilities. Namely, we expand the spectrum of gene effects from the individual loci to associate an interaction effect to every possible set, orcluster of loci (e.g., as singletons, pairs, triads, etc.). Given this many interactions, any trait can be broken into a spectrum of interactions but there are now possibilities. Allele effect sizes are included in this spectrum, which is enlarged to account for all possible interactions. But it is quite plausible that the important interactions are a very sparse subset — needles in the haystack of possible interactions — and these significant interactions can be reliably discovered.
Techniques from the field of compressive sensing, avoid the scale problem arising when we associate a trait with an entire genome, yet also avoid the unquantifiability of a direct map from gene to trait. The goal is to discover the Fourier transform of a trait, which is the gene network (see Section 3.2). The quantitative effect of the genome is broken into Fourier components on a Boolean lattice. Each Fourier coefficient measures the strength of the interaction of a certain subset of loci. The “level” of a Fourier component indicates the number of gene loci in a given cluster (see Section 3.3). It will turn out that the level- component, corresponding to the null set, is the average value of the trait, and the level- components, which correspond to a single locus, are simply the allele effect size, the change in the trait average from swapping allele ’’ to allele ’’ at the locus in question. Similarly, for two homologous loci, the corresponding level- Fourier component encodes their interaction, with the sign expressing whether this is a dominant or a recessive interaction. In this context, the components associated with relatively few genes take the place of “low-frequency” terms in standard Fourier analysis, while “high-frequency” terms represent interactions involving many genes. Formally, the mathematics used is a well-understood generalization of standard Fourier series or the Fourier transform, using a finite abelian group as the underlying basis, rather than the circle or line.111We recently became aware of new work on detecting gene interactions using Fourier analysis on non-abelian groups DetectGenomicSpectralAnalysis_Uminsky2019 . However, it does not address our main theme of smooth/modular traits nor utilize compressive sensing theory.
We can view the Fourier transform of a trait as providing information about the network of gene interactions in the following way. Suppose and are loci and that the Fourier coefficients and indicate the direct effects of different alleles at and , respectively. Then a nonzero Fourier coefficient reveals that, in addition to the direct effects, there is some cooperation or interference between the alleles at loci and . For example, if locus produces a ligand and locus produces a receptor, and if the trait is conditional on ligand binding to receptor, then the trait emerges only when both and are effective. This interaction of and will be reflected by the Fourier coefficient having a nonzero value. Gene interactions can be visualized as a simplicial complex with the loci as points, and large interaction terms as edges. More complex interactions can be pictured as geometric simplices.
There are two potential benefits to exploring the Fourier transform of a trait.
In the Fourier domain, entries provide information about the gene network, such as the interactions between the metabolic activities of different genes. Large gene interaction coefficients may provide clues as to which genes are interacting intensely as well which are occupied with independent tasks. Thus, observations of trait matched to genome may contribute to an understanding of gene function.
In the trait space, we have precise information of the trait for relatively few specific genomes, and no information for the rest. An accurate enough reconstruction of the larger gene interaction Fourier coefficients will allow a low-error prediction of the full trait via an inverse Fourier transform. This would provide insight to trait phenotype for novel, previously unexamined organisms.
1.1 Smooth traits
There is a class of traits which have sufficient additivity of gene effects to permit Fourier analysis to isolate them. Provisionally, let us call such traits “smooth.” Similar to functions dominated by low frequencies in classical Fourier analysis, a smooth trait is one where, on average, the Fourier coefficients of significant size are concentrated in the lower levels
. Important quantitative traits may turn out be smooth. In particular, we give heuristic arguments that traits with a pronounced modular structure will be smooth. This will ultimately be an empirical question. Our present purpose is to show how to circumvent the scale problem for such traits. The low-level concentration phenomenon implies sparsity in the Fourier domain, which allows the use of ideas from compressive sensing to estimate the genome-to-trait function from relatively few observations.
1.2 Requirements of compressive sensing
Two prerequisites must be met if we are to implement a compressive sensing scheme: (i) a sparse (or compressible) representation of the data of interest, and (ii) a sensing modality that is “incoherent” with respect to the sparsifying basis. The first condition is satisfied based on the assumed model of the traits we are interested in. The second condition is conveniently fulfilled in our model since the rows of the Fourier matrix are known to be incoherent relative to the standard basis. Ultimately, this is connected to an uncertainty principle, which dictates that localization in one domain implies its dual is “spread out.” This provides a rule of thumb central to the philosophy of the compressive sensing method — by subsampling in the spread out domain (as opposed to the sparse domain) we are essentially guaranteed to gather nontrivial measurements. As the Fourier transform is a global operator, each of these measurements yield some information about the sparse domain of interest. A review of compressive sensing can be found in Section 4.1.
1.3 Previous work in compressive sensing and genetics
It is compelling to note that compressive sensing has a connection with biological science through the study of combinatorial group testing, which has its roots in designing (the minimum number of) trials to screen for venereal disease in soldiers during World War II Gilbert08grouptesting . Indeed, many previous applications of compressive sensing to genetics have also used group testing in their modi operandi ProbeDesign_CSDNAMicroarrays_Shikh2008 ; RareAlleleCarriers_CS_Zuk2010 ; CompressedGenotyping_Erlich2010 ; BioScreensLinearCodes_Erlich2015 . Other studies have taken a more traditional compressive sensing approach RecoveringSparseSignals_CompressedDNAMicroarrays_Hassibi2008 ; ApplyingCStoGWAS_Chow2014 . By and large what these investigations have in common is the assumption that DNA sequences of interest can be represented as a sparse signal, e.g., because there are relatively few gene markers participating (either directly or differentially), or because certain genetic diseases are rare, or due to the fact that only a few agents in a pool are present or are defective. Most of the work is then devoted to: (i) designing the screening method, minimizing the observations (e.g., the number of sequencing probe spots, the number of pools, or just the sheer number of subjects), and the mathematics of the associated sensing matrix; and (ii) the resulting challenges in recovery of the sparse signal.
Our work has little in common with these earlier approaches — rather than a DNA sequence, we assume it is a trait’s Fourier transform that is the underlying sparse object of interest. Although previous theoreticians have suggested the application of Fourier analysis to trait “fitness” EmpFitnessLandscapesPredEvol_deVisser_Krug2014 , to the best of our knowledge, the study presented here is the first to provide the theory on when and why a trait’s Fourier transform may be compressible, and to also connect it with ideas from compressive sensing. Guidance can be gleaned from previous work on reconstructing a sparse time domain signal from incomplete frequency information (e.g., RobustUncPrinc_CanRomTao2006
), except we treat traits as real-valued functions on the Boolean hypercube of possible genomes, and examine the dual scenario with assumed sparsity in the frequency domain dominated by low-level components, accompanied by incomplete, random observations in the trait space. We also deal with vector sizes that are exponentially larger than typically reported in the compressive sensing literature.
2 Modular Traits
In our model of gene interaction, a trait is a real-valued function on a set of genes, a vector in a -dimensional space. These numbers can, in principle, be empirically determined by measuring the trait for a selection of individuals of each genome. But in practice, if is large (let us say, an order of ten or more), the task exceeds the number of organisms. It would seem that converting these numbers to their Fourier transform coefficients does not obviate this scale problem, but there are certain traits where we might expect the Fourier coefficients to be small, or even vanish for high level.
Consider what we call “gauntlet processes,” multi-stage survival tests. Imagine a sequence of filters, , and denote
as the probability of surviving past filter. Then the overall survival rate is the product
The paradigm for such processes relates to the original idea behind Darwinian fitness. An organism is tested by a series of barriers to survival and to successful reproduction. Each barrier is associated with a certain probability of successful passage. These probabilities are, to good approximation, independent and overall survival (or reproduction) requires success with every barrier. However, the product of probabilities can lead to undesired interaction terms. Instead, we want a measure for “fitness” which is additive. This can be achieved by defining “survivability” as the logarithm of the probability of surviving a gauntlet of challenges. Hence, (1) becomes
It has been noted in prior studies of antibiotic resistance in bacteria EmpFitnessLandscapesPredEvol_deVisser_Krug2014 that the logarithm of survival is the appropriate version of a fitness trait, and not raw survival percentages.
2.1 Modularity, low-level concentration, and sparsity
The gauntlet processes above may be generalized. Filters need not be organized sequentially: it is sufficient that a trait be organized into functional components, with interactions concentrated within each component. Biological functions are typically organized into modules, each with an associated suite of genes MolecularModularCellBio_Hartwell1999 , breaking a task into an array of subtasks. In turn, subtasks may themselves be composite, leading to a hierarchical structure. This organizing principle facilitates evolution, since a submodule may be modified without inducing global complications. This explains why those exceptional proteins which interact with many other proteins are very stable throughout evolution PleiotropyPreservationPerfection_Waxman1998 . Thus, modularity largely confines gene interactions to those within a module, i.e., locally. This motivates our governing hypothesis: due to modularity, many gene networks are, or can be approximated as sparse, with the vast majority of large coefficients concentrated into the lower levels222“Levels” are introduced in Section 3.3 and formally defined in (13). In words, coefficients in the th level of a network measure the strength of interaction within clusters consisting of precisely loci. For instance, for , level deals with interactions for all possible pairs of loci.. If compressive sensing detects clustering of interactions, it can help identify the module structure of genome interactions ClassNotes_Lect15_ClusterModule_Clauset2011 .
2.1.1 Concrete examples
The notion of modules may be made less abstract with a few concrete examples. It has long been conjectured that human intelligence is due to distinct functional components, or “factors” in Thomson’s terminology FactorialAnalysis_Thomson1939 . Further, GWAS have identified gene locations significantly associated with the trait of intelligence GWAS_Intelligence_Sniekers2017 . A typical IQ test involves counting the percentage of correct answers to a collection of questions, in other words, estimating a probability of success. It is plausible then that the task of “problem solving” is composed of genetically distinct modules related, e.g., to correct comprehension, sufficient memory, accurate calculation, etc.
The analysis of myopia genes may also follow this paradigm. Here, we are concerned with survival of useful visual information, which must endure the successive effects of the cornea, of the lens, and then the blur due to excessive axial length of the eye. If the genes causing steep cornea, dense lens, and elongated globe are distinct, then their effects should be roughly additive, when quantified by diopter. Recent work GeneExpressionResponseOpticalDefocus_Tkatchenko2018 is beginning to identify functional modules in the genetics of myopia.
Gene loci often code for enzymes that establish a metabolic network with multiple functions. Further, these networks often resemble a logic network, although there is no exact correspondence. Under suitable restrictions of depth and size, the Fourier components of such logic networks have power spectra concentrated at low levels ConstantDepthCircuits_Linial:1993 . Metabolic networks achieve a desired state via a wide variety of genetically controlled transitions, where genes correspond to edges that permit a transition from one state to the next both in series and in parallel (see Figure 1). Such a structure is sufficiently general as it can represent many possible processes: such as catalysis of a chemical reaction, transport across a membrane, activation of a receptor, etc.
2.1.2 Very small deterministic example
Figure 1 shows a small, but nontrivial network; its gene loci are arranged as two parallel branches with two nodes in each branch. Let us use this circuit for a numerical example to demonstrate how large coefficients in the gene network are restricted to the lower levels. Suppose the parallel branches are combined so as to emulate a logical OR gate. Thus either the left or right branch can achieve the “goal” of completing some subtask, or perhaps, having or not having some dominant trait. Example 2.1.2 on page 2.1.2 explains the mathematical detail of a trait based on this Boolean circuit and contains its resulting truth table showing the genome-to-trait relationship: how the th trait value is a function of genome . From (8) and (11), the associated gene network is , shown in the right-hand table (note, the order of has been permuted so that its indices/coefficients are grouped into their respective levels). Observe that seven rows () are emboldened, highlighting the fact that the large-magnitude coefficients occupy the lower levels, .
The Fourier relationship between a full trait and its gene network .
Suppose we are interested in the genes connected as seen in Figure 1 with the left and right branches combined as an “OR” gate. There are possible genomes: these codes and their decimal equivalents are shown in the Boolean lattice of Figure 2. An organism, due to its genetic code, either DOES or DOES NOT “have the trait,” which we represent with a ‘’ or ‘’, respectively. Hence the trait value for the organism with genome is
Suppose we have access to the organisms with the unique genomes. We measure each individual’s trait value and store it in the vector as the th entry (see the table below on the left). Given the full trait , we can take its Fourier transform (11) to obtain the associated gene network , which is defined on its own Boolean lattice, again shown in Figure 2. Let us permute the order of so that its coefficients are grouped into levels . Recall, corresponds to the number of participating loci identified by the ‘’s in the binary code of index . The table below on the right shows this level-ordered gene network. Rows are highlighted since they contain the larger magnitude — notice they occur in the lower levels: .
We remark that the distinction between “large” and “small” coefficients of is slight in this case (i.e., the magnitudes and versus in the table on page 2.1.2). However, for larger and more complicated networks, significantly greater dynamic ranges will occur, which means the gene networks will be compressible and thus well-approximated by a -sparse representation. In this sense, we can interpret the coefficients here as “insignificant.” Hence, vector can be loosely characterized as “-sparse” since it has relatively “large” coefficients.
Let us analyze each level of the gene network in further depth:
For , in level the index corresponds to the all-ones row of . As a result, is the sum of all of the trait values, thus the average trait value is .
For , the indices in level represent singletons, or individual genes. Notice the corresponding binary codes for each have precisely one flag, indicating the individual locus being evaluated. Here the for all in are relatively large. This makes sense because each gene locus independently has a strong direct effect on the trait.
For , indices in level refer to pairs of genes. However, only and are relatively large. The binary code for has flags at loci , which are in series on the left branch of Figure 1 — we would expect this pair of loci to have a strong interaction — and similarly for index , which represents the interaction between loci in the right branch.
However, the other indices of level correspond to that are relatively small. For example, has flags at loci . Figure 1 affirms that these loci are on different branches and therefore do not strongly interact.
For , the indices in the highest levels and correspond solely to relatively small coefficients. For example, the binary code for has flags at loci . Coefficient is small because these three loci do not strongly interact as a triad to affect the trait. Similarly for , the cluster of all four loci do not strongly interact as a quad.
In summary, even this very small example demonstrates the key property that we conjecture: the larger coefficients of the gene network are confined to the lower levels (in this case, ) as opposed to the higher levels (). As mentioned previously, for larger and more complicated networks, the low-level concentration effect will be much more pronounced.
2.1.3 General effect from two modules
Consider an arbitrary quantitative trait governed by genes, illustrated below. As previously mentioned, there are possible combinations in which the loci can interact. Now suppose the trait is composed of two subtasks, with genes in Module dedicated to the first subtask and genes in Module to the second subtask. Let us assume only local interactions, where the loci of Module do not communicate with those of Module .
The mere act of partitioning the genes into two modules with no cross-interactions naturally leads to a (very) sparse gene network. To see this, let us estimate the density of significant coefficients at level . The problem is isomorphic to that of calculating the probability for stones of the same color to be drawn at random from an urn with white stones and
black stones. For the moment, assumeis larger than . Then, as increases, the white stone entries will asymptotically dominate the white-to-black ratio of successes. In fact, once exceeds , all successes are white and once exceeds , there will be no way of selecting a monocolored -set. Thus there is a cutoff level .
The calculation is quite transparent if we are returning each stone to the urn after selection. There is a probability of selecting a white stone and a corresponding so that is the density of all-white subsets, is the density of all-black subsets. Then by analogy, bounds the density of significant coefficients in level from Modules and . Notice that this argument slightly overestimates the probability of drawing stones of the same color, because it counts some cases where a stone is replaced. In turn, this can only overestimate the density of significant coefficients. Yet, it can be shown with a bit more effort that the case of choosing stones without replacement yields a similar result.
The important conclusion is that modular traits have two properties which immediately imply sparsity of significant coefficients:
A cutoff level beyond which there are essentially no significant coefficients. It follows that provides a convenient way of delineating the “low” and “high” levels.
An exponentially decaying density with the vast majority of its mass concentrated in levels , where is small relative to .
The above analysis examined sparsity density for each level . Another approach is to first assess the overall sparsity and then compare it to the size of the ambient space. Denote as the total number of significant interactions from all levels. The loci of Module have at most interactions, and the loci of Module have at most , and so
There exists such that , therefore the sparsity ratio of relative to the number of possible interactions for the whole network is
which decays exponentially with . The effect of exponential drop off becomes even more pronounced as soon as there are three or more subtasks, or if the subtasks themselves consist of finer subtasks. Thus for large enough , it is clear that a trait with modules limited to local interactions can give rise to , i.e., very strong sparsity in the gene network. Section 6.1 extends this analysis to modules, also including other scenarios, such as when the sparsity is polynomial in the number of genes .
2.1.4 Simple probabilistic model
Next, we extend the level-wise density estimate developed in the previous example to an arbitrary trait affected by genes that is partitioned into modules. For , let the th module have loci and probability ratio , with . Temporarily assume the loci are evenly distributed across all modules so that each and . Then the density of -loci clusters in any module is just . Summing over all modules yields
If we now account for a nonuniform distribution of loci per module, then the density at level is simply . However, asymptotically one of the will dominate the summation, which leads to a more general form of the density, or probability that a -loci cluster can significantly interact:
where are not necessarily integers such that . This form can also represent the merging or melding of different networks. In analogy with the denominator of (5), we observe the parameter informally represents the “effective number of modules.” Although (6) is simplistic and not likely to be exact in any real biological system, it captures the essence of the problem at hand — that modularity strongly favors interactions between fewer genes rather than many.
The number of possible -loci clusters that can be drawn from loci is (see (14)). Thus for each level , multiplying by yields , the expected number of -loci clusters that can have significant interaction energy:
where denotes the integer part of . This can also be interpreted as the expected sparsity of level . Note, it is always the case that because there is only one element in level .
The effect of (7) on a gene network is a natural concentration of significant coefficients into its lower levels. This is because the polynomial growth of cannot “outrun” the rate of exponential decay of . Therefore when modeled by (7), modular traits are necessarily smooth, as defined in Section 1.1. For example, suppose some quantitative trait is governed by genes with and in (6). The columns of Table 1 shows , , and as a function of index for the first levels. Observe the faster decay rate of versus the rise of . The resulting low-level concentration of the trait’s gene network is clearly evident in the plot on the right showing sparsity for all levels . Note, the cutoff level is , and the total expected sparsity of the lower levels is elements.
While modular traits may not exactly obey strict local interactions nor the asymptotic analysis above, this ideal scenario illuminates how the mechanics of modules naturally lead to low-level gene interactions. Subsequently, we can loosen the assumptions of this model to also allow for high-level interactions. In the simulations of Section5 we do this by employing (7) in addition to permitting a limited number of random high-level coefficients.
To conclude this section, many quantitative biological traits may appear to be simple in nature, especially when they are measured on a linear scale, but they actually represent the result of error-free activities in multiple successive processes. From the arguments above, it is reasonable to look at traits in the Fourier domain, expecting that the bulk of the information about gene interactions to be concentrated in components of low level. Theoretical and practical approaches may diverge here.
We can pursue mathematical analysis of precise, but sometimes unrealistic scenarios, so as to understand the overall landscape of fitness functions and their role in evolution. The theory of evolution in its modern synthesis requires a more robust mathematical theory of percolation in a Boolean lattice, endowed with a fitness function. We advocate a broad view that the Fourier transform of the fitness is an appropriate way to classify fitness landscapes and that large percolating subsets are associated with smooth transforms.
Practical. Assume we have identified a candidate trait which can be expressed as a real number and which we expect to be smooth. It is impossible to measure the overall impact of these genes over every possible combination. However, techniques of compressive sensing could begin to reconstruct values of the trait, with error estimates from incomplete and inaccurate data, and model mismatch (discussed in more detail in Section 4.1).
3 The Mathematical Formalism
The use of Fourier techniques on functions on the Boolean lattice is a solidly established field and fully explained in books by O’Donnell AnalysisBooleanFncs_ODonnell2014 and Garban and Steif NoiseSensBoolFncsPerc_Garban2014 . As a consequence, there are a variety of notations and viewpoints. We briefly review the matter to establish notation and orient the reader. Throughout, vectors and matrices are represented with boldface letters, while their elements are non-boldface; thus the th component of a vector is (all indices commence at zero). The transposition of a matrix is denoted . For , the -norm of is defined as . For index in the set , let be its canonical -bit binary representation, with and as the least- and most-significant bits, respectively (note, the are non-italicized). The Gaussian distribution with mean
and varianceis denoted as , and the -th binomial coefficient is .
Suppose we are interested in how particular genes of an organism affect a given quantitative trait. Arbitrarily assign the genes to loci. We will use a simplified model where the gene at each locus has just two possible alleles (though the model can be extended to accommodate factors that are non-binary). Hence, a genome can be represented by the -bit word (note the reverse order of subscripts), where indicates a certain allele at locus , and indicates that an alternate allele has been swapped at that locus. This is an elementary mathematical model, but it is sufficient to capture the essential nature of the scale problem, and is standard in the literature EvolutionaryDynamics_Nowak2006 . Position in the binary word does not correspond to position on the chromosome, and different positions in the word may encode alleles from different chromosomes, or even alleles from mitochondrial alleles. The model only focuses on a certain subset of the organism’s genes in order to quantify their interactions, the rest are treated as a fixed background. Similarly, features of the environment that may influence a trait are considered fixed.333In our model of a genome, not all loci necessarily must be genes. If desired, an obvious extension is to let some loci represent other binary factors, e.g., influences of the natural or man-made world. Although simple, this setup generalizes the usual discussion of dominant and recessive genes, which focuses on only two homologous loci.
3.1 The trait’s Boolean lattice
The -bit words that represent distinct genomes in a given population can be represented by the vertices of an -dimensional Boolean lattice (or hypercube). For example, Figure 2 shows a small hypercube with dimensions and its vertices, labeled in decimal and binary. Throughout, we explicitly reserve the index ‘’ to refer to the “trait side.” That is, suppose the binary word encodes the specific allele combination for specific genes of some organism. Then we can either use this binary code or its decimal-equivalent index to identify that individual, which then permits representing its trait value as (if there are multiple organisms with the same genome, then represents a mean of their trait values). Hence, the full trait is defined on the -dimensional Boolean lattice.
The group acts on -length binary strings in the obvious way (i.e., component-wise addition modulo ). The group elements may be taken as the basis for a real vector space of dimension , and a trait is defined as an element of this vector space, assigning a real value to every possible genome. But the space is also a group algebra with a multiplication corresponding to convolution and also a different operation of pointwise multiplication. A character of the group is a homomorphism to the reals. Because is abelian, the set of characters is itself a different group under pointwise multiplication: another isomorphic copy of . The characters provide an alternate basis for the vector space of functions on .
3.2 The Hadamard matrix and Fourier transform
The character table for our groups and is the Sylvester-type Hadamard matrix444These matrices were introduced as “tessellated pavements” by J. J. Sylvester, who commends their versatility, “furnishing interesting food for thought, or a substitute for the want of it, alike to the analyst at his desk and the fine lady in her boudoir” ThoughtsInverseOrthogMatrices_Sylvester1867 . Note, power-of-two Hadamard matrices are often referred to en masse as “Walsh-Hadamard matrices,” although formally, this connotes a different ordering of the rows and columns from the Sylvester-type defined in (8). of order , defined for by
where . Row and column indices should be labeled to , or in their binary equivalents. The Sylvester-Hadamard matrix changes basis from to , and back. By definition, Hadamard matrices are orthogonal, obeying
is the identity matrix of order. Further, it is well known that Sylvester-type matrices are symmetric: , hence
There is a duality between the and representation of functions, where pointwise multiplication on one domain corresponds to convolution in the other. Analogues of Plancherel’s and Parsevel’s formulae hold, with appropriately chosen scaling factors owing to (9). More details on Hadamard matrices and their use in group theory can be found in Horadam_HadamardMatrices .
Formally, we define a gene network to be the (forward) Fourier transform of a corresponding trait , each defined on its own Boolean lattice. However, to apply the character table to an -dimensional Boolean lattice, we must reshape it into a vector of length ordered in terms of its vertices’ indices. Hence the Fourier coefficients are related to a given trait by the matrix-vector multiplication
Clearly, given a gene network , we can take its inverse-Fourier transform using (10) to find its full trait for the whole population of genomes:
The choice of the labels ‘’ or ‘’ for the generators at a particular locus is a matter of convenience. It is easy to show that flipping the bit of an arbitrary locus will only change the sign, and not the magnitude, of its associated network coefficients.
3.3 The gene network’s Boolean lattice and levels
We emphasize that the gene network is defined on its own Boolean lattice and reserve the index ‘’ to refer to its coefficients. We further point out that the binary codes of the indices contain crucial information on how the genes communicate. That is, for index , the location of the in its binary code are simply “flags” indicating which gene loci participate in a given cluster. The corresponding coefficient then represents the amount of interaction between the genes within this cluster. For instance, the flags in the binary code for index indicate participation of the two rightmost loci, and the value of coefficient tells us how strongly/weakly these two loci interact. Compared to classical discrete Fourier analysis, this is a different interpretation of an index in the frequency domain.
The indices of the gene network’s Boolean lattice are organized into so-called “levels,” where each level contains the binary codes with the same number of ‘’s. With the number of loci fixed and for , define the th level, ,555The levels are also a function of , so we should formally be discussing the “-th level, .” However, since is fixed, we opt for a slightly cleaner notation relying on the reader to recognize when to utilize its dependence on , e.g., in (14). Further, for convenience we sometimes refer to just as “level ,” and occasionally also refer to a coefficient as being “in” or “of” level , when in reality it is its index that is in . as the set of indices whose -bit code contains precisely ones:
As such, level contains indices with just flag in each of its binary codes: the location of the one in each code indicates which individual locus is being examined. Next, level contains indices that have flags in each of its binary codes: here, the locations of the two flags indicate different pairs of loci, and so on. It should be clear that the number of indices (or codes) in each level is “ choose ”:
Returning to the example of loci, the vertices of the (now different) Boolean lattice in Figure 2 should now be viewed as the indices of the corresponding gene network. Notice the right panel shows the -bit codes partitioned into their five respective levels , and that the cardinality of each level confirms . Example 2.1.2 on page 2.1.2 should help clarify how to interpret the network for the genes in Figure 1 (note, the order of has been permuted so that its indices/coefficients are grouped into their respective levels).
In summary: on the “trait side,” the vertices of the Boolean lattice (indexed by ) represent specific -bit genomes of a population — the coefficients are the trait values associated with these genetic codes. On the “gene network side,” the vertices of its Boolean lattice (indexed by ) represent specific clusters of of the gene loci organized into levels — the coefficients reflect the interaction strength of these clusters.
Figure 3 illustrates a simulated trait-gene network Fourier transform pair related by (11) and (12) due to genes. The top panel shows the full trait over all million genomes, and the bottom shows its associated gene network over all gene cluster combinations. This exemplifies the uncertainty principle mentioned in Section 1.2: the two representations contain the exact same genetic information, yet the “energy” of the trait is spread over its whole population, while the gene network’s is focused into just of its possible coefficients; the rest are identically zero. The plotted points of the full trait are too dense to see individual values, so refer to the middle panel to see a zoomed-in portion of the trait values for the first genomes; notice how the trait is spread out with mostly nonzero values. This justifies collecting trait values from relatively few random genomes in order to discover the gene network.
Although the mathematics of the two domains, the gene network space (11) and trait space (12), are formally symmetric, our knowledge about the two is not. We can physically measure the trait value of the th organism, but the scale problem means that we only have access to a very small subset of genomes. At the same time, because of the smoothness assumption, we have some statistical knowledge about the distribution of all the Fourier coefficients in the gene network. Further, from (12) each trait value measurement corresponding to a particular genome may be considered as sampling a weighted average of the network coefficients. That is, each encodes partial information of the full vector . This setup perfectly complements the model of compressive sensing.
4 Compressive Sensing
Important insights have been gained in the last decade into signal and information processing. In particular, there has been a shift away from the long-standing assumption of bandlimited real-world signals to one that assumes a model of a sparse representation SparseRedundReps_Elad2010 . This point of view has enabled many new possibilities of acquiring, storing, and processing of data that was not previously possible. Notable among these has been in the field of compressive sensing, which offers degrees of undersampling that did not exist before. Compressive sensing is a combined sampling-reconstruction modality that is appropriate whenever it is expensive, or even impossible, to acquire many observations of a signal of interest. In that case, and under the correct conditions, we can take relatively few samples and still be able to reconstruct a signal with high fidelity.
“Expensive” may be due to the cost of a sensor’s material or its size, or it may be related to the time required to acquire the data. For instance, CMOS megapixel sensors made of silicon currently cost less than one U.S. dollar, while those made of indium gallium arsenide (InGaAs) are on the order of $ Shortwave infrared compressive imaging systems exist with just one InGaAs detector instead of an array of pixels, which offers significant financial savings InView_SinglePixelCamera ; Rice_SinglePixelCameraCS . In terms of time, there are scenarios that require a very short duration of sensing. One of the original compressive sensing applications addressed magnetic resonance imaging (MRI) CS_MRI_Lustig2008 . An incident is described in WiredMag_Ellenberg2010 where relatively few MRI samples were sufficient to image the organs and save the life of a child who could not safely endure the high dose of anesthesia necessary for a full scan. The U.S. Federal Drug Administration has recently approved the use of compressive sensing in commercial MRI machines FDA_CS-MRI_GE ; FDA_CS-MRI_Siemens .
In the context of the genomic analysis explored in this study, we are interested in quantitative traits affected by genes. The combinatoric possibilities inform that it is essentially impossible to access and measure all individual organisms of a population once becomes large. Hence, compressive sensing may be an appropriate tool to permit measuring the trait values from relatively few genomes, yet still being able to analyze and quantify certain gene-to-trait functions that have been beyond realistic observable and computational means.
4.1 Compressive sensing overview
We now review the specifics of compressive sensing in more detail. Interested readers can find more information in the foundational and related papers, e.g., RobustUncPrinc_CanRomTao2006 ; CS_Donoho2006 ; CanRomTao_Noise ; NearOptSigRecovFromRandProj-UnivEncStrat_CandesTao2006 ; SparseReconConvexRelax_FourierGauss_Vershynin2006 ; RIPlessTheoryCS_CandesPlan2011 . A less formal introduction is available in different survey articles, such as CS_LectureNotes_Baraniuk2007 ; IntroCS_Wakin2008 .
Suppose we are interested in observing a real-valued -D discrete signal of length . Rather than traditional point sampling, suppose further that we acquire general linear measurements via a sensing/measurement matrix of size , with . The basic compressive sensing model is embodied by the observation/measurement vector
where is an unknown additive noise vector of length . In general, there is no hope to recover since this is an underdetermined systems of equations (there are fewer equations than unknowns). However, if at most of the possible entries of are nonzero (commonly referred to as a -sparse signal), and if matrix possesses a necessary “incoherence” property relative to the sparsifying basis, then with as few as
measurements (for some ) we can reconstruct , e.g., by solving the program666Alternative and equivalent formulations of (17) exist as well.
where is some assumed or known measure of the energy of the noise . In words, rather than the ambient dimension , (16) reveals that we only need to sample at a rate roughly commensurate with the underlying sparsity . This is then sufficient for (17) to find the best candidate whose image under coincides closely with , while also being of minimal -norm. The convex -norm constraint is used since it is known to promote sparsity. Nonconvex -quasi-norms are also possible with , but they tend to be less computationally tractable.
The beauty and power of compressive sensing occurs when is extremely sparse compared to the ambient dimension. In this case means we can severely undersample with due to (16).
Let be the solution to (17). A typical measure of the absolute error between the recovered and original signals is
where is the best -sparse approximation to and . In theory, constants and are not too large and can be calculated, however it is not always so easy to do so in practice. Clearly, there is no error when is precisely -sparse and there is no noise. However, (18) tells us that can still be an excellent approximation even if has more than nonzero elements and when the measurements are taken in the presence of moderate noise, both of which are encountered in real-world applications. It is well known, e.g., that vectors whose sorted coefficients decay according to a power law admit a -sparse approximation with minimal error.
A great deal of effort in the compressive sensing community has focused on finding the theoretical conditions of that lead to successful recovery of from (15)–(17). Injecting randomness into is the easiest way to guarantee robust results with high probability. For example,
may be a random matrix with independent and identically distributed (i.i.d.) elements drawn from either a Gaussian or Bernoulli distribution. Although these sensing matrices yield excellent performance, they are plagued by the need to store all elements of, which becomes a nontrivial burden when is large. A more realistic embodiment that avoids this problem is to randomly subsample rows of an unitary matrix, e.g., a discrete Fourier or Hadamard matrix. These matrices do not need to be explicitly stored because they either have a closed-form expression or a “black box” implementation. Further, as mentioned earlier, Fourier and Hadamard matrices are known to be incoherent relative to the standard basis, as well as wavelet bases and others. This makes them ideal sensing operators necessary to admit (16).
In practical applications, however, the measurement matrix may be more elusive. For example, we may assume (15) properly models the measurement procedure, but when it is implemented in the field may be unknowingly perturbed so that the measurements are actually of the form . In this case recovering with (17) incurs a so-called “model mismatch” that does not account for the extra multiplicative (rather than additive) noise term . Previous work HermanStrohmer_GenPertCS ; HermanNeedell_MixedOpsCS ; SensitivityBasisMismatchCS_Calderbank2011 has extended the error expression (18) to include this real-world phenomenon.
A different, but related issue occurs when we do not have the liberty of choosing the type of measurement matrix to be used. That is, is often dictated by the sensing modality or by specific physical, chemical, or biological constraints. For example, the magnetic coils of MRI machines emit sinusoidal radio waves. These spawn measurements in the frequency domain that are naturally modeled by the canonical discrete Fourier transform (the so-called “DFT”), hence consists of rows of the DFT matrix. The previous work in genetics ProbeDesign_CSDNAMicroarrays_Shikh2008 ; RareAlleleCarriers_CS_Zuk2010 ; CompressedGenotyping_Erlich2010 ; BioScreensLinearCodes_Erlich2015 ; RecoveringSparseSignals_CompressedDNAMicroarrays_Hassibi2008 ; ApplyingCStoGWAS_Chow2014 mentioned in Section 1.3 exemplify scenarios where one must design based on realities and restrictions of the sensing environment. In the study presented here, the genome is modeled as an -dimensional Boolean lattice and the trait, or fitness landscape, is a vector which takes the elements of the Boolean lattice as a basis. From (11) and (12), the trait and gene network vector spaces are related by the Fourier transform characterized by the Sylvester-Hadamard matrix , whose rows naturally form our sensing matrix.
5 Simulations and Results
We are tasked with the problem of discovering a genome-to-trait function affected by loci, which can be expressed on a Boolean lattice that captures all possible combinations of participating alleles. Instead of explicitly identifying the genome-to-trait function , it suffices to discover its associated gene network . However, we do not have direct access to the gene network — hence the challenge is to discover a “hidden” gene network from the trait values of a limited number of genomes. A summary of the steps of the simulation can be found in Procedure 1. We conducted trials, the results of which are recorded in Table 2 on page 2.
To assess how well our combined compressive sensing-recovery method works, we used two different measures of error. For a general length- vector , recall that its typical root-mean-square (RMS) and mean-absolute (MA) values are defined as and . Accordingly, the relative (or normalized) RMS and MA errors between and its approximate are defined as:
5.1 Assumed model of the gene network,
Based on the theory of smooth traits presented in Section 2, we can assume with high probability that the gene network is sparse or compressible with the bulk of significant coefficients occupying the lower levels. For convenience we employ a strictly sparse model, although we achieved similar results with a compressible model whose coefficients decayed similar to a power law. To make the simulations more realistic, we relaxed the assumption of only low-level concentration and allowed some nonzero coefficients to occupy higher levels. This corresponds to possible nonlocal interactions across modules, as explained in Section 2.1. In what follows, denote as the support set of nonzero coefficients in level .
5.1.1 Support set
There are many ways to enforce low-level concentration on the gene network in a computer simulation. We implemented the model outlined in Procedure 2 on page 2. Step 1(a) follows Section 2.1.4 to determine the sparsity for each of the lower levels. Step 1(b) then chooses indices uniformly at random from level and assigns them to the subset . The complete low-level support in (20) is the union of the support sets up to cutoff level (for level , the support is trivially ). Step 2 introduces a limited number of high-level interactions for some ; these indices are chosen at random from the union of the levels above and assigned to high-level support . Finally in Step 3, the total support and overall sparsity are just the result of joining and , and their respective sparsities and .
Many natural phenomena possess a high degree of organization. As such, we suspect there will be significant structure within the support . That is, the support sets for real-world gene networks most probably have some degree of dependence. For example, suppose some coefficient is nonzero, thus the index corresponding to the triad of loci will be in . Then it is reasonable to assume that these loci also interact pairwise to some extent, hence coefficients , , will also be nonzero with their respective indices in . In this case there would be a correlation between these indices in support sets and . A priori knowledge of this kind of structure could make reconstructions of the gene network faster and with better accuracy. At the same time, counterexamples can be found. Without empirical evidence as a guide, we therefore chose not to incorporate this phenomenon into the model of the gene network for our simulations; thus the random indices of in (21) are uncorrelated with each other.
5.1.2 Coefficient values
There are also many ways to assign values to the nonzero coefficients corresponding to support set . For example, it may be that the magnitudes decay with increasing level , say, according to some power law. However, as above, lacking empirical evidence, we chose to simply assign i.i.d. random Gaussian numbers777Some of these coefficients will probably be small enough to be considered “insignificant.” However, this does not contradict our assumption of a smooth trait. to these coefficients as seen in Step 4. More sophisticated models are, of course, possible. But this, as well as our method for generating the support set in Steps 1–3, is sufficient for our goal: to demonstrate how compressive sensing can deal with the scale problem.
Generate the support set
High levels Generate support set containing indices chosen uniformly at random from the union of levels above the cutoff level:
Total support The resulting support set of the gene network is
with sparsity .
Generate the coefficients
For each , set
5.2 Generating the gene network,
Next, we explain the choice of parameters used in Procedure 2 to generate the gene network for use in our simulations. We were limited by available computing resources, so chose arbitrary quantitative traits affected by just genes. We utilized the probabilistic model described in Section 2.1.4 to determine the low-level sparsity distribution with and in (6). The number of nonzero coefficients in levels is shown in Table 1 with cutoff level , resulting in low-level sparsity coefficients. For the higher levels we permitted randomly located coefficients. This yielded a -sparse vector with meaningful interactions, representing a sparsity ratio of merely .
The variance in (22) was chosen so that the resulting trait in (12) would have a realistic range. For example, measurements of myopia are usually in the range of roughly , so we simply chose (though this may seem large, it is countered by the scaling constant in the denominator of (12)). The support set and corresponding values of the gene network were randomly regenerated for each simulation trial, as described in the Procedure 2. The bottom of Figure 3 shows a particular gene network of size used in one trial of our simulations. From Step 1 of Procedure 1, this is the example “hidden” gene network that we next attempt to discover.
5.3 Generating the observations,
Consider the gene network in the bottom of Figure 3, which we do not have the ability to directly measure. Assume we are constrained to observe only a subset of the organisms of its associated full trait , shown in the top of Figure 3. This is a valid assumption because with genes the population of size is already quite large. If genomes of are subsampled essentially uniformly at random from (12), this is equivalent to the (noiseless) observation vector in (15) with matrix defined as the corresponding randomly chosen rows of .
To implement Step 2 of Procedure 1, we chose and in (16), resulting in measured genome-trait values. This represents an extremely strong undersampling compared to the ambient dimension : the subsampling ratio is just . Any practical application will have measurements that have been corrupted. To simulate this we added a white Gaussian noise (AWGN) vector to in (15), such that
. In terms of decibels (dB), the signal-to-noise ratio (SNR) isdB. The top of Figure 4 depicts these randomly chosen and noisy trait measurements with red dots superimposed on the true full trait in blue (duplicated from the top of Figure 3). Although the red dots seem plentiful, the zoomed-in portion in the middle panel reveals that only genomes and their perturbed trait values have been sampled at random from the first true trait values — this extremely small sampling of coincides with the overall subsampling ratio .
The distribution of the values of the true full trait and its noisy subsamples are shown in the bottom of Figure 4. The histogram of (left) has mean, and (right) has mean and standard deviation . Although the mean is off by about , the severe random undersampling and perturbed observations have captured the range of trait values contained in the true full trait with only about error.
5.4 Recovery of the gene network,
For Step 3 of Procedure 1, given measurements and matrix in the previous section, we recovered an approximate gene network by implementing (17) in two substeps. First, we used the “Fast Adaptive Shrinkage/Thresholding Algorithm” (FASTA) FASTA:2014 to recover the approximate support of the gene network. Define as the submatrix of whose columns are restricted to the indices of . With and of full column rank, we then performed a traditional least-squares regression on , the support of :
Inserting zeros on the complement of yields .
The top of Figure 5 shows the gene network recovered from noisy observations .888We also conducted simulations with noiseless measurements . Note that (18) predicts perfect reconstruction in the case of strictly -sparse vectors recovered from noiseless observations. Our simulations essentially confirm this: both relative errors and were on the order of , which is close to machine epsilon. Compare this with the hidden gene network we are trying to discover in the bottom of Figure 3. Although they look quite similar there are small differences between them, seen in the pointwise error in the bottom of Figure 5. Some of these differences are due to the FASTA algorithm either missing the smallest coefficients of , or introducing relatively small coefficients with incorrect support in . Applying (19) to and in this example we first compute the - and -norms of the ground-truth: and . Next, the absolute - and -errors were and , which yielded relative errors of and .
Table 2 on page 2 lists the relative errors of the gene networks recovered in the simulation trials we conducted (the emboldened row corresponds to this example trial). The bottom row is the average over all trials. From this we see for our assumed model of an arbitrary quantitative trait affected by genes with meaningful clusters, when randomly sampled trait values are corrupted with dB of AWGN, on average we can recover its associated gene network with about – accuracy. Clearly, the error can be further reduced if one has access to more trait values. It is also possible to fine-tune the parameters in the FASTA algorithm to either permit fewer measurements and/or achieve lower error, but these results are fairly representative for the scenario presented. The key takeaway of these simulations is that the significant coefficients of the gene network can be faithfully recovered with minimal error. In turn, we see in the next section that we can also dependably predict trait values.
5.5 Predicting the trait,
In analogy with (12), the approximate full trait corresponding to is
Recognize, though, we do not have to predict the full trait; we may only be interested in the trait value for some previously unobserved th genome. This is easily accomplished by taking the inner product of the th row of with .
Clearly, the effects of any perturbations in are inherited by , so we are interested in assessing deviations from the ground-truth . A standard measure of this is an average error over all predicted trait values. But by assumption, for large we most probably do not have access to the full trait . Suppose, however, that we have access to the - and -norms and errors of the gene network, either through measurement, approximation, or some other a priori knowledge; with these we can establish theoretical error bounds on in terms of .
From the unitarity of the Fourier transform (i.e., ), the absolute error of in terms of the -norm is just a scaled version999We could have defined the transforms in (11) and (12) to be and , respectively. The symmetry would have then precisely yielded . of that of :
while the relative -errors are identical:
This is significant: we can precisely determine the absolute and relative -errors of the trait solely based on those of the gene network due to the unitarity of the Fourier transform.
However, similar analysis of the errors in terms of the -norm does not yield such equalities. In particular, the absolute -error of is upper-bounded by that of :
where in the second inequality we used the fact that for a matrix with rows. Unfortunately, the lower bound on the -norm of incurs a significant scaling factor:
This theoretical bound does not provide any meaningful utility owing to the fact that will be extremely large for any of practical interest.
In our numerical simulations, however, we have access to all of and . For the trial we are examining, the top of Figure 6 shows the predicted full trait reconstructed from using (24), which is very similar to the true trait in the top of Figure 3 — this is affirmed in the middle panel showing a zoomed-in portion of the predicted trait values for the first genomes (red ‘’s) plotted relative to the first true trait values (blue dots). The differences are recorded in the error , however, plotting this as a function of all genomes contains too many points to be a useful visual aid. Instead, we show the distribution of error values in the bottom of Figure 6. The histogram of the prediction error has a bell-like shape, with mean and standard deviation .