Structural variants (SV), insertions, deletions, trans-locations, copy number variants, are by far the most common types of human genetic variation (chaisson15). They have been linked to large number of heritable disorders (hurles08). Technology to assay the presence or absence of these variants has steadily improved in ease and resolution (huddleston16; audano19)
. Whole genome shotgun DNA sequencing (WGS) can detect small variants (less than 10bp) readily and can detect some classes of large SV. This approach, however, is inferential and often struggles to capture copy number variation in gene families or to correctly estimate the size of insertions. An alternative approach, genomic mapping (such as the technology of BioNano Genomics), addresses the deficiencies of WGS by providing linkage and size information from ordered fragments of chromosomes spanning tens to hundreds of kilobases. In contrast to WGS, genomic mapping approaches directly observe SV, rather than inferring the existence of a SV from patterns of mismatch in WGS data. In the near future, these genome mapping technologies are expected to be used for clinical diagnosis of SV known to be associated with genetic disorders.
In a clinical setting, the cells or tissues needed for analysis may be hard to obtain, which poses several important statistical questions: what is the minimum amount of starting material necessary to have some confidence of detecting a target fragment? What is the optimal sampling strategy for the primary and derived material throughout the process? How best to model the technical errors–such as failure to digest at a site–during the processing of the data as these errors can lead to false positives and negatives? As is often the case, answering these questions motivated an exploration and expansion of the statistical machinery used to model this biological process. Specifically, we established a relationship between the tail bounds of the binomial and hyper-geometric distributions.
2 Statistical Model
In this section, we abstract our sampling procedure into an “urn sampling” model. As DNA is processed through the optical mapping procedure, we imagine the material passing through a series of urns. Assume we have different types of long sequences (i.e. chromosomes), each type has copies (i.e. cells), so we have long sequences in total. We assume only one type of long sequences contains the target sequence, or the fragment of interest. The basic idea of our sampling model is shown in Figure 1. The notations introduced below are summarized in Table 1.
|n||Number of cells in the first urn (copies of each type of long sequences).|
|K||Number of sequences sampled from the first urn.|
|R||Number of sequences sampled from third urn.|
|L||Approximated length of long sequence.|
|l||Approximated length of short sequence.|
|T||Threshold on detectability of target sequences.|
|f||Length of fragment of interest.|
|c||Approximated ratio between lengths of long and short sequences.|
|Q||Minimum number of target sequences we want in the detection machine.|
|p||Minimum confidence in achieving the goal.|
The first urn contains our original biological material, total of long sequences out of which of them contain the target sequence. At the first stage, we sample sequences without replacement from the first urn, and put them in the second urn. The second urn will therefore contain a random number of target sequences. All of the long sequences in second urn are cut at random locations according to a Poisson process and placed into the third urn. The third urn will therefore contain a random number of sequences out of which are target sequences. The content of the third urn models the biological material prepared for assay in a detection machine. Finally, we sample smaller sequences without replacement out of the third urn and put them into a detection machine. There will be a random number of target sequences processed by the detection machine, and the goal is to assure that for some pre-specified values and
, we have the probability ofis at least . Throughout the experiment, the variables () are in our control and we will find the conditions on them to achieve our goal. Throughout this paper, we call the long sequence in the second urn which contains the fragment of interest as “target sequence”.
Next we state the following biological assumptions:
The length of target sequence is .
The lengths of long sequences in the first urn are approximately , here .
Short sequences in the third urn have lengths approximately , and we have .
We proceed by describing the probabilistic parts of our model. The distributions and their expectations are summarized in Table 2. There are target sequences in the second urn. It is straightforward to see , a hyper-geometric distribution with samples and samples of interest and as sampling size. Hence .
Let () denotes the number of cuts on -th long sequence in the second urn. Combine with the third assumption above, we assume that
follows a Poisson distribution with mean. Note that cuts divide the sequence into shorter sub-sequences. Consequently, is the total number of short sequences in the third urn, and follows Poisson distribution with mean .
Write as the number of the sequences in the third urn that contain the target sequence. The distribution of is more complicated than that of . Assuming , we have at least target sequence contained in the second urn. We have , where fix ,
are independent Bernoulli random variables. Condition on, the probability of success of random variable satisfies
respectively. Here , , . The proof is found in Section 4.1.
Finally, condition on and , the number of target sequences in the detection machine follows a hyper-geometric distribution with parameters , and .
2.1 Analytical Results
In this section, we present the analytical results of our statistical modeling. Mathematically, our goal can be written as
Now we consider the quantity , such that with pre-fixed quantities , and
Note here . We will find as a function of from a concentration inequality on hyper-geometric distribution.
In section 4.2
, we developed the relationship between the tail bounds of binomial distribution and that of hyper-geometric distribution. Specifically, consider the following random variables with parameters:
, a hyper-geometric distributed random variable.
We proved that under some regularity conditions, the following relations are true
The conditions needed and detailed proof are presented in section 4.2.
In section 3, the numerical calculations implied that for large , (5) is a better bound, otherwise we may want to use (4). From the relationship above we immediately know a tail bound on binomial distribution can also be used as the tail bound for hyper-geometric distribution.
Throughout this paper, we assume the conditions needed for (4) and (5) are always met. Therefore we may use large deviation bounds from arratia89 at the following two binomial distributions: and to find in (3).
From now on we write . Note that and are typically unknown. Therefore, itself is still a random quantity and we need to further find a upper bound for depending on and , this is denoted by . With large probability, sampling sequences in the third urn is enough to guarantee sampling no less than samples.
It is fairly straightforward to see increases with and decreases with . Now we fix and , and write and as the probabilistic upper/lower bounds for and , respectively. From (4) and (5) we can find directly from tail bounds on and . In particular, the steps needed to determine for a given and are summarized here:
Use lemma 2 on binomial distributions and to find lower bound of . Here depends only on , and so that: .
Set from step 1. Note that is the summation of independent Bernoulli trials. Hence from lemma 2 we can find lower bound of depending only on , , , , , , so that: Consequently .
Use inequality from lemma 1 to find and depending only on so that: and .
Use lemma 2 on binomial distributions and to find so that:
Note that we need to ensure the needed sample size is not larger than the available number of short sequences . To this end, both and are deterministic functions of given constants and we can add numerical constraint on to force it smaller than . A key observation from our numerical result is, as gets larger, and will be more concentrated around the mean , while will be much smaller than . Therefore, we need to find a lower bound on to ensure .
Finally, given that we choose and as our sampling sizes at two stages, respectively. The following relations are true:
It suffices to set the desired probability equal to the right-hand-side of (6). The exact selection of can be found in Section 4.3. As discussed in section 4.3, the range of is , while not every in this range is feasible, a straightforward monotone analysis shows that as long as is larger than a certain threshold, the solution always exists.
2.2 Optimal Sampling Strategy
In this section, we discuss how to use the formulas derived in section 2.1 to find the optimal values of and for any given and . Specifically, assume there is a user-specified cost function over number of samples and the sampling size from first urn. In this paper we assume is an monotone increasing function of both and .
The proposed procedure is summarized here:
Solve for such that .
For fixed , we calculate .
For any fixed and such that , we calculate .
The implementation details are discussed in section 4. In reality the amount of biological materials is limited, hence there is an upper bound on and there are only finite number of to consider. We do not need to consider any as that would lead to sub-optimal design. However, for fixed , we do need to consider , because larger might lead to smaller and a more efficient solution.
Assume we have a cost function that increases with and . We only have finitely many to consider and a brute force search among all the possible triples will yield the optimal minimizing the cost function.
Due to technology limits, we may have certain constraints on sampling percentages: for example, we can only sample in the first stage, and from the second stage. We can still use the brute force search only considering the cases that do satisfy these extra constraints.
3 Numerical Results and Conclusions
For our numerical results, the calculations were based on biologically reasonable parameters: , , , , , .
together with the results without using any concentration inequalities (we get the tail points by the inverse of cumulative distribution functions, which is applicable for relatively small); both of them have the similar patterns. From original calculation results we can find two “kinks” for each fixed . This is because when is small, we will need to sample almost everything from the second stage, which will force us to choose the correspond for as the binomial bounds. Then as gets larger but not big enough, we will use for both stages. Finally will get close to which again forces to use at the first sampling stage.
In Figure 3 we plot the simulation results together with population expectation results. Here the simulation means of each and fixed we create large amount of , and . Then for each simulation trial, we use a brute force search to find the smallest that can gives us (2). Note this simulation is an “averaging” approach while our algorithm is more like a tolerance interval approach, thus they are not comparable and we put them into two separate figures. The population expectation results means we replace and directly by their expectations, and again brute force search for the smallest . From Figure 3 we can see as gets larger, these two results will be very close, which implies for large , we can approximately use expectations of and to conduct the calculation.
Table 3, provides examples the minimization results based on a linear cost function. under various constraints. In particular we use , and various sampling percentage constraints on both sampling stages.
We have also applied our algorithm to other choices of . The lessons learned are similar to what we have shown here. In the supporting materials we provide the Matlab code that can be used to calculate optimal sampling strategy with different parameters.
In conclusion, we have developed an optimization approach for estimating the amount of material needed for genomic mapping based on a simple, yet realistic, model of the process that uses a novel result regarding the tail bounds of the hyper-geometric distribution. Our approach is both computationally and analytically tractable and We show that if a genomic mapping technology can sample most of the chromosomal fragments within a sample, comparatively little biological material is needed to detect a variant at high confidence.
4.1 Proof of Equation (1)
There are copies of the target fragments in the second urn. Some of the fragments of interest might not survive during the cutting process, therefore we have . Define as the event that the -th target fragment survives (i.e. being intact after cutting procedure) and is placed in the third urn. Given that we have cuts on the -th target sequence, the locations of these
cuts are then uniformly distributed, therefore.
Next, in order for the target fragment to be usable by the detector, it has to be longer than . If , the sequences that contain the target fragment are always longer than , then . Otherwise we estimate from a lower bound using the probability of an event , where is the event of not having cuts within on either one or the other side of the target sequence (see Figure 4). Recall that , , . Then by inclusion and exclusion = and consequently
4.2 Hyper-geometric Distribution and Binomial Bounds
In this section, we discuss the relationship between the tail bounds of binomial distribution and that of hyper-geometric distribution.
For fixed positive integer , consider the following two inequalities
The above two inequalities can be simplified as:
We use mathematical induction on . Given that . Now for , we want . It suffices to have , which only requires . Similarly we can prove this property for (9). ∎
Again we use (backward) mathematical induction on . Given . We need . It suffices to show
the inequality above is equivalent to . Take first order derivative of with respect to we have:
If for , the result is proved by using the monotonicity of and the assumption that . Now we will prove . It suffices to show:
The fist line of (11) is obviously negative by noting the following facts
For the rest lines, we use the following relations:
where the last inequality follows by assumption . For the rest parts, we want . It suffices to show
which is equivalent to
this follows directly from the assumptions. Thus the first part of Property is proved.
Now we prove the second part. From mathematical reduction on , we want . It suffices to have
which is equivalent to . Similarly as before, we want this function increases with , from which we only need to check and this follows from our assumption. Consider the first order derivative of :
We can expand the first line of (12) and write it as:
we want to show the above line is non-positive. Note that
Finally we only need , which follows from our assumption: .
Same as before we use (backward) mathematical induction on . For we want:
which is equivalent to . Similarly as the proof of Property , it suffices to show
for any that satisfies the assumptions. It suffices to have , this only needs , which is obviously true according to our assumptions. Similarly for the second part we need for :
and it suffices to have for any that satisfies the assumptions. Again we prove the monotonicity of :
it suffices to show , which can be proved by using our assumption . Thus the second part is proved. ∎
4.3 Implementation Details
In this section we discuss the implementation details of optimal sampling strategy in section 2.
The following quantities should be specified/calculated beforehand:
Specify the values of , , , , , , according to the particular application.
Select , and so that the right-hand-side of (6) becomes .
Compute: , , and set , . Here and
are the expected value and variance of Bernoulli Berrandom variable.
Also we write as the relative entropy function defined in arratia89.
4.3.1 Calculating lower bound on K
We need to find the lower bound of such that with large probability we have at least target sequences in the third urn. Equivalently, we want . To this end, we assume the cutting process in urn 2 does not break any target sequences and we take everything out from urn 3. Therefore, we only need to make sure is larger than with high probability. In section 3, we solved both (4) and (5) to get different lower bounds for , similarly with different lower bounds on we will have different lower bounds for downstream quantities like , etc.
4.3.2 Calculating lower bound on R
Algorithm 1 can be used to calculate with pre-fixed and . Please note that we use tail bounds of binomial distribution to approximate that of hyper-geometric distribution in step 1, 2 and step 4. Here step 1 and 2 only requires property 1 and 2 in section 4.2, while for step 4 we also need property 3 and 4, because we need the relations in (4) and (5) to be true with and as well. For each fixed , the range of is relatively small, thus for each input we can simply try all the possible and calculate the corresponding smallest (use to denote it) that achieves our goal. To make our algorithm more efficient, we can first find the smallest that can give us a lower tail that is larger than (any smaller will not be feasible, see our supporting codes for details), call this . For each from to , we use Algorithm 1 to find .