A key step in many bioinformatics analysis pipelines is the identification of regions of similarity between pairs of DNA sequencing reads. This task, known as pairwise sequence alignment, is a heavy computational burden, particularly in the context of third-generation long-read sequencing technologies, which produce noisy reads (pacbio_errorRate). This challenge is commonly addressed via a two-step approach: first, an alignment estimation procedure is used to identify those pairs that are likely to have a large alignment. Then, computationally intensive alignment algorithms are applied only to the selected pairs. This two-step approach can greatly speed up the alignment task because, in practice, one only cares about the alignment between reads with a large sequence identity or overlap.
Several works have developed ways to efficiently estimate pairwise alignments (Myers2014; Berlin2015; li2016minimap; li2018minimap2). The proposed algorithms typically rely on hashing to efficiently find pairs of reads that share many -mers (length- contiguous substrings). Particularly relevant to our discussion is the MHAP algorithm of Berlin2015. Suppose we want to estimate the overlap size between two strings and and let be the set of all -mers in , . For a hash function , we can compute a min-hash
for each read . The key observation behind MHAP is that, for a randomly selected hash function ,
In other words, the indicator function
provides an unbiased estimator for the-mer Jaccard similarity between the sets and , which we denote by . By computing for several different random hash functions, one can thus obtain an arbitrarily accurate estimate of . As discussed in Berlin2015, serves as an estimate for the overlap size and can be used to filter pairs of reads that are likely to have a significant overlap.
Now suppose we fix a reference read and wish to estimate the size of the overlap between and , for . Assume that all reads are of length and let be the overlap fraction between and (i.e., the maximum such that a -prefix of matches a -suffix of or vice-versa). By taking random hash functions , we can compute min-hashes for and . The MHAP approach corresponds to estimating each as
In the context of crowdsourcing and vote aggregation (dawid1979maximum; raykar2010learning; ghosh2011moderates), one can think of each hash function as a worker/expert/participant, who is providing binary responses to the questions “do and have a large alignment score?” for . Based on the binary matrix of observations , we want to estimate the true overlap fractions .
The idea of jointly estimating from the whole matrix was recently proposed by baharav2019spectral. The authors noticed that in practical datasets the distribution of
-mers can be heavily skewed. This causes some hash functionsto be “better than others” at estimating alignment scores. Hence, much like in crowdsourcing models, each worker has a different level of expertise, which determines the quality of their answer to all questions. Motivated by this, baharav2019spectral proposed a model where each hash function has an associated unreliability parameter and, for and , the binary observations are modeled as
is a Bernoulli distribution with parameterand is the OR operator. If a given assigns low values to common -mers, spurious min-hash collisions are more likely to occur, leading to the observation when and do not have an overlap (thus being a “bad” hash function). Similarly, some workers in crowdsourcing applications provide less valuable feedback, but we cannot know a priori how reliable each worker is.
A key observation about the model in (4) is that, in expectation, the observation matrix is rank-one after accounting for an offset. More precisely, since ,
where and . baharav2019spectral proposed to estimate
by computing a singular value decomposition (SVD) of, and setting , where
is the leading left singular vector of. The resulting overlap estimates are called the Spectral Jaccard Similarity scores and were shown to provide a much better estimate of overlap sizes than the estimator given by (3), by accounting for the variable quality of hash functions for the task.
In this paper, motivated by the model of baharav2019spectral, we consider the more general framework of rank-one models. In this setting, a vector of parameters (the item qualities) is to be estimated from the binary responses provided by workers, and the observation matrix is assumed to satisfy . In the context of these rank-one models, a natural estimator for is the leading left singular vector of . Such a spectral estimator has been shown to have good performance both in the context of pairwise sequence alignment (baharav2019spectral) and in voting aggregation applications ghosh2011moderates; karger2013efficient. However, the spectral decomposition by default allocates worker resources uniformly across all items. In practice, one is often only interested in identifying the “most popular” items, which, in the context of pairwise sequence alignment, corresponds to the reads that have the largest overlaps with a reference read . Hence, we seek strategies that can harness the performance of spectral methods while using adaptivity to avoid wasting worker resources on unpopular items.
Main contributions: We propose an adaptive spectral estimation algorithm, based on multi-armed bandits, for identifying the largest entries of the leading left singular vector of
. A key technical challenge is that multi-armed bandit algorithms generally rely on our ability to build confidence intervals for each arm, but it is difficult to obtain tightelement-wise confidence intervals for the singular vectors of random matrices with low expected rank (abbe2017entrywise). For that reason, we propose a variation of the spectral estimator for , in which one computes the leading right singular vector first and uses it to estimate the entries of via a matched filter verdu1998multiuser. This allows us to compute entrywise confidence intervals for each , which in turns allows us to adapt the Sequential Halving bandit algorithm of Karnin2013 to identify the top- entries of . We provide theoretical performance bounds on the total workers’ response budget required to correctly identify the top-
items with a given probability. We empirically validate our algorithm on controlled experiments that simulate the vote aggregation scenario and on real data in the context of pairwise alignment of DNA sequences. For a PacBioE. coli dataset PBecoli, we show that adaptivity can reduce the budget requirements (which correspond to the number of min-hash comparisons) by around half.
Related work: Our work is motivated by the bioinformatics literature on pairwise sequence alignment (Myers2014; Berlin2015; li2016minimap; li2018minimap2). In particular, we build on the idea of using min-hash-based techniques to efficiently estimate pairwise sequence alignments Berlin2015, described by the estimator in (3). More sophisticated versions of this idea have been proposed, such as the use of a tf-idf (term frequency-inverse document frequency) weighting for the different hash functions chum2008near; marccais2019locality, which follows the same observation that “some hashes are better than others” made by baharav2019spectral. A proposed strategy to reduce the number of hash functions needed for the alignment estimates is to use bottom sketches ondov2016mash. More precisely, for a single hash function, one can compute minimisers per read (the bottom- sketch) and estimate the alignments based on the size of the intersection of bottom sketches. Winnowing has also been used in combination with min-hash techniques to allow the mapping of reads to very long sequences, such as full genomes jain2017fast.
The literature on crowdsourcing and vote aggregation is vast (raykar2010learning; dalvi2013aggregating; whitehill2009whose; ghosh2011moderates; karger2014budget; liu2012variational; karger2013efficient; whitehill2009whose; welinder2010multidimensional; zhou2014aggregating; zhou2012learning; zhang2014spectral; shah2016permutation). Many of these works are motivated by the classical Dawid-Skene model (dawid1979maximum). ghosh2011moderates considered a setting where the questions have binary answers and the workers’ responses are noisy. They proposed a rank-one model and used a spectral method to estimate the true answers. For a similar setting, karger2014budget showed that random allocation of questions to workers (according to a sparse random graph) followed by belief propagation is optimal to obtain answers to the questions with some probability of error, and this work was later extended (karger2013efficient; dalvi2013aggregating). While the rank-one model these works have considered is similar in spirit to the one we consider, in our setting, the true answers to the questions are not binary. Rather, our answers are parameters in , which can be thought of as the quality of an item or the fraction of the population that would answer “yes” to a question. Natural tasks to consider in our case would be to find the top ten products among a catalogue of 100,000 products, or identify the items that are liked by more than 80% of the population. Notice that such questions are meaningless in the case of questions with binary answers.
The use of adaptive strategies for ranking a set of items or identifying the top- items have been studied in the context of pairwise comparisons between the items (heckel2018approximate; heckel2019active; szorenyi2015online; busa2013top). Moreover, the top- arm selection problem has been considered in many contexts and a large set of algorithms exist to identify the best items while minimising the number of observations required to do that (maron1994hoeffding; even2002pac; heidrich2009hoeffding; bubeck2013multiple; kalyanakrishnan2012pac; jun2016top; Karnin2013). Recent works have taken advantage of some of these multi-armed bandit algorithms to solve large-scale computation problems (Bagaria2017; Bagaria2018; baharav2019ultra), which is similar in flavor to our work.
Outline of manuscript: In Section 2, we introduce rank-one models and give examples of potential applications. In Section 3, we develop a spectral estimator for these models that enables construction of confidence intervals. In Section 4, we leverage these confidence intervals to develop adaptive bandit-based algorithms. Section 5 presents empirical results.
2 Rank-One Models
While our main target application is the pairwise sequence alignment problem, we define rank-one models for the general setting of response aggregation problems. In this setting, we are interested in estimating a set of parameters, or item values, . To do that, we recruit a set of workers with unknown levels of expertise, which provide binary opinions about the items. We can choose the number of workers and their opinions can be requested for any subset of the items. The rank-one model assumption is that the matrix of responses is rank-one in expectation; i.e.,
for unknown vectors and with entries in . This means that, for each and , has a distribution. Furthermore, we assume throughout that all entries are independent.
Abstractly, this setting applies to a situation in which we have a list of questions that we want answered. One can think of a question as the rating of a movie or whether a product is liked. The parameter associated with the th question, , can be thought of as representing the average rating of the movie, or the fraction of people in the population who like the product. We can think of as a noisy version of the response of worker to question , with the noise modelling respondent error. We point out that can alternatively be viewed as noisy observations of responses to the questions through a binary channel (cover2012elements), with each worker having a channel of distinct characteristics. We next discuss two instances of this model and the corresponding binary channels.
One-sided error: In the pairwise sequence alignment problem presented in Section 1, we want to recover the pairwise overlap fractions (or alignment scores) between read and read , for . The observation for read pair and hash function is modelled by , which can be thought of as observing a random variable through a Z-channel with crossover probability . A Z-channel, shown in Fig. 1(a), is one where an input is never flipped but an input is flipped with some probability (cover2012elements). Hence, this models one-sided errors. If we process the data as , we then have that
giving us and . While, in this case, the entries are technically in , our main results (presented in Section 3 for a binary matrix ) can be readily extended.
Two-sided error: In the binary crowdsourcing problem, items have associated parameters (the population rating of the item), for . The observation of the rating of worker on the th item is modelled as where represents the XOR operation. This can be thought of as observing a random variable through a Binary Symmetric Channel with crossover probability . A Binary Symmetric Channel, shown in Fig. 1(b), is one where the probability of flipping to and to is the same. The processed data in this case is , and the expected value of our observation matrix is given by
giving us and . As in the case of one-sided errors, the observations are not in but they only take two values and the results in Section 3 still hold.
Notice that if we do not make the assumption of symmetry and use a general binary channel, shown in Fig. 1(c), the model is still low-rank (it is rank-) in expectation. In this manuscript, we focus on rank-one models, but briefly discuss this generalization in Appendix A.
Model Identifiability: Strictly speaking, the models described above are not identifiable, unless extra information is provided. This is because and are unspecified, and replacing and with and leads to the same distribution for the observation matrix .
In practice one can overcome this issue by including questions with known answers. For the pairwise sequence alignment problem of baharav2019spectral, the authors add “calibration reads” to the dataset. These are random reads, expected to have zero overlap with other reads in the dataset. In a crowdsourcing setting one similarly can add questions whose answers are known to the dataset. Based on questions with known answers, it is possible accurately estimate the “average expertise” of the workers, captured by . In order to avoid overcomplicating the model and the results and to circumvent the unidentifiability issue, we assume that is known in our theoretical analysis. In our experiments, we adopt the known-questions strategy to estimate .
3 Spectral Estimators
Consider the general rank-one model described in Section 2. The binary matrix of observations can be written as where . We assume throughout this section that , where . Moreover, we assume that is known, in order to make the model identifiable, as described in Section 2.
A natural estimator for is the leading left singular vector of
(or the leading eigenvector of), rescaled to have the same norm as . One issue with such an estimator is that it is not straightforward to obtain confidence intervals for each of its entries. There is a fair amount of work in constructing confidence intervals around eigenvectors of perturbed matrices (coja2010graph; chin2015stochastic; chen2016community). However, the translation of control over eigenvectors to element-wise control using standard bounds costs us a factor, which makes the resulting bounds too loose for the purposes of adaptive algorithms, which we will explore in Section 4. There has been some work on directly obtaining control for eigenvectors by abbe2017entrywise; fan2016ell. However, they analyse a slightly different quantity, and so a direct application of these results to our rank-one models does not give us the desired element-wise control.
In order to overcome this issue and obtain element-wise confidence bounds on our estimate of each , we propose a variation on the standard spectral estimator for . To provide intuition to our method, let us consider a simpler setting – one where we know exactly. In this case a popular means to estimate is the matched filter estimator (verdu1998multiuser)
where is the th row of . It is easy to see that is an unbiased estimator of , and standard concentration inequalities can be used to obtain confidence intervals. We try to mimic this intuition by splitting the rows of the matrix into two – red rows and blue rows. We then use the red rows to obtain an estimate of . We treat this as the true value of and obtain the matched filter estimate for the s corresponding to the blue rows, which gives us element-wise confidence intervals. We then use the blue rows to estimate , and apply the matched filter to obtain estimates for the s corresponding to the red rows. This is summarised in the following algorithm.
The main result in this section is an element-wise confidence interval for the resulting .
In the remainder of this section, we describe the key technical results required to prove Theorem 1. We discuss the application of these confidence intervals to create an adaptive algorithm in Section 4.
To prove Theorem 1 we first establish a connection between control of and element-wise control of . Then we provide expectation and tail bounds for the error in . For ease of exposition, we will drop the subscripts and in , , , and . We will implicitly assume that and correspond to distinct halves of the data matrix, thus being independent. The main technical ingredient required to establish Theorem 1 is the following lemma.
The error of estimator satisfies
Notice that the right-hand side of the bound in Lemma 1 (proved in Appendix B) involves the error of , and can in turn be used to bound the error of the estimator . The first term in the bound is a sum of independent, bounded, zero-mean random variables , for . Using Hoeffding’s inequality, we show in Appendix C that, for any ,
In order to bound the second and third terms on the right-hand side of Lemma 1, we resort to matrix concentration inequalities and the Davis-Kahan theorem (davis1970rotation). More precisely, we have the following lemma, which we prove in Appendix D.
The error of estimator satisfies
, for ,
where and are constants specified in Appendix G.
Given (11) and the bounds in Lemma 2, it is straightforward to establish Theorem 1, as we do next. Combining Lemmas 1, 2 and equation (11) to build a confidence interval for . To that end, fix some . If , Lemma 2(b) implies that
Hence, if ,
Notice that (12) is a vacuous statement whenever , as the right-hand side is greater than . Hence, if we replace with , the inequality holds for all and . The result in Theorem 1 then follows by assuming .
4 Leveraging confidence intervals for adaptivity
In the pairwise sequence alignment problem, one is typically only interested in identifying pairs of reads with large overlaps. Hence, by discarding pairs of reads with a small overlap based on a coarse alignment estimate and adaptively picking the pairs of reads for which a more accurate estimate is needed, it is possible to save significant computational resources. Similarly, in crowdsourcing applications, one may be interested in employing adaptive schemes in order to effectively use worker resources to identify only the most popular items.
We consider two natural problems that can be addressed within an adaptive framework. The first one is the identification of the top- largest alignment scores. In the second problem, the goal is to return a list of reads with high pairwise alignment to the reference, i.e., all reads with above a certain threshold. More generally, we consider the task of identifying a set of reads including all reads with pairwise alignment above and no reads with pairwise alignment below , for some . Adaptivity in the first problem can be achieved by casting the problem as a top- multi-armed bandit problem, while the second problem can be cast as a Thresholding Bandit problem (locatelli2016optimal).
4.1 Identifying the top- alignments:
We consider the setting where we wish to find the largest pairwise alignments with a given read. We assume that we have a total computational budget of min-hash comparisons. Notice that the regime where is uninteresting as we cannot even make one min-hash comparison per read. When , a simple non-adaptive approach is to divide our budget evenly among all reads (as done in (baharav2019spectral)). This gives us an min-hash collision matrix , from which we can estimate using Algorithm 1 and choose the top alignments based on their values. Let be the sorted entries of the true and define for . Notice that the non-adaptive approach recovers correctly if each stays within of its true value . From the union bound and Theorem 1,
if . Hence, the budget required to achieve an error probability of is
Moreover, from (13), we see that a budget , , allows you to correctly identify the top- alignments if , or . Hence, the budget places a constraint in the minimum gap that can be resolved.
Next we propose an adaptive way to allocate the same budget . Algorithm 2 builds on an approach by Karnin2013, but incorporates the spectral estimation approach from Section 3. The algorithm assumes the regime .
At the -th iteration, Algorithm 2 uses new hash functions and computes the min-hash collisions for the reads in , which are represented by the matrix . Notice that we assume that the min-hashes in each iteration are different, which makes observation matrices all independent. Also, we assume that the norm of the right singular vector of , , can be obtained exactly at each iteration of the algorithm. As discussed in Section 2, this makes the model identifiable and can be emulated in practice with calibration reads. At each iteration, Algorithm 2 eliminates half of the reads in . After iterations, the number of remaining reads satisfies , and the total budget used is . Finally, in the “clean up” stage, we use the remaining budget to obtain the top among the approximately remaining items. Notice that the final observation matrix is approximately .
Comparing (15) and (16) with the non-adaptive counterparts (13) and (14) requires a handle on . This quantity captures how difficult it is to separate the top alignments, and satisfies . The extreme case occurs when all of the suboptimal items have very similar qualities. In this case, adaptivity is not helpful. However, when is large compared to other non-top- alignments, we are in the regime. Then the budget requirements are essentially , which is for constant. Furthermore, in the regime, a budget , , allows you to correctly identify the top- alignments with a gap of which is significantly smaller than the afforded in the non-adaptive case. As a concrete example, suppose that out of the alignments, there are highly overlapping reads with , moderately overlapping reads with , and reads with no overlap and . In this case, Algorithm 2 requires a budget of , while the non-adaptive approach requires .
4.2 Identifying all alignments above a threshold:
In this section, we develop a bandit algorithm to return a set of coordinates in such that with high probability all coordinates with are returned and no coordinates with are returned, for some . We assume that . The algorithm and analysis follow similarly to locatelli2016optimal.
For any , our estimates from Algorithm 1 run with an matrix with such that
will have with probability at least . Thus, if
then with probability , for all .
Notice that a naive non-adaptive approach to this thresholding problem would consist of applying Algorithm 1 to the entire observation matrix, with enough enough workers such that the confidence intervals are smaller than and return coordinates with value more than . In that case, Lemma 3 implies that workers’ responses are enough to succeed with probability at least . In Algorithm 3, we propose an adaptive way of performing the same task.
Line 14 is used to make sure that we have in the clean up stage. Selecting uniformly at random from is simply given as a concrete way for the algorithm to run.
Let . Define
See Figure 2 for an illustration. Further let be the sorted list of the .
The proof of the theorem is in Appendix F.
Theorem 3 enables us to construct confidence intervals of for only the “most borderline” alignments and near optimal confidence intervals for the rest, while the non-adaptive algorithm would need to construct confidence intervals of for all.
5 Empirical Results
In order to validate the Adaptive Spectral Top- algorithm, we conducted two types of experiments:
controlled experiments on simulated data for a crowdsourcing model with symmetric errors;
pairwise sequence alignment experiments on real DNA sequencing data.
We consider the top- identification problem with in both cases. We run Algorithm 2 with some slight modifications, namely halving until we have fewer than remaining arms before moving to the clean up step, and compare its performance with the non-adaptive spectral approach. Further experimental details are in Appendix I. We measure success in two ways. First, we consider the error probability of returning the top items (i.e., any deviation from the top- is considered a failure). Second, we consider a less stringent metric, where we allow the algorithm to return its top- items, and we consider the fraction of the true top- items that are present to evaluate performance. Our code is publicly available online at github.com/TavorB/adaptiveSpectral.
Controlled experiments: We consider a crowdsourcing scenario with symmetric errors as modelled in (8). We want to determine the best products from a list of products. We generate the true product qualities (that is, the parameters) from a Beta distribution independent of each other. Each of the worker abilities is drawn from a Uniform distribution, independent of everything else. We consider the problem of top- product detection at various budgets as shown in Figure 3(a) with success rate measured by the presence in the top- items. We see that the adaptive algorithm requires significantly fewer worker responses to achieve equal performance to the non-adaptive one.
In Figure 3(b) we consider the same set up as above but in the fixed confidence setting, considering the problem of being able to detect all products that are liked by more than of the population while returning none that is liked by less than of the population. Again, we see that for the same probability of error the adaptive algorithm needs far fewer workers.
Real data experiments: Using the PacBio E. coli data set (PBecoli) that was examined in baharav2019spectral we consider the problem of finding, for a fixed reference read, the reads that have the largest alignment with the reference read in the dataset. We show the fraction of the reads that are present when returning reads in Figure 3(c) and the success probability when returning exactly reads in Figure 3(e) (i.e., the probability of returning exactly the top- reads). To achieve an error rate of 0.9% the non-adaptive algorithm requires over 8500 min-hash comparisons per read, while the adaptive algorithm requires fewer than 6000 per read to reach an error rate of 0.1%.
Motivated by applications in sequence alignment, we considered the problem of efficiently finding the largest elements in the left singular vector of a binary matrix with . To utilize the natural spectral estimator of , we designed a method to construct confidence intervals around the spectral estimator. To perform this spectral estimation efficiently, we leveraged multi-armed bandit algorithms to adaptively estimate the entries of the leading left singular vector to the necessary degree of accuracy. We show that this method provides computational gains on both real data and in controlled experiments.
Over the last decade, high-throughput sequencing technologies have driven down the time and cost of acquiring biological data tremendously. This has caused an explosion in the amount of available genomic data, allowing scientists to obtain quantitative insights into the biology of all living organisms. Countless tasks – such as gene expression quantification, metagenomic sequencing, and single-cell RNA sequencing – heavily rely on some form of pairwise sequence alignment, which is a heavy computational burden and often the bottleneck of the analysis pipeline. The development of efficient algorithms for this task, which is the main outcome of this paper, is thus critical for the scalability of genomic data analysis.
From a theoretical perspective, this work establishes novel connections between a classical problem in bioinformatics (pairwise sequence alignment), spectral methods for parameter estimation from crowdsourced noisy data, and multi-armed bandits. This will help facilitate the transfer of insights and algorithms between these traditionally disparate areas. It will also add a new set of techniques to the toolbox of the computational biology community that we believe will find a host of applications in the context of genomics and other large-scale omics data analysis. Further, this work will allow other Machine Learning researchers unfamiliar with bioinformatics to utilise their expertise in solving new problems at this novel intersection of bioinformatics, spectral methods, and multi-armed bandits.
G.M. Kamath would like to thank Lester Mackey of Microsoft Research, New England Lab for useful discussions on Rank-one models and their connection to crowd-sourcing. The research of T. Baharav was supported by the Alcatel Lucent Stanford Graduate Fellowship and NSF GRFP. The research of I. Shomorony was supported in part by NSF grant CCF-2007597.
Appendix A A Rank- Model
Consider the case where we the -th observation is the output of a Ber random variable passed through a general binary channel BC (see Figure 1(c)). Here is probability of a being flipped to a and is probability of a being flipped to a on the th column.
We note that we have parameters here, while a rank-one model would admit only parameters. Hence this is not a rank-one model. However we note that
where , and . This shows that, when the noise in the workers’ responses is modelled by a general binary channel (see Figure 1(c)), we have a rank- model.
Appendix B Proof of Lemma 1
We start by using the triangle inequality to obtain
To bound the first term, we first notice that since
and is independent of , we have that
From the triangle inequality, we then have
From the fact that and the fact that for any , the second term can be bounded as