Chaining with overlaps revisited

01/19/2020 ∙ by Veli Mäkinen, et al. ∙ 0

Chaining algorithms aim to form a semi-global alignment of two sequences based on a set of anchoring local alignments as input. Depending on the optimization criteria and the exact definition of a chain, there are several O(n log n) time algorithms to solve this problem optimally, where n is the number of input anchors. In this paper, we focus on a formulation allowing the anchors to overlap in a chain. This formulation was studied by Shibuya and Kurochin (WABI 2003), but their algorithm comes with no proof of correctness. We revisit and modify their algorithm to consider a strict definition of precedence relation on anchors, adding the required derivation to convince on the correctness of the resulting algorithm that runs in O(n log^2 n) time on anchors formed by exact matches. With the more relaxed definition of precedence relation considered by Shibuya and Kurochin or when anchors are non-nested such as matches of uniform length (k-mers), the algorithm takes O(n log n) time. We also establish a connection between chaining with overlaps to the widely studied longest common subsequence (LCS) problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As optimal alignment of two strings takes quadratic time (which has recently been shown to be conditionally hard to improve [3]

), there are several heuristic approaches in the literature to accomplish this important task using reasonable resources. Among these, chaining algorithms represent a rigorous algorithmic approach to solve optimally a slightly easier problem: Given a precomputed set of plausible anchoring local matches, extract a chain of matches that forms a good (semi-global) alignment.

If anchors are not allowed to overlap in the solution, there are several time solutions for various formulations of the chaining problem [11, 7, 1, 2], where is the number of anchors. Some of the solutions and extensions focus on asymmetric measures, where overlaps are allowed in one of the strings [9, 10], or add other features that make the problem even harder [14]. While these formulations are useful in different contexts, the asymmetric measures can return different values depending on which of the coordinates is considered in the solver. This is an undesirable consequence in, e.g., string alignment, where the solution may be different depending on which string is used to traverse the ordered anchors, and specifically the solver may over count the amount of aligned characters.

The fully symmetric chaining variant allows arbitrary overlaps, guarantees not to over count the amount of aligned characters, and in addition, is particularly important for its connections to the Longest Common Subsequence problem (LCS): An optimal chain in this formulation corresponds to a LCS of the input strings, restricted to the matches included in the anchors. As far as we know, except for trivial time solutions, only Shibuya and Kurochkin [13] have proposed a solution aiming to solve the fully symmetric case of allowing overlaps of anchors in both strings simultaneously. However, while considering overlaps properly, the algorithm by Shibuya and Kurochkin [13] assumes a relaxed precedence order for the anchors.

In this paper, we revisit this algorithm and propose a modification that takes into account a strict order for the anchors. This modified algorithm runs in time on exact matches as input. When relaxing the precedence order or when the input consist of non-nested anchors such as -mer matches, the algorithm can be simplified to take time. The resulting algorithms are slightly simpler than the original [13], requiring only a general data structure for semi-dynamic range maximum queries, while the original uses in addition a tailored structure. We also provide detailed derivation of the algorithms, while the original [13] comes with no proof of correctness. Finally, we show that the relaxed chaining problem also solves a restricted version of the LCS problem.

2 Chaining problems

Let be a long text string and short pattern string. An anchor interval pair denotes a match between and . For now, we assume these matches are precomputed, and they could be either full identities or close similarities. We denote the endpoints of the intervals in anchor as for . Given two anchors and we define two relations: precedence and overlap. The former is denoted and this relation holds whenever , , , and . The latter is denoted and holds whenever or . The complement of the overlap relation is denoted for (an empty intersection is interpreted as truth value False). Figure 1 illustrates these concepts.

(a) No overlaps

(b) One-sided overlaps

(c) Larger overlap in the first dimension

(d) Larger overlap in the second dimension
Figure 1: Precedence and overlaps of anchors separated into different cases related to the chaining algorithms. Dotted and solid rectangles denote anchors and , respectively. In (a) and (b) no overlaps are allowed in the first dimension, while in (c) and (d) anchors are assumed to overlap in the first dimension.
Problem 1 (Chaining with overlaps).

Let be an array of anchor interval pairs . For each , , compute the symmetric ordered coverage score , where

  • is an ordered subset (chain) of pairs from ,

  • , for all ,

  • , and

Notice that for chains containing no overlaps, is the sum of lengths of the anchors in it, where length is defined as the minimum of the interval lengths. For overlapping cases, only the segment after the overlap is added to the score.

We develop an time algorithm to solve this problem assuming one additional property of the input:

  • Equal Match Length property: For each anchor it holds .

If the set of anchors is computed e.g. by Maximal Exact Matches (MEMs) [8], the input automatically satisfies the Equal Match Length property.

Our algorithm is based on techniques by Shibuya and Kurochkin [13], who solved a version of the problem with the definition of precedence relaxed to consider only start points of intervals: weakly precedes if and . Let us denote this relation .

Problem 2 (Chaining with overlaps and weak precedence).

Let be an array of anchor interval pairs . For each , , compute the symmetric weakly ordered coverage score , defined as in Problem 1, with the precedence condition relaxed to

  • , for all .

To see the connection of the problems, consider a chain for which holds but not for some . That is, at least one of the intervals of is nested inside the corresponding interval of . Say is nested in with (the other case is symmetric). Consider modifying into , where and . Assuming Equal Match Length property such adjustment is possible and causes without affecting the score. One can thus adjust any chain for which the weak precedence relation holds into another chain , where the (strict) precedence relation holds, so that .

As can be seen from the above construction, the two problems are identical when the input anchors are non-nested. This happens e.g. when anchors are matches of uniform length (-mer matches). Even more importantly, if one is only interested in the overall maximum scoring chain, the two problems produce the same result.

Lemma 1.

Assuming Equal Match Length property, the solutions

from Problem 1 and

from Problem 2 are the same.

Proof.

Consider an optimal chain for Problem 2. If does not hold, then one of the intervals of is nested inside the corresponding interval of . This means that for it holds . Continuing this induction, one observes that there is an overall maximum scoring chain, say , that ends with strict precedence. Consider now the construction given before this lemma that converts into . Adjust the construction so that instead of modifying into , just remove . This gives with strict precedence holding and without change in the score, that is, is an optimal solution to both problems. ∎

The motivation behind studying symmetric formulations of this problem is that the solution is guaranteed to be less or equal to the amount of ordered common characters between and

. This is important in various applications where one is interested in the alignment between two strings (such as in Bioinformatics). Asymmetric formulations can over estimate this amount. A further justification for the relevance of the definitions is the connection to the widely studied

Longest Common Subsequence (LCS) problem: String is a LCS of strings and if it is a longest string that can be obtained by deleting or more characters from both and from . Such can be written as and as , where and . Consider the set of anchors being exact matches between and . We say that is an anchor-restricted LCS if it is a longest string with all characters in appearing in increasing order in T and P where each such occurrence is contained in at least one anchor. Such can be written as and as (defined above) such that for each there is an anchor in with and for some , . We now show that an anchor-restricted LCS can be found by solving the problem of chaining under the weak precedence (Problem 2).

Theorem 2.

Assume the anchors are exact matches between input strings and . The score of a chain such that of Problem 2 equals the length of an anchor-restricted LCS of and .

Proof.

As discussed above, we can assume is a chain under the strict precedence order. Each anchor in contributes to the score by the minimum length of its intervals after the overlaps with the previous anchor intervals have been cut out. This minimum length equals the number of characters that can be included to the common subsequence. That is, we can extract an anchor-restricted subsequence of and of length from the solution. We need to show that such subsequence is the longest among anchor-restricted subsequences. Assume, for contradiction, that there is an anchor-restricted LCS longer than . Consider the chain of anchors formed by taking for each an anchor containing match . Assign a score to each anchor included in the chain. Let us modify this chain into a chain where weak precedence holds such that the total score (number of matches induced by the solution) remains the same. First, we merge from left to right all runs of identical anchors; score of an anchor is then the length of the run in the original chain. Then we consider anchors from left to right. Consider the first pair of anchors in the current chain for which (case is symmetric). Let the score of be . By construction, we know that the left-most position possible for the first match of included in is . Therefore, we can remove from the chain and include matches from . The total score does no decrease by this change. This process can be repeated until the weak precedence relation holds up to , and then continued similarly to the end of the chain, yielding a contradiction. ∎

Shibuya and Kurochkin [13] gave an time algorithm for Problem 2, but their algorithm comes with no proof of correctness. Our goal in this paper is to complement the original proposal with the required derivation steps to see that one can indeed solve the problem correctly in time. Instead of proving directly the correctness of the original proposal, we derive a simplified version of the algorithm, whose correctness is easier to verify.111Uricaru et al. [14] claim that also algorithms by Felsner et al. [7] and Abouelhoda and Ohlebusch [2] could be extended to handle overlaps. We think this is only possible if using asymmetric measure of overlaps [9] or a sum of extensions [14] instead of : With sum as the measure one can use identical methods as used in breaking gap costs [2], but with minimum, as we will see, more cases need to be considered.

We derive this algorithm in three steps: First we consider one-sided overlaps of anchors. Then we modify this algorithm to handle two-sided overlaps of anchors, solving Problem 1. Finally, we show that the use of strict precedence relation can be relaxed to in order to solve Problem 2.

3 Chaining algorithms

Our goal is here to study the variations of chaining algorithms under the symmetric ordered coverage. We will give chaining algorithms under the symmetric ordered coverage and equal-match property taking time. In order to do this we will structure the recurrence relations that solve Problems 1 and 2 such that one can factor out dependencies between anchors into different cases that are handled by evaluation order of the recurrences, range search, and special features of the scoring function. Assume now that the anchor interval pairs are stored in an array in arbitrary order. We fill a table so that gives the maximum symmetric ordered coverage of using the pair and any subset of pairs that precede : Hence, gives the total maximum symmetric ordered coverage.

After considering separately non-overlapping and overlapping cases (see Fig. 1), one observes that can be computed by , where

These recurrences can be computed in time: Sort by values to handle one dimension of the precedence relation. Then compute each in this order by scanning previously computed values and check precedence in the other dimension. Add the coverage values ( part) depending on the overlap relation. Select the maximum among the options of added with the coverage value.

By assuming Equal Match Length property, we can simplify the recurrence of with

3.1 One-sided overlaps

We will now present an algorithm that works for one-sided overlaps (see Fig. 1): We restrict the chains so that no two anchors in the solution overlap in the first coordinate (that is, in ). This lets us modify the recurrence of into the form

That is, we added the constraint on overlaps, removed the then obsolete and took out the values not affected by s. Now it is easy to see that the evaluation of the values can be done when visiting the starting points of the anchors in the first coordinate, and the maximizations over range of values can be done using search trees, specified in the next lemma.

Lemma 3.

The following three operations can be supported with a one-dimensional range search tree in time , where is the number of search keys inserted to the tree.

  • : Update associated with into .

  • : Update associated with into .

  • : Return maximum , where (Range Maximum Query).

Moreover, the search tree can be built in time, given the pairs sorted by component .

We will later need also a two-dimensional version of this structure.

Lemma 4.

The following two operations can be supported with a two-dimensional range search tree in time , where is the number of search keys inserted to the tree.

  • : Update associated with and into .

  • : Update associated with and into .

  • : Return maximum , where and (2D Range Maximum Query).

Moreover, the search tree can be built in time, given the triplets sorted first by primary key and then by secondary key.

These lemmas follow directly by maintaining maxima of values in each subtree for the corresponding standard range search structures [6] that support listing all the (key, value) pairs in a range. It is essential to build the two-dimensional range search tree with initial (primary key, secondary key, value) triplets, as it does not support efficient dynamic insertions and deletion of keys; for the one-dimensional case we could replace updates and upgrades by insertions and deletions.

Input: A set of interval pairs with all interval endpoints being distinct positive integers.
Output: Array containing the symmetric ordered coverage values.
Initialize one-dimensional search trees and with keys , , and with key , all keys associated with values ;
;
;
;
for  to  do
       ;
       ;
       if  then
             ;
             ;
             ;
            
       end if
      else
             ;
             ;
            
       end if
      
end for
return ;
Algorithm 1 Chaining allowing one-sided overlaps.

We obtain Algorithm 1 to handle the one-sided overlaps case. The pseudocode of Algorithm 1 assumes interval endpoints to be distinct. This assumption is only used for the ease of presentation. It can be relaxed by the standard method used in computational geometry: Replace each endpoint by a pair where identifies the anchor in question. These pairs are distinct, and can be used as the keys of the search trees (in place of just ). Range queries can be implemented to ignore the secondary key .

Lemma 5.

Problem 1 on input pairs restricted to solutions that contain only one-sided overlaps can be solved in time, assuming the input satisfies Equal Match Length property.

Proof.

The evaluation order of Algorithm 1 guarantees that when computing the values and , the data structures contain only anchors that precede the current anchor and do not overlap it in the first coordinate. The range query on guarantees that we also consider only those anchors that precede and do not overlap in the second coordinate for the computation of . The range query on guarantees that we also consider only those anchors that overlap in the second coordinate for the computation of , but this is not enough to guarantee predecessor-relation to hold. That is, there can be an anchor stored in with and and thus the evaluation order and range query fail to guarantee to make (recall the definition). We need to show that if such is in an optimal chain to , there is always another optimal chain to not including . Consider the last anchor in an optimal chain to that overlaps and precedes . Then we know that and , so a chain where directly precedes does not decrease the score. If such does not exist but an optimal chain to includes , we have that , as all anchors in an optimal chain to , excluding , are stored in , and adding (contribution of ) to the score is at least as much as (contribution of of and ). ∎

3.2 Two-sided overlaps

The trick by Shibuya and Kurochkin [13] to handle two-sided overlaps is to separate them to two cases (see Fig. 1): (c) overlaps in the first dimension are at least as long as in the second dimension and (d) overlaps are longer in the second dimension. Since our algorithm so far considers all anchors that do not overlap in the first dimension, it will be enough to consider how to enhance the algorithm to handle anchors that do overlap in the first dimension.

Consider case (c). That is, for any two pairs of anchors , , it holds , , and . The latter inequality can be written as (due to Equal Match Length property). Also, if precedes in an optimal chain to , the score calculated up to will increase by inclusion of by (due to Equal Match Length property). This means that once we first stop at anchor in our algorithm, if we have inserted to a search tree all anchors s that overlap in the first dimension, using keys and values , we can query and add to obtain the correct score for this case. However, in the order we evaluate the anchors we can only guarantee and thus (property of case (c)), but not or . To solve this, we add another dimension to the search tree, so we can add constraint to the query, which also covers the remaining constraint (property of case (c)).

Case (d) is almost symmetric to case (c): For any two pairs of anchors , , it holds , , and . The latter inequality can be written as . Also, if precedes in an optimal chain to , the score calculated up to will increase by inclusion of by (due to Equal Match Length property). This means that once we first stop at anchor in our algorithm, if we have inserted to a search tree all anchors s that overlap in the first dimension, using keys and values , we can query and add to obtain the correct score for this case. As before, we need to add another dimension to the search tree to handle constraint , which also covers constraint (property of case (d)). We are left with constraints and , where the first ones follow from the evaluation order, but now the latter is not automatically guaranteed to hold: Using arguments analogous to the proof of Lemma 5, we show that such nested case cannot change the optimal solution.

The resulting enhancement to handle two-sided overlaps is given as Algorithm 2.

Input: A set of interval pairs with all interval endpoints being distinct positive integers.
Output: Array containing the symmetric ordered coverage values.
Initialize one-dimensional search trees and with keys , , and with key , all keys associated with values ;
Initialize two-dimensional search trees and with keys and , respectively, for , associated with values ;
;
;
;
for  to  do
       ;
       ;
       if  then
             ;
             ;
             ;
             ;
             ;
             ;
             ;
            
       end if
      else
             ;
             ;
             ;
             ;
            
       end if
      
end for
return ;
Algorithm 2 Chaining with two-sided overlaps.

The pseudocode of Algorithm 2 assumes interval endpoints to be distinct, but this can be relaxed as in the proof of Lemma 5. Using the data structure from Lemmas 3 and 4 we obtain the following result.

Theorem 6.

Problem 1 on input pairs can be solved in time (by Algorithm 2), assuming the input satisfies Equal Match Length property.

Proof.

As discussed earlier, it is sufficient to consider anchors and that satisfy the precedence and overlap relations except for not holding, as all other constraints are properly covered by the combination of evaluation order and the queries. Such invalid anchors can affect the query results from data structures and when computing the score for . Consider that an optimal chain to has as the previous anchor and thus . Consider the last anchor in an optimal chain to that precedes and overlaps . Assume the overlap is larger in the first dimension (the other case is considered already in the proof of Lemma 5). Then as must overlap in the first dimension, must directly precede for it being the last with this property, and the overlap between and is larger in the first dimension due to transitivity. As , the direct use of before gives . That is, can be omitted from the optimal path. ∎

3.3 Overlaps with weak precedence

Let us now proceed to improve the running time of Algorithm 2 to by considering chains under the weak precedence relation (Problem 2). For this, we drop the second dimension of the data structures and , that were added to guarantee strict precedence. However, this is not sufficient for proving correctness as we used these constraints to indirectly guarantee precedence of start positions of anchors as well. Case (c) causes no problems, as the evaluation order guarantees that contains anchors with and the query restricts to cases . However, in case (d) the solution returned can have . We will consider this in the proof of the next theorem.

Theorem 7.

Problem 2 on input pairs can be solved in time (by Algorithm 2 with the operations on the second dimension of search trees and omitted), assuming the input satisfies Equal Match Length property.

Proof.

As discussed, it is sufficient to show that queries from correspond to proper solutions. For contradiction, assume that , , and only for s for which . Such solution is not proper (weak precedence not holding), so we need to show that there is an equivalently good proper solution. First, if it also holds , we have the nested case handled already in the proof of Theorem 6. It is left to consider case where and hold. Consider cutting subinterval of length from the end of both intervals of of to create an anchor that replaces in the set of input anchors. Consider two cases a) is non-empty and b) is empty. In case a), for all hold . Then is the score of an optimal chain to . But since is nested in , we have already seen that there is an equivalently good chain to omitting . In case b), we have that . This case is illustrated in Fig. 2. Assume that an optimal chain to covers fully the interval in the second dimension. Consider the last anchor in this optimal chain that overlaps position in the second dimension. It holds . From case a) we know that there is a proper optimal chain to that avoids with score . Thus, there is a proper optimal chain to that avoids with score . If there is no such that overlaps position , then either i) is nested in or ii) . In case i), we know that there is an optimal chain to that avoids with . Thus, there is an optimal chain to that avoids with . In case ii), . As in all cases we can replace with another anchor that yields at least the same score for , we get a contradiction with our counterargument. ∎

Figure 2: Dotted and solid rectangles denote and , respectively. Here the (weak) precedence is not holding as the interval of in the second dimension succeeds the corresponding interval of . The thick line segment represents maximum coverage a chain ending at (not including ) can achieve before entering the interval of . The algorithm subtracts this thick line segment length and the length of from the score, so that there is at least as good chain to that avoids using .

4 Discussion

We studied symmetric chaining formulations starting from the motivation to avoid over counting the matches of local anchors. This over counting can also be avoided using the asymmetric formulation studied in Sect. 3.1, and this formulation is actually a special case of the chaining between a sequence and a DAG [10]; it appears hard to extend the fully symmetric chaining formulations we studied here to work with DAGs. We are currently working on a practical alternative that uses a multiple alignment in place of a DAG, so that we can use our new methods for long read sequence alignment in transcript prediction.

References

  • [1] M. I. Abouelhoda and E. Ohlebusch (2003) Multiple genome alignment: chaining algorithms revisited. See Combinatorial pattern matching, 14th annual symposium, CPM 2003, morelia, michocán, mexico, june 25-27, 2003, proceedings, Baeza-Yates et al., pp. 1–16. External Links: Link, Document Cited by: §1.
  • [2] M. I. Abouelhoda and E. Ohlebusch (2005) Chaining algorithms for multiple genome comparison. J. Discrete Algorithms 3 (2-4), pp. 321–341. External Links: Link, Document Cited by: §1, footnote 1.
  • [3] A. Backurs and P. Indyk (2015) Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). See Proceedings of the forty-seventh annual ACM on symposium on theory of computing, STOC 2015, portland, or, usa, june 14-17, 2015, Servedio and Rubinfeld, pp. 51–58. External Links: Link, Document Cited by: §1.
  • [4] R. A. Baeza-Yates, E. Chávez, and M. Crochemore (Eds.) (2003)

    Combinatorial pattern matching, 14th annual symposium, CPM 2003, morelia, michocán, mexico, june 25-27, 2003, proceedings

    .
    Lecture Notes in Computer Science, Vol. 2676, Springer. External Links: Link, Document, ISBN 3-540-40311-6 Cited by: 1.
  • [5] K. L. Clarkson (Ed.) (1995) Proceedings of the sixth annual ACM-SIAM symposium on discrete algorithms, 22-24 january 1995. san francisco, california, USA. ACM/SIAM. External Links: Link, ISBN 0-89871-349-8 Cited by: 11.
  • [6] M. de Berg, O. Cheong, M. J. van Kreveld, and M. H. Overmars (2008) Computational geometry: algorithms and applications, 3rd edition. Springer. External Links: Link, ISBN 9783540779735 Cited by: §3.1.
  • [7] S. Felsner, R. Müller, and L. Wernisch (1997) Trapezoid graphs and generalizations, geometry and algorithms. Discrete Applied Mathematics 74 (1), pp. 13–32. External Links: Link, Document Cited by: §1, footnote 1.
  • [8] D. Gusfield (1997) Algorithms on strings, trees, and sequences - computer science and computational biology. Cambridge University Press. External Links: Link, Document, ISBN 0-521-58519-8 Cited by: §2.
  • [9] V. Mäkinen, L. Salmela, and J. Ylinen (2012) Normalized N50 assembly metric using gap-restricted co-linear chaining. BMC Bioinformatics 13, pp. 255. External Links: Link, Document Cited by: §1, footnote 1.
  • [10] V. Mäkinen, A. I. Tomescu, A. Kuosmanen, T. Paavilainen, T. Gagie, and R. Chikhi (2019) Sparse dynamic programming on DAGs with small width. ACM Trans. Algorithms 15 (2), pp. 29:1–29:21. External Links: Link, Document Cited by: §1, §4.
  • [11] G. Myers and W. Miller (1995) Chaining multiple-alignment fragments in sub-quadratic time. See Proceedings of the sixth annual ACM-SIAM symposium on discrete algorithms, 22-24 january 1995. san francisco, california, USA, Clarkson, pp. 38–47. External Links: Link Cited by: §1.
  • [12] R. A. Servedio and R. Rubinfeld (Eds.) (2015)

    Proceedings of the forty-seventh annual ACM on symposium on theory of computing, STOC 2015, portland, or, usa, june 14-17, 2015

    .
    ACM. External Links: Link, ISBN 978-1-4503-3536-2 Cited by: 3.
  • [13] T. Shibuya and I. Kurochkin (2003) Match Chaining Algorithms for cDNA Mapping. In Algorithms in Bioinformatics, G. Benson and R. D. M. Page (Eds.), Berlin, Heidelberg, pp. 462–475. External Links: ISBN 978-3-540-39763-2 Cited by: §1, §1, §2, §2, §3.2.
  • [14] R. Uricaru, A. Mancheron, and E. Rivals (2011) Novel definition and algorithm for chaining fragments with proportional overlaps. Journal of Computational Biology 18 (9), pp. 1141–1154. External Links: Link, Document Cited by: §1, footnote 1.