Longest Common Prefixes with k-Errors and Applications

01/13/2018
by   Lorraine A. K. Ayad, et al.
0

Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length n over a constant-sized alphabet that occurs elsewhere in the string with k-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant k and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in O(n ^k n n) time on average using O(n) space. We show that our technique is applicable to several algorithmic problems in computational biology and elsewhere.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/13/2020

k-Approximate Quasiperiodicity under Hamming and Edit Distance

Quasiperiodicity in strings was introduced almost 30 years ago as an ext...
05/16/2018

On Computing Average Common Substring Over Run Length Encoded Sequences

The Average Common Substring (ACS) is a popular alignment-free distance ...
12/02/2018

Sequence Searching Allowing for Non-Overlapping Adjacent Unbalanced Translocations

Unbalanced translocations are among the most frequent chromosomal altera...
06/21/2018

Hardness and algorithmic results for the approximate cover problem

In CPM 2017, Amir et al. introduce a problem, named approximate string c...
05/25/2021

Minimal unique palindromic substrings after single-character substitution

A palindrome is a string that reads the same forward and backward. A pal...
12/25/2018

Deep neural networks are biased towards simple functions

We prove that the binary classifiers of bit strings generated by random ...
01/08/2021

Cantor Mapping Technique

A new technique specific to String ordering utilizing a method called "C...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The longest common prefix (LCP) array is a commonly used data structure alongside the suffix array (SA). The LCP array stores the length of the longest common prefix between two adjacent suffixes of a given string as they are stored (in lexicographical order) in the SA [20]. A typical use combining the SA and the LCP array is to simulate the suffix tree functionality using less space [2].

However, there are many practical scenarios where the LCP array may be applied without making use of the SA. The LCP array provides us with essential information regarding repetitiveness

in a given string and is therefore a useful data structure for analysing textual data in areas such as molecular biology, musicology, or natural language processing (see 

[21] for some applications).

It is also quite common to account for potential alterations within textual data (sequences). For example, they can be the result of DNA replication or sequencing errors in DNA sequences. In this context, it is natural to define the longest common prefix with -errors. Given a string , the longest common prefix with -errors for every suffix is the length of the longest common prefix of and any , where , with applying up to substitution operations [21]. Some applications are given below.

Interspersed Repeats.

Repeated sequences are a common feature of genomes. One type in particular, interspersed repeats, are known to occur in all eukaryotic genomes. These repeats have no repetitive pattern and appear irregularly within DNA sequences [15]. Single nucleotide polymorphisms result in the existence of interspersed repeats that are not identical [19]. Identifying these repeats has been linked to genome folding locations and phylogenetic analysis [24].

Genome Mappability Data Structure.

In [3] the authors showed that using the longest common prefixes with -errors they can construct, in worst-case time, an -sized data structure answering the following type of queries in time per query: find the smallest such that at least of the substrings of of length do not occur more than once in with at most errors. This is a data structure version of the genome mappability problem [8, 21, 4].

Longest Common Substring with -Errors.

The longest common substring with -errors problem has received much attention recently, in particular due to its applications in computational biology [27, 18, 26]. We are asked to find the longest substrings of two strings that are at distance at most . The notion of longest common prefix with -errors is thus closely related to the notion of longest common substring with -errors. We refer the interested reader to [1, 10, 11, 14, 25].

All-Pairs Suffix/Prefix Overlaps with -Errors.

Finding approximate overlaps is the first stage of most genome assembly methods. Given a set of strings and an error-rate , the goal is to find, for all pairs of strings, their suffix/prefix matches (overlaps) that are within distance , where is the length of the overlap [23, 28, 16]. By concatenating the strings to form one single string and then computing longest common prefixes with -errors for only against the prefixes of the strings we have all the information we need to solve this problem.

Our Model.

We assume the standard word-RAM model with word size . Although real-world text datasets are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. We are thus interested in the average-case behaviour of our algorithms. When we state average-case time complexities for our algorithms, we assume that the input is a string of length over an alphabet of size with the letters of

being independent and identically distributed random variables, uniformly distributed over

. In the context of molecular biology we typically have and so we assume .

Related Works.

The problem of computing longest common prefixes with -errors was first studied by Manzini for in [21]. We distinguish the following techniques that can be applied to solve this and other related problems.

Non-constant and space:

In this case, we can make use of the well-known data structure by Cole et al [7]. The size of the data structure is , where is a constant.

Constant and space:

In this case, we can make use of the technique by Thankachan et al [25] which builds heavily on the data structure by Cole et al. The working space is exponential in but for .

Non-constant and space:

In this case, there exists a simple -time worst-case algorithm to solve the problem. The best-known average-case algorithm was presented in [3]. It requires time on average, where .

Other related works:

In [14] it was shown that a strongly subquadratic-time algorithm for the longest common substring with -errors problem, for and binary strings, refutes the Strong Exponential Time Hypothesis. Thus subquadratic-time solutions for approximate variants of the problem have been developed [14]. A non-deterministic algorithm is also known [1].

Our Contribution.

In this paper, we continue the line of research for non-constant and space to investigate the limits of computation in the average-case setting; in particular in light of the worst-case lower bound shown in [14]. We make the following threefold contribution.

  1. We first show a non-trivial upper bound of independent interest: the expected length of the maximal longest common prefix with -errors between a pair of suffixes of is when .

  2. By applying this result, we significantly improve upon the state-of-the-art algorithm for non-constant and using space [3]. Specifically, our algorithm runs in time on average using space.

  3. Notably, we extend our results to the edit distance model with no extra cost thus solving the genome mappability data structure problem, the longest common substring with -errors problem, and the all-pairs suffix/prefix overlaps with -errors problem in strongly sub-quadratic time for .

2 Preliminaries

We begin with some basic definitions and notation. Let be a string of length over a finite ordered alphabet of size . For two positions and on , we denote by the substring (sometimes called factor) of that starts at position and ends at position . We recall that a prefix of is a substring that starts at position 0 () and a suffix of is a substring that ends at position ().

Let be a string of length with . We say that there exists an occurrence of in , or, more simply, that occurs in , when is a substring of . Every occurrence of can be characterised by a starting position in . We thus say that occurs at the starting position in when .

The Hamming distance between two strings and , with , is defined as . If , we set . The edit distance between and is the minimum total cost of a sequence of edit operations (insertions, deletions, substitutions) required to transform into . It is known as Levenshtein distance for unit cost operations. We consider this special case here. If two strings and are at (Hamming or edit) distance at most we say that and have -errors or have at most errors.

We denote by SA the suffix array of . SA is an integer array of size storing the starting positions of all (lexicographically) sorted non-empty suffixes of , i.e. for all we have  [20]. Let lcp denote the length of the longest common prefix between and for positions , on . We denote by LCP the longest common prefix array of defined by LCP for all , and LCP. The inverse iSA of the array SA is defined by , for all . It is known that SA, iSA, and LCP of a string of length , over a constant-sized alphabet, can be computed in time and space  [22, 9]. It is then known that a range minimum query (RMQ) data structure over the LCP array, that can be constructed in time and space [5], can answer lcp-queries in time per query [20]. The lcp queries are also known as longest common extension (LCE) queries.

The permuted LCP array, denoted by PLCP, has the same contents as the LCP array but in different order. Let denote the starting position of the lexicographic predecessor of . For , we define , that is, is the length of the longest common prefix between and its lexicographic predecessor. For the starting position of the lexicographically smallest suffix we set . For any , we define as the largest such that and exist and are at Hamming distance at most ; note that this is defined for a pair of strings. We analogously define the permuted LCP array with -errors, denoted by . For , we have that

The main computational problem in scope can be formally stated as follows.

PLCP with -Errors Input: A string of length and an integer Output: and ; , for , is such that , where

We assume that throughout, since all relevant time-complexities contain an factor and any larger would force this value to be :

3 Computing

In this section we propose a new algorithm for the PLCP with -Errors problem under both the Hamming and the edit distance (Levenshtein distance) models. This algorithm is based on a deeper look into the expected behaviour of the longest common prefixes with -errors. This in turn allows us to make use of the -fast trie, an efficient data structure for maintaining integers from a bounded domain. We already know the following result for errors under the Hamming distance model.

Theorem 3.1 ([3])

Problem PLCP with -Errors for can be solved in average-case time , where , using extra space.

In the rest of this section, we show the following result for errors under both the Hamming and the edit distance models.

Theorem 3.2 ()

Problem PLCP with -Errors can be solved in average-case time , where is a constant, using extra space.

For clarity of presentation, we first do the analysis and present the algorithm under the Hamming distance model in Sections 3.1 and 3.2. We then show how to extend our technique to work under the edit distance model in Section 3.3.

3.1 Expectations

The expected maximal value in the

LCP array is  [13]. We can thus obtain a trivial bound on the expected length of the maximal longest common prefix with -errors for arbitrary and . By looking deeper into the expected behaviour of the longest common prefixes with -errors we show the following result of independent interest for when .

Theorem 3.3

Let be a string of length over an alphabet of size and be an integer.

  • The expected length of the maximal longest common prefix with -errors between a pair of suffixes of is .

  • There exists a constant such that the expected number of pairs of suffixes of with a common prefix with -errors of length at least is .

Proof (a)

Let us denote the th suffix of by . Further let us define the following random variables:

Claim

.

Proof (of Claim)

Each possible set of positions where a substitution is allowed is a subset of one of the subsets of of size . For each of these subsets, we can disregard what happens in the chosen positions; in order to yield a match with -errors, the remaining

positions must match and each of them matches with probability

. The claim follows by applying the Union-Bound (Boole’s inequality).∎

By applying the Union-Bound again we have that

for and for . The expected value of is given by:

(Note that we bound the first summand using that for all .)

Claim

Let . We have that for .

Proof (of Claim)

By assuming , for some , we apply the above claim to bound the second summand as follows.

for some constant since and . Then and we can thus pick an large enough such that this sum is for any . ∎

Proof (b)

Let

be the indicator random variable for the event

. We then have that

which we have already shown is if for some .∎

3.2 Improved Algorithm for Hamming Distance

The -fast trie, introduced in [29], supports insert, delete and search (exact, predecessor and successor queries) in time with high probability, using space, where is the number of stored values and is size of the universe. We consider each substring of of length at most for a constant satisfying Theorem 3.3 (b) as a -digit number; note that by our assumptions this number fits in a computer word. We thus have and hence .

We initialise and for each based on the longest common prefix of (i.e. not allowing any errors) that occurs elsewhere using the SA and the LCP array; this can be done in time. For each pair of suffixes that share a prefix of at least we perform (at most) LCE queries to find their longest common prefix allowing for -errors; by Theorem 3.3 these pairs are .

We then initialise the -fast trie by inserting to it for each position of with . (For the rest of the positions, for which we reach the end of , we insert after some trivial technical considerations.) This procedure takes time in total.

We then want to find a longest prefix of the strings of length at most that are at Hamming distance at most from that occurs elsewhere in as well as an occurrence of it. If this prefix is of length , we find all positions in for which and treat each of them individually. We generate a subset of the strings; we avoid generating some that we already know do not occur in . We only want to allow the first error at position , where . Let us denote the substitution at position with letter by . Suppose that the longest prefix of after substitutions that occurs elsewhere in the string is of length . We then want to allow the th error at positions ; inspect Figure 1 for an illustration. It should be clear that we obtain each possible sequence of substitutions at most once.

Figure 1: The th error is any possible substitution at a position .

We view each string created after at most substitution operations as a number; the aim is to find its longest prefix that occurs elsewhere in . To this end we perform at most three queries over the -fast trie: an exact; a predecessor; and a successor query. If the exact query is unsuccessful, then either the predecessor or the successor query will return a factor of that attains the maximal longest common prefix that any factor of has with . Note that it may be the case that only occurs at position ; however in this case will be smaller or equal than the value currently stored at due to how we generate each such string . Hence we do not perform an invalid update of .

Having found , we can then compute the length of the longest common prefix between and in constant time using standard bit-level operations. For clarity of presentation we assume . An XOR operation between and provides us with an integer specifying the positions of errors (bits set on when is viewed as binary). If , we take , which provides us with the index of the leftmost bit set on which in turn specifies the length of the longest common prefix between and ; specifically .

If we perform LCE queries between all suffixes of the text that have as a prefix and ; by Theorem 3.3 we expect this to happen times in total, so the cost is immaterial.

We have positions where we need to consider the errors, yielding an overall time complexity of . We thus obtain the following result.

See 3.2

Remark 1

We have that and hence the required time is bounded by .

Remark 2

If , where is the word size in the word-RAM model, we can make use of the deterministic data structure presented in [6] (Theorem 1 therein), which can be built in time for a string of length and answers predecessor queries (i.e. given a query string , it returns the lexicographically largest suffix of that is smaller than ) in time . In particular, the queries in scope can be answered in time per query.

3.3 Edit Distance

We next consider computing under the edit distance model; however in this case we observe that and are at edit distance for . We hence alter the definition so that refers to the longest common prefix of with -errors occurring at a position .

The proof of Theorem 3.3 can be extended to allow for -errors under the edit distance. In this case we have that ; this can be seen by following the same reasoning as in the first claim of the proof with two extra considerations: (a) each deletion/insertion operation conceptually shifts the letters to be matched (giving the factor); (b) the letters to be matched are minus the number of deletions and substitutions and hence at least . The extra factor gets consumed by later in the proof since .

On the technical side, we modify the algorithm of Section 3.2 as follows:

  1. At each position, except for substitutions, we also consider insertions and deletion. This yields a multiplicative factor in the time complexity. We keep counters ins for insertions and del for deletions; for each length we obtain, we add del and subtract ins.

  2. When querying for a string while processing position we now have to check that we do not return a position . We can resolve this by spending time for each position ; when we start processing position , we create an array of size that stores for each position a position with the maximal longest common prefix with using the SA and the LCP array. When a query returns a position we instead consider .

  3. We replace the LCE queries used to compute values in longer than (that required time in total) by the Landau-Vishkin technique [17] to perform extensions. For an illustration inspect Figure 2. We initiate diagonal paths in the classical dynamic programming matrix for and . The th diagonal path above and the th diagonal path below the main diagonal are initialised to errors. The path starting at the main diagonal is initialised to errors. We first perform an LCE query between and , for all , and an LCE query between and , for all . Then, for all , we try to extend a path with exactly errors to a path with exactly errors. We perform an insertion, a deletion, or a substitution with a further LCE query and pick the farthest reaching extension. The bottom-most extension of any diagonal when specifies the length of the longest common prefix with -errors. The whole process takes time .

Figure 2: We need to perform extension steps in diagonals of the dynamic programming matrix for and .

4 Genome Mappability Data Structure

The genome mappability problem has already been studied under the Hamming distance model [8, 21, 4]. We can also define the problem under the edit distance model. Given a string of length and integers and , we are asked to count, for each length- substring of , the number occ of other substrings of occurring at a position that are at edit distance at most from . We then say that this substring has -mappability equal to occ. Specifically, we consider a data structure version of this problem [3]. Given and , construct a data structure, which, for a query value given on-line, returns the minimal value of that forces at least length- substrings of to have -mappability equal to .

Theorem 4.1 ([3])

An -sized data structure answering genome mappability queries in time per query can be constructed from in time .

By combining Theorem 3.2 with Theorem 4.1 we obtain the first efficient algorithm for the genome mappability data structure under the edit distance model.

5 Longest Common Substring with -Errors

In the longest common substring with -errors problem we are asked to find the longest substrings of two strings that are at distance at most . The Hamming distance version has received much attention due to its applications in computational biology [27, 18, 26]. Under edit distance, the problem is largely unexplored. The average -error common substring is an alignment-free method based on this notion for measuring string dissimilarity under Hamming distance; we denote the induced distance by for two strings and (see [27] for the definition). can be computed in time from arrays and , defined as

A worst-case and a more practical average-case algorithm for the computation of have been presented in [25, 26]. This measure was extended to allow for wildcards (don’t care letters) in the strings in [12]. Here we provide a natural generalisation of this measure: the average -error common substring under the edit distance model. The sole change is in the definition of : except for substitution, we also allow for insertion and deletion operations.

The algorithm of Section 3.3 can be applied to compute under the edit distance model within the same complexities. We start by constructing the -fast trie for . We then do the queries for the suffixes of ; we now also check for an exact match (i.e. for ). We obtain the following result.

Theorem 5.1

Given two strings and of length at most and a distance threshold , arrays and and can be computed in average-case time , where is a constant, using extra space.

Remark 3

By applying Theorem 5.1 we essentially solve the longest common substring with -errors for and within the same complexities.

6 All-Pairs Suffix/Prefix Overlaps with -Errors

Given a set of strings and an error-rate , the goal is to find, for all pairs of strings, their suffix/prefix matches (overlaps) that are within distance , where is the length of the overlap [23, 28, 16].

Using our technique but only inserting prefixes of the strings in the -fast trie and querying for all starting positions (suffixes) in a similar manner as in Section 3.1, we obtain the following result.

Theorem 6.1

Given a set of strings of total length and a distance threshold , the length of the maximal longest suffix/prefix overlaps of every string against all other strings within distance can be computed in average-case time , where is a constant, using extra space.

References

  • [1] Amir Abboud, Ryan Williams, and Huacheng Yu. More applications of the polynomial method to algorithm design. In SODA, SODA ’15, pages 218–230. Society for Industrial and Applied Mathematics, 2015.
  • [2] Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53–86, 2004.
  • [3] Hayam Alamro, Lorraine A. K. Ayad, Panagiotis Charalampopoulos, Costas S. Iliopoulos, and Solon P. Pissis. Longest common prefixes with -mismatches and applications. In SOFSEM, volume 10706 of LNCS, pages 636–649. Springer International Publishing, 2018.
  • [4] Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Solon P. Pissis, Jakub Radoszewski, and Wing-Kin Sung. Faster algorithms for 1-mappability of a sequence. In COCOA, volume 10628 of LNCS, pages 109–121. Springer International Publishing, 2017.
  • [5] Michael A. Bender and Martín Farach-Colton. The LCA problem revisited. In LATIN, volume 1776 of LNCS, pages 88–94. Springer-Verlag, 2000.
  • [6] Philip Bille, Inge Li Gørtz, and Frederik Rye Skjoldjensen. Deterministic indexing for packed strings. In CPM, volume 78 of LIPIcs, pages 6:1–6:11. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017.
  • [7] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In STOC, STOC ’04, pages 91–100. ACM, 2004.
  • [8] Thomas Derrien, Jordi Estellé, Santiago Marco Sola, David Knowles, Emanuele Raineri, Roderic Guigó, and Paolo Ribeca. Fast computation and applications of genome mappability. PLoS ONE, 7(1), 2012.
  • [9] Johannes Fischer. Inducing the LCP-array. In WADS, volume 6844 of LNCS, pages 374–385. Springer-Verlag, 2011.
  • [10] Tomás Flouri, Emanuele Giaquinta, Kassian Kobert, and Esko Ukkonen. Longest common substrings with mismatches. Inf. Process. Lett., 115(6-8):643–647, 2015.
  • [11] Szymon Grabowski. A note on the longest common substring with k-mismatches problem. Inf. Process. Lett., 115(6-8):640–642, 2015.
  • [12] Sebastian Horwege, Sebastian Lindner, Marcus Boden, Klas Hatje, Martin Kollmar, Chris-Andre Leimeister, and Burkhard Morgenstern. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Research, 42(Webserver-Issue):7–11, 2014.
  • [13] S. Karlin, G. Ghandour, F. Ost, Tavare S., and L. J. Korn. New approaches for computer analysis of nucleic acid sequences. In Proceedings of the National Academy of Sciences USA, volume 80, pages 5660–5664, 1983.
  • [14] Tomasz Kociumaka, Jakub Radoszewski, and Tatiana A. Starikovskaya. Longest common substring with approximately mismatches. CoRR, abs/1712.08573, 2017.
  • [15] Roman Kolpakov, Ghizlane Bana, and Gregory Kucherov. mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Research, 31(13):3672–3678, 2003.
  • [16] Gregory Kucherov and Dekel Tsur. Improved filters for the approximate suffix-prefix overlap problem. In SPIRE, volume 8799 of LNCS, pages 139–148. Springer, 2014.
  • [17] Gad M. Landau and Uzi Vishkin. Efficient string matching in the presence of errors. In IEEE, editor, FOCS, pages 126–136. IEEE Computer Society, 1985.
  • [18] Chris-Andre Leimeister and Burkhard Morgenstern. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14):2000–2008, 2014.
  • [19] Kung-Hao Liang. Bioinformatics for Biomedical Science and Clinical Applications. Woodhead Publishing Series in Biomedicine. Woodhead Publishing, 2013.
  • [20] Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993.
  • [21] Giovanni Manzini. Longest common prefix with mismatches. In SPIRE, volume 9309 of LNCS, pages 299–310. Springer, 2015.
  • [22] Ge Nong, Sen Zhang, and Wai Hong Chan. Linear suffix array construction by almost pure induced-sorting. In DCC, IEEE, pages 193–202, 2009.
  • [23] Kim R. Rasmussen, Jens Stoye, and Eugene W. Myers. Efficient -gram filters for finding all epsilon-matches over a given length. Journal of Computational Biology, 13(2):296–308, 2006.
  • [24] Arian FA Smit. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Current Opinion in Genetics & Development, 9(6):657–663, 1999.
  • [25] Sharma V. Thankachan, Alberto Apostolico, and Srinivas Aluru. A provably efficient algorithm for the k-mismatch average common substring problem. Journal of Computational Biology, 23(6):472–482, 2016.
  • [26] Sharma V. Thankachan, Sriram P. Chockalingam, Yongchao Liu, Alberto Apostolico, and Srinivas Aluru. ALFRED: A practical method for alignment-free distance computation. Journal of Computational Biology, 23(6):452–460, 2016.
  • [27] Igor Ulitsky, David Burstein, Tamir Tuller, and Benny Chor. The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology, 13(2):336–350, 2006.
  • [28] Niko Välimäki, Susana Ladra, and Veli Mäkinen. Approximate all-pairs suffix/prefix overlaps. Inf. Comput., 213:49–58, 2012.
  • [29] Dan E. Willard. Log-logarithmic worst-case range queries are possible in space theta(n). Inf. Process. Lett., 17(2):81–84, 1983.