Approximate Online Pattern Matching in Sub-linear Time

We consider the approximate pattern matching problem under edit distance. In this problem we are given a pattern P of length w and a text T of length n over some alphabet Σ, and a positive integer k. The goal is to find all the positions j in T such that there is a substring of T ending at j which has edit distance at most k from the pattern P. Recall, the edit distance between two strings is the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. For a position t in {1,...,n}, let k_t be the smallest edit distance between P and any substring of T ending at t. In this paper we give a constant factor approximation to the sequence k_1,k_2,...,k_n. We consider both offline and online settings. In the offline setting, where both P and T are available, we present an algorithm that for all t in {1,...,n}, computes the value of k_t approximately within a constant factor. The worst case running time of our algorithm is O(n w^3/4). As a consequence we break the O(nw)-time barrier for this problem. In the online setting, we are given P and then T arrives one symbol at a time. We design an algorithm that upon arrival of the t-th symbol of T computes k_t approximately within O(1)-multiplicative factor and w^8/9-additive error. Our algorithm takes O(w^1-(7/54)) amortized time per symbol arrival and takes O(w^1-(1/54)) additional space apart from storing the pattern P. Both of our algorithms are randomized and produce correct answer with high probability. To the best of our knowledge this is the first worst-case sub-linear (in the length of the pattern) time and sub-linear succinct space algorithm for online approximate pattern matching problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/08/2018

Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time

Edit distance is a measure of similarity of two strings based on the min...
04/10/2019

Constant factor approximations to edit distance on far input pairs in nearly linear time

For any T ≥ 1, there are constants R=R(T) ≥ 1 and ζ=ζ(T)>0 and a randomi...
06/10/2021

Small space and streaming pattern matching with k edits

In this work, we revisit the fundamental and well-studied problem of app...
10/26/2021

Linear Approximate Pattern Matching Algorithm

Pattern matching is a fundamental process in almost every scientific dom...
10/03/2018

Approximating Approximate Pattern Matching

Given a text T of length n and a pattern P of length m, the approximate ...
07/19/2021

Sensitivity of string compressors and repetitiveness measures

The sensitivity of a string compression algorithm C asks how much the ou...
01/03/2021

Text Searching Allowing for Non-Overlapping Adjacent Unbalanced Translocations

In this paper we investigate the approximate string matching problem whe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Finding the occurrences of a pattern in a larger text is one of the fundamental problems in computer science. Due to its immense applications this problem has been studied extensively under several variations [22, 18, 5, 14, 20, 15, 21, 28, 23]. One of the most natural variations is where we are allowed to have a small number of errors while matching the pattern. This problem of pattern matching while allowing errors is known as approximate pattern matching. The kind of possible errors varies with the applications. Generally we capture the amount of errors by the distance metric defined over the set of strings. One common and widely used distance measure is the edit distance (aka Levenshtein distance[25]. The edit distance between two strings and denoted by is the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. In this paper we focus on the approximate pattern matching problem under edit distance. This problem has various applications ranging from computational biology, signal transmission, web searching, text processing to many more.

Given a pattern of length and a text of length over some alphabet , and an integer we want to identify all the substrings of at edit distance at most from . As the number of such substrings might be quadratic in and one wants to obtain efficient algorithms one focuses on finding the set of all right-end positions in of those substrings at distance at most . More specifically, for a position in , we let be the smallest edit distance of a substring of ending at -th position in . (We number positions in and from .) The goal is to compute the sequence for and . Using basic dynamic programming paradigm we can solve this problem in time [30]. Later Masek and Paterson [26] shaved a factor from the above running time bound. Despite of a long line of research, this running time remains the best till now. Recently, Backurs and Indyk [6] indicate that this bound cannot be improved significantly unless the Strong Exponential Time Hypothesis (SETH) is false. Moreover Abboud et al. [3] showed that even shaving an arbitrarily large polylog factor would imply that NEXP does not have non-uniform circuits which is likely but hard to prove conclusion. More hardness results can be found in  [2, 7, 1, 4].

In this paper we focus on finding an approximation to the sequence for and . For reals , a sequence is -approximation to , if for each , . Hence, is the multiplicative error and is the additive error of the approximation. An algorithm computes -approximation to approximate pattern matching if it outputs a -approximation of the true sequence for and . We refer -approximation simply as -approximation. Our main theorem is the following.

Theorem 1.1.

There is a constant and there is a randomized algorithm that computes -approximation to approximate pattern matching in time with probability at least .

In the recent past researchers also studied the approximate pattern matching problem in the online setting. The online version of this pattern matching problem mostly arises in real life applications that require matching pattern in a massive data set, like in telecommunications, monitoring Internet traffic, building firewall to block viruses and malware connections and many more. The online approximate pattern matching is as follows: we are given a pattern first and then the text is coming symbol by symbol. Upon receipt of the -th symbol we should output the corresponding . The online algorithm runs in amortized time if it runs in total time and it uses succinct space if in addition to storing it uses at most cells of memory at any time.

Theorem 1.2.

There is a constant so that there is a randomized online algorithm that computes -approximation to approximate pattern matching in amortized time and succinct space with probability at least .

To the best of our knowledge this is the first online approximation algorithm that takes sublinear (in the length of the pattern) running time and sublinear succinct space for the approximate pattern matching problem. The succinct space data structure is quite natural from the practical point of view and has been considered for many problems including pattern matching, e.g. [29, 19].

To prove our result we use the technique developed by Chakraborty, Das, Goldenberg, Koucký and Saks in [8], where they provide a sub-quadratic time constant factor approximation algorithm for the edit distance problem. Suppose one has only a black-box access to a sub-quadratic time approximation algorithm for computing the edit distance. It is not clear how to use that algorithm to design an algorithm for the offline approximate pattern matching problem that runs in time, for some . So even given the result of [8] it was still open whether one can solve approximate pattern matching problem in time better than .

In this paper we first design an offline algorithm by building upon the technique used in [8]. To do this we exploit the similarity between the "dynamic programming graphs" (see Section 2) for approximate pattern matching problem and the edit distance problem. As witnessed for example by the running time of our pattern matching algorithm, which is , whereas the running time of the edit distance algorithm is , this still requires careful modifications to the edit distance algorithm. However the scenario becomes more involved if one wants to design an online algorithm using only a small amount of extra space. The approximation algorithm for edit distance in [8] works in two phases: first a covering algorithm is used to discover a suitable set of shortcuts in the pattern matching graph, and then a min-cost path algorithm on a grid graph with the shortcuts yields the desired result. In the online setting we carefully interleave all of the above phases. However that by itself is not sufficient since the first phase, i.e., the covering algorithm used in [8] essentially relies on the fact that both of the strings are available at any point of time. We modify the covering technique so that it can also be implemented in the situation when we cannot see the full text. We show that if we store the pattern then we need only extra space to perform the sampling. Furthermore, the min-cost path algorithm in [8] takes space. We modify that algorithm too in a way so that it also works using only space. We describe our algorithm in more details in Section 6.

1.1 Related work

The approximate pattern matching problem is one of the most extensively studied problems in modern computer science due to its direct applicability to data driven applications. In contrast to the exact pattern matching here a text location has a match if the distance between the pattern and the text is within some tolerated limit. In our work we study the approximate pattern matching under edit distance metric. The very first -time algorithm was given by Sellers [30] in . Masek and Paterson [26] proposed an -time -space algorithm using Four Russians [34] technique. Later  [27, 24, 17] gave -time algorithms where is the upper limit of allowed edit operations. All of these algorithms use either or space. However  [16, 33] reduced the space usage to while maintaining the run time. A faster algorithm was given by Cole and Hariharan [13], which has a runtime of . We refer the interested readers to a beautiful survey by Navarro [28] for a comprehensive treatment on this topic.

All the above mentioned algorithms assume that the entire text is available from the very begining of the process. However in the online version, the pattern is given at the beginning and the text arrives in a stream, one symbol at a time. Clifford et al. [9] gave a "black-box algorithm" for online approximate matching where the supported distance metrics are hamming distance, matching with wildcards, -mismatch, and norm. Their algorithm has a run time of per symbol arrival, where is the running time of the best offline algorithm. This result was extended in  [11] by introducing an algorithm solving online approximate pattern matching under edit distance metric in time per symbol arrival. This algorithm uses -space. In [12] the runtime was further improved to per symbol. However none of these algorithms for edit distance metric is black-box and they highly depend on the specific struture of the corresponding offline algorithm. Furthermore all these algorithms use linear space. Recently, Starikovskaya [31] gave a randomized algorithm which has a worst case time complexity of and uses space . Although her algorithm takes both sublinear time and sublinear space for small values of , heavy dependancy on in the complexity terms makes it much worse than the previously known algorithms in the high regime of . On the lower bound side, Clifford, Jalsenius and Sach [10] showed in the cell-probe model that expected amortized run time of any randomized algorithm solving online approximate pattern matching problem must be per output.

2 Preliminaries

We recall some basic definitions of [8]. Consider the text of length to be aligned along the horizontal axis and the pattern of length to be aligned along the vertical axis. For , denotes the -th symbol of and for , denotes the -th symbol of . is the substring of starting by the -th symbol and ending by the -th symbol of . For any interval , denotes the substring of indexed by and for , denotes the substring of indexed by .

Edit distance and pattern matching graphs.

For a text of length and a pattern of length , the edit distance graph is a directed weighted graph called a grid graph with vertex set and following three types of edges: (H-steps), (V-steps) and (D-steps). Each H-step or V-step has cost and each D-step costs if and otherwise. The pattern matching graph is the same as the edit distance graph except for the cost of horizontal edges which is zero.

For and , is the subgraph of induced on . Clearly, . We define the cost of a path in , denoted by , as the sum of the costs of its edges. We also define the cost of a graph , denoted by , as the cost of the cheapest path from to .

The following is well known in the literature (e.g. see [30]).

Proposition 2.1.

Consider a pattern of length and a text of length , and let . For any , let , and . Then .

A similar proposition is also true for the edit distance graph.

Proposition 2.2.

Consider a pattern of length and a text of length , and let . For any , let and . Then .

Let be a grid graph on and be a path in . Horizontal projection of a path is the set . Let be a set contained in the horizontal projection of , then denotes the (unique) minimal subpath of with horizontal projection . Let be a subgraph of . For we say that -covers the path if the initial and the final vertex of are at a vertical distance of at most from and , resp..

A certified box of is a pair where , are intervals, and such that . At high level, our goal is to approximate each path in by a path via the corner vertices of certified boxes. For that we want that a substantial portion of the path goes via those boxes and that the sum of the costs of the certified boxes is not much larger than the actual cost of the path. The next definition makes our requirements precise. Let be a sequence of certified boxes in . Let be a path in with horizontal projection . For any , we say that -approximates if the following three conditions hold:

  1. is a decomposition of , i.e., , and for all , .

  2. For each , -covers .

  3. .

3 Offline approximate pattern matching

To prove Theorem 1.1 we design an algorithm as follows. For , , we run the standard algorithm  [16] to identify all such that . To identify positions with for where is a power of two we will use the technique of [8] to compute -approximation of . The obtained information can be combined in a straightforward manner to get a single -approximation to : For each , if for some , is at most (as determined by the former algorithm) then output the smallest such as the approximation of , otherwise output the approximation of found by the latter algorithm. This way, for we will get -approximation, and for we will get a -approximation. We will now elaborate on the latter algorithm based on [8]. The edit distance algorithm of [8] has two phases which we will also use. The first phase (covering phase) identifies a set of certified boxes, subgraphs of the pattern matching graph with good upper bounds on their cost. These certified boxes should cover the min-cost paths of interest. Then the next phase runs a min-cost path algorithm on these boxes to obtain the output sequence. Both of these phases will take time so the overall running time of our algorithm will be .

We describe the algorithms for the two phases next. The algorithm will use the following parameters: , , , . The meaning of the parameters is essentially the same as in [8]

and we will see it in a moment but their setting is different. Let

be the large enough constants from [8]. For simplicity we will assume without loss of generality that and are powers of two (by rounding them down to the nearest powers of two), is a reciproval of a power of two (by decreasing by at most a factor of two), (by chopping off a small suffix from which will affect the approximation by a negligible additive error as ), and (if not we can run the algorithm twice: on the largest prefix of of length divisible by and then on the largest suffix of of length divisible by ). The algorithm will not explicitly compute for all but only for where is a multiple of , and then it will use the same value for each block of consecutive ’s. Again, this will affect the approximation by a negligible additive error.

4 Covering phase

We describe the first phase of the algorithm now. First, we partition the text into substrings of length , where . Then we process each of the parts independently. Let be one of the parts. We partition into substrings of length , and we also partition into substrings of length , where and . For a substring of starting by -th symbol of and ending by -th symbol of , we let be its span. Then the covering algorithm proceeds as follows:

Dense substrings.

In this part the algorithm aims to identify for each , that is a power of two, a set of substrings which are similar to more than relevant substrings of . (A string is relevant if it starts at a position such that is divisible by and it is of the same length as .) We identify each by testing a random sample of relevant substrings of . If we determine with high confidence that there are at least substrings of similar to , we add into a set of such strings, and we also identify all that are similar to . By triangle inequality we would also expect them to be similar to many relevant substrings of . So we add these to as well as we will not need to process them anymore. We output the set of certified boxes of edit distance found this way. More formally:

For , the algorithm maintains sets of substrings . These sets are initially empty.

Step 1. For each and , if is in then we continue with the next and . Otherwise we process it as follows.

Step 2. Set . Independently at random, sample many -aligned substrings of of length . (By an -aligned substring of length in we mean a substring starting by a symbol at a position such that is a multiple of .) For each sampled substring check if its edit distance from is at most . If less than of the samples have their edit distance from below then we are done with processing this and and we continue with the next pair.

Step 3. Otherwise we identify all substrings that are not in and are at edit distance at most from , and we let to be the set of their spans relative to the whole .

Step 4. Then we identify all -aligned substrings of of length that are are at edit distance at most from , and we let to be the set of their spans. We might allow also some -aligned substrings of of edit distance at most to be included in the set (as some might be misidentified to have the smaller edit distance from by our procedure that searches for them).

Step 5. For each pair of spans from we output corresponding certified box . We add substrings corresponding to into and continue with the next pair and .

Once we process all pairs of and , we proceed to the next phase: extension sampling.

Extension sampling.

In this part for every and every substring , which does not have all its substrings contained in we randomly sample a set of such ’s. For each sampled we determine all relevant substrings of at edit distance at most from . There should be -many such substrings of . We extend each such substring into a substring of size within and we check the edit distance of the extended string from . For each extended substring of edit distance at most we output a set of certified boxes.

Here we define the appropriate extension of substrings. Let be a substring of of length less than , and let be a substring of starting by the -th symbol of . Let be a substring of of the same length as starting by the -th symbol of . The diagonal extension of in with respect to and , is the substring of of length starting at position . If then the extension is the prefix of of length , and if then the extension is the suffix of of length .

Step 6. Process all pairs and .

Step 7. Independently at random, sample substrings that are part of and that are not in . (If there is no such substring continue for the next pair of and .)

Step 8. For each , find all -aligned substrings of of length that are at edit distance at most from .

Step 9. For each determine its diagonal extension with respect to and . Check if the edit distance of and is less than . If so, compute it and denote the distance by . Let be the span of relative to , and be the span of in . For all powers and of two, , output the certified box . Proceed for the next and .

This ends the covering algorithm which outputs various certified boxes.

To implement the above algorithm we will use Ukkonen’s  [32] -time algorithm to check whether the edit distance of two strings of length is at most in time . Given the edit distance is within this threshold the algorithm can also output its precise value. To identify all substrings of length at edit distance at most of from a given string (where is the pattern of length and is one of the of length ) we use the -time pattern matching algorithm of Galil and Giancarlo [16]. For a given threshold , this algorithm determines for each position in , whether there is a substring of edit distance at most from ending at that position in . If the algorithm reports such a position then we know by the following proposition that the substring is at edit distance at most . At the same time we are guaranteed to identify all the substrings of of length at edit distance at most from . Hence in Step 4, finding all the substrings at distance with perhaps some extra substrings of edit distance at most can be done in time .

Proposition 4.1.

For strings and and integers , , if then .

Proof.

Let be the best match for ending by the -th symbol of . Hence, . If is by symbols longer that then and by the triangle inequality. Similarly, if is shorter by symbols. ∎

4.1 Correctness of the covering algorithm

Lemma 4.2.

Let be such that is a multiple of . Let be the min-cost path between vertex and in the edit distance graph of and of cost at least . The covering algorithm outputs a set of weighted boxes such that every is correctly certified i.e., and there is a subset of that -approximates with probability at least .

It is clear from the description of the covering algorithm that it outputs only correct certified boxes from the edit distance graph of and , that is for each box , .

The cost of corresponds to the edit distance between and and it is bounded by by Proposition 4.1. Let be the smallest power of two . We claim that by essentially the same argument as in Proposition 3.8 and Theorem 3.9 of [8] the algorithm outputs with high probability a set of certified boxes that -approximates .

There are differences between the current covering algorithm and that of [8]. The main substantial difference is that the algorithm in [8] searches for certified boxes located only within diagonals along the main diagonal of the edit distance graph. (This rests on the observation of Ukkonen  [32] that a path of cost must pass only through vertices on those diagonals.) Here we process certified boxes in the whole matrix as each requires a different “main” diagonal. Except for this difference and the order of processing various pieces the algorithms are the same.

Although technically not quite correct, one could say that the certified boxes output by the current algorithm form a superset of boxes output by the algorithm of [8]. This is not entirely accurate as the discovery of certified boxes depends on the number (density) of relevant substrings of similar to a given . In [8] this density is measured only in the -width strip along the main diagonal of the edit distance graphs whereas here it is measured within the whole . (So the actual classification of substrings on dense (in ) and sparse (not in ) might differ between the two algorithms.) However, this difference is immaterial for the correctness argument in Theorem 3.9 of [8].

Another difference is that in Steps 4 we use -time algorithm to search for all the similar substrings. This algorithm will report all the substrings we were looking for and additionally it might report some substrings of up to twice the required edit distance. This necessitates the upper bound in certified boxes in Step 5. It also means a loss of factor of at most two in the approximation guarantee as the boxes of interest are reported with the cost instead of the more accurate of the original algorithm in [8] which would give a -approximation. (In that theorem represents an (arbitrary) upper bound on the cost of provided it satisfies certain technical conditions requiring that is large enough relative to . This is satisfied by requiring that .)

Another technical difference is that the path might pass through two edit distance graphs and , where . This means that one needs to argue separately about restriction of to and . However, the proof of Theorem 3.9 in [8] analyses approximation of the path in separate parts restricted to substrings of of size . As both and are multiples of , the argument for each piece applies in our setting as well.

4.2 Time complexity of the covering algorithm

Now we analyze the running time:

Claim 4.3.

The covering algorithm runs in time with probability at least .

We analyse the running time of the covering algorithm for each separately. We claim that the running time on is so the total running time is .

In Step 1, for every and , we might sample substrings of of length and check whether their edit distance from is at most . This takes time at most in total.

We say that a bad event happens either if some substring has more than relevant substrings of having distance at most but we sample less than of them, or if some substring has less than relevant substrings of having distance at most but we sample more than of them. By Chernoff bound, the probability of a bad event happening during the whole run of the covering algorithm is bounded by , for sufficiently large constant . Assuming no bad event happens we analyze the running time of the algorithm further.

Each substring that reaches Step 3 can be associated with a set of its relevant substrings in of edit distance at most from it. The number of these substrings is at least many. These substrings must be different for different strings that reach Step 3 as if they were not distinct then the two substrings and would be at edit distance at most from each other and one of them would be put into in Step 5 while processing the other one so it could not reach Step 3. Hence, we can reach Steps 3–5 for at most strings . For a given and each that reaches Step 3, the execution of Steps 3 and 4 takes time, hence we will spend in them time in total.

Step 5 can report for each at most certified boxes, so the total time spent in this step is as .

Step 7 takes order less time than Step 8. In Step 8 we use Ukkonen’s [32] -time edit distance algorithm to check the distance of strings of length . We need to check pairs for the total cost per .

As no bad event happens, for each , there will be at most strings processed in Step 9. We will spend time on each of them to check for edit distance and to output the certified boxes. Hence, for each we will spend here time, which is in total.

Thus, the total time spent by the algorithm in each of the steps is as required.

5 Min-cost Path in a Grid Graph with Shortcuts

In this section we explain how we use certified boxes to calculate the approximation of ’s. Consider any grid graph . A shortcut in is an additional edge with cost , where and .

Let be the edit distance graph for and . Let be a certified box in with . If add a shortcut edge from vertex to vertex with cost . Do this for all certified boxes output by the covering algorithm to obtain a graph. Next remove all the diagonal edges (D-steps) of cost or from graph and obtain graph graph .

Proposition 5.1.

If is a path from to in which is -approximated by a subset of certified boxes by the covering algorithm then there is a path from to in of cost at most consisting of shorcut edges corresponding to and H and V steps.

Proof.

Let be the set of certified boxes that -approximates and be a subset of such that for any pair in , . By definition for each , . We approximate path by a path as illustrated in Fig. 1(b). For each let be the first vertex of . Defime . Moreover if then let and be the start and end vertex, resp., of the corresponding shortcut edge. As -covers , and . Hence we define passing through all of the . For each the part of between vertex and can be constructed in the following way: first climb from to using V steps, then if take the shorcut edge from to and then climb up to , otherwise take H steps from to reach and then take V steps upto vertex .

Figure 1: (a) The shortcut edge corresponding to a certified box . (b) An example of a path (in solid) passing through a certified box . The dashed path is an approximation of in .

Next we argue about the cost of . For each , if , cost of is otherwise the horizontal path with projection has cost . Hence the total cost is . The sum of the cost of the vertical edges is as . Hence the total cost of is at most . Since by definition of -approximation, we get cost of is at most . ∎

By Lemma 4.2 and Proposition 5.1, for , where , the cost of a shortest path from to in is bounded by . At the same time, any path in from to , , has cost at least . So we only need to find the minimal cost of a shortest path from any to in to get an approximation of .

To find the minimal cost, we reset to zero the cost of all horizontal edges in to get a graph . The graph corresponds to taking the pattern matching graph , removing from it all its diagonal edges and adding the shortcut edges. The cost of a path from to in is the minimum over of the cost of a shortest path from to in .

Hence, we want to calculate the cost of the shortest path from to for all .222Although, we really care only about , where as for all the other values of we will approximate by for the previous multiple of . For this we will use a simple algorithm that will make a single sweep over the shortcut edges sorted by their origin and calculate the distances for . The algorithm will maintain a data structure that at time will allow to answer efficiently queries about the cost of the shortest path from to for any .

The data structure will consist of a binary tree with leaves. Each node is associated with a subinterval of so that the -th leaf (counting from left to right) corresponds to , and each internal node corresponds to the union of all its children. We denote by the interval associated with a node . The depth of the tree is at most . At time , query to the node of the data structure will return the cost of the shortest path from to that uses some shortcut edge , where . Each node of the data structure stores a pair of numbers , where is the cost of the relevant shortest path from to and is the time it was updated the last time. (Initially this is set to .) At time , the query to the node returns .

At time to find the cost of the shortest path from to we traverse the data structure from the root to the leaf . Let be the left children of the nodes along the path in which we continue to the right child. We query nodes to get answers . The cost of the shortest paths from to is . As each query takes time to answer, computing the shortest path to takes time.

The algorithm that outputs the cheapest cost of any path from to in will process the shortcut edges one by one in the order of increasing . The algorithm will maintain lists of updates to the data structure to be made before time . At time the algorithm first outputs the cost of the shortest path from to . Then it takes each shortcut edge one by one, . (The algorithm ignores shortcut edges where .) Using the current state of the data structure it calculates the cost of a shortest path from to and adds to list