Faster Approximate(d) Text-to-Pattern L1 Distance

01/28/2018
by   Przemysław Uznański, et al.
0

The problem of finding distance between pattern of length m and text of length n is a typical way of generalizing pattern matching to incorporate dissimilarity score. For both Hamming and L_1 distances only a super linear upper bound O(n√(m)) are known, which prompts the question of relaxing the problem: either by asking for 1 ±ε approximate distance (every distance is reported up to a multiplicative factor), or k-approximated distance (distances exceeding k are reported as ∞). We focus on L_1 distance, for which we show new algorithms achieving complexities respectively O(ε^-1 n) and O((m+k√(m)) · n/m). This is a significant improvement upon previous algorithms with runtime O(ε^-2 n) of Lipsky and Porat (Algorithmica 2011) and O(n√(k)) of Amir, Lipsky, Porat and Umanski (CPM 2005). We also provide a series of reductions, showing that if our upper bound for approximate L_1 distance is tight, then so is our upper bound for k-approximated L_1 distance, and if the latter is tight then so is k-approximated Hamming distance upper bound due to the result of Gawrychowski and Uznański (arXiv 2017).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/03/2018

Approximating Approximate Pattern Matching

Given a text T of length n and a pattern P of length m, the approximate ...
02/09/2020

Approximating Text-to-Pattern Distance via Dimensionality Reduction

Text-to-pattern distance is a fundamental problem in string matching, wh...
11/10/2017

Hamming distance completeness and sparse matrix multiplication

We investigate relations between (+,) vector products for binary integer...
11/16/2019

On q-ary Bent and Plateaued Functions

We obtain the following results. For any prime q the minimal Hamming dis...
01/01/2020

Approximating Text-to-Pattern Hamming Distances

We revisit a fundamental problem in string matching: given a pattern of ...
07/09/2019

L_p Pattern Matching in a Stream

We consider the problem of computing distance between a pattern of lengt...
11/18/2021

Hamming Distance Tolerant Content-Addressable Memory (HD-CAM) for Approximate Matching Applications

We propose a novel Hamming distance tolerant content-addressable memory ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the fundamental problems in text algorithms is, given text of length and pattern of length , both over some (integer) alphabet , the computation of distance between and every -substring of . This is a popular way of searching for imperfect occurrences of pattern in the text, generalizing standard pattern matching problem. Two distance functions received most attention in the context of integer sequences, Hamming distance and distance.

Those problems exhibit the usual story (in pattern matching problems) of starting with a naive quadratic time upper bound, and the goal being to develop close to linear time one. The result of [Abr87] have shown algorithm computing text-to-pattern Hamming distance in time . This algorithm goes beyond the usual repertoire of combinatorial tools, and uses boolean convolution as a subroutine. Using similar approach, [CCI05] and [ALPU05] have shown identical upper bound for distance version of the problem.

However, the bound of is unsatisfactory. What followed was a compelling argument (see note [Cli09]) showing that any significant improvement to the bound for Hamming distances by a combinatorial algorithm leads automatically to an improvement in the complexity of boolean matrix multiplication, suggesting that further progress might be hard. Later, a direct reduction from Hamming distance problem to one was shown [LP08], and a reverse reduction was presented in [GLU17], together with full suite of two way reductions between those two metrics and other score functions, linking the complexity of those problems together (ignoring poly-logarithmic factors).

Thus, the natural next step in is to consider relaxations to those two problems, i.e. require only reporting of a multiplicative approximation of the distances. A result of [Kar93] has shown how to use random projections to achieve approximation of the Hamming distance in time . Many believed that the so-called variance bound, that is dependency is tight, given the evidence of similar lower-bounds in sketching of Hamming distance (c.f. [Woo04], [JKS08], [CR12]). However, in a breakthrough paper [KP15] presented (a quite involved) algorithm working in time . That result was recently simplified in [KP18], with a slightly improved runtime . For distance, the only known approximation algorithm was one from [LP11], having runtime . The question whether the barrier of can be beaten for distance remained open.

Another standard way of relaxing exact text-to-pattern distance is to report exactly only the values not exceeding certain threshold value , the so-called -approximated distance. The motivation for this comes from interpretation of exact text-to-pattern Hamming distance as simply counting mismatches in exact pattern matching, and then -approximated Hamming distance becomes reporting only alignments where there are at most mismatches. The very first solution to the Hamming distances version of this problem was shown in [LV86] working in time , using essentially a very combinatorial approach of taking time per mismatch per alignment using LCP queries. This initiated a series of improvements to the complexity, with algorithms of complexity and in [ALP04], later improved to by [CFP16] and finally by [GU17]. The last result also provides an argument that the might be tight up to sub-polynomial factors, by extending argumentation from [Cli09] to incorporate into the reduction. On the side, the only known -approximated algorithm was from [ALPU05] with complexity of . A question on whether one can design algorithms that for polynomially large values of still work in almost-linear time remained open (such as it for the Hamming distance when ).

Problem definition and preliminaries.

Let and be two words over integer alphabet for some constant . We define their distance as , and their Hamming distance as .

The exact text-to-pattern distance between text and pattern is defined as an array such that . The approximate text-to-pattern distance is the array such that for all , . The -approximated text-to-pattern distance is the array such that for all , when and when . The definitions for exact, approximate and approximated Hamming distance follow in the same manner.

We assume that all of the values in the input are positive. If not, then we can add some large integer to every value of input without changing the distance. Let be the upperbound on every value of the input. We also assume RAM model, with words big enough to hold integers up to , and having arithmetic operations over those in constant time. Unless stated otherwise, we denote text length as and pattern length as .

We define the size of run length encoding (RLE) of a string as a number of different runs (maximal sequences of identical letters). We say that string is -RLE if its RLE is at most .

Our results.

We show improved algorithms for both approximation and -approximated version of the text-to-pattern distance.

Theorem 1.1.

There is a randomized Monte Carlo algorithm that outputs approximation of text-to-pattern distance in time

. The algorithm works with high probability.

Theorem 1.2.

There is a deterministic algorithm that outputs -approximated text-to-pattern distance in time .

This shows, that similarly to reporting -approximated Hamming distance, one can report all positions exactly where the distance is at most in almost linear time.

Our results show that in the text-to-pattern approximate/approximated distance reporting, there does not seem to be significant difference in the complexities of Hamming and distance versions (up to current upper bounds).

We also link -approximated Hamming and distance problems.

Theorem 1.3.

Let be the runtime of -approximated text-to-pattern Hamming distance. Then -approximated text-to-pattern distance is computed in time

Corollary 1.4.

If the -approximated text-to-pattern Hamming distance is computed in time for then -approximated text-to-pattern distance is computed in time as well.

-approximated
Hamming distance

-approximated
distance

Boolean Matrix
Multiplication

-RLE
Hamming distance

-RLE
distance

[GU17]
(3.1)

(3.6)

[GLU17]
(3.5)

[GU17]

(3.4)
Figure 1: Existing (dashed lines) and new (solid lines) reductions.

Overview of the techniques.

Main technique used in our approximation algorithm, just as in the previous work of [LP11], are the generalized weighted mismatches: given arbitrary weight function , we output array such that . We use the following algorithm described first by [LP11], that computes in time : for each

, the contribution of this letter can be computed from convolution of two vectors, first being

, the characteristic vector of letter in , and second being .

In approximated algorithm, we use the techniques of alignment filtering and kernelization, used in the context of Hamming distances by [CFP16] and then refined and simplified by [GU17]. The general idea is to first consider the periodic structure of the pattern. If the pattern is not periodic enough, then -substrings of text that have small distance to pattern must occur not too often. One can use approximate algorithm to filter out all the alignments with too large distance, and manually verify all the alignments that remain in time per each. If the pattern is periodic enough, then both the pattern and text can be rearranged into new instance, that retains letter alignments, and both new pattern and new text are compressible (both have RLE of only blocks). This reduces problem of lets say, text-to-pattern Hamming distances to the same problem, but with additional constraints on RLE of pattern and text.

Linearity preserving reductions introduced in [GLU17] are a formalization of existing previously reductions between metrics (cf. [LP08]). Main idea is that in order to show a reduction between two pattern-matching problems, one can represent them as lets say and convolutions, and show a reduction just between and binary operators. To make such reduction work, it needs to be of a specific form. More precisely, fix integer being the size of reduction, integer coefficients and functions such that for any :

Then -convolution of and is computed as a linear combination of -convolutions with , where for .

2 Proof of Theorem 1.1

Input: Integer strings and .
Output: Score vector .
1 def :
2    if  then
3       return 0
4       else if   then
5          return
6          else
7             return
8            
9             def :
10                for  to  do
11                  
12                   return
Algorithm 1 -approximation of text-to-pattern distance.

In this section we prove Theorem 1.1. We use a procedure that computes, for a text and a pattern and an arbitrary weight function , the array such that in in time.

Let , and let be the smallest positive integer such that . We claim that with such parameters, Algorithm 1 outputs the desired -approximation in the claimed time. Let be its output.

Theorem 2.1.

For any , with probability at least .

Proof.

Consider first and , two characters of the input. We analyze how well Algorithm 1 approximates in the consecutive calls of generalized_weighted_matching. First, fix value of and consider the binary representations of and . More precisely, let and for some . Algorithm 1

in essence estimates

with where is the estimation of a contribution of to and depends only on values of for in the following way:

  • If, for every we have , then .

  • Otherwise, let be the largest such that and . If , then the local estimation is that and so , and otherwise .

Consider and , that is is the position of the highest bit on which and differ, and is the position of the highest bit of . In general, , and we say that pair is -bad, if .

We first observe that for a pair to be at least -bad, a following condition must be met: . Since is chosen uniformly at random from a large enough range of integers, there is

We also observe following: for any pair , in , all the coefficients are computed correctly, since for any such that there is , and then . Therefore

If a pair is -bad, it immediately follows that the absolute error of estimation is at most .

We now estimate expected error in estimation based on choice of . If a particular pair is -bad, then . Using the previous observations, we have

By linearity of expectation and by Markov’s inequality the claim follows. ∎

Now, a standard amplification technique applies: it is enough to repeat Algorithm 1 independently times and take the median value from as the final estimate . Taking to be large enough makes the final estimate good with high probability, and by the union bound whole is a good estimate of .

The complexity of Algorithm 1 is dominated by generalized_weighted_matching being invoked times on alphabet of size . Each such invocation takes , and Algorithm 1 takes time and the total time for computing -approximation is .

3 Proofs of Theorem 1.2 and Theorem 1.3

In our construction of algorithm for -approximated text-to-pattern distance, we make extensive use of techniques used in [GU17] for solving -approximated Hamming distances. More precisely, we need two components:

Corollary 3.1 ([Cfp16],[Gu17]).

The -approximated text-to-pattern Hamming distance problem reduces in time to instances of text-to-pattern Hamming distance on -RLE inputs of length .

Corollary 3.2 ([Gu17]).

Text-to-pattern Hamming distance on -RLE inputs of length is computed exactly in time time.

We also need a following definition, as in [GU17] and [CFP16], that an integer to is a -period of a string , if .

Lemma 3.3 (Fact 3.1 in [Cfp16]).

If the minimal -period of the pattern is , then for any two distinct -substrings of text with Hamming distance to pattern at most , their starting positions are at distance at least .

We start with a version of Corollary 3.1.

Theorem 3.4.

The -approximated text-to-pattern distance problem reduces in time to instances of exact text-to-pattern distance with -RLE inputs of length , where both pattern and text might have wildcards.

Proof.

By a standard trick, it is enough to consider case where is of length , as any other case can be reduced to instances of this type. We proceed by showing key features of reduction in Corollary 3.1.

Observe that for any words over integer alphabet , there is . This makes any technique eliminating alignments with too large Hamming distance correct for filtering distances as well. We can determine easily minimal -period of the pattern. As in [GU17], we run Karloff’s approximate Hamming distances algorithm [Kar93] matching pattern against pattern, with precision . This takes time, and we end up with one of two cases:

  • every -period of the pattern is at least , or

  • there is a -period of the pattern that is at most .

No small -period.

We run Karloff’s algorithm on pattern against the text, with . We then filter out all alignments where there were more than reported mismatches. By Lemma 3.3, there are such alignments. By the relation between and Hamming distances, all discarded alignments were safe to do so for distances as well. Then, we test every such position using the “kangaroo jumps” technique of Landau and Vishkin [LV86], using constant-time operations per position, in total time. The only modification in this part from [GU17] approach is that with each found mismatch through, we account for the score it generates.

Small -period.

Let be such a small -period. The initial approach from [GU17] in this case summarizes as follows. First, a subword of is located, that contains all alignments of that match with Hamming distance at most . Thus for our purposes it contains all alignments with distance to at most as well (c.f. Lemma 2.4 in [GU17]). Then both and

are padded with

special characters, and are subsequently rearranged into and of following properties (c.f. Lemma 2.5 in [GU17]):

  • Both and are of length.

  • Both and have runs.

  • There is a map such that if pair of letters is aligned between and , then there is corresponding aligned pair of identical letters in and . Moreover, any additional aligned pair of letters in and involve at least one special letter. The map depends only on values of , and .

The last property, coupled with invariance of Hamming distance under permuting of input (as long as we preserve alignments) means that text-to-pattern Hamming distances between and encode, under constant additive term and reordering, text-to-pattern Hamming distances between and , and thus all small between and . We use the same transformation and claim that if all special characters used were wildcards (symbols that have distance 0 to every other character), the distance is preserved. ∎

Now, instead of building explicit algorithm for computing distance on bounded RLE instances, we make use of existing algorithm for Hamming distances and a reduction that preserves bounds on RLE.

Corollary 3.5 (c.f. Theorem 2.1, Theorem 2.2 and Lemma A.1 in [Glu17]).

For any , there is a linearity preserving reduction from distance between integers from to instances of Hamming distance. There is a converse reduction from Hamming distance to instances of distance. Those reductions allows for wildcards in the input and produces wildcard-less instances on the output.

We now have enough tools to construct -approximated distance algorithm of desired runtime.

Proof of Theorem 1.2.

By Theorem 3.4, we reduce the input instance to instances of bounded RLE distances. By fixing in Corollary 3.5 each of those reduces to instances of text-to-pattern Hamming distance. Additionally, the reduction does not create any new runs, so the output instances have the same bound on RLE, so the Corollary 3.2 applies. Final runtime is . ∎

To finish the full chain of reductions, we also show a following.

Lemma 3.6.

Text-to-pattern Hamming distance on -RLE inputs with text and pattern of length reduces to instances of -approximated Hamming distance on inputs of length .

Proof.

We proceed in a manner similar to the [GU17]. We observe that it is enough to compute the second discrete derivate of the output array , that is defined as , since having and two initial values of (later can be compute naively in time ) is enough to recover all of . For any two blocks and of the same letter, needs to be updated in only places, that is , , and . We now explain how to deal with the first kind of updates, with the three following being done in an analogous manner.

We first reduce to a problem of -sparse text-to-pattern Hamming distance, where text and pattern are of length and have each at most regular characters, with every other character being wildcard (special character having 0 distance to every other character). We construct sparse instance as follows: for every position in that starts a block, we set , and similarly in pattern for a position (that starts a block), we set . Observe, that if , then in the answer there is , and if then remains unchanged (answer counts mismatches, while we want to count matches). To invert the answer, we create such that iff and otherwise, and in an analogous manner. Then a single convolution of with counts for every alignment total number of non-special text characters that were aligned with non-special pattern characters. A single subtraction yields answer.

To reduce from -sparse instances of Hamming distance to -approximated Hamming distance, we follow an analogous reduction from [GLU17] (c.f. Lemma A.1) that reduces Hamming distance on to Hamming distance on . Write first instance such that iff and iff (with analogous transformation on to compute ). Write second instance such that iff and iff (with the same transformation on to compute ). Now we observe that , thus it is enough to compute exact Hamming text-to-pattern distances on those two instances and subtract them. However, we observe that in both of them, there are in total at most characters different than 0, thus -approximated Hamming distance works just as fine. ∎

We now observe that the proof of Theorem 1.3 follows automatically from plugging Lemma 3.6 in place of Corollary 3.2 in the proof of Theorem 1.2. We conclude this section with a remark on necessity of applying reduction to kernelized version of the problems, instead of directly to approximated problems. That is, some coefficients are negative, which makes the reduction fail to work with arithmetic on numbers .

References

  • [Abr87] Karl R. Abrahamson. Generalized string matching. SIAM J. Comput., 16(6):1039–1051, 1987.
  • [ALP04] Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with mismatches. J. Algorithms, 50(2):257–275, 2004.
  • [ALPU05] Amihood Amir, Ohad Lipsky, Ely Porat, and Julia Umanski. Approximate matching in the metric. In CPM, pages 91–103, 2005.
  • [CCI05] Peter Clifford, Raphaël Clifford, and Costas S. Iliopoulos. Faster algorithms for ,-matching and related problems. In CPM, pages 68–78, 2005.
  • [CFP16] Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana A. Starikovskaya. The k-mismatch problem revisited. In SODA, pages 2039–2052. SIAM, 2016.
  • [Cli09] Raphaël Clifford. Matrix multiplication and pattern matching under Hamming norm. http://www.cs.bris.ac.uk/Research/Algorithms/events/BAD09/BAD09/Talks/BAD09-Hammingnotes.pdf, 2009. Retrieved March 2017.
  • [CR12] Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-hamming-distance. SIAM J. Comput., 41(5):1299–1317, 2012.
  • [GLU17] Daniel Graf, Karim Labib, and Przemyslaw Uznański. Hamming distance completeness and sparse matrix multiplication. CoRR, abs/1711.03887, 2017.
  • [GU17] Paweł Gawrychowski and Przemysław Uznański. Optimal trade-offs for pattern matching with k mismatches. CoRR, abs/1704.01311, 2017.
  • [JKS08] T. S. Jayram, Ravi Kumar, and D. Sivakumar. The one-way communication complexity of hamming distance. Theory of Computing, 4(1):129–135, 2008.
  • [Kar93] Howard J. Karloff. Fast algorithms for approximately counting mismatches. Inf. Process. Lett., 48(2):53–60, 1993.
  • [KP15] Tsvi Kopelowitz and Ely Porat. Breaking the variance: Approximating the hamming distance in time per alignment. In FOCS, pages 601–613, 2015.
  • [KP18] Tsvi Kopelowitz and Ely Porat. A simple algorithm for approximating the text-to-pattern hamming distance. In SOSA, pages 10:1–10:5, 2018.
  • [LP08] Ohad Lipsky and Ely Porat. pattern matching lower bound. Inf. Process. Lett., 105(4):141–143, 2008.
  • [LP11] Ohad Lipsky and Ely Porat. Approximate pattern matching with the , and metrics. Algorithmica, 60(2):335–348, 2011.
  • [LV86] Gad M. Landau and Uzi Vishkin. Efficient string matching with mismatches. Theor. Comput. Sci., 43:239–249, 1986.
  • [Woo04] David P. Woodruff.

    Optimal space lower bounds for all frequency moments.

    In SODA, pages 167–175, 2004.