1 Introduction
One of the fundamental problems in text algorithms is, given text of length and pattern of length , both over some (integer) alphabet , the computation of distance between and every substring of . This is a popular way of searching for imperfect occurrences of pattern in the text, generalizing standard pattern matching problem. Two distance functions received most attention in the context of integer sequences, Hamming distance and distance.
Those problems exhibit the usual story (in pattern matching problems) of starting with a naive quadratic time upper bound, and the goal being to develop close to linear time one. The result of [Abr87] have shown algorithm computing texttopattern Hamming distance in time . This algorithm goes beyond the usual repertoire of combinatorial tools, and uses boolean convolution as a subroutine. Using similar approach, [CCI05] and [ALPU05] have shown identical upper bound for distance version of the problem.
However, the bound of is unsatisfactory. What followed was a compelling argument (see note [Cli09]) showing that any significant improvement to the bound for Hamming distances by a combinatorial algorithm leads automatically to an improvement in the complexity of boolean matrix multiplication, suggesting that further progress might be hard. Later, a direct reduction from Hamming distance problem to one was shown [LP08], and a reverse reduction was presented in [GLU17], together with full suite of two way reductions between those two metrics and other score functions, linking the complexity of those problems together (ignoring polylogarithmic factors).
Thus, the natural next step in is to consider relaxations to those two problems, i.e. require only reporting of a multiplicative approximation of the distances. A result of [Kar93] has shown how to use random projections to achieve approximation of the Hamming distance in time . Many believed that the socalled variance bound, that is dependency is tight, given the evidence of similar lowerbounds in sketching of Hamming distance (c.f. [Woo04], [JKS08], [CR12]). However, in a breakthrough paper [KP15] presented (a quite involved) algorithm working in time . That result was recently simplified in [KP18], with a slightly improved runtime . For distance, the only known approximation algorithm was one from [LP11], having runtime . The question whether the barrier of can be beaten for distance remained open.
Another standard way of relaxing exact texttopattern distance is to report exactly only the values not exceeding certain threshold value , the socalled approximated distance. The motivation for this comes from interpretation of exact texttopattern Hamming distance as simply counting mismatches in exact pattern matching, and then approximated Hamming distance becomes reporting only alignments where there are at most mismatches. The very first solution to the Hamming distances version of this problem was shown in [LV86] working in time , using essentially a very combinatorial approach of taking time per mismatch per alignment using LCP queries. This initiated a series of improvements to the complexity, with algorithms of complexity and in [ALP04], later improved to by [CFP16] and finally by [GU17]. The last result also provides an argument that the might be tight up to subpolynomial factors, by extending argumentation from [Cli09] to incorporate into the reduction. On the side, the only known approximated algorithm was from [ALPU05] with complexity of . A question on whether one can design algorithms that for polynomially large values of still work in almostlinear time remained open (such as it for the Hamming distance when ).
Problem definition and preliminaries.
Let and be two words over integer alphabet for some constant . We define their distance as , and their Hamming distance as .
The exact texttopattern distance between text and pattern is defined as an array such that . The approximate texttopattern distance is the array such that for all , . The approximated texttopattern distance is the array such that for all , when and when . The definitions for exact, approximate and approximated Hamming distance follow in the same manner.
We assume that all of the values in the input are positive. If not, then we can add some large integer to every value of input without changing the distance. Let be the upperbound on every value of the input. We also assume RAM model, with words big enough to hold integers up to , and having arithmetic operations over those in constant time. Unless stated otherwise, we denote text length as and pattern length as .
We define the size of run length encoding (RLE) of a string as a number of different runs (maximal sequences of identical letters). We say that string is RLE if its RLE is at most .
Our results.
We show improved algorithms for both approximation and approximated version of the texttopattern distance.
Theorem 1.1.
There is a randomized Monte Carlo algorithm that outputs approximation of texttopattern distance in time
. The algorithm works with high probability.
Theorem 1.2.
There is a deterministic algorithm that outputs approximated texttopattern distance in time .
This shows, that similarly to reporting approximated Hamming distance, one can report all positions exactly where the distance is at most in almost linear time.
Our results show that in the texttopattern approximate/approximated distance reporting, there does not seem to be significant difference in the complexities of Hamming and distance versions (up to current upper bounds).
We also link approximated Hamming and distance problems.
Theorem 1.3.
Let be the runtime of approximated texttopattern Hamming distance. Then approximated texttopattern distance is computed in time
Corollary 1.4.
If the approximated texttopattern Hamming distance is computed in time for then approximated texttopattern distance is computed in time as well.
Overview of the techniques.
Main technique used in our approximation algorithm, just as in the previous work of [LP11], are the generalized weighted mismatches: given arbitrary weight function , we output array such that . We use the following algorithm described first by [LP11], that computes in time : for each
, the contribution of this letter can be computed from convolution of two vectors, first being
, the characteristic vector of letter in , and second being .In approximated algorithm, we use the techniques of alignment filtering and kernelization, used in the context of Hamming distances by [CFP16] and then refined and simplified by [GU17]. The general idea is to first consider the periodic structure of the pattern. If the pattern is not periodic enough, then substrings of text that have small distance to pattern must occur not too often. One can use approximate algorithm to filter out all the alignments with too large distance, and manually verify all the alignments that remain in time per each. If the pattern is periodic enough, then both the pattern and text can be rearranged into new instance, that retains letter alignments, and both new pattern and new text are compressible (both have RLE of only blocks). This reduces problem of lets say, texttopattern Hamming distances to the same problem, but with additional constraints on RLE of pattern and text.
Linearity preserving reductions introduced in [GLU17] are a formalization of existing previously reductions between metrics (cf. [LP08]). Main idea is that in order to show a reduction between two patternmatching problems, one can represent them as lets say and convolutions, and show a reduction just between and binary operators. To make such reduction work, it needs to be of a specific form. More precisely, fix integer being the size of reduction, integer coefficients and functions such that for any :
Then convolution of and is computed as a linear combination of convolutions with , where for .
2 Proof of Theorem 1.1
In this section we prove Theorem 1.1. We use a procedure that computes, for a text and a pattern and an arbitrary weight function , the array such that in in time.
Let , and let be the smallest positive integer such that . We claim that with such parameters, Algorithm 1 outputs the desired approximation in the claimed time. Let be its output.
Theorem 2.1.
For any , with probability at least .
Proof.
Consider first and , two characters of the input. We analyze how well Algorithm 1 approximates in the consecutive calls of generalized_weighted_matching. First, fix value of and consider the binary representations of and . More precisely, let and for some . Algorithm 1
in essence estimates
with where is the estimation of a contribution of to and depends only on values of for in the following way:
If, for every we have , then .

Otherwise, let be the largest such that and . If , then the local estimation is that and so , and otherwise .
Consider and , that is is the position of the highest bit on which and differ, and is the position of the highest bit of . In general, , and we say that pair is bad, if .
We first observe that for a pair to be at least bad, a following condition must be met: . Since is chosen uniformly at random from a large enough range of integers, there is
We also observe following: for any pair , in , all the coefficients are computed correctly, since for any such that there is , and then . Therefore
If a pair is bad, it immediately follows that the absolute error of estimation is at most .
We now estimate expected error in estimation based on choice of . If a particular pair is bad, then . Using the previous observations, we have
By linearity of expectation and by Markov’s inequality the claim follows. ∎
Now, a standard amplification technique applies: it is enough to repeat Algorithm 1 independently times and take the median value from as the final estimate . Taking to be large enough makes the final estimate good with high probability, and by the union bound whole is a good estimate of .
3 Proofs of Theorem 1.2 and Theorem 1.3
In our construction of algorithm for approximated texttopattern distance, we make extensive use of techniques used in [GU17] for solving approximated Hamming distances. More precisely, we need two components:
Corollary 3.1 ([Cfp16],[Gu17]).
The approximated texttopattern Hamming distance problem reduces in time to instances of texttopattern Hamming distance on RLE inputs of length .
Corollary 3.2 ([Gu17]).
Texttopattern Hamming distance on RLE inputs of length is computed exactly in time time.
We also need a following definition, as in [GU17] and [CFP16], that an integer to is a period of a string , if .
Lemma 3.3 (Fact 3.1 in [Cfp16]).
If the minimal period of the pattern is , then for any two distinct substrings of text with Hamming distance to pattern at most , their starting positions are at distance at least .
We start with a version of Corollary 3.1.
Theorem 3.4.
The approximated texttopattern distance problem reduces in time to instances of exact texttopattern distance with RLE inputs of length , where both pattern and text might have wildcards.
Proof.
By a standard trick, it is enough to consider case where is of length , as any other case can be reduced to instances of this type. We proceed by showing key features of reduction in Corollary 3.1.
Observe that for any words over integer alphabet , there is . This makes any technique eliminating alignments with too large Hamming distance correct for filtering distances as well. We can determine easily minimal period of the pattern. As in [GU17], we run Karloff’s approximate Hamming distances algorithm [Kar93] matching pattern against pattern, with precision . This takes time, and we end up with one of two cases:

every period of the pattern is at least , or

there is a period of the pattern that is at most .
No small period.
We run Karloff’s algorithm on pattern against the text, with . We then filter out all alignments where there were more than reported mismatches. By Lemma 3.3, there are such alignments. By the relation between and Hamming distances, all discarded alignments were safe to do so for distances as well. Then, we test every such position using the “kangaroo jumps” technique of Landau and Vishkin [LV86], using constanttime operations per position, in total time. The only modification in this part from [GU17] approach is that with each found mismatch through, we account for the score it generates.
Small period.
Let be such a small period. The initial approach from [GU17] in this case summarizes as follows. First, a subword of is located, that contains all alignments of that match with Hamming distance at most . Thus for our purposes it contains all alignments with distance to at most as well (c.f. Lemma 2.4 in [GU17]). Then both and
are padded with
special characters, and are subsequently rearranged into and of following properties (c.f. Lemma 2.5 in [GU17]):
Both and are of length.

Both and have runs.

There is a map such that if pair of letters is aligned between and , then there is corresponding aligned pair of identical letters in and . Moreover, any additional aligned pair of letters in and involve at least one special letter. The map depends only on values of , and .
The last property, coupled with invariance of Hamming distance under permuting of input (as long as we preserve alignments) means that texttopattern Hamming distances between and encode, under constant additive term and reordering, texttopattern Hamming distances between and , and thus all small between and . We use the same transformation and claim that if all special characters used were wildcards (symbols that have distance 0 to every other character), the distance is preserved. ∎
Now, instead of building explicit algorithm for computing distance on bounded RLE instances, we make use of existing algorithm for Hamming distances and a reduction that preserves bounds on RLE.
Corollary 3.5 (c.f. Theorem 2.1, Theorem 2.2 and Lemma A.1 in [Glu17]).
For any , there is a linearity preserving reduction from distance between integers from to instances of Hamming distance. There is a converse reduction from Hamming distance to instances of distance. Those reductions allows for wildcards in the input and produces wildcardless instances on the output.
We now have enough tools to construct approximated distance algorithm of desired runtime.
Proof of Theorem 1.2.
By Theorem 3.4, we reduce the input instance to instances of bounded RLE distances. By fixing in Corollary 3.5 each of those reduces to instances of texttopattern Hamming distance. Additionally, the reduction does not create any new runs, so the output instances have the same bound on RLE, so the Corollary 3.2 applies. Final runtime is . ∎
To finish the full chain of reductions, we also show a following.
Lemma 3.6.
Texttopattern Hamming distance on RLE inputs with text and pattern of length reduces to instances of approximated Hamming distance on inputs of length .
Proof.
We proceed in a manner similar to the [GU17]. We observe that it is enough to compute the second discrete derivate of the output array , that is defined as , since having and two initial values of (later can be compute naively in time ) is enough to recover all of . For any two blocks and of the same letter, needs to be updated in only places, that is , , and . We now explain how to deal with the first kind of updates, with the three following being done in an analogous manner.
We first reduce to a problem of sparse texttopattern Hamming distance, where text and pattern are of length and have each at most regular characters, with every other character being wildcard (special character having 0 distance to every other character). We construct sparse instance as follows: for every position in that starts a block, we set , and similarly in pattern for a position (that starts a block), we set . Observe, that if , then in the answer there is , and if then remains unchanged (answer counts mismatches, while we want to count matches). To invert the answer, we create such that iff and otherwise, and in an analogous manner. Then a single convolution of with counts for every alignment total number of nonspecial text characters that were aligned with nonspecial pattern characters. A single subtraction yields answer.
To reduce from sparse instances of Hamming distance to approximated Hamming distance, we follow an analogous reduction from [GLU17] (c.f. Lemma A.1) that reduces Hamming distance on to Hamming distance on . Write first instance such that iff and iff (with analogous transformation on to compute ). Write second instance such that iff and iff (with the same transformation on to compute ). Now we observe that , thus it is enough to compute exact Hamming texttopattern distances on those two instances and subtract them. However, we observe that in both of them, there are in total at most characters different than 0, thus approximated Hamming distance works just as fine. ∎
We now observe that the proof of Theorem 1.3 follows automatically from plugging Lemma 3.6 in place of Corollary 3.2 in the proof of Theorem 1.2. We conclude this section with a remark on necessity of applying reduction to kernelized version of the problems, instead of directly to approximated problems. That is, some coefficients are negative, which makes the reduction fail to work with arithmetic on numbers .
References
 [Abr87] Karl R. Abrahamson. Generalized string matching. SIAM J. Comput., 16(6):1039–1051, 1987.
 [ALP04] Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with mismatches. J. Algorithms, 50(2):257–275, 2004.
 [ALPU05] Amihood Amir, Ohad Lipsky, Ely Porat, and Julia Umanski. Approximate matching in the metric. In CPM, pages 91–103, 2005.
 [CCI05] Peter Clifford, Raphaël Clifford, and Costas S. Iliopoulos. Faster algorithms for ,matching and related problems. In CPM, pages 68–78, 2005.
 [CFP16] Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana A. Starikovskaya. The kmismatch problem revisited. In SODA, pages 2039–2052. SIAM, 2016.
 [Cli09] Raphaël Clifford. Matrix multiplication and pattern matching under Hamming norm. http://www.cs.bris.ac.uk/Research/Algorithms/events/BAD09/BAD09/Talks/BAD09Hammingnotes.pdf, 2009. Retrieved March 2017.
 [CR12] Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gaphammingdistance. SIAM J. Comput., 41(5):1299–1317, 2012.
 [GLU17] Daniel Graf, Karim Labib, and Przemyslaw Uznański. Hamming distance completeness and sparse matrix multiplication. CoRR, abs/1711.03887, 2017.
 [GU17] Paweł Gawrychowski and Przemysław Uznański. Optimal tradeoffs for pattern matching with k mismatches. CoRR, abs/1704.01311, 2017.
 [JKS08] T. S. Jayram, Ravi Kumar, and D. Sivakumar. The oneway communication complexity of hamming distance. Theory of Computing, 4(1):129–135, 2008.
 [Kar93] Howard J. Karloff. Fast algorithms for approximately counting mismatches. Inf. Process. Lett., 48(2):53–60, 1993.
 [KP15] Tsvi Kopelowitz and Ely Porat. Breaking the variance: Approximating the hamming distance in time per alignment. In FOCS, pages 601–613, 2015.
 [KP18] Tsvi Kopelowitz and Ely Porat. A simple algorithm for approximating the texttopattern hamming distance. In SOSA, pages 10:1–10:5, 2018.
 [LP08] Ohad Lipsky and Ely Porat. pattern matching lower bound. Inf. Process. Lett., 105(4):141–143, 2008.
 [LP11] Ohad Lipsky and Ely Porat. Approximate pattern matching with the , and metrics. Algorithmica, 60(2):335–348, 2011.
 [LV86] Gad M. Landau and Uzi Vishkin. Efficient string matching with mismatches. Theor. Comput. Sci., 43:239–249, 1986.

[Woo04]
David P. Woodruff.
Optimal space lower bounds for all frequency moments.
In SODA, pages 167–175, 2004.
Comments
There are no comments yet.