Streaming dictionary matching with mismatches

In the k-mismatch problem we are given a pattern of length m and a text and must find all locations where the Hamming distance between the pattern and the text is at most k. A series of recent breakthroughs has resulted in an ultra-efficient streaming algorithm for this problem that requires only O(k m/k) space [Clifford, Kociumaka, Porat, 2017]. In this work we consider a strictly harder problem called dictionary matching with k mismatches, where we are given a dictionary of d patterns of lengths < m and must find all their k-mismatch occurrences in the text, and show the first streaming algorithm for it. The algorithm uses O(k d ^k d polylog m) space and processes each position of the text in O(k ^k+1 d polylog m + occ) time, where occ is the number of k-mismatch occurrences of the patterns that end at this position.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

07/09/2019

L_p Pattern Matching in a Stream

We consider the problem of computing distance between a pattern of lengt...
04/27/2020

The Streaming k-Mismatch Problem: Tradeoffs between Space and Total Time

We revisit the k-mismatch problem in the streaming model on a pattern of...
09/17/2019

Generalized Dictionary Matching under Substring Consistent Equivalence Relations

Given a set of patterns called a dictionary and a text, the dictionary m...
06/10/2021

Small space and streaming pattern matching with k edits

In this work, we revisit the fundamental and well-studied problem of app...
05/13/2021

The Dynamic k-Mismatch Problem

The text-to-pattern Hamming distances problem asks to compute the Hammin...
07/03/2019

Circular Pattern Matching with k Mismatches

The k-mismatch problem consists in computing the Hamming distance betwee...
11/18/2021

Hamming Distance Tolerant Content-Addressable Memory (HD-CAM) for Approximate Matching Applications

We propose a novel Hamming distance tolerant content-addressable memory ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The pattern matching problem is the fundamental problem of string processing, and has been studied for more than 40 years. Most of the existing algorithms are deterministic and assume the word-RAM model of computation. Under these assumptions, we must store the input in full, which is infeasible for modern massive data applications. The streaming model of computation was designed to overcome the restrictions of the word-RAM model. In this model we assume that the text arrives as a stream, one character at a time. Each time a new character of the text arrives, we must update the output. The space complexity of an algorithm is defined to be all the space used, including the space we need to store the information about the pattern(s) and the text. The time complexity of an algorithm is defined to be the time we spend to process one character of the text. The streaming model of computation aims for algorithms that use as little space and time as possible.

The first sublinear-space streaming algorithm for exact pattern matching was suggested by Porat and Porat in FOCS 2009 [26]. For a pattern of length their algorithm uses space and time per character 111

All streaming algorithms we discuss in this paper are randomised by necessity. They can err with probability inverse-polynomial in the length of the input.

. Later, Breslauer and Galil gave an -space and -time algorithm [8].

The first algorithm for dictionary matching was developed by Aho and Corasick [1]. The algorithm assumes the word-RAM model of computation, and for a dictionary of patterns of length uses space and amortised time per character, where is the number of the occurrences. Apart from the Aho-Corasick algorithm, other word-RAM algorithms for exact dictionary matching include [27, 4, 7, 16, 14, 13, 22]. In particular, [16, 4, 7, 22] focus on space-efficient solutions. In ESA 2015, Clifford et al. [9] showed a streaming dictionary matching algorithm that uses space and time per character. In ESA 2017, Golan and Porat [18] showed an improved algorithm that uses the same amount of space and time per character for constant-size alphabets.

In the -mismatch problem we are given a pattern of length and a text and must find all alignments of the pattern and the text where the Hamming distance is at most . By reduction to the streaming exact pattern matching, Porat and Porat [26] showed the first streaming -mismatch algorithm with space and time . The complexity has been subsequently improved in [10, 17, 11]. The current best algorithm uses only space and time per character [11].

1.1 Our results

In this work, we commence a study of dictionary matching with mismatches in the streaming model of computation. In this problem we are given a dictionary of patterns of maximal length and must find all their -mismatch occurrences in the text. This problem is strictly harder than both -mismatch and dictionary matching, and on the other hand it is well-motivated by practical applications in cybersecurity and bioinformatics. Our goal is to develop an algorithm that is efficient in terms of both space and time. We assume that we receive the dictionary first, preprocess it, and then receive the text, which is by now a standard assumption in streaming pattern matching.

We can obtain a streaming algorithm for the problem in a straightforward way, by a repeated application of the -mismatch algorithm [11]:

Corollary 1.

There is a streaming algorithm for dictionary matching with mismatches that uses space and time per character222Hereafter hides a multiplicative factor polynomial in .. The algorithm is randomised and its answers are correct w.h.p.

As it can be seen, the time complexity of Corollary 1 depends on in linear way. This is prohibitive for applications where the stream characters arrive at a high speed and the size of the dictionary is large, up to several thousands of patterns (such as intrusion detection), as we must be able to process each character before the next one arrives to benefit from the space advantages of streaming algorithms.

Our contribution in this paper is twofold. First, we show a streaming dictionary matching algorithm that uses space sublinear in and time sublinear in both and (Section 5). To achieve this, we introduce a randomised variant of the -errata tree, a famous data structure of Cole, Gottlieb, and Lewenstein for dictionary matching with mismatches [12] (Section 4), which allows to improve both the query time and the space consumption of the data structure. This variant of the -errata tree can be considered as a generalisation of z-fast tries [5, 6], that have proved to be useful in many streaming applications.

Theorem 2.

There is a streaming algorithm that solves dictionary matching with mismatches in space and worst-case time per arriving character. The algorithm is randomised and correct w.h.p.

Hereafter is the number of -mismatch occurrences of the patterns that end at the currently processed position of the text, i.e., it is at most and typically it is much smaller than the total number of the occurrences of the patterns in the text. The techniques that we use to combine the algorithms have flavour similar to [9, 18, 10, 11]. However, our techniques make a significant step forward to allow both mismatches and multiple patterns.

Our second contribution is time and space lower bounds for streaming dictionary matching with mismatches which show that our algorithm is within a polylogarithmic factor of being optimal in terms of space for certain values of . We start by showing a space lower bound by reduction from the indexing problem (see the proof for the definition):

Lemma 1.

Any streaming algorithm for dictionary matching with mismatches requires bits of space.

Our time lower bound is conditional on the Strong Exponential-Time Hypothesis (SETH) of Impagliazzo, Paturi, and Zane [19, 20]; see also [15, Chapter 14]:

Hypothesis 3 (Seth).

For every , there exists an integer such that SAT on -CNF formulas with clauses and variables cannot be solved in time.

Recall that we assume that we preprocess the dictionary before receiving the text.

Lemma 2.

Suppose there is such that after having preprocessed the dictionary in time, we can solve streaming dictionary matching with mismatches in time per character of the text. Then SETH is false.

1.2 Related work

Previously, dictionary matching with mismatches has been addressed in [24, 3, 25]. Muth and Manber [24] gave a randomised algorithm for , and Bayeza-Yates and Navarro [3] and Navarro [25] gave first algorithms for a general value of . The time complexity of the algorithms is good on average, but in the worst case can be per character.

2 Overview of techniques

In this section we give an overview of the main ideas of this work. Recall that we are given a text arriving as a stream, and a dictionary of patterns of maximal length , and must find all -mismatch occurrences of the patterns in the text. Hereafter we assume (all logs are base two), which is true for any . For we can use Corollary 1 to achieve the complexities of Theorem 4.

2.1 Algorithm based on dictionary look-up with mismatches

A compact trie for a dictionary of patterns is a tree where each inner node has degree at least two. The edges of the compact trie are labelled with strings, and the labels of the edges outgoing from the same node must start with different characters. A label of a node is defined to be a concatenation of the labels of the edges in the root-to- path. There is one-to-one correspondence between the leaves of the tree and the dictionary patterns. The label of a leaf must be equal to the corresponding pattern appended with a special character, usually denoted by , that does not belong to the main alphabet. A trie can be used for example for dictionary look-up queries: Given a string , a dictionary look-up query retrieves all the patterns in the dictionary equal to .

The -errata tree was introduced by Cole, Gottlieb, and Lewenstein in STOC 2004 [12]. The -errata tree supports dictionary look-up with mismatches queries: Given a query string , find all patterns in the dictionary that are at the Hamming distance at most from it. In Section 4 we show the first randomised implementation of the -errata tree and consequently develop a streaming algorithm for dictionary matching with mismatches:

Lemma 3.

There is a streaming algorithm for dictionary matching with mismatches that uses space and time per character, where is the number of the occurrences. Furthermore, the algorithm can output the mismatches in time per occurrence by request. The algorithm is randomised and correct w.h.p.

2.2 Improving space

The algorithm of Corollary 1 is efficient in terms of space, but not in terms of time. The algorithm based on a randomised implementation of the -errata tree is efficient in terms of time, but not in terms of space. As our main contribution, we show that it is possible to achieve sublinear dependency on for the space, and in and for the time. To develop our algorithm, we consider the patterns with large periods (occur rarely) and the patterns with small periods (occur often, but are compressible) separately.

Definition 1 (-period [10]).

The -period of a string is the minimal integer such that the Hamming distance between and is at most .

Observation 1.

If the -period of is larger than , there can be at most one -mismatch occurrence of per consecutive positions of the text.

We partition the dictionary into three smaller dictionaries, and process them in parallel. The first dictionary contains all patterns such that the -period of is larger than , which means that the -mismatch occurrences of and consequently of are rare. The second dictionary contains all patterns such that the -period of is at most . The third dictionary contains all the remaining patterns, i.e. all the patterns such that the -period of is at most and the -period of is larger than . A similar partitioning based on the values of exact periods of the patterns was used by the algorithms for exact dictionary matching in a stream [9, 18].

For the first dictionary we combine a streaming -mismatch algorithm of Porat and Porat and the -errata tree. The -mismatch algorithm generates a small number of equispaced subpatterns such that given the subset of the matching subpatterns we can compute the Hamming distance between the pattern and the text, and then run a streaming (exact) pattern matching for each of the subpatterns. The exact pattern matching algorithm tests each position of the text a logarithmic number of times. Once a position passes all the tests, the algorithm announces that it is an occurrence of the pattern. There can be quite many candidate positions at the beginning and it takes a lot of time to check them, but they become more and more rare towards the last test. The main idea of our approach is to skip the first few tests using the -errata tree, hence improving the time compared to a naive application of independent streaming -mismatch algorithms. In Section 5.2 we will show the following result.

Lemma 4.

If for each pattern in the dictionary the -period of its -length prefix is larger than , there is a streaming algorithm for dictionary matching with mismatches that uses space and amortised time per character. The algorithm is randomised and correct w.h.p.

For the second dictionary we use a different strategy. We partition the dictionary again, this time into a logarithmic number of groups, where a group contains patterns of length in . For the group , we partition the text into blocks of length overlapping by characters, and process each block independently. We will show that the region of containing -mismatch occurrences of the patterns in group must be periodic, and we will show how to maintain the region in a streaming fashion. Furthermore, we introduce a new encoding of this periodic region to allow computing a hash of any substring of this periodic region, which in turn will make it possible to use the -errata tree to retrieve the occurrences of the patterns (Section 5.3). Finally, our algorithm for the third dictionary combines the ideas of the algorithms for the patterns with large and small periods.

Lemma 5.

If for each pattern in the dictionary the -period of its -length prefix is smaller than , there is a streaming algorithm for dictionary matching with mismatches that uses space and amortised time per character. The algorithm is randomised and correct w.h.p.

Lemmas 4 and 5 give us the first streaming algorithm for dictionary matching with mismatches:

Theorem 4.

There is a streaming algorithm that solves the problem of dictionary matching with mismatches in space and amortised time per character. The algorithm is randomised and its answers are correct w.h.p.

Finally, in Appendix 6 we show how to de-amortise the running time to obtain our main result, Theorem 2.

3 Preliminaries: Fingerprints and sketches

In this section we give the definitions of two hash functions that we use throughout the paper. We first give the definition of Karp-Rabin fingerprints that let us decide whether two strings are equal.

Definition 2 (Karp-Rabin fingerprints [21]).

The Karp–Rabin fingerprint of a string is a hash function defined as , where is a fixed prime number and is chosen uniformly at random. The reverse Karp-Rabin fingerprint is defined as .

Fact 1 (Karp–Rabin fingerprints).

For chosen uniformly at random, the probability of two distinct strings of equal lengths to have equal Karp-Rabin fingerprints is at most . This claim holds for reverse fingerprints as well.

Consider a string that is equal to the concatenation of two strings and , that is . We can compute in time given and , and given and . It also follows that given the Karp-Rabin fingerprints of and , we can compute the reverse Karp-Rabin fingerprint of in time, which we will use in the proof of Lemma 3.

We now remind the definition of -mismatch sketches that will allow us to decide whether two strings are at Hamming distance at most .

Definition 3 (-mismatch sketch [11]).

For a fixed prime number and for chosen uniformly at random, the -mismatch sketch of a string is defined as , , where and for .

Lemma 6 ([11]).

Given the sketches and of two strings of equal lengths , in time we can decide (with high probability) whether the Hamming distance between and is at most . If so, the algorithm reports each mismatch between and as well as the difference . The algorithm uses space.

Lemma 7 ([11]).

We can construct one of the sketches , , or given the other two in time using space, provided that all the processed strings are over the alphabet and are of length at most . Furthermore, we can compute , where is a concatenation of copies of , in time as well under the same assumption.

4 Proof of Lemma 3 - the k-errata tree

We start the proof by reminding the definition of the -errata tree of Cole et al. [12]. Next, we show a randomised implementation of this data structure, and finally we show a streaming algorithm based on the -errata tree.

4.1 Reminder: the -errata tree

Consider a dictionary of patterns of maximal length . The -errata tree for is a recursively built set of compact tries of total size .

Let us first give a construction of the -errata tree that has the desired size but not the query time. We start with the compact trie for the dictionary , and decompose it into heavy paths.

Definition 4.

The heavy path of is the path that starts at the root of and at each node on the path branches to the child with the largest number of leaves in its subtree (heavy child), with ties broken arbitrarily. The heavy path decomposition is defined recursively, namely it is defined to be a union of the heavy path of and the heavy path decompositions of the off-path subtrees of the heavy path.

During the recursive step, we construct a number of new compact tries. For each heavy path , and for each node consider the off-path trees hanging from it. First, we create a vertical substitution trie for . Let be the first character on the edge . Consider an off-path tree hanging from , and let be the first character on the edge from to this tree. For each pattern in this off-path tree we replace by . We consider a set of pattern obtained by such a substitution for all off-path trees hanging from , and build a new compact trie for this set. Next, we create horizontal substitution tries for the node . We create a separate horizontal substitution trie for each off-path tree hanging from . To do so, we take the patterns in it and cut off the first characters up to and including the first character on the edge from to this tree, and then build a compact trie on the resulting set of patterns. To finish the recursive step we build the -errata trees for each of the new vertical and horizontal tries.

From the construction it follows that the -errata tree is a set of compact tries, and each string in the tries originates from a pattern in the dictionary . We mark the end of the path labelled by by the id of the pattern it originates from.

Fact 2[12]).

The id of any pattern in occurs in the compact tries of the -errata tree times, and as a corollary the total size of the tries is .

A dictionary look-up with mismatches for a string is performed in a recursive way as well. We will make use of a procedure called . This procedure takes three arguments: a compact trie, a starting node in this trie, and a query string , and must output a pointer to the end of the longest path starting at and labelled by a prefix of . For the purposes of recursion, we introduce a mismatch credit — the number of mismatches that we are still allowed to make. We start with mismatch credit . The algorithm first runs a in the trie for the query string starting from the root. If and the path is labelled by , the algorithm returns the ids of the patterns in that are associated with the end of the path. Otherwise, we consider the heavy paths traversed by the . Let be the position where the leaves the heavy path , . Note that for , is necessarily a node of , and for it can be a position on an edge. We can divide all the patterns in into four groupes: (i@) Patterns hanging off some node in a heavy path , where is located above , ; (ii@) Patterns in the subtrees of ’s children not in the heavy path , for ; (iii@) Patterns in the subtree of the position in that is just below ; (iv@) If is a node, then patterns in the subtrees of ’s children not in the heavy path . We process each of these groups of patterns independently. Consider a pattern in group i@, and let it hang from a node , where is above . Let be the length of the label of , then and any pattern in this subtree have a mismatch at a position . When creating vertical substitution tries, we removed this mismatch. Therefore, we can retrieve all such patterns that are at the Hamming distance from by running the algorithm recursively with mismatch credit in the -errata tree that we created for the vertical substitution trie for the node . The patterns of group ii@ and iv@ are processed in a similar way but using the -errata trees for the horizontal substitution trees. Finally, to process the patterns of group iii@, we run the algorithm with mismatch credit starting from the position that follows in .

This algorithm correctly retrieves the subset of the patterns in that are at Hamming distance from , but can be slow as it makes many recursive calls. Cole et al. showed that the number of recursive calls can be reduced to logarithmic by introducing grouping on the substitution tries. In more detail, for each heavy path we consider its vertical substitution tries and build a weight-balanced tree, where the leaves of the weight-balanced tree are the vertical substitution tries, in the top-down order, and for each node of the tree we create a new trie by merging the tries below it. For each of these group vertical substitution tries we build the -errata tree. We group the horizontal substitution tries in a similar way, namely, we consider each node and build a weight-balanced tree on the horizontal substitution tries that we created for the node . To speed-up the algorithm, we search a logarithmic number of group substitution tries instead of searching each substitution trie individually.

Remark 5.

We will use the -errata tree to retrieve the patterns that are within Hamming distance from the query string or from one of its prefixes. In order to do this, we store a pointer from each node to its nearest marked ancestor. At the end of each we follow the pointers and retrieve the patterns corresponding to the marked nodes above. The number of operations that we perform does not change.

It remains to explain how we perform the operations. Cole et al. gave a deterministic implementation of that requires extra space and time of preprocessing, which is too much for our purposes. In the next section, we will show a randomised implementation of which requires both less space and less time.

4.2 Randomised implementation of the k-errata tree

In this section we will show the following lemma.

Lemma 8.

A dictionary of patterns of length can be preprocessed into a data structure called randomised -errata tree that uses space and can answer a dictionary look-up with mismatches query for a string in time, assuming that we know the -mismatch sketches of all prefixes of .

Recall from above that the -errata tree is a collection of compact tries. In the randomized version of the -errata tree, we replace each of them with a -fast trie. We also store the -mismatch sketch of the label of every node of the tries, which requires space in total.

Fact 3 (z-fast tries [6, 5]).

Consider a string and suppose that we can compute the reverse Karp–Rabin fingerprint of any prefix of in time. A compact trie on a set of strings of length at most can be implemented in space to support the following queries in time: Given , find the highest node such that the longest prefix of present in the trie is a prefix of the label of the root-to- path. The answers are correct w.h.p.333Error probability comes from the collision probability for Karp-Rabin fingerprints.

We now explain how we answer dictionary look-up with mismatches. Recall that each dictionary look-up with mismatches is a sequence of calls to the procedure, and therefore it suffices to give an efficient implementation of . We first explain how to implement this operation if it starts at the root of some compact trie of the -errata tree. Assuming that we can retrieve the reverse Karp-Rabin fingerprint of any substring of in time, the theorem immediately implies that a starting at the root of a compact trie can be implemented in time. Note that if the end of the is a position in an edge of the trie, we will only know the edge it belongs to, but as we explain next, it is sufficient for our purposes.

We now give an implementation of a starting at an arbitrary position of a compact trie by reducing it first to a that starts at a node of the trie and then to a that starts at the root of the trie. We first show a reduction from a that starts at an arbitrary position on an edge to a that starts at a node. We might know the edge, but not the exact position. From the description of the query algorithm in Section 4.1 it follows that the algorithm will continue along the edge by running operations until it either runs out of the mismatch credit or reaches the lower end of the edge. We will fast-forward to the lower end of the edge using the -mismatch sketches. Namely, let be the query string when we entered the current tree (note that we do not change the tree when retrieving patterns of group iv@). Importantly, the string is a suffix of . We want to check whether we can reach the lower end of the edge and not run out of the mismatch credit. In other words, we want to compare the number of mismatches between the label of the lower end of the edge and the prefix of of length , and the mismatch credit. We use the -mismatch sketches for this task. We store the sketch of , and the sketch can be computed in time as it is a subtring of . Having computed the sketches, we can compute the Hamming distance between and using Lemma 6. If the Hamming distance is larger than , we stop, otherwise, we continue the from the lower end of the edge. Finally, we show an implementation of a for a string that starts at a node of a trie. Let be the label of . Our task is equivalent to performing a starting from the root of a trie for a string . We do not know the reverse Karp-Rabin fingerprints of the prefixes of , but we can compute them as follows. First, we use the -mismatch sketches similar to above to compute the at most mismatches that occurred on the way from the root of the trie to . After having computed the mismatches, we can compute any of the fingerprints in time by taking the fingerprint of the corresponding substring of and “fixing” it in at most positions.

It follows that we can answer a dictionary look-up with mismatches query in time, and to compute the mismatches for each of the retrieved patterns in time per pattern if requested.

4.3 Streaming algorithm

During the preprocessing step, the algorithm builds the -errata tree for the reverses of the patterns. During the main step, the algorithm maintains the Karp-Rabin fingerprints and the -mismatch sketches of the longest prefixes of the text in a round-robin fashion updating them in time when a new character arrives (Lemma 7). If the text ends with a -mismatch occurrence of some pattern , there is a suffix of the text of length such that the Hamming distance between it and some pattern in the dictionary is bounded by . It means that we can retrieve all occurrences of such patterns by running a dictionary look-up with mismatches for the reverse of the -length suffix of the text. We can retrieve the reverse fingerprint and the -mismatch sketch of any substring of this suffix in time (Lemma 7), and therefore perform the dictionary look-up query in time. In total, the algorithm uses space and time per character.

5 Proof of Theorem 4 - improving space

We assume that the lengths of the patterns are at least , for shorter patterns we can use the algorithm of Lemma 3. We first partition the dictionary into two smaller dictionaries: the first dictionary contains the patterns such that the -period of their prefix is larger than , and the second dictionary contains patterns such that the -period of their prefix is at most . In Section 5.2 we show a streaming algorithm that finds all -mismatch occurrences of the patterns in , and in Section 5.3 a streaming algorithm for . The two algorithms ran in parallel give Theorem 4.

5.1 Reminder: The k-mismatch algorithm of Porat and Porat

We first give an outline of the -mismatch algorithm of Porat and Porat [26] which will be used for the proofs of Lemmas 4 and 5. Porat and Porat started by demonstrating a streaming algorithm for exact pattern matching, and then showed a reduction from the -mismatch problem to exact pattern matching. The pseudocode for exact pattern matching for a streaming text and a pattern is given in Algorithm 1. The algorithm stores levels of positions of the text. The positions stored in level are occurrences of and form an arithmetic progression, which allows to store them in constant space. In total, the algorithm uses space and time per character.

Input: Text of length arriving as a stream, a pattern of length

1:for each arriving character  do
2:     Compute the fingerprint of from the fingerprint of
3:     if then push to level end if
4:     for each  do
5:          leftmost position in level
6:         if  then check if is an occurrence of
7:              Pop and from level
8:              Compute from and
9:              if then push to level end if
10:         end if
11:     end for
12:end for
Algorithm 1 Streaming pattern matching

The -mismatch algorithm for a pattern can be reduced to instances of the exact pattern matching algorithm in the following way. Let be the first primes larger than , and be the first primes larger than . A subpattern of is defined by two primes and an integer , namely, and so on until the end of . The prime number theorem implies that all the primes we consider are in , and therefore there are subpatterns.

Lemma 9 ([26]).

Consider an alignment of the pattern and the text. Given the subset of the subpatterns that match exactly at this alignment, there is a deterministic -time algorithm that outputs “No” if the Hamming distance between and is larger than , and the true value of the Hamming distance as well as the mismatch positions otherwise.

We can now explain the reduction from -mismatch to exact pattern matching. For each pair of primes and an integer we define a text substream and so on until the end of . We run the exact pattern matching algorithm for each substream and for each subpattern , for . At each position we know which of the subpatterns match and hence can compute the Hamming distance between and . In total for each pair of primes there are substreams, and therefore the algorithm uses space and time per character, the latter is because each time a new character of arrives we must update exactly one substream for each pair of the primes.

5.2 Proof of Lemma 4 - patterns with large periods

During the preprocessing step we build the -errata tree for the dictionary of the prefixes of all the patterns in . We then run the streaming algorithm of Lemma 3 that retrieves all -mismatch occurrences of the prefixes in the text using the -errata tree. Note that any -mismatch occurrence of a pattern starts with a -mismatch occurrence of . After having found an occurrence of , our second step is to check if it can be extended into a full occurrence of .

In order to do this, we run the -mismatch algorithm for each pattern in . Recall that this algorithm consists of instances of the exact pattern matching algorithm, one for each substream and for each subpattern of (see Section 3 for the definition of subpatterns). Suppose we have found a -mismatch occurrence of as well as the mismatches between this occurrence and using the -errata tree. To plug the occurrence into the -mismatch algorithm, for each subpattern of we consider its prefix of maximal length which is fully contained in ; in other words, of maximal length such that . Given the mismatch positions, we can decide whether this prefix matches the text in time. If it does, we add the occurrence to level of the exact pattern matching algorithm for , and continue from there.

Next, we need to explain how we update the instances of the -mismatch algorithm. We cannot consider each instance at each position as it would be too expensive. Instead, we store a binary search tree which contains the first position for each level of each instance of the exact pattern matching algorithm. When a new character arrives, we use the binary search tree to find all instances that require an update in time, and update them.

We now analyse the complexity of the algorithm. To find occurrences of the prefixes , we need space and time per character. The -mismatch algorithms require

space in total. To estimate the running time, recall that by Observation 

1 there is at most one -mismatch occurrence of per positions of the text. Hence, we add new positions to all instances of the exact pattern matching algorithms per positions of the text. Consequently, at any level of any of the exact pattern matching algorithms we have at most one position per positions of the text. Updating these positions takes (amortised) time per character.

Lemma 4 follows, see Algorithm 2 for pseudocode.

Input: Text of length arriving as a stream, a dictionary of patterns of length , where the -period of a -length prefix of each pattern

1:for each pattern in the dictionary do Preprocessing
2:      end for
3:end for
4:Build the randomised -errata tree on the reverses of the prefixes
5:for each position in  do Main stage
6:     Use the -errata tree to find the -mismatch occurrences of ending at the position 
7:     for each  do
8:         if there is a -mismatch occurrence of  then
9:              Plug this occurrence into the -mismatch algorithm for
10:         end if
11:         If required, run the next step of the -mismatch algorithm for
12:     end for
13:end for
Algorithm 2 Streaming dictionary matching with mismatches, large periods

5.3 Proof of Lemma 5 - patterns with small periods

In this section we show a streaming algorithm for the second dictionary that contains patterns such that the -period of their prefix is at most . We denote , , to be the longest prefix of with the -period at most . Two cases are possible: (i) equals (in other words, the -period of is at most ); (ii) is a proper prefix of .

We first assume that Case (i) holds for all the patterns in , and then extend the algorithm to Case (ii) as well. We start by showing a simple but important property of patterns with small periods.

Lemma 10.

Let be a string of length in with the -period , and be a substring of the text of length , where is the longest suffix of with the -period at most , and be the longest prefix of with the -period at most . Every -mismatch occurrence of in is fully contained in .

Proof.

Consider an occurrence of in . Let be the part of the occurrence which is a suffix of , and be the part of the occurrence which is a prefix of . Since the distance between and is at most , the -period of is at most . Indeed,

where is the Hamming distance between . A similar claim holds for . Therefore, is fully contained in and is fully contained in . ∎

We are now ready to describe the algorithm. We divide the patterns into groups, where the -th group contains patterns of length in . During the preprocessing stage, we build the -errata tree for the patterns of each of the groups. For group , we process the text in blocks of length overlapping by characters. Any occurrence of a pattern in the group is fully contained in one of the blocks, and each of them is contained in at most two blocks.

5.3.1 Computing and

Consider a block of length . While reading the first characters of , we compute and a data structure . The data structure will be used to answer the following queries in time: Given a substring of (defined by its starting and ending positions), return its -mismatch sketch. We first explain how we compute and then give the details for the data structure .

We initialize with an empty string, and update it after each characters. We assume that is a multiple of , meaning that we will compute after reading the first characters. If this is not the case, we can process characters instead of during the last step, the time complexity will not change (recall that we assume that the lengths of the patterns are at least ). While reading the next characters, that is a substring , we compute the -mismatch sketches of its prefixes in time (Lemma 7). After having reached , we update . It suffices to compute the longest suffix of such that its -period is at most , for , and take to be the longest of the computed suffixes. For a fixed value of , we use binary search and the -mismatch sketches. Suppose we want to check whether the -period of is equal to , for some position .

Observation 2.

If is longer than , then its -period is larger than .

Proof.

If the -period of is at most , the -period of is at most . Since is longer than , we obtain a contradiction. ∎

Therefore, we must only consider the case when is fully contained in . To decide whether has the -period equal to we must compute the Hamming distance between and . Both of the two strings can be represented as a concatenation of a suffix of and a prefix of . We can retrieve the -mismatch of any suffix of in time using the data structure and we know the -mismatch sketches of all prefixes of . Therefore, we can compute the -mismatch sketches of both strings and the Hamming distance between them in time. If the Hamming distance , the -period of is at most , and otherwise it is larger than . In total, we need time to update , or amortised time per character.

We now describe the data structure associated with . Suppose that after the latest update the -period of is and consider a partitioning of into non-overlapping mini-block of length . We say that a mini-block contains a mismatch if its -th character is different from the -th character of the preceding mini-block, for some . For convenience, we also say that the first mini-block in is mismatch-containing.

Observation 3.

The total number of the mini-blocks containing a mismatch is .

Proof.

By definition, the Hamming distance between and is at most , and it upper bounds the number of the substrings containing a mismatch. ∎

consists of three parts. First, we store a binary search tree on the set of all mini-blocks containing a mismatch. Secondly, for each mini-block containing a mismatch we store the -mismatch sketch of each of its suffixes, as well as as the sketch of a suffix of that starts at the position . In total, occupies space.

Lemma 11.

We can update in amortised time per character. After it has been updated, we can compute the -mismatch sketch of any substring of in time.

Proof.

Using the -mismatch sketches for and , we can find the mini-blocks containing a mismatch time. We can then re-compute the binary search tree in time. Since we have already computed the sketches of the last suffixes of , we can compute the sketches for the mismatch-containing mini-blocks in time.

By Lemma 7, we only need to explain how to compute the sketch of suffixes of . Given a starting position of a suffix of , we use the binary search tree to determine the streak of mini-blocks without mismatches it belongs to, and retrieve the sketch of the suffix starting just after the streak in time. The remaining part consists of a number of repetitions of the substring containing the position prepended with the suffix of the substring. We can compute the sketch of the substring and of its suffix in time, and therefore we can compute the sketch of the remaining part in time using Lemma 7 (note that the length of is at most ). ∎

Suppose we have reached the middle position and have finished the computation of and . We then continue to computing . First, we compute and remember the -mismatches sketches and the fingerprints of the first prefixes of . While reading , we compute the sketches and the fingerprints of the prefixes of . We also assume that the sketch of is known. After having reached , we update . It suffices to compute the longest prefix of such that its -period is at most , for , and take to be the longest of the computed prefixes. For a fixed value of , we use binary search and the sketches. If ends before , we have reached its end and can stop the computation. This part of the algorithm requires space and amortised time per character.

5.3.2 Retrieving occurrences of the patterns

During the preprocessing step we build the randomised -errata tree on the reverses of the patterns. Suppose we are in the course of computing and we would like to retrieve the -mismatch occurrences of the patterns at the current position. Any suffix of the current text of length at most can be represented as a concatenation of at most three strings: a substring of , , and, if we have not updated yet, one of the latest suffixes of the text. The data structure allows to compute the -mismatch sketch of any substring of in time. We also store the -mismatch sketches of and the latest suffixes of the text. Therefore, we can retrieve the -mismatch occurrences of the patterns for a current position in time using the -errata tree.

We summarize our solution for Case (i) in Algorithm 3.

Input: Text of length arriving as a stream, a dictionary of patterns of length , where the -period of each pattern

1:for each  do Preprocessing
2:     Patterns of length in
3:     Build the randomised -errata tree on the reverses of the patterns in
4:end for
5:for each block of length of  do Main stage
6:     Maintain a compressible region of containing all occurrences of patterns in
7:     Maintain a data structure that allows to compute the -mismatch sketch of any substring of
8:     At each position, use the -errata tree to retrieve the -mismatch occurrences of patterns in
9:end for
Algorithm 3 Streaming dictionary matching with mismatches, patterns with small periods

5.3.3 Extension to Case (ii) and wrapping up

Consider now Case (ii) when each pattern has a proper prefix with the -period smaller than . Note first that the -period of a string , which is extended by one character, must be at least , and therefore by Observation 1 there can be at most one -mismatch occurrence of per characters of the text. We use the techniques of the algorithm for Case (i) to retrieve the occurrences of , and then use the techniques of the algorithm for patterns with large periods (Lemma 4) to extend the retrieved occurrences.

In more detail, we again process the patterns in groups, and for each group divide the text into blocks. Consider a block , and , and let be the character following . All -mismatch occurrences of are fully contained in and we can compute the Karp-Rabin fingerprint and the -mismatch sketch of any substring of in time. Therefore, we can find the occurrences of using the -errata tree in time per character. We now need to decide which of the found occurrences can be extended into full occurrences of . In order to do this, we run an instance of the -mismatch algorithm of Porat and Porat for each pattern . When we find an occurrence of , we plug it into the -mismatch algorithm for and proceed as described in Section 5.2.

In total over all groups of patterns, the algorithm for Case (i) uses space and time per character. The algorithm for Case (ii) uses space and amortised time per character.

Lemma 5 follows.

6 Proof of Theorem 2 - De-amortization

Recall that the streaming algorithm of Theorem 4 is comprised of the algorithms of Lemma 4 and of Lemma 5 ran in parallel. Below we explain how to de-amortize these two algorithms. We use a standard approach called the tail trick that was already used in [9, 10, 11].

First, note that there is an easy way to de-amortise the algorithm of Lemma 4 if we allow to delay the occurrences by characters. In order to do that, we divide the text into non-overlapping blocks of length , and de-amortise the processing time of a block over the next block, by running steps of the computation per character. We will need to memorize the last two blocks of characters, but this requires only space and we can afford it.

We now show how to de-amortize the algorithm for Case (i) of Lemma 5. This time, we will not need the delay. Recall that we consider patterns of lengths separately for each . For each , we process the text in blocks of length overlapping by characters. For each block , we compute a compressible region inside it and a data structure that allows to compute the sketches of any substring of the region efficiently. We compute and the data structure online, updating them after each characters of the text in time. This is the only step of the algorithm that requires de-amortization. We can de-amortise this step in a standard way. Namely, we de-amortize the time we need for an update by running steps of the computation per each of the next characters of text. We also maintain the sketches of the latest suffixes of the text in a round-robin fashion using space and time. If we need to extract the sketch or the fingerprint of some substring before the update is finished, we use the previous version of the data structure and the sketches or fingerprints of the latest suffixes of the text to compute the required values using Lemma 7.

Finally, we show how to de-amortize the algorithm of Case (ii) of Lemma 5, again with a delay of characters. Recall that this algorithm first processes the prefixes using the algorithm for Case (i) of Lemma 5, which can be de-amortized with no delay as explained above, and then feeds the occurrences that we found into the algorithm of Lemma 4, which can be de-amortized with a delay of characters. The claim follows.

Removing the delay.

We now show how to remove the delay. Recall that we assume the patterns to have length . We partition each pattern , where is the suffix of of length , and is the remaining prefix. The idea is to find occurrences of the prefixes and of the suffixes independently, and then to see which of them form an occurrence of .

As above, we have three possible cases: the -period of is larger than ; the -period of is at most ; the -period of is at larger than but the -period of is at most .

In the second case, we can still use the de-amortized algorithm of Case (i) of Lemma 5. If an occurrence of belongs to a compressible region of some block, the corresponding occurrence of must belong to . We can retrieve the sketch of any substring of using the data structure that we constructed for and the sketches for the prefixes of . We can therefore use the -errata tree for , as before.

We now explain how we remove the delay in the first and the third cases. To find the occurrences of we use the -errata tree using the streaming algorithm of Section 4.3. Since all the suffixes have equal lengths, it suffices to retrieve the patterns corresponding to the nodes where the operations terminate.

To find the occurrences of we use the de-amortised version of the algorithm of Lemma 4 or of Lemma 5, as appropriately, that reports the occurrences with a delay of at most characters. It means that at the time when we find an occurrence of the corresponding occurrence of is already reported, so it is easy to check whether they form an occurrence of . The only technicality is that we need to store the occurrences of that we found while processing the last characters of the text.

We do it in the following way. We create binary search trees, one for each possible value of Hamming distance and for each of the tries of the -errata tree. Suppose that for a certain position we encounter several occurrences of the prefixes . Let the Hamming distance between and the text be . corresponds to nodes of the -errata tree. Consider the binary tree we created for and a trie of the -errata tree. If corresponds to a node of the trie , we add to the binary search tree. Note that the total size of the binary search trees is as each of the patterns has at most one -mismatch occurrence over each characters of the text.

Suppose we are at a position of the text and that we want to run a dictionary look-up query and found the nodes in the tries of the -errata tree corresponding to the suffixes that occur at this position with at most mismatches. For each such node we know the Hamming distance between the occurrences and the text. We then go to the binary search trees at the position corresponding to the trie containing node and the Hamming distances . From each of the considered binary trees, we report all occurrences corresponding to the node . This step takes time.

7 Proof of Lemma 1 - Space lower bound

In the communication complexity setting the index problem is stated as follows. We assume that there are two players, Alice and Bob. Alice holds a binary string of length , and Bob holds an index encoded in binary. In a one-round protocol, Alice sends Bob a single message (depending on her input and on her random coin flips) and Bob must compute the -th bit of Alice’s input using her message and his random coin flips correctly with probability . The length of Alice’s message (in bits) is called the randomised one-way communication complexity of the problem. In [23] it was shown that the randomized one-way communication complexity of the index problem is .

We will construct a randomised one-way communication complexity protocol for the index problem from the streaming algorithm for dictionary matching with mismatches. As above, let be the size of the dictionary, and assume that . Split Alice’s string into blocks of length . Let be distinct characters different from . For the -th block create a string , where