Generalised Pattern Matching Revisited

01/16/2020
by   Bartłomiej Dudek, et al.
0

In the problem of Generalised Pattern Matching (GPM) [STOC'94, Muthukrishnan and Palem], we are given a text T of length n over an alphabet Σ_T, a pattern P of length m over an alphabet Σ_P, and a matching relationship ⊆Σ_T ×Σ_P, and must return all substrings of T that match P (reporting) or the number of mismatches between each substring of T of length m and P (counting). In this work, we improve over all previously known algorithms for this problem for various parameters describing the input instance: * D being the maximum number of characters that match a fixed character, * S being the number of pairs of matching characters, * I being the total number of disjoint intervals of characters that match the m characters of the pattern P. At the heart of our new deterministic upper bounds for D and S lies a faster construction of superimposed codes, which solves an open problem posed in [FOCS'97, Indyk] and can be of independent interest. To conclude, we demonstrate first lower bounds for GPM. We start by showing that any deterministic or Monte Carlo algorithm for GPM must use Ω(S) time, and then proceed to show higher lower bounds for combinatorial algorithms. These bounds show that our algorithms are almost optimal, unless a radically new approach is developed.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

05/22/2019

Cartesian Tree Matching and Indexing

We introduce a new metric of match, called Cartesian tree matching, whic...
12/18/2020

The Parameterized Suffix Tray

Let Σ and Π be disjoint alphabets, respectively called the static alphab...
07/09/2021

Optimal Space and Time for Streaming Pattern Matching

In this work, we study longest common substring, pattern matching, and w...
12/28/2017

On the Decision Tree Complexity of String Matching

String matching is one of the most fundamental problems in computer scie...
08/26/2020

Combinatorial Communication in the Locker Room

The reader may be familiar with various problems involving prisoners and...
07/03/2019

Circular Pattern Matching with k Mismatches

The k-mismatch problem consists in computing the Hamming distance betwee...
03/01/2019

Integrating Temporal Information to Spatial Information in a Neural Circuit

In this paper, we consider a network of spiking neurons with a determini...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Processing noisy data is a keystone of modern string processing. One possible approach to address this challenge is approximate pattern matching, where the task is to find all substrings of the text that are close to the pattern under some similarity measure, such as Hamming or edit distance. The approximate pattern matching approach assumes that noise is arbitrary, i.e. that we can delete or replace any character of the pattern or of the text by any other character of the alphabet.

The assumption that the noise is completely arbitrary is not necessarily justified, as in practice we might have some predetermined knowledge about the structure of the errors. In this paper we focus on the Generalised Pattern Matching (GPM) problem that addresses this setting. We assume to be given a text over an alphabet , a pattern over an alphabet , and we allow each character of to match a subset of characters of . We must report all substrings of the text that match the pattern. This problem was introduced in STOC’94 [35] by Muthukrishnan and Palem to provide a unified approach for solving different extensions of the classical pattern matching question that has been considered as separate problems in the early 90s. Later, Muthukrishnan [34] considered a counting variant of GPM, where the task is to count the number of mismatches between substrings of the text and the pattern. Formally, the problem is defined as follows:

Generalised Pattern Matching (GPM) Input: A text , a pattern , and a matching relationship . Output (Reporting): All such that matches . Output (Counting): For each , the number of positions such that does not match .

Muthukrishnan and Palem [35] and subsequent work [34, 36] considered three natural parameters describing the matching relationship () or the pattern (). Viewing the matching relationship as a bipartite graph with edges connecting pairs of matching characters from , is the maximum degree of a node and is the total number of edges in the graph. Next, the parameter describes the pattern rather than the matching relationship. For each character , let be the minimal set of disjoint sorted intervals that contain the characters that match , and define .

The maximum number of characters that match a fixed character, .

For the reporting variant of GPM, Muthukrishnan [34] showed a Las Vegas algorithm with running time . Indyk [27] used superimposed codes to show a deterministic algorithm with running time . For the counting variant, Muthukrishnan [34] showed a -approximation Las Vegas algorithm with time . Indyk [27] gave a -approximation deterministic and Monte Carlo algorithm with running time and , respectively.

The number of matching pairs of characters, .

Muthukrishnan and Ramesh [36] gave an -time algorithm for the reporting variant of GPM.

The number of intervals of matching characters, .

For this parameter, Muthukrishnan [34] gave an -time algorithm111[34, Theorem 9] claims , but the first sentence of the proof states that for the algorithm takes time, where the first term is the time that we need to read the input. For a longer text, one needs to apply it times for overlapping blocks of length , making the total time ..

1.1 Our Contribution

We improve existing randomised and deterministic upper bounds for GPM, and demonstrate matching lower bounds. At heart of our deterministic algorithms for the counting variant of GPM is a solution to an open problem of Indyk [27] on construction of superimposed codes.

Data-dependent superimposed codes.

A

-superimposed code is a set of binary vectors such that no vector is contained in a Boolean sum (i.e. bitwise OR) of

other vectors. Superimposed codes find their main application in information retrieval (e.g. in compressed representation of document attributes), and optimizing broadcasting on radio networks [30], and have also proved to be useful in graph algorithms [25, 1]. Indyk [27] extended the notion of superimposed codes to the so-called data-dependent superimposed codes, and asked for a deterministic construction for such codes with a certain additional property that makes them useful for counting mismatches (see Section 2 for a formal definition). We provide such a construction algorithm in Theorem 2.7. We briefly describe the high-level idea below.

We need the concept of discrepancy minimization. Given a universe , each of its elements is assigned one of two colours, red or blue. The discrepancy of a subset of is defined as the difference between the number of red and blue elements in it, and the discrepancy of a family of subsets is defined as the maximum of the absolute values of discrepancies of the subsets in . Discrepancy minimization is a fundamental notion with numerous applications, including derandomization, computational geometry, numerical integration, understanding the limits of models of computation, and so on (see e.g. [13]). A recent line of work showed a series of algorithms for constructing colourings of low discrepancy in various settings [33, 5, 10, 6, 9, 8, 7, 32]. For our applications, we need to work under the assumption that the size of each subset in is bounded by a given parameter . In Theorem 2.4, we describe a fast deterministic algorithm that returns a colouring of small discrepancy for this case. We follow the algorithm described by Chazelle [13] that can be roughly summarized as based on the method of conditional expectations tweaked as to allow for an efficient implementation. In more detail, Chazelle’s construction assumes infinite precision of computation and does not immediately translate into an efficient algorithm working in the Word RAM model of computation, thus requiring resolving some technical issues to bound the required precision and the overall complexity.

We apply discrepancy minimization to design in Lemma 2.6 a procedure that, given a family of subsets of , partitions the universe into not too many parts such that the intersection of each part and each of the subsets in is small. The procedure follows the natural idea of colouring the universe with two colours, and then recursing on the elements with the same colour. Every step of such construction introduces some penalty that needs to be carefully controlled as to guarantee the desired property in the end. Because of this penalty, we are only able to guarantee that the intersections are small, but not constant. To finish the construction, we combine the partition with a hash function into the ring of polynomials. We stress that this part of the construction is new and not simply a modification of Chazelle’s (or Indyk’s) method.

Upper bounds for Gpm.

Similar to previous work, we assume that the alphabets’ sizes are polynomial in and that the matching relationship is given as a graph on the set of vertices . We also assume to have access to three oracles that can answer the following questions in time:

  1. Is there an edge between and (in other words, do and match)?

  2. What is the degree of a character or (in other words, what is the number of characters that match a given character)?

  3. What is the -th neighbor of (in other words, what is the -th character matching )? We assume an arbitrary (but fixed) order of neighbors of every node.

Under these assumptions, we show the following upper bounds summarized in Tables 1 and 2:

  1. We start by showing a new Monte Carlo algorithm for the parameter with running time (Theorem 3.1). While its running time is the same as that of [34], it encapsulates a novel approach to the problem that serves as a basis for other algorithms. We then derive a Monte Carlo algorithm for the parameter with running time (Theorem 3.2). As a corollary, we show a -approximation Monte Carlo algorithm that solves the counting variant of GPM in time (Corollary 3.3

    ). All three algorithms have inverse-polynomial error probability.

  2. Next, using the data-dependent superimposed codes, we construct -approximation deterministic algorithms for the counting variant of GPM. The first algorithm requires time (Theorem 3.4), and the second algorithm time (Theorem 3.5). By taking , we immediately obtain deterministic algorithms for the reporting variant of the problem with the same complexities.

  3. Finally, we show that both the reporting and the counting variants of GPM can be solved exactly and deterministically in time (Theorem 3.7).

Time Det./Rand.
Det. [27]
Det. This work
Rand. [34]
Rand. This work
Det. [36]
Det. This work
Rand. This work
Det. [34]
Det. This work
Table 1: Generalised Pattern Matching (reporting)
Time Det./Rand. Approx. factor
Det. [27]
Det. This work
Rand. [34]
Rand. [27]
Rand. This work
Det. This work
Rand. This work
Det. [34]
Det. This work
Table 2: Generalised Pattern Matching (counting)
Lower bounds for Gpm.

We also show first lower bounds for GPM (see Appendix 4). We start with a simple adversary-based argument that shows that any deterministic algorithm or any Monte Carlo algorithm with constant error probability that solves GPM must use time (Lemma 4.1 and 4.2). We then proceed to show higher lower bounds for combinatorial algorithms by reduction from Boolean matrix multiplication222It is not clear what combinatorial means precisely. However, FFT and Boolean convolution often used in algorithms on strings are considered not to be combinatorial. parameterized by (Lemma 4.3 and Corollary 4.4). All the lower bounds are presented for the reporting variant of GPM, so they immediately apply also to the counting variant. These bounds show that our algorithms are almost optimal, unless a radically new approach is developed.

1.2 Related Work

Degenerate string matching.

A more general approach to dealing with noise in string data is degenerate string matching, where the set of matching characters is specified for every position of the text or of the pattern (as opposed to every character of the alphabets). Abrahamson [3] showed the first efficient algorithm for a degenerate pattern and a standard text. Later, several practically efficient algorithms were shown [37, 26].

Pattern matching with don’t cares.

In this problem, we assume , where  contains a special character — “don’t care”. We assume that two characters of match if either one of them is the don’t care character, or they are equal. The study of this problem commenced in [21], where a -time algorithm was presented. The time complexity of the algorithm was improved in subsequent work [18, 28, 29], culminating in an elegant -time deterministic algorithm of Clifford and Clifford [15]. Clifford and Porat [17] also considered the problem of identifying all alignments where the number of mismatching characters is at most .

Threshold pattern matching.

In the threshold pattern matching problem, we are given a parameter , and we say that two characters match if . The threshold pattern matching problem has been studied both in reporting and counting variants [4, 11, 12, 16, 19, 20, 22, 39]. The best algorithm for the reporting variant of the threshold pattern matching problem is deterministic and takes linear time (after the pattern has been preprocessed). The best deterministic algorithm for the counting variant of threshold pattern matching has time , while the best randomised algorithm has time  [39].

In threshold pattern matching the matching relationship is described with a single interval per character, so . Hence from Theorem 3.7 immediately follows a faster deterministic algorithm for the counting variant of the threshold pattern matching problem (Corollary 3.8).

2 Data-Dependent Superimposed Codes

We start by solving an open problem posed by Indyk [27]: provide a deterministic algorithm for construction of a variant of data-dependent superimposed codes that is particularly suitable for the counting variant of GPM. The solution that we present is rather involved, a reader more interested in pattern matching applications can skip this section on the first reading.

Definition 2.1.

Let be subsets of a universe . A family of sets , where and for is called an -superimposed code if for every and we have . We call and respectively the length and the weight of the code .

Suppose that the size of each is at most , where is some fixed integer. Indyk asked if there exists a deterministic -time algorithm that computes an -superimposed code of some weight and length . It can be seen that we cannot hope to construct such a code with independent of . In the following lemma we show that even if we restrict to the case of we still need that significantly depends on .

Lemma 2.2.

For every constant , function , and large enough, there exists a family of singleton sets and such that any -superimposed code of weight must have length length .

Proof.

Consider sets for , where will be determined later. Let and suppose that there is a -superimposed code . Then, by definition of superimposed codes and from , for it holds

so . Hence, and every and must be disjoint, so . Assume towards a contradiction that . We obtain

where the last inequality holds for sufficiently large . This leads to contradiction and the claim follows. ∎

Therefore, one should allow . We give a positive answer to this natural relaxation. We start by showing an efficient deterministic algorithm for discrepancy minimization that will play an essential role in our approach.

2.1 Discrepancy Minimization

Let us start with a formal definition of discrepancy.

Definition 2.2 (Discrepancy).

Consider a family of sets , . We call a function a colouring. The discrepancy of a set is defined as , and the discrepancy of is defined as .

In [13, Section 1.1], Chazelle presented a construction of a colouring of small discrepancy assuming infinite precision of computation. Our deterministic algorithm will follow the outline of this construction (although crucial modifications are required in order to overcome the infinite precision assumption), so we quickly restate Chazelle’s construction below. The main idea is to assign colours so as to minimize the value of an objective function defined as follows: let be chosen so that for some constant , and let (respectively, ) be the number of such that (respectively, ) for . Define

Chazelle’s construction assigns colours to one element of at a time, without ever backtracking. To assign a colour to an element , it performs the following three simple steps. First, it computes , the value of assuming . Second, it computes , the value of assuming . Finally, if , it sets and , and otherwise it sets and . Note that for each , we have

and therefore the value of can only decrease. This implies an important property of Chazelle’s construction: since at initialization we have for all and therefore , we have for

at any moment of the construction. Let us show that small values of

’s imply small discrepancy. In order to do this, we follow the outline of [13], but use a slightly higher bound for ’s to be able to apply this lemma later.

Lemma 2.3 ([13]).

If after all elements of have been assigned a colour we have for all , then the discrepancy of the resulting colouring is at most for any constant .

Proof.

After all elements of have been assigned a colour, we have

Consequently,

By taking the logarithm of both sides, we obtain

For all ,

which implies that

Substituting , we finally obtain for any :

We will show a deterministic algorithm that computes a colouring for which the values are bounded by . By Lemma 2.3, this implies that the discrepancy is bounded by . We must overcome several crucial issues: first, we must explain how to compute . Second, we must design an algorithm that uses only multiplications and additions so as to be able to control the accumulated precision error. And finally, we must explain how to remove the assumption of infinite precision and to ensure that we never operate on numbers that are too small.

Proposition 2.3.

Assume . There is a deterministic algorithm that computes such that for some constant in time. Both and are bounded from below by .

Proof.

We present the algorithm as a sequence of four steps. Let , where is computed in time by incrementing a counter until . As , it holds that . Compute such that by incrementing a counter until and returning . We have . This step takes time. For , we have . For , we have and . Finally, we have . It follows that we can take , or equivalently, , which concludes the proof.

Note that both and are bounded from below by :

We can implement Chazelle’s construction to use only multiplications and additions via segment trees.

Proposition 2.3.

Assume that and are known. Chazelle’s construction can be implemented via addition and multiplication operations.

Proof.

We maintain a complete binary tree on top of , where . At any moment, the -th leaf stores and the -th leaf stores for all , while all the other leaves store value . Each internal node stores the sum of the values in the leaves of its subtree. In particular, the root stores the value . To update after setting for , we must update the values stored in the -th and -th leaves for all such that , as well as the sums in the internal nodes above these leaves. For each leaf, we use one multiplication operation (we must multiply the value by or as appropriate), and for each internal node we use one addition operation. In total, we need addition and multiplication operations. ∎

We are now ready to remove the infinite precision assumption and to show the final result of this section. Our algorithm will follow the outline of Proposition 2.1, but the addition and the multiplication operations will be implemented with precision . Moreover, we will guarantee that the algorithm only works with values in , which will imply that both arithmetic operations can be performed in constant time and that the algorithm takes time.

Theorem 2.4.

Given a family of sets where and , one can find deterministically in time a colouring such that for some constant .

Proof.

Let and . If , then for any colouring . From now on, we assume .

We first compute as explained in Proposition 2.1. After having computed and , the algorithm initializes a complete binary tree on top of , where . The algorithm assigns to every leaf , and to all other leaves, and then performs a bottom-up traversal to compute the values of inner nodes as the sum of their children.

Then we proceed as in Proposition 2.1, that is to update after setting for , we update the values stored in the -th and -th leaves for all such that , as well as the sums in the internal nodes above these leaves. We implement addition and multiplication with precision , which means that instead of the true value the algorithm obtains a value such that . Moreover, to ensure that the values the algorithm operates with are never too small, we apply the following workaround. For each leaf , we store a counter denoting that the value in the leaf should have been multiplied by , but was not in order not to store numbers below . Whenever the value in the leaf is multiplied by , we also multiply it by , where is the largest integer such that the value is still larger than and update . Then we update the values in the inner nodes on the path from the leaf to the root and, while summing the values of children, we treat the values in the leaves that have (in other words, the values that are smaller than ) as zeros.

Recall that after assigning a colour to an element , we must update the value . We claim that at any moment, the absolute difference (“the absolute error”) between the value  computed by our algorithm and the value computed by the algorithm of Proposition 2.1 with infinite precision, is . Below we call the latter value “the true value” of . It follows that we can choose small enough so that after our algorithm has assigned colours to all elements of , the value of will be bounded by . By Lemma 2.3, this implies that the discrepancy of the constructed colouring is bounded by , where the constant is as in Proposition 2.1.

We show the claim by induction. Namely, we show that after we have assigned colours to  elements, the absolute error is . For , the claim obviously holds. Consider now . The value computed by the algorithm can be different from the true value of at this step for three reasons:

  1. The values in some leaves are replaced with zeros.

  2. Addition and multiplication are implemented with precision .

  3. We decide the colouring based on approximate values of .

Now we bound the absolute error between the value of of the solution computed by our algorithm and the true value of at this step. Recall that the total number of arithmetic operations in the algorithm is . It follows that are the values of arithmetic expressions with addition and multiplication operations. Underestimating the values of leaves, we additionally decrease the values by at most . Implementing addition and multiplication with precision , we compute the sum of the remaining terms in the arithmetic expressions with precision . By the definition of and the induction assumption, the true values of at this step are bounded from above by . It therefore follows that at step we add at most to the absolute error. By the induction assumption, the total absolute error at step is

This implies that the value of can be bounded by for small enough, which concludes the proof. ∎

Theorem 2.4 can be used to partition the universe into a small number of subsets such that the intersection of every subset of the partition and every set is small. We start with a simple technical lemma.

Lemma 2.5.

Consider a process that starts with , and keeps computing as long as . The process ends after at most steps.

Proof.

We claim that after at most steps of the process we have that . Assume otherwise, that is, after steps we still have . But then, for each , , and therefore

Substituting we obtain

where the last inequality holds for , which leads to contradiction. For sufficiently large we have . Thus, by repeating the above reasoning times we obtain that after steps the value of has decreased to at most . Then, using the fact that as long as , we conclude that after additional iterations the value of decreases to at most , and so the process terminates. ∎

Lemma 2.6.

Given a family of sets where and , one can construct deterministically in time a function such that for each and for each , the intersection of and contains elements.

Proof.

We can reformulate the statement of the lemma as follows. We must show that there is a partitioning of into subsets such that for every , the intersection has size at most .

We partition recursively using the procedure from Theorem 2.4. We start with a single set . Suppose that after several steps we have a partitioning of into sets such that for all and and some integer . We then apply Theorem 2.4 to the sets . Using the colouring output by the lemma, we partition each set into sets and , where the former contains all the elements of of colour and the latter all the elements of of colour . For we choose (and also the value of for ) so that its binary representation equals the binary representation of appended with . By Theorem 2.4, there is a constant such that . We continue this process until for all and .

It remains to bound the number of iterations. By setting in Lemma 2.5, we obtain that we need at most recursive applications of the partition procedure implemented with Theorem 2.4 to ensure that every set has at most elements in common with every . Therefore, the size of the image of is bounded by . The overall construction time is . ∎

2.2 Superimposed Codes

We are now ready to show an efficient construction algorithm for data-dependent superimposed codes (see Definition 2.1). At a high level, we will construct a family of functions which, combined with the partition  from Lemma 2.6, will give us the superimposed code.

Theorem 2.7.

Given a family of sets where and , one can construct an -superimposed code of weight and in time and space.

Proof.

By applying Lemma 2.6, we obtain in time a function which gives a partitioning of into subsets , such that for some constant , for every and holds .

Consider the ring of polynomials . Let . We define a mapping as follows. Let and be the binary representation of , where , then .

Let be the family of functions of the form for all irreducible polynomials of degree . By Gauss’s formula [14, 23], there are irreducible polynomials of degree over , and so is the size of the family . Consider two distinct polynomials of degree . Observe that there are at most irreducible polynomials that hash both and to the same value , because is a unique factorization domain [23]. We choose in such a way that the probability that are hashed to the same value while choosing a hash function uniformly at random from is bounded by : and hence we can choose .

If , then and we can take and set . From now on, assume . Let be as in Lemma 2.6. Consider such that , where . We define as follows:

where the mapping treats a polynomial as a -bit number . Clearly, and where:

We claim that the obtained code is a -superimposed code. Consider any and . We need to count elements of that do not belong to any , for . Let and so . By construction, . Thus, by the union bound, the probability that for some is at most  for chosen uniformly at random from . Recall that consists of elements for . The number of irreducible polynomials such that for some is at most . Consequently, at least elements of  do not belong to any , for .

We now show that we can construct the above superimposed codes in time. To this end, we need to generate all irreducible polynomials of degree and to explain how we compute remainders modulo these polynomials. Note first that as we only operate on polynomials of degree , they fit in a machine word and hence we can subtract two polynomials or multiply a polynomial by any power of in constant time. We can now use this to generate the irreducible polynomials and compute the sets at the same time. We maintain a bit vector that for each polynomial of degree stores an indicator bit equal to iff , i.e. iff its remainder modulo any polynomial of degree smaller than is not zero. We consider the polynomials of degree in order. For every irreducible polynomial , we compute a table for all polynomials of degree in overall time using dynamic programming with the following recursive formula:

We use the table to compute for all . Also, if for a polynomial the remainder is zero, we zero out the corresponding bit in . Here we use the fact that to guarantee that we will find all irreducible polynomials of degree in this way.

As there are irreducible polynomials, in total we spend time. At any moment, we use space to store the table and space to store the codes. ∎

3 Upper Bounds for Generalised Pattern Matching

In this section, we present new algorithms for the parameters , , and . Our algorithms for the parameters and share similar ideas, so we present them together in Section 3.1. The algorithm for is presented in Section 3.2.

We start by recalling the formal statement of the Pattern Matching with Don’t Cares problem that will be used throughout this section.

Pattern Matching with Don’t Cares (counting, binary alphabet) Input: A text and a pattern , where “?” is a don’t care character that matches any character of the alphabet. Output: For each , the number of positions such that does not match .

Clifford and Clifford [15] showed that this problem can be solved in time.

3.1 Parameters and

We first show Monte Carlo algorithms for the reporting and counting variants of GPM, and then de-randomise them using the data-dependent superimposed codes of Section 2.

3.1.1 Randomised Algorithms

We start by presenting a new reporting algorithm for the parameter . It does not improve over the algorithm of [34], but encapsulates a novel idea that will be used by all our algorithms for the parameters and . Essentially, we use hashing to reduce