Even Faster Elastic-Degenerate String Matching via Fast Matrix Multiplication

05/07/2019 ∙ by Giulia Bernardini, et al. ∙ University of Pisa Akademia Sztuk Pięknych we Wrocławiu Centrum Wiskunde & Informatica 0

An elastic-degenerate (ED) string is a sequence of n sets of strings of total length N, which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention in the combinatorial pattern matching community, and an O(nm^1.5√( m) + N)-time algorithm is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on this question is that N is substantially larger than both n and m, and thus we would like to have a linear dependency on the former. Under this assumption, the natural open problem is whether we can decrease the 1.5 exponent in the time complexity, similarly as in the related (but, to the best of our knowledge, not equivalent) word break problem [Backurs and Indyk, FOCS 2016]. Our starting point is a conditional lower bound for the EDSM problem. We use the popular combinatorial Boolean matrix multiplication (BMM) conjecture stating that there is no truly subcubic combinatorial algorithm for BMM [Abboud and Williams, FOCS 2014]. By designing an appropriate reduction we show that a combinatorial algorithm solving the EDSM problem in O(nm^1.5-ϵ + N) time, for any ϵ>0, refutes this conjecture. Of course, the notion of combinatorial algorithms is not clearly defined, so our reduction should be understood as an indication that decreasing the exponent requires fast matrix multiplication. Two standard tools used in algorithms on strings are string periodicity and fast Fourier transform. Our main technical contribution is that we successfully combine these tools with fast matrix multiplication to design a non-combinatorial O(nm^1.381 + N)-time algorithm for EDSM. To the best of our knowledge, we are the first to do so.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Boolean matrix multiplication (BMM) is one of the most fundamental computational problems. Apart from its theoretical interest, it has a wide range of applications [39, 54, 31, 29, 48]. BMM is also the core combinatorial part of integer matrix multiplication. In both problems, we are given two matrices and we are to compute values. Integer matrix multiplication can be performed in truly subcubic time, i.e., in operations over the field, for some . The fastest known algorithms for this problem run in time [32, 56]. These algorithms are known as algebraic: they rely on the underlying ring structure.

There also exists a different family of algorithms for the BMM problem known as combinatorial. Their focus is on unveiling the combinatorial structure in the Boolean matrices to reduce redundant computations. A series of results [7, 9, 15] culminating in an -time algorithm [60] (the notation suppresses factors) has led to the popular combinatorial BMM conjecture stating that there is no combinatorial algorithm for BMM working in time , for any  [2]. There has been ample work on applying this conjecture to obtain BMM hardness results: see, e.g.[46, 2, 51, 35, 45, 44, 17].

String matching is another fundamental problem. The problem is to find all fragments of a string text of length that match a string pattern of length . This problem has several linear-time solutions [23]. In many real-world applications, it is often the case that letters at some positions are either unknown or uncertain. A way of representing these positions is with a subset of the alphabet . Such a representation is called degenerate string. The first efficient algorithm for a degenerate text and a standard pattern was published by Fischer and Paterson in 1974 [30]. It has undergone several improvements since then [38, 41, 20, 19]. The first efficient algorithm for a degenerate pattern and a standard text was published by Abrahamson in 1987 [3], followed by several practically efficient algorithms [59, 49, 36].

Degenerate letters are used in the IUPAC notation [40] to represent a position in a DNA sequence that can have multiple possible alternatives. These are used to encode the consensus of a population of sequences [21, 4] in a multiple sequence alignment (MSA). In the presence of insertions or deletions in the MSA, we may need to consider alternative representations. Consider the following MSA of three closely-related sequences (on the left):

  1. GCAACGGGTA--TT

  2. GCAACGGGTATATT

  3. GCACCTGG----TT

These sequences can be compacted into a single sequence of sets of strings (on the right) containing some deterministic and some non-deterministic segments. A non-deterministic segment is a finite set of deterministic strings and may contain the empty string corresponding to a deletion. The total number of segments is the length of and the total number of letters is the size of . We denote the length by and the size by .

This representation has been defined in [37] by Iliopoulos et al. as an elastic-degenerate (ED) string. Being a sequence of subsets of , it can be seen as a generalization of a degenerate string. The natural problem that arises is finding all matches of a deterministic pattern in an ED text . This is the elastic-degenerate string matching (EDSM) problem. Since its introduction in 2017 [37], it has attracted some attention in the combinatorial pattern matching community, and a series of results have been published. The simple algorithm by Iliopoulos et al. [37] for EDSM was first improved by Grossi et al. in the same year, who showed that, for a pattern of length , the EDSM problem can be solved on-line in time [34]; on-line means that the text is read segment-by-segment and an occurrence is detected as soon as possible. This result was improved by Aoyama et al. [6] who presented an -time algorithm. An important feature of these bounds is their linear dependency on . A different branch of on-line algorithms waiving the linear-dependency restriction exists [34, 50, 18]. Moreover, the EDSM problem has been considered under Hamming and edit distance [12].

A question with a somewhat similar flavor is the word break problem. We are given a dictionary , , and a string , , and the question is whether we can split into fragments that appear in (the same element of can be used multiple times). Backurs and Indyk [8] designed an -time algorithm for this problem (the notation suppresses factors). Bringmann et al. [14] improved this to and showed that this is optimal for combinatorial algorithms by a reduction from -Clique. Their algorithm uses fast Fourier transform (FFT), and so it is not clear whether it should be considered combinatorial. While this problem seems similar to EDSM, there does not seem to be a direct reduction and so their lower bound does not immediately apply.

Our Results. It is known that BMM and triangle detection in graphs either both have truly subcubic combinatorial algorithms or none of them do [58]. Recall also that the currently fastest algorithm with linear dependency on for the EDSM problem runs in time [6]. In this paper we prove the following two theorems.

Theorem 1.

If the EDSM problem can be solved in time, for any , with a combinatorial algorithm, then there exists a truly subcubic combinatorial algorithm for triangle detection.

Arguably, the notion of combinatorial algorithms is not clearly defined, and Theorem 1 should be understood as an indication that in order to achieve a better complexity one should use fast matrix multiplication. Indeed, there are examples where a lower bound conditioned on BMM was helpful in constructing efficient algorithms using fast matrix multiplication [1, 16, 13, 47, 25, 57, 61]. We successfully design such a non-combinatorial algorithm by combining three ingredients: a string periodicity argument, FFT, and fast matrix multiplication. While periodicity is the usual tool in combinatorial pattern matching [42, 24, 43] and using FFT is also not unusual (for example, it often shows up in approximate string matching [3, 5, 19, 33]), to the best of our knowledge, we are the first to combine these with fast matrix multiplication. Specifically, we show the following result for the EDSM problem.

Theorem 2.

The EDSM problem can be solved on-line in expected time.

An important building block in our solution that might find applications in other problems is a method of selecting a small set of length- substrings of the pattern, called anchors, so that any relevant occurrence of a string from an ED text set contains at least one but not too many such anchors inside. This is obtained by rephrasing the question in a graph-theoretical language and then generalizing the well-known fact that an instance of the hitting set problem with sets over , each of size at least , has a solution of size . While the idea of carefully selecting some substrings of the same length is not new, for example Kociumaka et al. [43] used it to design a data structure for pattern matching queries on a string, our setting is different and hence so is the method of selecting these substrings.

Roadmap. Section 2 provides the necessary definitions and notation as well the algorithmic toolbox used throughout the paper. In Section 3 we prove our hardness result for the EDSM problem (Theorem 1). In Section 4 we present our algorithm for the same problem (Theorem 2); this is the most technically involved part of the paper.

2 Preliminaries

Let be a string of length over a finite ordered alphabet of size . For two positions and on , we denote by the substring of that starts at position and ends at position (it is of length if ). By we denote the empty string of length 0. A prefix of is a substring of the form , and a suffix of is a substring of the form . denotes the reverse of , that is, . We say that a string is a power of a string if there exists an integer , such that is expressed as consecutive concatenations of , denoted by . A period of a string is any integer such that for every , and the period, denoted by , is the smallest such . We call a string strongly periodic if .

Lemma 1 ([28]).

If and are both periods of the same string , and additionally , then is also a period of .

A trie is a rooted tree in which every edge is labeled with a single letter, and every two edges outgoing from the same node have different labels. The label of a node in such a tree , denoted by , is defined as the concatenation of the labels of all the edges on the path from the root of to . Thus, the label of the root of is , and a trie is a representation of a set of strings consisting of the labels of all its leaves. By replacing each path consisting of nodes with exactly one child by an edge labeled by the concatenation of the labels of the edges of we obtain a compact trie. The nodes of the trie that are removed after this transformation are called implicit, while the remaining ones are referred to as explicit. The suffix tree of a string is the compact trie representing all suffixes of , , where instead of explicitly storing the label of an edge we represent it by a pair .

A heavy path decomposition of a tree is obtained by selecting, for every non-leaf node , its child such that the subtree rooted at is the largest. This decomposes the nodes of into node-disjoint paths, with each such path (called a heavy path) starting at some node, called the head of , and ending at a leaf. An important property of such a decomposition is that the number of distinct heavy paths above any leaf (that is, intersecting the path from a leaf to the root) is only logarithmic in the size of  [53].

Let denote the set of all finite non-empty subsets of . Previous works (cf. [37, 34, 12, 6, 50]) define as the set of all finite non-empty subsets of excluding but we waive here the latter restriction as it has no algorithmic implications. An elastic-degenerate string, or ED string, over alphabet , is a string over , i.e., an ED string is an element of .

Let denote an ED string of length , i.e. . We assume that for any , the set is implemented as an array and can be accessed by an index, i.e., . For any , denotes the total length of all strings in , and for any ED string , denotes the total length of all strings in all s or the size of , i.e., and . An ED string can be thought of as a representation of the set of strings , where for any sets of strings and . For any ED string and a pattern , we say that matches if

  1. and is a substring of some string in , or,

  2. and , where is a suffix of some string in , is a prefix of some string in , and , for all .

We say that an occurrence of a string ends at position of an ED string if there exists such that matches . We will refer to string as the pattern and to ED string as the text. We define the main problem considered in this paper.

Elastic-Degenerate String Matching (EDSM) INPUT: A string of length and an ED string of length and size . OUTPUT: All positions in where at least one occurrence of ends.

Example 1.

Pattern ends at positions , , and of the following text .

Aoyama et al. [6] obtained an on-line -time algorithm by designing an efficient solution for the following problem.

Active Prefixes (AP) INPUT: A string of length

, a bit vector

of size , a set of strings of total length . OUTPUT: A bit vector of size with if and only if there exists and , such that and .

In more detail, given an ED text one should consider an instance of the AP problem per segment. Hence, an solution for AP (with being the size of the -th segment of the ED text) implies an solution for EDSM, as . We provide an example of the AP problem.

Example 2.

Let of length , , and . We have that .

For our hardness results we rely on BMM and the following closely related problem.

Boolean Matrix Multiplication (BMM) INPUT: Two Boolean matrices and . OUTPUT: Boolean matrix , where .

Triangle Detection (TD) INPUT: Three Boolean matrices and . OUTPUT: Are there such that ?

An algorithm is called truly subcubic if it runs in time, for some . TD and BMM either both have truly subcubic combinatorial algorithms, or none of them do [58].

3 EDSM Conditional Lower Bound

As a warm-up, we first show a conditional lower bound for the AP problem that already highlights the high-level idea used in the proof of Theorem 1.

Theorem 3.

If the AP problem can be solved in time, for any , with a combinatorial algorithm, then there exists a truly subcubic combinatorial algorithm for Boolean matrix multiplication.

Proof.

Recall that in an instance of BMM the matrices are denoted by and . To compute , we need to find, for every , an index such that and . To this purpose, we split matrix into blocks of size and into blocks . This corresponds to considering values of and in intervals of size , and clearly there are such intervals. Matrix is thus split into blocks, giving rise to an equal number of instances of the AP problem, each one corresponding to an interval of and an interval of . We will now describe the instance corresponding to the -th block, .

We build the string of the AP problem, for any block, as a concatenation of gadgets corresponding to , and the bit vector of the AP problem as a concatenation of bit vectors, one per gadget. Each gadget is simply the string , and, if , we set in its bit vector at the position corresponding to the -th a in the first half of the gadget. After solving the AP problem, we will look at in the output bit vector at the position corresponding to the -th a in the second half of the gadget; it should be there if . In order to enforce this, we need to include the following strings in set :

This guarantees that after solving the AP problem we have the required property, and thus after solving all the instances we have obtained matrix . Indeed, consider values , i.e., the index that runs on the columns of , in intervals of size . By construction and by definition of BMM the -th line of the -th column interval of is obtained by taking the disjunction of the second half of the -th interval of each -th bit vector for every .

We have a total of instances. In each of them, the total length of all strings is , and the length of the input string is . Using our assumed algorithm for each instance, we obtain the following total time:

If we set , then the total time becomes:

Hence we obtain a combinatorial BMM algorithm with complexity , where . ∎

Example 3.

Consider the following instance of the BMM problem with and .


Given and , we compute by solving instances of the AP problem constructed as follows. The pattern is

where the six gadgets are separated by a to be highlighted. For the AP instances, the vectors shown below are the input bit vectors, the sets are the input set of strings, and finally the vectors are the output bit vectors.

As an example on how to obtain , the first line of block of is obtained by taking the disjunction of the bold parts of and .

Now we move to showing the promised conditional lower bound for the EDSM problem. Specifically, we show that TD can be reduced to the decision version of the EDSM problem, in which the goal is to detect whether there exists at least one occurrence of in .

See 1

Proof.

Consider an instance of TD, where we are given three Boolean matrices , and the question is to check if there exist such that . Let be a parameter to be determined later that corresponds to decomposing into blocks of size . We reduce to an instance of EDSM over an alphabet of size .

Pattern . We construct by concatenating, in some fixed order, the following strings:

for every and , where , , , , , and are disjoint subsets of .

ED text . The text consists of three parts. Its middle part encodes all the entries equal to in matrices , and , and consists of three string sets , where:

  1. contains all strings of the form , for some , and such that ;

  2. contains all strings of the form , for some and such that , i.e., if the corresponding entry of is 1;

  3. contains all strings of the form , for some , and such that .

It is easy to see that . This implies the following:

  1. The length of the pattern is ;

  2. The size of is .

By the above construction, we obtain the following fact.

Fact 1.

matches if and only if the following holds for some :

Solving the TD problem thus reduces to taking the disjunction of all such conditions. Let us write down all strings in some arbitrary but fixed order to obtain with , where every , for some . We aim to construct a small number of sets of strings that, when considered as an ED text, match any prefix of the pattern, ; a similar construction can be carried on to obtain sets of strings that match any suffix , . These sets will then be added to the left and to the right of , respectively, to obtain the ED text .

ED Prefix. We construct sets of strings as follows. The first one contains the empty string and . The second one contains , and . The third one contains , , , and
. Formally, for every , the -th of such sets is:

ED Suffix. We similarly construct sets to be appended to :

The total length of all the ED prefix and ED suffix strings is . The whole ED text is: .

Lemma 2.

The pattern occurs in the ED text if and only if there exist such that .

Proof.

By Fact 1, if such exist then matches , for some . Then, by construction of the sets and , the prefix matches the ED prefix (this can be proved by induction), and similarly the suffix matches the ED suffix, so the whole matches , and so occurs in . Because of the letters $ appearing only in the center of s and strings from , every and a concatenation of , , having the same length, and the s being distinct, there is an occurrence of the pattern in if and only if for some and , , . But then, by Fact 1 there exists a triangle. ∎

Note that for the EDSM problem we have , and . Thus if we had a solution running in time, for some , by choosing a sufficiently small and setting we would obtain an -time algorithm for TD, for some . ∎

4 An -time Algorithm for EDSM

Our goal is to design a non-combinatorial -time algorithm for EDSM. It suffices to solve an instance of the AP problem in time. We further reduce the AP problem to a logarithmic number of restricted instances of the problem, in which the length of every string is in , for . If we solve every such instance in time, then we can solve the original instance in time by taking the disjunction of results. We partition the strings in into three types, compute the corresponding bit vector for each type separately and in different ways, and, finally, take the disjunction to obtain the answer for the restricted instance.

Partitioning S. Let (to avoid clutter we assume that is an integer divisible by 4, but this can be avoided by appropriately adjusting the constants), so that the length of every string in S belongs to . The three types of strings are as follows:

Type 1:

Strings such that every length- substring of is not strongly periodic.

Type 2:

Strings containing at least one length- substring that is not strongly periodic and at least one length- substring that is strongly periodic.

Type 3:

Strings such that every length- substring of is strongly periodic (in Lemma 3 we show that in this case ).

These three types are evidently a partition of and, before we proceed with the algorithm, we need to show that we can determine the type of a string in time. We start with showing that, in fact, strings of type 3 are exactly strings with period at most .

Lemma 3.

Let be a string. If for every then .

Proof.

We first show that, for any string and letters , if and then . This follows from Lemma 1: since and are both periods of and , we obtain that is a period of . If then either or , by symmetry it is enough to consider the former possibility. We claim that then is a period of . Indeed, (observe that ) and for any , so by being a multiple of we obtain that , which is a contradiction because cannot be a period of by definition of .

If for every then by the above reasoning the periods of all substrings are in fact equal to the same . But then for every , so . ∎

Lemma 4.

Given a string we can determine its type in time.

Proof.

It is well-known that can be computed in time for any string  [23]. We partition into blocks and compute for every in total time. Observe that every substring contains at least one whole block inside. If then the period of any substring that contains is larger than . Consequently, if for every we declare to be of type 1.

Consider any such that . If the period of a substring that contains is at most then in fact it must be equal to , because and so, by Lemma 1 applied on , must be a multiple of , and, by repeatedly applying and and using the fact that occurs inside , we conclude that in fact for any , and thus . This allows us to check if there exists a substring that contains such that by computing, in time, how far the period extends to the left and to the right of in (if or do not exist we do not extend the period in the corresponding direction). Then, there exists such a substring if and only if the length of the extended substring with period is at least .

For every we can check in time if there exists a length- substring containing with . By repeating this procedure for every , we can distinguish between of type 2 and of type 3 in total time. ∎

4.1 Type 1 Strings

In this section we show how to solve a restricted instance of the AP problem where every string is of type 1, that is, each of its length- substrings is not strongly periodic, and furthermore for some . Observe that all (hence at most ) length- substrings of any must be distinct, as otherwise we would be able to find two occurrences of a length- substring at distance at most in , making the period of the substring at most and contradicting the assumption that is of type 1.

We start with constructing the suffix tree of (our pattern in the EDSM problem) in time [55] (note that we are spending time and not just as to avoid any assumptions on the alphabet). For every explicit node , we construct a perfect hash function mapping the first letter on every edge outgoing from to the corresponding edge. This takes time [52] and allows us to navigate in in constant time per letter. Then, for every we check if it occurs in using the suffix tree in time, and if not disregard it from further consideration. We want to further partition into that are processed separately. For every , we want to select a set of length- substrings of , called the anchors, each represented by one of its occurrences in , such that:

  1. The total number of occurrences of all anchors in is .

  2. For every , at least one of its length- substrings is an anchor.

  3. For every , at most of its length- substrings are anchors.

We formalize this using the following auxiliary problem, which is a strengthening of the hitting set problem: for any collection of sets over , each of size at least , we can choose a subset of of size that nontrivially intersects every set.

Node Selection (NS) INPUT: A bipartite graph with for every . OUTPUT: A set of nodes from such that every node in has at least one selected neighbor but such selected neighbors.

To reduce finding anchors to an instance of the NS problem, we first build a bipartite graph in which the nodes on the left correspond to strings , the nodes on the right correspond to distinct length- substrings of , and there is an edge connecting a node corresponding to a length- string with a node corresponding to a string when occurs in . Using suffix links, we can find the node of the suffix tree corresponding to every length- substring of in total time, so the whole construction takes