Given a string and a pattern , the core problem of pattern matching is to report all locations where occurs in . Pattern matching problems can be divided into two: the algorithmic problem where the text and the pattern are given at the same time, and the data structure problem where one is allowed to preprocess the text (pattern) before a query pattern (text) is given. Many problems within both these categories are well-studied in the history of stringology, and optimal solutions to many variants have been found.
In the last decades, researchers have shown an increasing interest in the compressed version of this problem, where the space used by the index is related to the size of some compressed representation of instead of the length of . This could be measures such as the size of the LZ77-parse of , the smallest grammar representing , the number of runs in the BWT of , etc. see e.g. [10, 3, 9, 8, 17, 16, 13]. This problem is highly relevant as the amount of highly-repetitive data increases rapidly, and thus it is possible to handle greater amounts of data by compressing it. The increase in such data is due to things like DNA sequencing, version control repositories, etc.
In this paper we consider what we call the compressed indexing problem, which is to preprocess a string of length into a compressed representation that supports fast pattern matching queries. That is, given a string of length , report all occurrences of substrings in that match .
Table 1 gives an overview of the results on this problem.
|Gagie et al. |
|Nishimoto et al. |
|Bille et al. |
|Bille et al. |
|Bille et al. |
|Theorem 2 (1)|
|Theorem 2 (2)|
1.1 Our Results
In this paper we improve previous solutions that are bounded by the size of the LZ77-parse. For constant-sized alphabets we obtain the following result:
Given a string of length from a constant-sized alphabet with an LZ77 parse of length , we can build a compressed-index supporting pattern matching queries in time using space.
In particular, we are the first to obtain optimal search time using only space. For general alphabets we obtain the following:
Given a string of length from an integer alphabet polynomially bounded by with an LZ77-parse of length , we can build a compressed-index supporting pattern matching queries in:
time using space.
expected time using space.
time using space.
1.2 Technical Overview
Our main new contribution is based on a new grammar construction. In  Melhorn et al. presented a way to maintain dynamic sequences subject to equality testing using a technique called signatures. They presented two signature construction techniques. One is randomized and leads to complexities that hold in expectation. The other is based on a deterministic coin-tossing technique of Cole and Vishkin  and leads to worst-case running times but incurs an iterated logarithmic overhead compared to the randomized solution. This technique has also resembles the string labeling techniques found e.g. in . To the best of our knowledge, we are the first to consider grammar compression based on the randomized solution from . Despite it being randomized we show how to obtain worst-case query bounds for text indexing using this technique.
The main idea in this grammar construction is that similar substrings will be parsed almost identically. This property also holds true for the deterministic construction technique which has been used to solve dynamic string problems with and without compression, see e.g. [17, 1]. In  Jeż devices a different grammar construction algorithm with similar properties to solve the algorithmic pattern matching problem on grammar compressed strings which has later been used for both static and dynamic string problems, see [20, 11]
Our primary solution has an term in the query time which is problematic for short query patterns. To handle this, we show different solutions for handling short query patterns. These are based on the techniques from LZ77-based indexing combined with extra data structures to speed up the queries.
We assume a standard unit-cost RAM model with word size and that the input is from an integer alphabet . We measure space complexity in terms of machine words unless explicitly stated otherwise. A string of length is a sequence of symbols drawn from an alphabet . The sequence is the substring of given by and strings can be concatenated, i.e. . The empty string is denoted and while if , if and if . The reverse of denoted is the string . A run in a string is a substring with identical letters, i.e. for . Let be a run in then it is a maximal run if it cannot be extended, i.e. and . If there are no runs in we say that is run-free and it follows that for . Denote by the set of integers .
Let be a set of points in a 2-dimensional grid. The 2D-orthogonal range reporting problem is to compactly represent while supporting range reporting queries, that is, given a rectangle report all points in the set . We use the following:
Lemma 1 (Chan et al. )
For any set of points in and constant , we can solve 2D-orthogonal range reporting with expected preprocessing time using:
space and query time
space and query time
where is the number of occurrences inside the rectangle.
A Karp-Rabin fingerprinting function  is a randomized hash function for strings. Given a string of length and a fingerprinting function we can in time and space compute and store fingerprints such that the fingerprint of any substring of can be computed in constant time. Identical strings have identical fingerprints. The fingerprints of two strings and collide when and . A fingerprinting function is collision-free for a set of strings when there are no collisions between the fingerprints of any two strings in the set. We can find collision-free fingerprinting function for a set of strings with total length in expected time .
Let be a lexicographically sorted set of strings. The weak prefix search problem is to compactly represent while supporting weak prefix queries, that is, given a query string of length report the rank of the lexicographically smallest and largest strings in of which is a prefix. If no such strings exist, the answer can be arbitrary.
Lemma 2 (Belazzougui et al. , appendix H.3)
Given a set of strings with average length , from an alphabet of size , we can build a data structure using bits of space supporting weak prefix search for a pattern of length in time where is the word size.
We will refer to the data structure of Lemma 2 as a z-fast trie following the notation from . The term in the time complexity is due to a linear time preprocessing of the pattern and is not part of the actual search. Therefore it is simple to do weak prefix search for any length substring of in time after preprocessing once in time.
The LZ77-parse  of a string of length is a string of the form . We define , for . For to be a valid parse, we require , , , and for . This guarantees represents and is uniquely defined in terms of . The substring is called the phrase of the parse and is its source. A minimal LZ77-parse of can be found greedily in time and stored in space . We call the positions the borders of .
3 Signature Grammars
We consider a hierarchical representation of strings given by Melhorn et al.  with some slight modifications. Let be a run-free string of length from an integer alphabet and let be a uniformly random permutation of . Define a position as a local minimum of if and and . In the block decomposition of , a block starts at position and at every local minimum in and ends just before the next block begins (the last block ends at position ). The block decomposition of a string can be used to construct the signature tree of denoted which is an ordered labeled tree with several useful properties.
Let be a run-free string of length from an alphabet and let be a uniformly random permutation of such that is the rank of the symbol in this permutation. Then the expected length between two local minima in the sequence is at most 3 and the longest gap is in expectation.
First we show the expected length between two local minima is at most 3. Look at a position in the sequence . To determine if is a local minimum, we only need to consider the two neighbouring elements and thus let us consider the triple . We need to consider the following cases. First assume . There exist permutations of a triple with unique elements and in two of these the minimum element is in the middle. Since is a uniformly random permutation of all 6 permutations are equally likely, and thus there is chance that the element at position is a local minimum. Now instead assume in which case there is chance that the middle element is the smallest. Finally, in the case where or there is also chance. As is run-free, these cases cover all possible cases. Thus there is at least chance that any position is a local minimum independently of . Thus the expected number of local minima in the sequence is therefore at least and the expected distance between any two local minima is at most .
The expected longest distance between two local minima of was shown in .
3.1 Signature Grammar Construction
We now give the construction algorithm for the signature tree . Consider an ordered forest of trees. Initially, consists of trees where the tree is a single node with label . Let the label of a tree denoted be the label of its root node. Let denote the string that is given by the in-order concatenation of the labels of the trees in . The construction of proceeds as follows:
Let be a maximal subrange of consecutive trees of with identical labels, i.e. . Replace each such subrange in by a new tree having as root a new node with children and a label that identifies the number of children and their label. We call this kind of node a run node. Now is run-free.
Consider the block decomposition of . Let be consecutive trees in such that their labels form a block in . Replace all identical blocks by a new tree having as root a new node with children and a unique label. We call this kind of node a run-free node.
Repeat step and until contains a single tree, we call this tree .
In each iteration the size of decreases by at least a factor of two and each iteration takes time, thus it can be constructed in time.
Consider the directed acyclic graph (DAG) of the tree where all identical subtrees are merged. Note we can store run nodes in space since all out-going edges are pointing to the same node, so we store the number of edges along with a single edge instead of explicitly storing each of them. For run-free nodes we use space proportional to their out-degrees. We call this the signature DAG of denoted . There is a one-to-one correspondence between this DAG and an acyclic run-length grammar producing where each node corresponds to a production and each leaf to a terminal.
3.2 Properties of the Signature Grammar
We now show some properties of and that we will need later. Let denote the substring of given by the labels of the leaves of the subtree of induced by the node in left to right order.
Let be a node in the signature tree for a string of length . If has height then is at least and thus (and ) has height .
This follows directly from the out-degree of all nodes being at least 2.
Denote by the set of nodes in that are ancestors of the through leaf of . These nodes form a sequence of adjacent nodes at every level of and we call them relevant nodes for the substring .
and have identical nodes except at most the two first and two last nodes on each level whenever .
Trivially, the leaves of and are identical if . Now we show it is true for nodes on level assuming it is true for nodes on level . We only consider the left part of each level as the argument for the right part is (almost) symmetric. Let be the nodes on level in and the nodes on level in in left to right order. From the assumption, we have are identical with for some . When constructing the level of , these nodes are divided into blocks. Let be the first block that starts after then by the block decomposition, the first block after starts at . The nodes are spanned by at most two blocks and similarly for . These blocks become the first one or two nodes on level in and respectively. The block starting at is identical to the block starting at and the same holds for the following blocks. These blocks result in identical nodes on level . Thus, if we ignore the at most two first (and last) nodes on level the remaining nodes are identical.
We call nodes of consistent in respect to if they are guaranteed to be in any other where . We denote the remaining nodes of as inconsistent. From the above lemma, it follows at most the left-most and right-most two nodes on each level of can be inconsistent.
The expected size of the signature DAG is .
We first bound the number of unique nodes in in terms of the LZ77-parse of which has size . Consider the decomposition of into the substrings given by the phrases and borders of the LZ77-parse of and the corresponding sets of relevant nodes . Clearly, the union of these sets are all the nodes of . Since identical nodes are represented only once in we need only count one of their occurrences in . We first count the nodes at levels lower than . A set of nodes relevant to a substring of length one has no more than such nodes. By Lemma 5 only of the relevant nodes for a phrase are not guaranteed to also appear in the relevant nodes of its source. Thus we count a total of nodes for the sets of relevant nodes. Consider the leftmost appearance of a node appearing one or more times in . By definition, and because every node of is in at least one relevant set, it must already be counted towards one of the sets. Thus there are unique vertices in at levels lower than . Now for the remaining at most levels, there are no more than nodes because the out-degree of every node is at least two. Thus we have proved that there are unique nodes in . By Lemma 3 the average block size and thus the expected out-degree of a node is . It follows that the expected number of edges and the expected size of is .
A signature grammar of using (worst case) space can be constructed in expected time.
Construct a signature grammar for using the signature grammar construction algorithm. If the average out-degree of the run-free nodes in is more than some constant greater than 3 then try again. In expectation it only takes a constant number of retries before this is not the case.
Given a node , the child that produces the character at position in can be found in time.
First assume is a run-free node. If we store for each child of in order, the correct child corresponding to position can simply be found by iterating over these. However, this may take time since this is the maximum out-degree of a node in . This can be improved to by doing a binary search, but instead we use a Fusion Tree from  that allows us to do this in time since we have at most elements. This does not increase the space usage. If is a run node then it is easy to calculate the right child by a single division.
4 Long Patterns
In this section we present how to use the signature grammar to construct a compressed index that we will use for patterns of length for constant . We obtain the following lemma:
Given a string of length with an LZ77-parse of length we can build a compressed index supporting pattern matching queries in time using space for any constant .
4.1 Data Structure
Consider a vertex with children in . Let denote the prefix of given by concatenating the strings represented by the first children of and let be the suffix of given by concatenating the strings represented by the last children of .
The data structure is composed of two z-fast tries (see Lemma 2) and and a 2D-range reporting data structure .
For every non-leaf node we store the following. Let be the number of children of if is a run-free node otherwise let :
The reverse of the strings for in the z-fast trie .
The strings for in the z-fast trie .
The points where is the rank of the reverse of in and is the rank of in for are stored in . A point stores the vertex and the length of as auxiliary information.
Assume in the following that there are no fingerprint collisions. Compute all the prefix fingerprints of . Consider the signature tree for . Let denote the ’th left-most vertex on level in and let be the last level. Let . Symmetrically, let denote the ’th right-most vertex on level in and let . Let .
For search for the reverse of in and for in using the precomputed fingerprints. Let and be the respective ranges returned by the search. Do a range reporting query for the (possibly empty) range in . Each point in the range identifies a node and a position such that occurs at position in the string . If is a run node, there is furthermore an occurrence of in for all positions where and .
To report the actual occurrences of in we traverse all ancestors of in ; for each occurrence of in found, recursively visit each parent of and offset the location of the occurrence to match the location in instead of . When is the root, report the occurrence. Observe that the time it takes to traverse the ancestors of is linear in the number of occurrences we find.
We now describe how to handle fingerprint collisions. Given a z-fast trie, Gagie et al.  show how to perform weak prefix queries and identify all false positives using extra time by employing bookmarked extraction and bookmarked fingerprinting. Because we only compute fingerprints and extract prefixes (suffixes) of the strings represented by vertices in we do not need bookmarking to do this. We refer the reader to  for the details. Thus, we modify the search algorithm such that all the searches in and are carried out first, then we verify the results before progressing to doing range reporting queries only for ranges that were not discarded during verification.
For any occurrence of in there is a node in that stabs , ie. a suffix of equals a prefix and a prefix of equals the remaining suffix for some and . Since we put all combinations of , into and , we would be guaranteed to find all nodes that contains in if we searched for all possible split-points of i.e. and for .
We now argue that we do not need to search for all possible split-points of but only need to consider those in the set . For a position , we say the node stabs if the nearest common ancestor of the and leaf of denoted is .
Look at any occurrence of . Consider and . Look at a possible split-point and the node that stabs position in . Let and be adjacent children of such that the rightmost leaf descendant of is the leaf and the leftmost leaf descendant of is the leaf. We now look at two cases for and argue it is irrelevant to consider position as split-point for in these cases:
Case is consistent (in respect to ). In this case it is guaranteed that the node that stabs in is identical to . Since is a descendant of the root of (as the root of is inconsistent) cannot contain and thus it is irrelevant to consider as a split-point.
Case is inconsistent and and are both consistent (in respect to ). In this case and have identical corresponding nodes and in . Because and are children of the same node it follows that and must also both be children of some node that stabs in (however and may not be identical since is inconsistent). Consider the node to the left of (or symmetrically for the right side if is an inconsistent node in the right side of ). If contains then is also a child of (otherwise would be inconsistent). So it suffices to check the split-point . Surely stabs an inconsistent node in , so either we consider that position relevant, or the same argument applies again and a split-point further to the left is eventually considered relevant.
Thus only split-points where and at least one of or are inconsistent are relevant. These positions are a subset of the position in , and thus we try all relevant split-points.
A query on and takes time by Lemma 2 while a query on takes time using Lemma 1 (i) (excluding reporting). We do queries as the size of is . Verification of the strings we search for takes total time . Constructing the signature DAG for takes time, thus total time without reporting is for any . This holds because if then , otherwise and then . For every query on we may find multiple points each corresponding to an occurrence of . It takes time to report each point thus the total time becomes .
5 Short Patterns
Our solution for short patterns uses properties of the LZ77-parse of . A primary substring of is a substring that contains one or more borders of , all other substrings are called secondary. A primary substring that matches a query pattern is a primary occurrence of while a secondary substring that matches is a secondary occurrence of . In a seminal paper on LZ77 based indexing  Kärkkäinen and Ukkonen use some observations by Farach and Thorup  to show how all secondary occurrences of a query pattern can be found given a list of the primary occurrences of through a reduction to orthogonal range reporting. Employing the range reporting result given in Lemma 1 (ii), all secondary occurrences can be reported as stated in the following lemma:
Lemma 10 (Kärkkäinen and Ukkonen )
Given the LZ77-parse of a string there exists a data structure that uses space that can report all secondary occurrences of a pattern given the list of primary occurrences of in in time.
We now describe a data structure that can report all primary occurrences of a pattern of length at most in time using space.
Given a string of length and a positive integer we can build a compressed index supporting pattern matching queries for patterns of length in time using space that works for .
Consider the set of substrings of that are defined by for , ie. the substrings of length surrounding the borders of the LZ77-parse. The total length of these strings is . Construct the generalized suffix tree over the set of strings . This takes words of space. To ensure no occurrence is reported more than once, if multiple suffixes in this generalized suffix tree correspond to substrings of that starts on the same position in , only include the longest of these. This happens when the distance between two borders is less than .
To find the primary occurrences of of length , simply find all occurrences of in . These occurrences are a super set of the primary occurrences of in , since contains all substrings starting/ending at most positions from a border. It is easy to filter out all occurrences that are not primary, simply by calculating if they cross a border or not. This takes time (where includes secondary occurrences). Combined with Lemma 10 this gives Lemma 11.
6 Semi-Short Patterns
In this section, we show how to handle patterns of length between and . It is based on the same reduction to 2D-range reporting as used for long patterns. However, the positions in that are inserted in the range reporting structure is now based on the LZ77-parse of instead. Furthermore we use Lemma 1 (ii) which gives faster range reporting but uses super-linear space, which is fine because we instead put fewer points into the structure. We get the following lemma:
Given a string of length we solve the compressed indexing problem for a pattern of length with for any positive constant in time using space.
6.1 Data Structure
As in the previous section for short patterns, we only need to worry about primary occurrences of in . Let be the set of all substrings of length at most that cross a border in . The split positions of such a string are the offsets of the leftmost borders in its occurrences. All primary occurrences of in are in this set. The size of this set is . The data structure is composed by the following:
A dictionary mapping each string in to its split positions.
A z-fast trie on the reverse of the strings for .
A z-fast trie on the strings for .
A range reporting data structure with a point for every pair of strings for where and is the lexicographical rank of the reverse of in the set and is the lexicographical rank of in the set . We store the border along with the point .
The data structure described in Lemma 10 to report secondary occurrences.
The signature grammar for .
Each entry in requires bits to store since a split position can be at most . Thus the dictionary can be stored in bits which for is words. The tries and take space while takes space. The signature grammar takes . Thus the total space is .
Assume a lookup for in does not give false-positives. Given a pattern compute all prefix fingerprints of . Next do a lookup in . If there is no match then does not occur in . Otherwise, we do the following for each of the split-points stored in . First split into a left part and a right part . Then search for the reverse of in and for in using the corresponding fingerprints. The search induces a (possibly empty) range for which we do a range reporting query in . Each occurrence in corresponds to a primary occurrence of in , so report these. Finally use Lemma 10 to report all secondary occurrences.
Unfortunately, we cannot guarantee a lookup for in does not give a false positive. Instead, we pause the reporting step when the first possible occurrence of has been found. At this point, we verify the substring matches the found occurrence in . We know this occurrence is around an LZ-border in such that is to the left of the border and is to the right of the border. Thus we can efficiently verify that actually occurs at this position using the grammar.
Computing the prefix fingerprints of takes time. First, we analyze the running time in the case actually exists in . The lookup in takes time using perfect hashing. For each split-point we do two z-fast trie lookups in time . Since each different split-point corresponds to at least one unique occurrence, this takes at most time in total. Similarly each lookup and occurrence in the 2D-range reporting structure takes time, which is therefore also bounded by time. Finally, we verified one of the found occurrence against in time. So the total time is in this case.
In the case does not exists, either the lookup in tells us that, and we spend time, or the lookup in is a false-positive. In the latter case, we perform exactly two z-fast trie lookups and one range reporting query. These all take time . Since this is time. Again, we verified the found occurrence against in time. The total time in this case is therefore .
Note we ensure our fingerprint function is collision free for all substrings in during the preprocessing thus there can only be collisions if does not occur in when .
7 Randomized Solution
In this section we present a very simple way to turn the worst-case time of Lemma 9 into expected time. First observe, this is already true if the pattern we search for occurs at least once or if .
As in the semi-short patterns section, we consider the set of substrings of of length at most that crosses a border. Create a dictionary with entries and insert all the strings from . This means only a fraction of the entries are used, and thus if we lookup a string (where ) that is not in there is only a chance of getting a false-positive.
Now to answer a query, we first check if in which case we look it up in . If it does not exist, report that. If it does exist in or if use the solution from Lemma 9 to answer the query.
In the case does not exist, we spend either time if reports no, or time if reports a false-positive. Since there is only chance of getting a false positive, the expected time in this case is . In all other cases, the running time is in worst-case, so the total expected running time is . The space usage of is bits since we only need to store one bit for each entry. This is words for . To sum up, we get the following lemma:
Given a signature grammar for a text of length with an LZ77-parse of length we can build a compressed index supporting pattern matching queries in expected time using space for any constant .
-  Stephen Alstrup, Gerth Stølting Brodal, and Theis Rauhe. Pattern matching in dynamic texts. In Proceedings of the 11th Annual Symposium on Discrete Algorithms. Citeseer, 2000.
-  Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Fast prefix search in little space, with applications. In Proc. 18th ESA, pages 427–438, 2010.
-  Philip Bille, Mikko Berggren Ettienne, Inge Li Gørtz, and Hjalte Wedel Vildhøj. Time-space trade-offs for Lempel-Ziv compressed indexing. In 28th Annual Symposium on Combinatorial Pattern Matching. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2017.
-  Timothy M. Chan, Kasper Green Larsen, and Mihai Patrascu. Orthogonal range searching on the ram, revisited. In Proc. 27th SOCG, pages 1–10, 2011.
-  Richard Cole and Uzi Vishkin. Deterministic coin tossing with applications to optimal parallel list ranking. Information and Control, 70(1):32 – 53, 1986.
-  M. Farach and M. Thorup. String Matching in Lempel—Ziv Compressed Strings. Algorithmica, 20(4):388–404, 1998.
M. L. Fredman and D. E. Willard.
Blasting through the information theoretic barrier with fusion trees.
Proceedings of the Twenty-second Annual ACM Symposium on Theory of Computing, STOC ’90, pages 1–7, New York, NY, USA, 1990. ACM.
-  Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon Puglisi. A faster grammar-based self-index. Language and Automata Theory and Applications, pages 240–251, 2012.
-  Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J Puglisi. LZ77-based self-indexing with faster pattern matching. In Latin American Symposium on Theoretical Informatics, pages 731–742. Springer, 2014.
-  Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in bwt-runs bounded space. arXiv preprint arXiv:1705.10382, 2017.
-  Paweł Gawrychowski, Adam Karczmarz, Tomasz Kociumaka, Jakub Łącki, and Piotr Sankowski. Optimal dynamic strings. arXiv preprint arXiv:1511.02612, 2015.
-  Artur Jeż. Faster fully compressed pattern matching by recompression. ACM Transactions on Algorithms (TALG), 11(3):20, 2015.
-  Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching. Proceedings of the 3rd South American Workshop on String Processing (WSP’96), 26(Teollisuuskatu 23):141–155, 1996.
-  Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249–260, 1987.
-  Kurt Mehlhorn, Rajamani Sundar, and Christian Uhrig. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica, 17(2):183–198, 1997.
-  Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys (CSUR), 39(1):2, 2007.
-  Takaaki Nishimoto, I Tomohiro, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Dynamic index, LZ factorization, and LCE queries in compressed space. arXiv preprint arXiv:1504.06954, 2015.
-  Benny Porat and Ely Porat. Exact and approximate pattern matching in the streaming model. In Proc. 50th FOCS, pages 315–323, 2009.
-  S. C. Sahinalp and U. Vishkin. Efficient approximate and dynamic matching of patterns using a labeling paradigm. In Proceedings of 37th Conference on Foundations of Computer Science, Oct 1996.
-  I Tomohiro. Longest common extension with recompression. 2017.
-  Jacob Ziv and Abraham Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory, (3), 1977.