Fast entropy-bounded string dictionary look-up with mismatches

We revisit the fundamental problem of dictionary look-up with mismatches. Given a set (dictionary) of d strings of length m and an integer k, we must preprocess it into a data structure to answer the following queries: Given a query string Q of length m, find all strings in the dictionary that are at Hamming distance at most k from Q. Chan and Lewenstein (CPM 2015) showed a data structure for k = 1 with optimal query time O(m/w + occ), where w is the size of a machine word and occ is the size of the output. The data structure occupies O(w d ^1+ε d) extra bits of space (beyond the entropy-bounded space required to store the dictionary strings). In this work we give a solution with similar bounds for a much wider range of values k. Namely, we give a data structure that has O(m/w + ^k d + occ) query time and uses O(w d ^k d) extra bits of space.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/16/2019

Dynamic Packed Compact Tries Revisited

Given a dynamic set K of k strings of total length n whose characters ar...
08/04/2020

A Data-Structure for Approximate Longest Common Subsequence of A Set of Strings

Given a set of k strings I, their longest common subsequence (LCS) is th...
06/29/2020

Pattern Masking for Dictionary Matching

In the Pattern Masking for Dictionary Matching (PMDM) problem, we are gi...
12/21/2018

Lower bounds for text indexing with mismatches and differences

In this paper we study lower bounds for the fundamental problem of text ...
06/13/2019

On Longest Common Property Preserved Substring Queries

We revisit the problem of longest common property preserving substring q...
07/25/2019

How to Store a Random Walk

Motivated by storage applications, we study the following data structure...
11/22/2018

The Statistical Dictionary-based String Matching Problem

In the Dictionary-based String Matching (DSM) problem, a retrieval syste...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of dictionary look-up was introduced by Minsky and Papert in 1968 and is a fundamental task in many areas such as bioinformatics, information retrieval, and web search. Informally, the task is to store a set of strings referred to as dictionary in small space to maintain the following queries efficiently: Given a query string, return all dictionary strings that are close to it under some measure of distance. In this work we focus on Hamming distance and exact solutions to the problem. Formally, the problem is stated as follows.

Dictionary look-up with mismatches. We are given a dictionary that is a set of strings of length and an integer . The task is to preprocess the dictionary into a data structure that maintains the following queries: Given a string of length , return all the strings in the dictionary such that the distance between each of them and is at most .

As a natural first step, much effort has been concentrated on the case  [3, 22, 4, 7, 8, 31, 11]. We note that the structure of the problem in this case is very special. Namely, if two strings have Hamming distance at most one, then there is an integer such that their prefixes of length are equal and their suffixes of length are equal. Many existing solutions rely heavily on this property and cannot be extended to the case of arbitrary . The first non-trivial solution for was given in the seminal paper of Cole, Gottlieb, and Lewenstein [13], who introduced a data structure called -errata tree. The -errata tree requires bits of space and has query time , where is the size of a machine word and is the size of the output. The subsequent work [10, 9, 25] mainly focused on improving the space complexity of the data structure.

Two works are of particular interest to us. For any Belazzougui and Venturini [4] showed a data structure for with query time that uses bits of space, where is the -th empirical entropy of the concatenation of all strings in the dictionary. It was followed by the work of Chan and Lewenstein [11], who improved the query time to , while using approximately the same amount of bits, . In the model of Chan and Lewenstein the size of the alphabet is constant, the query string arrives in a packed form, meaning that each letters are stored in one machine word, under the standard assumption . The interest in this kind of bounds is explained by the fact that the value is a lower bound to the output size of any compressor that encodes each letter of the dictionary strings with a code that only depends on the letter itself and on the immediately preceding letters [26].

1.1 Our contribution and techniques

We investigate further this line of research and give a new data structure with similar bounds for a much wider range of values . We adopt the model of Chan and Lewenstein and show a data structure for dictionary look-up with mismatches that has query time and uses bits of space for all (Theorem 5). If in addition , the query time becomes , matching the query time of Chan and Lewenstein.

The basis of our data structure is the -errata tree of Cole, Gottlieb, and Lewenstein [13]. We first introduce a small but important modification to this data structure that will allow us to reduce the time requirements for non-constant . At a high level, the -errata tree is a collection of compact tries, where each trie contains suffixes of a subset of strings in the dictionary. The query algorithm runs prefix search queries in the tries. In Section 4.1 we show that the prefix search queries can be implemented in shared plus time per query using space beyond the space required by the -errata tree. Next, in Section 4.2 we show how to improve the space complexity to entropy-bounded. Our main contribution at this step is a new reduction from prefix search queries in the tries of the -errata trees to prefix search queries on a compact trie containing only a subset of all suffixes of the dictionary strings. Finally, in Section 5 we improve the time that we spend per each query to (amortised) time by a clever use of Karp-Rabin fingerprints, which gives the final result, Theorem 5. We emphasize that we derandomize the query algorithm and that our data structure is deterministic, regardless the fact that we use Karp-Rabin fingerprints.

1.2 Related work

Many of the works we cited above consider not only the Hamming distance, but also the edit distance. This is in particular true for

, when the edit distance and Hamming distance are equivalent. Another interesting direction is heuristic methods for the Hamming and the edit distances which have worse theoretical guarantees but perform well in practice 

[12, 6, 23, 27]. Finally, we note that the solutions discussed in this work are beneficial for low-distance regime, i.e. when . If , one should turn to approximate approaches, such as locality-sensitive hashing (see [2] and references therein).

Several works have studied the question of developing efficient data structures for string processing when the query arrives in a packed form. In particular, Takuya et al. suggested a data structure called packed compact tries [29] to maintain efficient exact dictionary look-ups, and Bille, Gørtz, and Skjoldjensen used similar technique to develop an efficient text index [5].

2 Preliminaries

We assume a constant-size integer alphabet . A string is a sequence of letters of the alphabet. For a string we denote its length by and its substring , where , by . If , the substring is referred to as prefix of . If , is called a suffix of . We say that is given in a packed form if each letters of are stored in one machine word, i.e. occupies machine words in total. Given a string in packed form, we can access (a packed representation) of any -length substring of in constant time using the shift operation.

A trie is a basic data structure for storing a set of strings. A trie is a tree which has the following three properties:

  1. Each edge is labelled by a letter of the alphabet;

  2. Each two edges outgoing from the same node are labelled with different letters, and the edges are ordered by the letters;

  3. Let the label of a node be equal to the concatenation of the labels of the edges in the root-to- path. For each string in the set there is a node of the trie such that its label is equal to , and the label of each node is equal to a prefix of some string in the set.

At each node we store the set of ids of the strings that are equal to the node’s label. The number of nodes in a trie can be proportional to the total length of the strings. To improve the space requirements, we replace each path of nodes with degree one and with no string ids assigned to them with an edge labelled by the concatenation of the letters on the edges in the path. The result is called a compact trie. Each node of the trie is represented in the compact trie as well, some as nodes, and some as positions in the edges. We refer to the set of all nodes and the positions in the edges of the compact trie as positions.

Fact .

A compact trie containing strings has nodes.

3 The k-errata tree: Reminder and fix

Our definition of the -errata tree follows closely that of Cole, Gottlieb, and Lewenstein [13], but as explained below we introduce an important fix to the original definition. We try to be as concise as possible, but we feel obliged to provide all the details both because we modify the original definition and because the details are important for our final result.

Intuition.

Let us explain the main idea first. Denote the given dictionary of strings by . The -errata tree for is built recursively. We start with the compact trie containing all the strings in and decompose it into heavy paths.

[[21]] The heavy path of is the path that starts at the root of and at each node on the path branches to the child with the largest number of leaves in its subtree (heavy child), with ties broken arbitrarily. The heavy path decomposition is defined recursively, namely it is defined to be a union of the heavy path of and the heavy path decompositions of the subtrees of that hang off the heavy path. The first node in a heavy path is referred to as its head.

Recall that our task is to find all strings in such that the Hamming distance between them and the query string is at most . As a first step, we find the longest path that starts at the root of and is labelled with a prefix of . Let this path trace heavy paths , leaving the heavy path at a position of , . We can partition all the strings in into three categories:

  1. Strings diverging off a heavy path at some node , where is located above ;

  2. Strings in the subtrees of ’s children that diverge from the heavy path , for ;

  3. Strings in the subtree rooted at .

Consider the set of strings in that diverge from a heavy path at a node . They necessarily have their first mismatch with there. The first idea is that we can fix that mismatch in each of the strings (decreasing the Hamming distance between them and by one), and then run a dictionary look-up with mismatches on the resulting set of strings. The second idea is that running an independent dictionary look-up query for each node in each heavy path is expensive, so we introduce a grouping on the nodes that reduces the number of queries to logarithmic.

Data structure.

We assign each string in a credit of mismatches and start building the -errata tree in a recursive manner. First, we build the compact trie for the dictionary . For each leaf of we store the ids of the dictionary strings equal to the leaf’s label ordered by the mismatch credits. Second, we decompose into heavy paths. For each node of we store a pointer to the heavy path it belongs to, and for each heavy path we store a pointer to its head. We will now make use of weight-balanced trees, defined in analogy with weight-balanced search trees.

A weight-balanced tree with leaves of weights (in left-to-right order) is a ternary tree. We build it recursively top-to-down. Let be the smallest index such that . Then the left subtree hanging from the root is a weight-balanced tree with leaves of weight , the middle contains one leaf of weight , and the right subtree is a weight-balanced tree with leaves of weight .

We build two sets of -errata trees for each heavy path of . We call the trees in the first set vertical, and in the second set horizontal, according to the way we construct them.

We first explain how we build the vertical -errata trees. Suppose that contains nodes , and the weight of a node is the number of strings that diverge from at . As a preliminary step, we build a weight-balanced tree on the nodes in . Consider a node of containing in its subtree. Let be the length of the string written on the path from the head of to , and be the first letter on the edge from to . We build a new set of strings as follows: For each node , , we take each string that diverges from the path at , cut off its prefix of length , and decrease the credit of the string by the number of mismatches between the cut-off prefix and (the string appended with the letter ). If the credit of a string becomes negative, we delete it. Finally, we build the -errata tree for each of the newly created sets of strings.

We now explain how we build the horizontal -errata trees. We repeat the following for each node . Let be the length of the label of . Consider the set of all children of except for the node (the child of that belongs to ). For each child in this set, we build a new set of strings as follows: We take each string that ends below , cut off its prefix of length , and decrease the credit of the string by . Similar to above, if the credit of a string becomes negative, we delete it. We define the weight of each child as the number of strings in the corresponding set. Next, we build the weight-balanced tree on the set of the children, and for each node of the tree consider a set of strings that is a union of the sets of strings below it. Finally, we build the -errata tree for each of these sets of strings.

Our modification to the original definition is that we truncate the strings and store the mismatch credits. Because of that, all the strings we work with are suffixes of the dictionary strings, which allows us to process them efficiently.

Queries.

A dictionary look-up with mismatches for a string is performed in a recursive way as well. For the purposes of recursion, we introduce an extra parameter, , and allow to run dictionary look-ups with mismatches from any position of a trie of the -errata tree. We will make use of a procedure called PrefixSearch: Given a string and a position of a trie, PrefixSearch returns the longest path starting at that is labelled by a prefix of the query string.

Suppose we must answer a dictionary look-up with mismatches for a string that starts at a position . We initialize . If , we run a PrefixSearch to find a path in labelled by . If such a path exists, we output all the dictionary strings assigned to the end of this path such that their mismatch credit . Assume now . If , the look-up terminates and we output all the dictionary strings assigned to the current position such that their mismatch credit . Otherwise, we run a PrefixSearch to find the longest path starting at that is labelled by a prefix of . Suppose that passes through heavy paths , leaving at a position , . Note that for , is necessarily a node of , and for it can be a position on an edge.

Recall that for each node of we store the heavy path it belongs to, and for each heavy path we store its head. The position is the ending node of . To find , consider the heavy path containing , by definition, is the parent of the head of . We find all the nodes , , analogously. Recall that we partitioned the dictionary strings into three types.

Strings of Type 1. We process each path in turn. We select a set of nodes of the weight-balanced tree covering the part of from the beginning and up to (but not including) . To do this, we follow the path from the root of to and take the nodes that hang off to the left of the path. Consider one of the selected nodes and its -errata tree. All the strings in this tree have equal lengths . To finish the recursive step, we run a dictionary look-up with mismatches for the suffix of of length in this tree.

Strings of Type 2. We take the weight-balanced tree for and select a set of nodes that covers all its leaves except for the head of . To select this set, we find the path from the root of the weight-balanced tree for to the head of , and take the nodes that hang off this path. For each of the selected nodes, we run a dictionary look-up with mismatches analogously to above. We also run a dictionary look-up with mismatches with starting from the position in that is one letter below .

Strings of Type 3. If is a position on an edge, we run a dictionary look-up query with mismatches from the next position on the edge with . If is a node, we run two dictionary look-up queries with mismatches. First query is run in the horizontal -errata tree corresponding to the set of all children of that are not in . The second query is run from a position in that is one letter below with .

Correctness of the algorithm follows from the following observation: first, we account for all dictionary strings. Second, in the case of -errata trees, we account for the mismatches between the portion of the strings that we truncate and the query string via the mismatch credits. Finally, when we continue the search in the same tree, there is just one mismatch and we account for it by increasing .

Analysis.

The bounds on the space and the time complexities are summarised below. The proofs of the lemmas, which we provide in Appendix A for completeness, follow closely the proofs given by Cole, Gottlieb, and Lewenstein [13].

Lemma .

The tries of the -errata tree contain strings in total.

Lemma .

A dictionary look-up with mismatches for a query string requires operations PrefixSearch. Apart from the time required for these operations, the algorithm spends time.

In the next section we give an efficient implementation of PrefixSearch under an assumption that arrives in a packed form. We will use, in particular, the following simple observation.

Fact .

Each trie of the -errata tree is built on a set of equal-length suffixes of the dictionary strings. If we run a PrefixSearch for a suffix of the query string from a position of a trie of the -errata tree, then the strings in the subtree of have length .

4 Prefix search for packed strings

We first remind several well-known data structure results that we use throughout the section. A priority queue is a data structure like a regular queue, where each element has an integer (“priority”) associated with it. In a priority queue, an element with high priority is served before an element with low priority. A priority queue can be implemented as a heap, that for a set of elements occupies bits of space and has query time .

A predecessor data structure on a set of integers supports the following queries: Given an integer , return the largest integer in the set that is at most . For a set of integer keys, the predecessor data structure can be implemented as a binary search tree in bits of space to support the predecessor queries in time . (We do not use solutions such as [17, 30] to avoid dependency on , which will be important for our final result.)

A dictionary data structure stores a set of integers. A dictionary look-up receives an integer and outputs “yes” if belongs to the set.

[[28]] Let be any given set of integers. There is a dictionary over that occupies bits of space, and has query time .

We will also need lowest common ancestor queries on tries. Given two nodes of a trie, their lowest common ancestor is a node of maximal depth that contains both and in its subtree.

[[18]] A trie of size can be preprocessed in bits of space to maintain lowest common ancestor queries in time.

Finally, we need weighted level ancestor queries on tries. A weighted level ancestor query receives a node and an integer , and must output the deepest ancestor of such that the length of the label of is at most . We will use the weighted level ancestor queries on tries for fast navigation: Suppose that we know a leaf labelled by a string , then to find a position labelled by a prefix of we can use one weighted level ancestor query instead of performing a PrefixSearch for . To avoid dependency on , we use the following simple folklore solution instead of [20, 1, 14].

A trie of size can be preprocessed in bits of space to maintain weighted level ancestor queries in time.

Proof.

We consider the heavy path decomposition of the trie. For each node we store a pointer to the head of the heavy path containing it, and for each path we build a binary search tree containing the length of the labels of the nodes in it. Suppose we are to answer a weighted level ancestor query for a node and an integer . The path from the root of the trie to (which contains all the ancestors of ) traverses a subset of heavy paths. The size of this subset is , because each time we switch paths the weight of the current node decreases by at least a factor of two. We iterate over this set of paths to find the path that contains the answer , and then use the binary search tree to find the location of in the path. Both steps take time. ∎

4.1 Linear space

As a warm-up we show a linear-space implementation of PrefixSearch that improves the runtime of dictionary look-up queries to . Formally, we will show the following result.

Assume a constant-size alphabet. For a dictionary of strings of length , there is a data structure for dictionary look-up with mismatches that occupies bits of space and has query time , where is the size of a machine word.

Let be the set of all suffixes of the strings in . We build a compact trie on . (In the literature, is referred to as the suffix tree of .) As the total length of the strings in is , the size of is , and therefore it occupies bits of space. We can reduce PrefixSearch queries on the tries of the -errata tree to PrefixSearch queries on . We distinguish between PrefixSearch queries that start at the root of some trie of the -errata tree (rooted queries), and those that start at some inner node or even a position on an edge of a trie of the -errata tree (unrooted queries). Note that unrooted queries are used in the case only.

Lemma .

After bits of space preprocessing, we can answer a rooted PrefixSearch query for a string and any trie of the -errata tree in time given the answer to a rooted PrefixSearch for in .

Lemma .

Assume . After bits of space preprocessing, we can reduce an unrooted PrefixSearch query for that starts at a position of a trie of the -errata tree to a rooted PrefixSearch for some suffix of in a trie of a -errata tree in time given the answer to a rooted PrefixSearch query for in .

Lemmas 4.1 and 4.1 were proved in [13]. For completeness, we give their proofs in Appendix A. Suppose we are to answer a dictionary look-up with mismatches for a string . Our algorithm traverses the -errata tree and generates rooted and unrooted PrefixSearch queries. We maintain a priority queue. Each time we need an answer to a PrefixSearch for a string in , we add to the priority queue. At each step of the algorithm we extract the longest string from the queue and answer the PrefixSearch query for it. Notice that all strings in the queue are suffixes of and that the maximal length of strings in the queue cannot increase. We can therefore assume that we must answer PrefixSearch queries for the suffixes of starting at positions , where .

Bille, Gørtz, and Skjoldjensen [5] showed that we can preprocess in linear space to answer PrefixSearch queries for a single query string of length in time. As an immediate corollary we obtain that we can answer PrefixSearch queries in time, but this is too slow for our purposes. Below we develop their ideas to give a more efficient approach.

can be preprocessed in bits of space to answer PrefixSearch for the suffixes of starting at positions in time.

Proof.

We assume that the strings in the dictionary are stored in the packed form. By construction, each edge of a trie of the -errata is labelled by a substring of a dictionary string. It means that we can store each label as three integers: the id of the string, and the starting and the ending positions of the substring. Next, we preprocess for weighted level ancestor queries (Lemma 4). A node or a position in the trie is called boundary if the length of its label is a multiple of , where is the size of a machine word and is the size of the alphabet. Boundary nodes cut the tree into micro-trees. We only consider the micro-trees containing more than two nodes. We define the label of a leaf of a micro-tree as a machine word that contains a packed representation of the string written on the path from the root of the micro-tree to the leaf. The labels can be treated as integers; for each micro-tree we create a dictionary (Lemma 4) and a predecessor data structure on the labels of its leaves. We also preprocess each micro-tree for lowest common ancestors. Note that the total size of the micro-trees is , as each edge of contains at most two nodes of the micro-trees. Therefore, the preprocessing requires bits of space.

We now explain how to answer the PrefixSearch queries for the suffixes of starting at the positions . For , we start at the root of . For , , we use the information obtained at the previous step. Namely, suppose that the PrefixSearch for terminated at a position labelled by . We take any leaf below this position, let it be labelled by a string . Let be the longest prefix of such that its length is a multiple of . We then start the PrefixSearch from a position labelled by . To find the position , we first find the leaf labelled by , and then use a weighted level ancestor query to jump to in time. Notice that is boundary. If is not a root of a micro-tree, it has a single outgoing edge of length at least . We compare the first letters of the label of this edge and in time by comparing the corresponding machine words. If they are equal, we continue from the next boundary node on the edge in a similar manner. Otherwise, we find the first mismatch between the two strings in time as follows: First, compute a bitwise XOR of the two strings, and then locate the most significant bit using the technique of [19].

If is the root of a micro-tree , we search for in the dictionary of . If it is in the dictionary and corresponds to a leaf , we continue to . Otherwise, we find its predecessor and successor using the predecessor data structure. The PrefixSearch must terminate either on the path from to the leaf of the micro-tree labelled by , or on the path from to the leaf of the micro-tree labelled by . We compute the longest common prefix of with and with using bitvector operations in time as explained above, take the longest of the two, and find the position labelled by it in time using a weighted level ancestor query.

The running time of each prefix search query is proportional to the number of -length blocks of that we compare with the labels of the edges of . Notice that each two different PrefixSearch queries share at most one block of letters. Therefore, as the size of the alphabet is constant, the total running time of PrefixSearch queries is . ∎

Lemmas 334.14.1, and 4.1 give Theorem 4.1.

4.2 Entropy-bounded space

In this section we improve the space requirements of our implementation of PrefixSearch and show the following theorem.

Assume a constant-size alphabet. For a dictionary of strings of lengths and any , let be the -th empirical entropy of the concatenation of all the dictionary strings. There is a data structure for dictionary look-ups with mismatches that uses bits of space and has query time , where is the size of a machine word.

There are two bottlenecks: First, we need to store the dictionary strings, and second, the tree structure of requires space. To overcome the first bottleneck, we replace the packed representation of the dictionary strings by the Ferragina-Venturini representation:

[[16]] Under the assumption of an alphabet of constant size , for any there exists a data structure that uses bits of space and supports constant-time access to any -length substring of a dictionary string.

If is a constant, if suffices to store the Ferragina-Manzini representation of the dictionary strings to obtain the bounds of Theorem 4.2. Indeed, when a query string arrives, we can decide if the Hamming distance between and a dictionary string is at most in time, using comparison by machine words and bitvector operations. As is constant, we obtain the desired time bound. Below we assume that .

We now deal with the second bottleneck. We will consider a smaller trie on a subset of , and will show that PrefixSearch queries on tries of the -errata tree can be reduced to PrefixSearch queries on this trie. is defined to be the set of all suffixes of the dictionary strings that start at positions , , and so on. We call such suffixes sampled. Below we show that we can reduce PrefixSearch queries in the tries of the -errata tree to PrefixSearch queries in .

After bits of space preprocessing, we can answer a rooted PrefixSearch query for a string in time given the answer to a rooted PrefixSearch query for a string in , where is the smallest multiple of .

Proof.

At the preprocessing step, we traverse and remember the leftmost and the rightmost leaves in each of its subtrees. We also remember the neighbours of each leaf in the left-to-right order, and finally we preprocess the trie for lowest common ancestor queries. As a second step we preprocess each trie of the -errata tree for lowest common ancestor and weighted level ancestor queries. We also build the following data structure for each trie of the -errata tree. For each string in the trie, let , where is the longest sampled suffix of . We call a head of , and define the rank of to be the rank of in . We build a compact trie containing the heads of all the strings, and preprocess it as in Lemma 4.1. If contains strings, we use bits of space for the preprocessing, i.e. bits of space in total. We also associate a predecessor data structure with each of its leaves. The predecessor data structure of a leaf labelled by contains the ranks of all the strings such that their head is equal to . The predecessor data structures occupy bits of space in total as well.

Suppose we are to answer a rooted PrefixSearch query for a string and a trie of the -errata tree. Let be the compact trie containing the heads of the strings in . By Fact 3, the length of the heads is . We first read in blocks of letters in time, and run a PrefixSearch for it in in time. If the PrefixSearch terminates in a position of that is not in a leaf, it remains to find the position corresponding to in , which we can do with one weighted level ancestor query.

Assume now that the PrefixSearch terminates in a leaf of . By the condition of the lemma, we know the answer to the rooted PrefixSearch for in . We also store the leftmost and the rightmost leaves in each subtree of , and therefore can find the predecessor of in in time. We use the predecessor data structure associated with the leaf to find the predecessor and successor of in time. To find the position where the PrefixSearch for terminates, we compute the lengths of the longest common prefix of and and of and . We can compute the longest common prefix of and (which is a sampled suffix of a dictionary string) in time via a lowest common ancestor query on . We then compute in a similar way. If , we return the lowest common ancestor of and as the answer. If , then the answer is the ancestor of such that the length of its label is , and we can find it by one weighted level ancestor query. The case is analogous. ∎

Assume . After bits of space preprocessing, we can answer an unrooted PrefixSearch query for a string by reducing it to a rooted PrefixSearch query in time given the answer to a rooted PrefixSearch query for a string in , where is the smallest multiple of .

Proof.

During the preprocessing step, we preprocess for lowest common ancestor queries and each trie of the -errata tree for weighted level ancestor queries. Let be the position in a trie where we start the PrefixSearch for . The search path for traverses a number of heavy paths. The first path is the path containing . Let be the label of the part of the path starting from . We consider two cases. Suppose first that . In this case, is a suffix of one of the dictionary strings starting at a position , i.e. it is sampled. Therefore, we can find the longest common prefix of and using one lowest common ancestor query on . We can then find the node in the path corresponding to this longest common prefix using one weighted level ancestor query. From there, we can find the starting node of the second heavy path traversed by in time. It remains to answer a rooted PrefixSearch query in the subtree rooted at this node, which is a trie of a -errata tree by construction. When we know the answer for this PrefixSearch, we can go back to using one weighted level ancestor query. In the second case . We start by comparing and by blocks of letters until we reach the start of a sampled suffix, and then proceed as above. ∎

Suppose that we are to answer a dictionary look-up with mismatches for a string . Our algorithm traverses the -errata tree and generates rooted PrefixSearch queries for the suffixes of in . We maintain a priority queue. Each time we need an answer to a rooted PrefixSearch for a suffix in , we add to the priority queue. At each step we extract the longest string from the queue and answer the PrefixSearch query for it. Since the maximal length of suffixes in the queue cannot increase, we can assume that we must answer PrefixSearch queries for the suffixes of starting at positions , where . Moreover, for each the position is a multiple of . We preprocess as in Lemma 4.1, which requires bits of space. We first run PrefixSearch for in time. Suppose it follows the path labelled by . Let be an arbitrary string that ends below the end of this path. We then find the leaf corresponding to . By construction of and because is a multiple of , such a leaf must exist. We then use a weighted level ancestor query to find the end of the path labelled by , where is the longest suffix of such that its length is a multiple of , and continue the PrefixSearch for from there, and so on. The total running time is .

If as we assumed earlier, lemmas 4.24.24.2, and the discussion above give Theorem 4.2.

5 Removing extra logarithm from the time complexity

In this section we improve the query time to and show our final result.

Assume a constant-size alphabet. For a dictionary of strings of lengths and for any , let be the -th empirical entropy of the concatenation of all strings in the dictionary. There exists a data structure for dictionary look-ups with mismatches that uses bits of space and has query time , where is the size of a machine word.

As explained in Theorem 4.2, we can assume . Recall that the dictionary look-up with mismatches is run recursively. The first levels of recursion require PrefixSearch queries and can be implemented in time. Therefore, it suffices to improve the runtime of the two last levels of the recursion, where we must perform a batch dictionary look-up queries with one mismatch. To achieve the desired complexity we use the fact that the queries are related, as explained below.

Preprocessing.

For a string we define its reverse . First, we build a compact trie on the reverses of all the dictionary strings and preprocess it as described in Lemma 4.1, which takes bits of space. We store the reverses using the Ferragina-Venturini representation (Lemma 4.2) in bits of space, where is the -th empirical entropy of the reverse of the concatenation of all the strings in the dictionary. By [15, Theorem A.3], . For the second step, we need Karp-Rabin fingerprints. We modify the standard definition as we work with packed strings.

[Karp-Rabin fingerprints [24]] Consider a string and its packed representation , where each is a machine word. (If is not a multiple of , we append an appropriate number of zeros.) The Karp–Rabin fingerprint of is defined as , where is a fixed prime number and is a randomly chosen integer in .

From the definition it follows that if the strings are equal, their fingerprints are equal. Furthermore, it is well-known that for any and

, the probability of two distinct strings of length

having the same fingerprint (collision probability) is less than . Consider a trie of the -errata tree. By definition, the lengths of the leaf labels in is at most . From the bound on the collision probability it follows that we can choose and so that the fingerprints of the reverses of these labels are distinct. For each leaf of , we compute the Karp-Rabin fingerprint of the reverse of its label and add it to a dictionary (Lemma 4) associated with . Also, using the same and , we compute Karp-Rabin fingerprints corresponding to inner nodes of the tries of the -errata tree. Namely, consider one of such nodes, and let be its label and be the length of the strings in the trie. We take the reverse of , prepend it with zeros, and compute the Karp-Rabin fingerprint of the resulting string.

Queries.

We must run dictionary look-up queries with one mismatch. Consider one of these queries, let it be a query for a string (which must be a suffix of ) in a trie and recall the algorithm of Section 3. First, we run a PrefixSearch to find the longest path that is labelled by a prefix of . For this step we can use , as the total number of such queries is and therefore we can spend time per each of them. Suppose that traverses the heavy paths and leaves the heavy path at a position . We can find the positions in time once we have found the end of . The rest of the algorithm can be described as follows. First, we must perform dictionary look-ups with mismatches (i.e., PrefixSearch) in vertical and horizontal -errata trees (that are tries of the -errata tree by definition). Second, for each , we must perform a dictionary look-up with mismatches (PrefixSearch) from a position that follows in the heavy path . Importantly, each is a node. Finally, we must perform a a dictionary look-up with mismatches (PrefixSearch) from a position that follows in the heavy path .

We note that to perform the PrefixSearch from the position we can use , as before, because the total number of such PrefixSearch operations is . We now explain how we perform the PrefixSearch operations in vertical and horizontal -errata trees, as well as the PrefixSearch operations from nodes , . In total, we must perform such operations, and for each of them the query string is a suffix of . Let , be the suffixes of for which we are to run a PrefixSearch. We create a bitvector of length where each bit is set. We then compute the Karp-Rabin fingerprints of the reverses of in time using the following fact.

Fact .

Given the Karp-Rabin fingerprints of and , where the length of is a multiple of , we can compute the Karp-Rabin fingerprint of their concatenation, in time.

We iterate over all blocks of the bitvector starting from the last one and maintain the Karp-Rabin fingerprint of the reverse of the suffix of that starts at the current position. When we start a new block, we update the Karp-Rabin fingerprint. If a block contains set bits (which we can decide in constant time), we extract the positions of all set bits in time per bit using the technique of [19], and compute the corresponding Karp-Rabin fingerprints. Also, as a preliminary step, we run a PrefixSearch for in the compact trie on the reverses of the dictionary strings in time. Let be the position where this PrefixSearch terminates.

PrefixSearch in vertical and horizontal -errata trees. Assume we must answer a PrefixSearch for on a tree . We search the fingerprint of the reverse of in the dictionary associated with . The search will return at most one leaf of the tree. We know that its label is equal to with high probability, but we need a deterministic answer. We test the leaf as follows. Let be one of the dictionary strings such that its id is stored at the leaf. We find the leaf of the compact tree on the reverses of the dictionary strings that corresponds to the reverse of . Now, we can compute the length of the longest common prefix of and in constant time via a lowest common ancestor query for and and check if it is indeed equal or larger than .

PrefixSearch from , . This step is equivalent to the following: Find all the strings in the trie that start with a label of and end with a given suffix of . We can compute the Karp-Rabin fingerprints of the reverses of the strings that we are looking for as follows. Positions are necessarily nodes and we store the Karp-Rabin fingerprints of the reverses of their labels. Recall that if was the label of , we prepended the reverse of with zeros, where is the length of the strings in the trie containing . It follows that we can compute the fingerprint of the reverse of the label of prepended with zeros in time. Knowing and the fingerprint of the reverse of the suffix of , we can compute the fingerprint of the strings we are searching for in constant time. We note that prepending with zeros is necessary in order to align the borders of the blocks in the reverse of the label of and the reverse of the label of the suffix of . We finish the computation as above, that is we find a leaf such that the fingerprint of the reverse of its label is equal to the fingerprint of the strings we are looking for, and test it using the trie on the reverses of the dictionary strings.

References

  • [1] Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Trans. Algorithms, 3(2), May 2007.
  • [2] Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. of the Forty-seventh Annual ACM Symposium on Theory of Computing, STOC’15, pages 793–801, 2015.
  • [3] Djamal Belazzougui. Faster and space-optimal edit distance “1” dictionary. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM’09, pages 154–167, 2009.
  • [4] Djamal Belazzougui and Rossano Venturini. Compressed string dictionary look-up with edit distance one. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM’12, pages 280–292, 2012.
  • [5] Philip Bille, Inge Li Gørtz, and Frederik Rye Skjoldjensen. Deterministic indexing for packed strings. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM’17, pages 6:1–6:11, 2017.
  • [6] Thomas Bocek, Ela Hunt, Burkhard Stiller, and Fabio Hecht. Fast similarity search in large dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, 2007.
  • [7] Gerth Stølting Brodal and Leszek Gasieniec. Approximate dictionary queries. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM’96, pages 65–74, 1996.
  • [8] Gerth Stølting Brodal and Srinivasan Venkatesh. Improved bounds for dictionary look-up with one error. Inf. Process. Lett., 75:57–59, 2000.
  • [9] Ho-Leung Chan, Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. Compressed indexes for approximate string matching. J. Algorithmica, 58:263–281, 2006.
  • [10] Ho-Leung Chan, Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. A linear size index for approximate pattern matching. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM’06, pages 45–59, 2006.
  • [11] Timothy Chan and Moshe Lewenstein. Fast string dictionary lookup with one error. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM’15, pages 114–123, 2015.
  • [12] Aleksander Cisłak and Szymon Grabowski. A practical index for approximate dictionary matching with few mismatches. Computing & Informatics, 36(5):1088–1106, 2017.
  • [13] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In Proc. of the 36th Annual ACM Symposium on Theory of Computing, STOC’04, pages 91–100, 2004.
  • [14] Martin Farach and S. Muthukrishnan. Perfect hashing for strings: Formalization and algorithms. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM’96, pages 130–140, 1996.
  • [15] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552–581, July 2005.
  • [16] Paolo Ferragina and Rossano Venturini. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science, 372(1):115 – 121, 2007.
  • [17] Johannes Fischer and Pawel Gawrychowski. Alphabet-dependent string searching with wexponential search trees. In Proc. of the Annual Symposium on Combinatorial Pattern Matching, CPM’15, pages 160–171, 2015.
  • [18] Johannes Fischer and Volker Heun. Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In Proc. of the Annual Conference on Combinatorial Pattern Matching, CPM’06, pages 36–48, 2006.
  • [19] Michael L. Fredman and Dan E. Willard. Surpassing the information theoretic bound with fusion trees. J. Comput. Syst. Sci., 47(3):424–436, December 1993.
  • [20] Pawel Gawrychowski, Moshe Lewenstein, and Patrick K. Nicholson. Weighted ancestors in suffix trees. In Proc. of the Annual European Symposium on Algorithms, ESA’14, pages 455–466, 2014.
  • [21] Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338–355, 1984.
  • [22] Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Compressed dictionary matching with one error. In Proc. of the Data Compression Conference, DCC’11, pages 113–122, 2011.
  • [23] Daniel Karch, Dennis Luxen, and Peter Sanders. Improved fast similarity search in dictionaries. In Proc. of the International Symposium on String Processing and Information Retrieval, SPIRE’10, pages 173–178, 2010.
  • [24] Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249–260, March 1987.
  • [25] Tak Wah Lam, Wing-Kin Sung, and Swee-Seong Wong. Improved approximate string matching using compressed suffix data structures. J. Algorithmica, 51:298–314, 2005.
  • [26] Giovanni Manzini. An analysis of the Burrows -Wheeler transform. J. ACM, 48(3):407–430, May 2001.
  • [27] Moshe Mor and Aviezri S. Fraenkel. A hash code method for detecting and correcting spelling errors. Commun. ACM, 25(12):935–938, December 1982.
  • [28] Milan Ružić. Uniform deterministic dictionaries. ACM Trans. Algorithms, 4(1):1:1–1:23, March 2008.
  • [29] Takuya Takagi, Shunsuke Inenaga, Kunihiko Sadakane, and Hiroki Arimura. Packed compact tries: A fast and efficient data structure for online string processing. In Proc. of the 27th International Workshop on Combinatorial Algorithms, volume 9843 of IWOCA’16, pages 213–225. Springer, 2016.
  • [30] Dan E. Willard. Log-logarithmic worst-case range queries are possible in space o(n). Information Processing Letters, 17(2):81 – 84, 1983.
  • [31] Andrew Chi-Chih Yao and Foong Frances Yao. Dictionary look-up with one error. J. Algorithms, 25:194–202, 1997.

Appendix A Missing proofs

Let the weight of a node of a weight-balanced tree be the sum of the weights of the leaves in the subtree rooted at .

Fact .

The weights of the nodes in any root-to-leaf path of a weight-balanced tree decrease. Moreover, if is an edge in the path, and is not a leaf, then the weight of is at least two times smaller than the weight of .

The proofs of the following lemmas follow closely the proofs given by Cole, Gottlieb, and Lewenstein [13], and we provide them only for completeness. We note that the upper bounds in [13] are more precise, but the ones below suffice for our purposes.

See 3

Proof.

Let denote the number of strings in the tries of the -errata tree for a dictionary of size . We will show by induction on that can be upper-bounded by

where is the nearest power of two larger than . The right-hand side of the inequality is , which will give the claim of the lemma. Indeed, using the inequality we can upper bound the right-hand side by for some constant . For large enough, so .

For we have as the -errata tree contains only one trie, so the base case holds. Let now . Recall that we start the construction of the -errata tree by creating a compact trie on strings, and then build the

-errata trees recursively. We now estimate the number of strings in the tries of the

-errata trees. For that, we will consider each string in the dictionary, and will estimate its contribution to the total size of the trees. Let correspond to a leaf of the trie such that the path from the root to this leaf passes through heavy paths leaving a heavy path at a node .

For each heavy path we build a number of vertical -errata trees that contain . The claim is that there are at most two trees containing at most strings, at most two trees containing at most strings, etc. Indeed, from the definition of heavy paths it follows that the total weight of the nodes in a heavy path is at least two times smaller than the total weight of the nodes in a heavy path . Furthermore, consider the sequence of the vertical -errata trees containing in the top-down order. By Fact A we obtain that the size of the trees in this sequence must decrease by a factor of at least two each time except for when we arrive to a leaf of a weight-balanced tree corresponding to a heavy path , where it can decrease by one. Therefore, each two trees the size decreases by at least a factor of two. The claim follows.

A similar claim holds for the horizontal -errata trees: among the trees containing , there are at most two trees of size at most , at most two trees of size at most , etc. The string belongs to the horizontal -errata trees associated with the nodes only. The total size of the horizontal -errata trees for decreases by a factor of at least two each time we switch paths (recall that we build these trees for all the children of but the heavy one). Consider the sequence of the horizontal -errata trees containing in the top-down order. By Fact A we obtain that the size of the trees in this sequence must decrease by a factor of at least two each time except for when we arrive to a leaf of a weight-balanced tree corresponding to a heavy path , where it can decrease by one. Therefore, each two trees the size decreases by at least a factor of two, and we obtain the claim as before.

Let be the nearest power of two larger than . By the induction hypothesis and because is a non-decreasing function of , we have

Plugging in the expression for and simplifying the sums of binomial coefficients (we use the fact that is an integer), we obtain

See 3

Proof.

Let be the number of PrefixSearch operations that we run while performing a dictionary look-up with mismatches for a dictionary of size . We will show by induction on that can be upper-bounded by

where is the nearest power of two larger than . We have , so the base case holds. Let now . Let be the heavy paths traced by the PrefixSearch for , and be the positions where the search leaves the paths.

First, we estimate the number of PrefixSearch operations in vertical -errata trees that we run for the patterns of Type 1. Consider the weight-balanced tree for a heavy path , and the path from the root of this tree to to . The set of nodes covering the part of from its head to is the children of the nodes in hanging off to the left. Recall that each node of the weight-balanced tree has at most two children hanging off to the left. Since the weight of each is at least two times larger than the weight of , and because of Fact A, we obtain that the number of PrefixSearch operations in vertical -errata trees is at most .

Analogously, we upper bound the number of PrefixSearch operations in horizontal -errata trees that we run for the patterns of Type 2 by .

Finally, we upper bound the number of PrefixSearch operations that we run in the same -errata tree, i.e. from the nodes following in (these are the operations we run for the patterns of Types 2 and 3) by . Summarizing, we obtain

Analogously to the space bound, we can show that . We also spend extra time for each PrefixSearch operation to find the starting node of PrefixSearch, which is time in total. Finally, we spend time to output the desired dictionary strings. ∎

See 4.1

Proof.

At the preprocessing step, we traverse and remember the leftmost and the rightmost leaves in each of its subtrees. We also remember the neighbours of each leaf in the left-to-right order, and finally we preprocess the trie for the lowest common ancestor queries. We then create a predecessor data structure for each trie of the -errata tree. By definition, is built on a subset of , and therefore, the rank of each string in relative to is well-defined. The predecessor data structure is built on the set of the ranks of the strings in . We also preprocess each trie for lowest common ancestor and weighted level ancestor queries.

Suppose we would like to answer a rooted PrefixSearch query for a string and a trie of the -errata tree. As we store the leftmost and the rightmost leaves in each subtree of , we can find the predecessor and therefore the successor of in in time. Using the predecessor data structure, we can further find the predecessor and the successor of in in time. We then compute the lowest common ancestor of and in time. We use two lowest common ancestor queries in to find the lengths of the longest common prefixes of and and and . If , we return as the answer. If , then the answer is the ancestor of such that the length of its label is . We can find it by one weighted level ancestor query. The case is analogous. ∎

See 4.1

Proof.

The preprocessing step repeats that of Lemma 4.1. We also assume to store the first letter on each edge of each trie. The search path for traverses a number of heavy paths of . Consider the heavy path of containing , and let be the label of the part of this heavy path starting from and up to the last node. If we know the end of the PrefixSearch query for in , we can compute the length of the longest common prefix of and using one lowest common ancestor query on . We can then find the node in the path corresponding to the end of this longest common prefix using one weighted level ancestor query in time. From there, we can find the starting node of the second heavy path traversed by in time. It remains to answer a rooted PrefixSearch query for on the subtree rooted at this node, which belongs to a