Weighted Ancestors in Suffix Trees Revisited

02/28/2021 ∙ by Djamal Belazzougui, et al. ∙ Helsingin yliopisto University of Leicester Mail.Ru Group 0

The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require Ω(loglog n) time for queries provided O(npolylog n) space is available and weights are from [0..n], where n is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an O(n)-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string s, we need a data structure that can locate in the tree any substring s[p..q] of s in O(1) time (as if one descended from the root reading s[p..q] along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 7

page 8

page 9

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The suffix tree is one of the central data structures in stringology. Its primary application is to determine whether an arbitrary string occurs as a substring of another string, , which can be done in time proportional to the length of by traversing the suffix tree of downward from the root and reading off the symbols of along the way. Many algorithms using suffix trees perform this procedure for substrings of itself. In this important special case, the traversal can be executed much faster than time provided the tree has been preprocessed to build some additional data structures [1, 11, 13]; particularly surprising is that the traversal can be performed in constant time using only linear space, as shown by Gawrychowski et al. [13]. In this paper, we describe the first linear construction algorithm for such a data structure. The lack of an efficient construction algorithm for the result of [13] has been, apparently, the main obstacle hindering its wider adoption. Our solution is completely different from that of [13] and is based on a combinatorial result of Kärkkäinen et al. [17] (see also [16]

) that estimates the sum of irreducible LCP values (precise definitions follow).

As one might expect, the described data structure has a multitude of applications: [1, 7, 8, 11, 19, 20, 21], to name a few. We, however, do not dive further into a discussion of these applications and refer the reader to the overview in [13, Sect. 1] and references therein for more details.

In order to “locate” a substring in the suffix tree of , it suffices to answer the following query: given a node of the tree that corresponds to the suffix of starting at position (usually, it is a leaf), we should find the farthest (closest to the root) ancestor of this node such that the string written on the root– path has length at least . This problem is a particular case of the general weighted ancestor problem [1, 11, 22]: given a tree whose nodes are associated with integer weights such that the weights decrease if one ascends from any node to the root (the weight of a node in the suffix tree is the length of the string written on the root–node path), the tree should be preprocessed in order to answer weighted ancestor queries that, for a given node and a number , return the farthest ancestor of whose weight is at least . The problem can be viewed as a generalization to trees of the predecessor search problem, in which we preprocess a set of integers to support predecessor queries: for any given number , return the largest integer from the set that precedes .

Clearly, any linear-space solution for weighted ancestors can be used as a solution for predecessor search. As was shown in [22] and [13], a certain converse reduction is also possible: the weighted ancestor queries for a tree with nodes and integer weights from a range can be answered in time using linear space, where is the time required to answer predecessor search queries for any set of integers from the range using space. Therefore, when (as in the case of suffix tree), both problems can be solved in linear space with -time queries using the standard van Emde Boas or y-fast trie data structures [30, 31].

Due to the lower bound of Pătraşcu and Thorup [25], the time is optimal for when the available space is linear, and moreover, any solution of the weighted ancestor problem that uses space must spend time on queries (see [13, Appendix A]). In view of this lower bound, it is all the more unexpected that the special case of suffix trees admits an -space solution with constant query time. In order to circumvent the lower bound, Gawrychowski et al. [13] solve predecessor search problems on some paths of the suffix tree using bits of space (or slightly less), which admits a constant time solution by a so-called rank data structure; because of the internal repetitive structure of the suffix tree, the solution for one path can be reused in many different paths in such a way that, in total, the utilized space is linear. Our approach essentially relies on the same intuition but we perform path predecessor queries on different trees closely related to the suffix tree and the advantages of repetitive structures come implicitly; in particular, we do not explicitly treat periodic regions of the string separately so that, in this regard, our solution is more “uniform”, in a sense, than that of [13].

This paper is organized as follows. In Section 2 the problem is reduced to certain path-counting queries using (unweighted) level ancestor queries. Section 3 describes a simple solution for the queries locating substrings whose position corresponds to an irreducible LCP value. In Section 4 we reduce general queries to the queries at irreducible positions and a certain geometric orthogonal predecessor problem. Section 5 describes a solution for this special geometric problem. Conclusions and reflections are then offered in Section 6.

2 Basic Data Structures

Let us fix a string of length , denoting its letters by . We write for the substring , assuming it is empty if ; is called a suffix (resp., prefix) of if (resp., ). For any string , let denote its length. We say that occurs at position in if . Denote .

A suffix tree of is a compacted trie containing all suffixes of . The labels on the edges of the tree are stored as pointers to corresponding substrings of . For each tree node , denote by the string written on the root– path. The number is called the string depth of . The locus of a substring is the (unique) node of minimal string depth such that is a prefix of .

The string and its suffix tree are the input to our algorithm. To simplify the exposition, we assume that is equal to a special letter that is smaller than all other letters in , so that there is a one-to-one correspondence between the suffixes of and the leaves of . Our computational model is the word-RAM and space is measured in -bit machine words. Our goal is to construct in time a data structure that can find in the tree the locus of any substring of in time.

A level ancestor query in a tree asks, for a given node and an integer , the th node on the –root path (provided the path has at least nodes). It is known that any tree can be preprocessed in linear time to answer such queries in constant time [5, 6]. With such a structure, the locus of a substring can be found by first locating the leaf of corresponding to (i.e., , which can be located using a precomputed array of length ) and, then, counting the number of nodes on the –root path such that ; then, evidently, the locus of is given by the level ancestor query on and .

We have to complicate this scheme slightly since the machinery that we develop in the sequel allows us to count only those nodes on the –root path that have a branch to the “left” of the path (or, symmetrically, to the “right”). More formally, given a node , a node on the –root path is called left-branching (resp., right-branching) if there is a suffix that is lexicographically smaller (resp., greater) than the string and its longest common prefix with is . For instance, the path from the leaf corresponding to the string in Figure 1 has four nodes but only one of them (namely, the root) is left-branching.

For any suffix tree, one can build in linear time a data structure that, for any node and integer , can return in time the th left-branching (or right-branching) node on the –root path.

Proof.

Traversing the suffix tree, we construct another tree on the same set of nodes in which the parent of each non-root node is either its nearest left-branching ancestor in the suffix tree (if any) or the root. The queries for left-branching nodes can be answered by the level ancestor structure [5, 6] built on this new tree. The right-branching case is symmetric. ∎

Thus, to find the locus for , we have to count on a leaf–root path the number of left- and right-branching nodes whose string depths are greater than the threshold ; we then use these two numbers in the data structure of Lemma 2 in order to find two candidate nodes and the node with smaller string depth is the locus. In the remaining text, we focus only on this counting problem and only on left-branching nodes as the right-branching case is symmetric. First, however, we define a number of useful standard structures.

The suffix array of is an array containing integers from to such that lexicographically [23]. The inverse suffix array, denoted , is defined as , for all . For any positions and , denote by the length of the longest common prefix of and . The longest common prefix (LCP) array is an array such that and , for . The permuted longest common prefix (PLCP) array is an array such that , for . The Burrows–Wheeler transform [9] is a string such that if , and otherwise. All the arrays , , , and the string can be built from in time. Some of these structures are depicted in Figure 1.

sorted suffixes
0 0 11
1 0 10
2 1 7
3 1 4
4 4 1
5 0 0
6 0 9
7 1 8
8 0 6
9 2 3
10 1 5
11 3 2
Figure 1: The suffix tree , , , and of the string . All irreducible LCP values are in bold and their sum is ; all positions of except are irreducible.

3 Queries at Irreducible Positions

An LCP value is called irreducible if either or . In other words, the irreducible LCP values are those values of that correspond to the first positions in the runs of (e.g., all values except for in Figure 1). The central combinatorial fact that is at the core of our construction is the following surprising lemma proved by Kärkkäinen et al. [17].

[see [17]] The sum of all irreducible LCP values is at most .

We say that a position of the string is irreducible if () is an irreducible LCP value.

Consider a query that asks to find the locus of a substring such that is an irreducible position. Denote . We first locate in time the leaf corresponding to using a precomputed array. As was discussed above, the problem is reduced to the counting of left-branching nodes on the –root path such that . We will construct for a bit array of length such that, for any , iff there is a left-branching node with string depth on the –root path. Since the suffix and its lexicographical predecessor among all suffixes (if any, i.e., if ) have the longest common prefix of length , the lowest left-branching node on the –root path has string depth . Therefore, the number of left-branching nodes on the –root path such that is equal to the number of ones in the subarray plus . The bit counting query is answered using the following rank data structure.

[see [10, 15]] For any bit array packed into -bit machine words, one can construct in time a rank data structure that can count the number of ones in any range in time, provided a table computable in time and independent of the array has been precomputed.

By Lemma 3, the total length of all arrays for all irreducible positions of is . Therefore, one can construct the rank data structures of Lemma 3 for them in overall time provided . It remains to build the arrays themselves.

To this end, we traverse the suffix tree in depth first (i.e., lexicographical) order, maintaining along the way a bit array of length such that, when we are in a node , the subarray marks the string depths of all left-branching ancestors of , i.e., for , we have iff there is a left-branching node with string depth on the –root path. The array is stored in a packed form in -bit machine words. For each node , we consider its outgoing edges. Each pair of adjacent edges (in the lexicographical order of their labels) corresponds to a unique value such that is the suffix corresponding to the leftmost (lexicographically smallest) leaf of the subtree connected to the larger one of the two edge labels. We check whether the value is irreducible comparing and and, if so, store the subarray into in time. As we touch in this way every value only once, the overall time is by the same argument using Lemma 3.

4 Reduction to Irreducible Positions

Consider a query for the locus of a substring such that the position is not irreducible. Let be the leaf corresponding to . As was discussed in Section 2, to answer the query, we have to count the number of left-branching nodes with string depths at least on the –root path. As a first approach, we do the following precalculations for this.

We build in time on the suffix tree a data structure that allows us to find the lowest common ancestor for any pair of nodes in time [4, 14]. In one tree traversal, we precompute in each node the number of left-branching nodes on the –root path. We create two arrays and such that, for any , and ( is undefined if there is no such ), i.e., and are indexes of irreducible LCP values, respectively, preceding and succeeding . Thus, for any suffix , the suffixes and are, respectively, the closest lexicographical predecessor and successor of with irreducible starting position. The closest irreducible lexicographical neighbour of is that suffix among these two that has longer common prefix with , i.e., if either or is undefined, and otherwise. Since can be calculated in time for any positions and by one lowest common ancestor query on their corresponding leaves, the closest irreducible lexicographical neighbour can be computed in time.

Now, consider a query for the locus of with non-irreducible . We first find in time the closest irreducible lexicographical neighbour for . Let and be the leaves corresponding to and , respectively. We compute in time the lowest common ancestor of and . Thus, . Using the number of left-branching nodes on the –root and –root paths that were precomputed in and , we calculate in time the number of left-branching nodes on the –root path that lie between the nodes and (inclusively).

Denote . If , then we can count in the –root path the number of left-branching nodes such that as follows. The number of nodes such that is equal to . The number of nodes such that can be found by counting the number of ones in the subarray of the bit array associated with the irreducible position (assuming that all values with are zeros in case the length of is less than ), which can be performed in time by Lemma 3. Thus, the problem is solved for the case .

To address the case , we have to develop more sophisticated techniques that allow us to reduce the counting of left-branching nodes on the –root path to counting on a different leaf–root path with a different threshold that meets the condition , for an analogously appropriately defined node . The remainder of the text describes the reduction.

Similar to irreducible positions, let us define, for each non-irreducible position , a number and an array such that, for any , iff there is a left-branching node of string depth on the path from the leaf corresponding to to the root. We do not actually store the arrays and use them only in the analysis.

For any non-irreducible position , we have and, if for some , then .

Proof.

Let be the suffix lexicographically preceding , i.e., and . Since is not irreducible, we have , i.e., . Hence, is the suffix lexicographically preceding : . Therefore, , which is equal to .

If for , then there is a suffix that is smaller than and . Then, , which implies that . ∎

Consider two consecutive irreducible positions and in (i.e., all positions between and are not irreducible). Lemma 4 states that the arrays form a trapezoidal structure as depicted in Figure 2 in which each value is inherited by all arrays above it while it fits in their range.

Figure 2: Here, and are consecutive irreducible positions. Each line of the trapezoid depicts an array , for , in which gray and white positions signify, respectively, ones and zeros.

The query for the locus of was essentially reduced to the counting of ones in the subarray , where . The problem now is that the array is not stored explicitly since the position is not irreducible. Denote the length of by , so that , which is a more convenient notation for what follows. Let be the closest irreducible position preceding , i.e., . If we are lucky and neither of the arrays introduces new values in the subarray (in other words if ), then we can simply count the number of ones in in time using Lemma 3 and, thus, solve the problem. Unfortunately, new values indeed could be introduced. But, as it turns out, such new values are, in a sense, “covered” by other arrays at some irreducible positions as the following lemma suggests.

Let be a non-irreducible position and be the closest irreducible lexicographical neighbour of . Denote . If, for some , we have but , then .

Proof.

The condition implies that there is a suffix that is smaller than and . If , then and, hence, the suffix is smaller than and . This gives , which contradicts the equality . Hence, . Thus, at least one of the values must be irreducible (note that since otherwise would be irreducible). Let be an irreducible position such that . We obtain since and, for any irreducible (in particular, for ), we have . ∎

Observe that in Lemma 2 is equal to , where is the lowest common ancestor of the leaves and corresponding to and . Thus, the numbers can be precomputed for all non-irreducible positions in time using lowest common ancestor queries and the arrays and described in the beginning of the present section. For irreducible positions , we put by definition. To illustrate Lemma 2 and its consequences, we depict in Figure 3 the same trapezoidal structure as in Figure 2 coloring in each a prefix of length (and also depicting the range ).

Figure 3: The yellow stripe in each array is a prefix of length . Note that, according to Lemma 2, each gray position whose corresponding position below is white is covered in yellow. The blue stripe is , where . It coincides with two corresponding regions below it that are in light blue. Here we have .

The problematic condition that we consider can be reformulated as . It follows from Lemmas 4 and 2 that in this case . Let be the first position preceding such that . Note that since by definition and, thus, . Then, applying Lemma 2 consecutively to all positions , we conclude that .

Informally, in terms of Figure 3, the searching of corresponds to the moving of the range down until we encounter an “obstacle”, a colored part of of length . The specific algorithm finding is discussed in Section 5; let us assume for the time being that is already known. Then, Lemma 2 implies that all the ranges with coincide and, thus, the whole problem was reduced to the counting of ones in the subarray . But since , the problem can be solved by the method described at the beginning of the section: we find an irreducible position such that , then locate the leaves and corresponding to and , respectively, find the lowest common ancestor of and , and separately count the number of left-branching nodes on the –root path such that and such that .

Let us briefly recap the reductions described above: the searching for the locus of was essentially reduced to the counting of ones in the subarray , where , for which we first compute the position (this is discussed in Section 5) and, then, count the number of ones in the subarray (the subarray coincides with by Lemmas 4 and 2) using lowest common ancestor queries, the arrays and , and some precomputed numbers in the nodes.

5 Reduction to Special Weighted Ancestors

We are to compute , for a non-irreducible position . As is seen in Figure 3, this is a kind of geometric “orthogonal range predecessor” problem.

For each irreducible position in , we build a tree with nodes, where is the next irreducible position after (so that all with are not irreducible): Each node of is associated with a unique position from and the root is associated with . A node associated with position has weight ; the root has weight . The parent of a non-root node associated with is the node associated with the position . Thus, the weights strictly increase as one ascends to the root. The tree is easier to explain in terms of the trapezoidal structure drawn in Figure 4: the parent of a node associated with is its closest predecessor whose colored range goes at least one position farther to the right than . By Lemma 4, and, hence, each weight such that is upperbounded by . However, it may happen that and, thus, . In this case, the colored range stretches beyond the length of (for simplicity, Figure 4 does not have such cases). Such large weights are a source of technical complications in our scheme.

Figure 4: Each stripe of the trapezoid depicts an array , for , and its prefix of length (possibly empty) is colored in yellow (not like in Figure 3). Each yellow prefix corresponds to a node of and it is connected to the yellow prefix corresponding to its parent in . To avoid overloading the picture, gray positions signifying s in are not drawn.

The tree can be constructed in linear time inserting consecutively the nodes associated with and maintaining a stack that contains all nodes of the path from the last inserted node to the root: to insert a new node with weight , we pop from the stack all nodes until a node heavier than is encountered to which the new node is attached.

In terms of the tree , the node associated with the sought position is the nearest ancestor of the node associated with such that , where : the condition is equivalent to since , , and . Since the threshold in the condition does not exceed , we will be able to ignore differences between weights larger than in our algorithm by “cutting” them in a way. Still, even with this restriction, at first glance this looks like quite a general weighted ancestor problem that admits no constant time solution in the space available. However, we will be able to use a structure common to such trees in order to speed up the computation. The idea is to decompose the tree into heavy paths (see below), as is usually done for weighted ancestor queries, and perform fast queries on the paths via the use of more memory; the trick is that, with some care, this space can be shared among “similar” trees and can be “traded” for irreducible LCP values relying again on Lemma 3.

Consider the tree . An edge connecting a child to its parent is called heavy if , and light otherwise, where denotes the number of nodes in the subtree rooted at . Thus, at most one child can be connected to by a heavy edge. One can easily show that any leaf–root path contains at most light edges. All nodes of are decomposed into disjoint heavy paths, maximal paths with only heavy edges in them. Note that in this version of the heavy-light decomposition [28] the number of heavy edges incident to a given non-leaf node can be zero (it then forms a singleton path). The decomposition can be constructed in linear time in one tree traversal.

The predecessor search problem, for a set of increasing numbers and a given threshold , is to find . Although the problem, in fact, searches for a successor, we call it “predecessor search” as it is essentially an equivalent problem.

[[13, Lem. 11]] Consider a tree on nodes with weights from such that some of the nodes are marked, any leaf–root path contains at most marked nodes, and the weights on any leaf–root path increase. One can preprocess the tree in time so that predecessor search can be performed among the marked ancestors of any node in time, provided a table computable in and independent of the tree is precalculated.111Although Lemma 11 in [13] does not claim linear construction time, it easily follows from its proof.

For each heavy path, we mark the node closest to the root. Then, the data structure of Lemma 5 is built on each tree , which takes total time for all the trees . To answer a weighted ancestor query on , we find in time the heavy path containing the answer using this data structure and, then, consider the predecessor problem inside the path.

Consider a heavy path whose node weights are in increasing order. We have to answer a predecessor query on the path for a threshold such that (recall that only thresholds are of interest for us) and it is known that its result is in the path, i.e., . Let . A trivial constant-time solution for such queries is to construct a bit array , where , endowed with a rank data structure such that, for any , we have iff , for some . Then, the predecessor query with a threshold such that can be answered by counting the number of 1s in the subarray , thus giving the result . The array occupies space since .

It turns out that, with minor changes, the arrays can be shared among many heavy paths in different trees. However, this approach per se leads to time and space, as will be evident in the sequel. We therefore need a slightly more elaborate solution.

Instead of the array , we construct a bit array endowed with a rank data structure such that, for any , iff the subarray contains non-zero values (assuming that , for ). The bit array packed into machine words of size bits can be built from the numbers in one pass in time. For each 1 value, , we collect all weights such that into a set . The sets , for all such that , are disjoint and non-empty, and can be assembled in one pass through the weights in time. We also store pointers , with (), such that refers to a non-empty set such that is the th value in (i.e., is the number of 1s in the subarray ). Each set is equipped with the following fusion heap data structure.

[see [12, 26]] For any set of integers from , one can build in time a fusion heap that answers predecessor queries in time.

With this machinery, a predecessor query with a threshold can be answered by first counting the number of 1s in the subarray and, then, finding in time the predecessor of in the fusion heap referred by .

The array occupies space, which is since . The computation of , which takes time, is the most time and space consuming part of the described construction; all other structures take time and space. We are to show that the computation of can sometimes be avoided if (almost) the same array was already constructed for a different path. The following lemma is the key for this optimization.

Given a tree for an irreducible position , consider its node associated with a non-irreducible position . Let be the closest irreducible lexicographical neighbour of and let . Then, the subtree of rooted at coincides with the tree in which all children of the root with weights are cut, then all weights are increased by , and the weight of the root is set to the weight of (see Figure 5).

Proof.

Denote by the subtree rooted at . Observe that all nodes of are exactly all nodes of associated with positions from a range , for some (see Figure 5). The position is either irreducible or . In order to prove the lemma, it suffices to show that (1) all positions from are not irreducible, (2) , for any , and (3) the position is either irreducible or .

(1) Suppose, to the contrary, that a position , for some , is irreducible. Then, since and , we have . Therefore, , the length of the common prefix of and its closest irreducible lexicographical neighbour, must be at least (since it was assumed that is irreducible). But then the weight of the node associated with is at least , which is equal to the weight of . Thus, cannot be an ancestor of the node, which is a contradiction.

(2) For , and share a common prefix of length since and . Further, since the weight is smaller than the weight of its ancestor . Suppose, without loss of generality, that is lexicographically smaller than (the case when it is greater is analogous). Then, neither of the suffixes lexicographically lying between and can start with an irreducible position, for otherwise . Therefore, the closest irreducible lexicographical neighbours of and coincide and .

(3) Suppose that is not irreducible and, by contradiction, . The suffixes and have a common prefix of length (note that ). The position is not irreducible since otherwise . Then, we have because otherwise the node corresponding to would have weight that is smaller than the weight of and, thus, would be a descendant of . Let be the closest irreducible lexicographical neighbour of