Sliding Suffix Tree

01/23/2018 ∙ by Andrej Brodnik, et al. ∙ University of Ljubljana 0

We consider a sliding window over a stream of characters from some finite alphabet. The user wants to perform deterministic substring matching on the current sliding window content and obtain positions of the matches. We present an indexed version of the sliding window based on a suffix tree. The data structure has optimal time queries Θ(m+occ) and amortized constant time updates, where m is the length of the query string and occ the number of occurrences.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Text indexing and big data in general is a well studied computer science and engineering field. A specially intriguing area is (infinite) streams of data which are too big to fit onto disk, and consequently, cannot be indexed in the traditional way (e.g. by using FM-index [5]). In practice, data streams are processed on-the-fly by efficient, carefully engineered filters. An excerpt of the data called text features are stored for later usage, while the original stream data is discarded.

In our research, we consider an infinite stream of characters where the main memory holds the most recent characters in the stream in terms of sliding window. At any moment, a user wants to find all occurrences of the given substring in the current window. In general, to answer the query, we could construct an automaton from a query using KMP

[9] or Boyer-Moore [1], and then feed the stream to the constructed automaton. This however requires that all queries are known in advance. On the other hand, if the query arrives on-the-fly, the automaton needs to be constructed from scratch. In both cases we need to scan the whole window which requires linear time in the size of a window. A better possibility would be to run Ukkonen’s online suffix tree construction algorithm [11] and construct the suffix tree. When the query arrives, we inject a delimiter character to finalize the suffix tree construction and perform the query on the constructed tree. However, finalizing might take, in the worst case, linear time in the size of the window.

In this paper we show how to construct and maintain an indexed version of the sliding window allowing a user to find occurrences of a substring in optimal time and space. This is the first data structure for on-the-fly text indexing which requires amortized constant time for updates and worst case optimal time for queries.

In the following section we define the notation and preliminary data structures and algorithms. In Section 3 we formally present a sliding suffix tree and we conclude in Section 4 with discussion and open problems.

2 Notation and Preliminaries

Capital letters denote strings and lower case letters integers except for which denotes an arbitrary character. Further, lower case Greek letters represent nodes in a tree, and calligraphic capital letters , and tree-based data structures. We denote concatenation of two strings by simply writing one string beside the other, e.g.  and the length of a string as . By we denote a substring of starting at position and ending at inclusive, where . Suffix of starting at is and a prefix of ending at is both inclusive.

We denote by the sliding window over an infinite input stream of characters which are from an alphabet of constant size. By , we denote the number of all characters read so far. To store a suffix starting at the current position, we store the current . At any later time , we can retrieve the content of this suffix as , where , for the suffix to be present in .

2.1 Suffix Tree, Suffix Links Tree, and Lowest Common Ancestor

A suffix tree is a dictionary containing each suffix of the text as a key with its position in the text as its value. The data structure is implemented as a PATRICIA tree, where each internal node stores a skip value and the first (discriminative) character of the incoming edge, whereas each leaf stores the position of the suffix in the text. We denote by a string depth operation of some node in a suffix tree and define it as a sum of all skip values from the root to . Each edge implicitly contains a label which is a substring of a suffix starting and ending at the string depth of the originating and the string depth of the terminating node respectively. We say a node spells out string , where is a concatenation of all labels from the root to . Or more formally, take a leaf in a subtree of and let it store position of a suffix, then . Next, let and be strings which are spelled out by nodes and respectively. We define a suffix link as an edge from node to , if , and denote this by . If we follow suffix links from times, we write this as .

We define a suffix links tree as follows. For each internal node in a suffix tree let there be a node in a suffix links tree. For each suffix link from to in the suffix tree, is a parent of in the suffix links tree. Consequently, following a suffix link from times is the same as finding the ancestor of in the suffix links tree.

The lowest common ancestor (LCA) of nodes and in a tree is the deepest node which is an ancestor of both and . The first constant time, linear space LCA algorithm was presented in [7] and later simplified by [10]. The dynamic version of the data structure still running in constant time and linear space was introduced in [3]. We will use this result to perform constant time LCA lookups and maintain it in amortized constant time.

2.2 Ukkonen’s online suffix tree construction algorithm

In [11] Ukkonen presented a suffix tree construction algorithm which builds the data structure in a single pass. During the construction, the algorithm maintains the following invariants in amortized constant time:

  • implicit buffer which corresponds to the longest repeated suffix of the text processed so far,

  • the active node which represents a node where we end up by navigating in the suffix tree constructed so far, i.e.  is a prefix of a string spelled out by .

The execution of the algorithm can be viewed as an automaton with two states. In the buffering state, the automaton reads a character from the input stream and implicitly extends all labels of leaves by . Then, it checks whether matches the next character of the prefix of length spelled out by . If it does, is appended to and the automaton remains in the buffering state reading the next character. When , a child in direction of becomes a new active node.

On the other hand, if character does not match, automaton switches to the expanding state. First it inserts a new branch in direction of with a single leaf storing the suffix position . If , the new branch is added as a child to . Otherwise, if the incoming edge of is split such that the string depth of the newly inserted internal node is , and the new branch is added to this node. Once the branch is inserted, the first character is removed from obtaining new . A new active node corresponding to is found in the following way. Let denote the parent of the original active node . Then the new active node is a node obtained by navigating suffix from a node . When is obtained, is reconsidered. If a branch in direction of exists, the automaton switches to buffering state. Otherwise, it remains in the expanding state and repeats the new branch insertion. Each time the expanding state is re-entered, is shortened for one character. In the worst case, if does not occur in the text yet, the suffix links will be followed all the way up to the root node, and will be added as a new child to the root node. In this case the implicit buffer will be an empty string.

We say the currently constructed suffix tree is unfinalized, until is completely emptied. Moreover, there are exactly leaves missing in the unfinalized tree and these correspond to suffixes of . For finite texts we finalize the suffix tree at the end by appending a unique character which forces the algorithm to empty and finalize the tree. For infinite streams however, there is no final character. Consequently, we need to support:

  1. Queries: When performing queries, we need to report the occurrences both in the partially constructed suffix tree and in .

  2. Maintenance: The original Ukkonen’s algorithm supports adding a new character to the indexed text. When a window is shifted, we also remove the oldest (longest) suffix from the text.

3 Sliding Suffix Tree

The sliding suffix tree is an indexed version of the current sliding window content . Formally, we define two operations:

  • find(, ) — returns all positions of the query string in .

  • shift(, ) — appends a character to and removes the oldest character from .

Initially, is empty and until the length of reaches the desired size, shift operation only appends new characters.

The sliding suffix tree is built on top of Ukkonen’s online suffix tree construction algorithm. We maintain a possibly unfinalized suffix tree including implicit buffer and active node (Fig. 1 on the left). Figure 1 on the right illustrates the position of and in a stream. Notice is always a proper suffix of . Additionally, we maintain a suffix links tree of , , with auxiliary data structure required for constant time LCA on .

Figure 1: On the left: Illustration of partially constructed suffix tree with implicit buffer and active node . On the right: Illustration of the stream, the sliding window , the implicit buffer , and three cases for positions of the query strings , , and .

In the next two subsections we show how to perform the find operation in time in the worst case and the shift operation in constant amortized time. As a model of computation, we use the standard RAM model.

3.1 Queries

To find all occurrences of query in , we first navigate in . Let correspond to a subtree rooted at the node at which we finished the navigation. Leaves of make up the first part of the resulting set. In Figure 1 corresponds to such occurrence. Also, position of in the same figure will be contained in one of the leaves of , since contains all suffixes that start at the beginning of up to the beginning of .

The second part of the resulting set are the missing leaves of due to the unfinalized state of . Intuitively, these leaves correspond to suffixes of which start with . in Figure 1 illustrates one such position. Obviously, if there are no matches of in and we solely return the leaves of . If , we test whether the active node is the same node as the root of . If it is, we add one additional occurrence at position to the resulting set.

The case requires special attention. One solution would be to scan for using KMP or similar approaches. But since in the worst case, we cannot afford the scan. In the remainder of this subsection we show how to determine the missing leaves in time . First, we claim that the navigated subtree always exists, if there are any occurrences of to be found in .

If exists in buffer , then a subtree exists by navigating the query in .

Proof.

If exists somewhere in , then is a substring of a string spelled out by . From the property of the suffix tree, by following the suffix links from we will find a node which spells out a string with at the beginning. This node is a root of . ∎

To consider occurrences of in where , we determine the relation of each node in to . Since we can afford this operation, if we spend at most constant time per node. We proceed depending on whether is an internal node of or not.

Let be the active node of , and let be an internal node. String is located in at position , iff is a node in .

Proof.

() We need to prove that a node corresponding to a suffix of which starts with exists in since is not finalized. Recall the expanding state of Ukkonen’s algorithm. At each call, the operation adds a leaf and possibly an internal node, whereas the existing internal nodes are left untouched. Since is an internal node, no changes will be made either to it or the nodes visited when recursively following the suffix link from , since they are also internal nodes. Therefore, a node corresponding to a suffix of which begins with exists in , if such a suffix exists in .

() By definition of , is a prefix of a string which spells out. is also an internal node, so it will always contain an outgoing suffix link (in case , let the suffix link point to the root node). When following the suffix link of , each time we implicitly remove one character from the beginning of . Suppose we follow the suffix link times and reach a node which is a member of . By definition of the suffix tree, each node in spells out a string which starts with . Therefore, our reached node corresponds to a suffix of at position and starts with . ∎

By using the lowest common ancestor operation (LCA) we can check in constant time whether a node is reachable from another node by following the suffix links in . If is an ancestor of in (i.e. the LCA of and in is ), then is reachable by following the suffix links from in . To determine all occurrences of in in time , for each candidate node in we find its LCA with in . If the LCA is and is an ancestor of in , then by Lemma 3.1 is located in at position .

If is a leaf of , we cannot use the approach described above, because leaves do not have usable suffix links. We find occurrences of in by exposing a repetitive pattern inside .

[The Buffer Pumping Lemma] The buffer is extended by a new character during Ukkonen’s suffix tree construction algorithm without inflicting the expansion of a tree, iff corresponds to the next character in a repetitive pattern inside .

Proof.

() Since does not inflict the expansion of a tree, occurred in the text before. Let denote the position of last such occurrence as illustrated on Figure 2. Notice that starting at and starting at overlap. Consequently, character and in turn is a concatenation of patterns , where and the last repetition might be empty.

() Given and a repetitive pattern , we can extend by a new character . The expansion of the tree will not occur because was present in the text before and consequently a corresponding edge in partially constructed suffix tree will exist. ∎

Let there be a single leaf in the subtree obtained when navigating in and let be the position stored in this leaf. The repetitive pattern inside is .

Proof.

Since the leaf storing is the only leaf in the obtained subtree, there are exactly two occurrences of in the text. The first one at position and the second one at position . If , then is a prefix of , because the leaf spelling out was obtained by navigating . If , then due to the buffer pumping lemma. ∎

Figure 2: Structure of relative to when is a leaf. Subfigures a) and b) illustrate cases for and respectively. Below each subfigure is an illustration of query relative to and respectively.

With the help of the lemma and the corollary above we can efficiently determine the positions of in by exposing the repetitive property of the pattern inside . Depending on the length , two cases are possible as illustrated in Figure 2. If (Fig. 2.a), we scan for in up to position inside and for each such occurrence of at some position we add occurrences to the resulting set until we reach . We require time in the worst case. If (Fig. 2.b), we visit the leaves of and consider the suffixes starting inside the interval of the stream. For each such occurrence , we add to the resulting set until we reach . We spend time in the worst case.

The data structure we used consists of , where requires space in the worst case (i.e. ) and assuming an alphabet of constant size. Next, contains the same number of nodes as and is oblivious to the alphabet size, so the space complexity has the same upper bound. Finally, used for constant time LCA queries on requires linear space in terms of the number of nodes in . This brings us to the following theorem.

A user can find all occurrences of query in a sliding suffix tree of size in time .

3.2 Maintenance

To shift window , we read a character and add it to our data structure and at the same time remove the oldest (longest) stored suffix. During the maintenance no queries can be performed.

To add a character, we first execute the original Ukkonen algorithm as described in subsection 2.2. During the expanding state we add to either one node (a new leaf is added to the active node) or two nodes (the incoming edge of the active node is split and a new leaf is added). Since contains only internal nodes of , it remains unchanged in the first case and in the second case, a node is also added to as follows.

When the expanding state is visited the first time, a new internal node is added to . We also add a new node to . At this point no suffix link originating in has been set, so does not have a parent in yet. In the next step either an expanding state is re-entered or a buffering state is entered. If the expanding state is re-entered, we repeat the procedure obtaining new nodes and . Now, a suffix link is created from to and consequently a parent of becomes . If the buffering state is entered, either a root node or a node containing the matched character is reached. Instead of creating new nodes in and as we did in the expanding state, we create a suffix link to an existing node in and set the parent of a node in accordingly.

Adding a suffix to requires constant amortized time [11]. During the re-entrances to the expanding state, a chain of nodes was formed in which was finally attached to the existing node in constant time when the buffering state was entered. For updating , attaching a chain of nodes to a tree requires linear time in the length of a chain [3]. By amortizing all expanding calls, adding a new character takes amortized constant time.

To remove the oldest stored suffix from , we first find the corresponding leaf (e.g. by following a linked list of all leaves). If the leaf’s parent has three or more children, the parent remains unchanged and we just remove the leaf from . Since leaves of are not present in , and consequently remain unchanged.

On the other hand, if the leaf’s parent has exactly two children, we remove the leaf from and also its parent from and from . To remove we merge its incoming and the remaining outgoing edges. Due to the following lemma, we can also safely remove since it is always a leaf in .

Let be a node with two children in , where one child is a leaf storing a position of the longest suffix . Then, is not a terminating node of any suffix link.

Proof by contradiction..

Assume there is a node in with a suffix link pointing to . Since has two children, has at most two children, because contains a subset of nodes of . Observe the child of storing the position i.e. it spells out . One child in should then spell out prepended by some character. Since is already the longest suffix which exists in the window, a longer suffix and its corresponding leaf do not exist. Then, only one child of remains and due to path compression does not exist in which contradicts the initial assumption. ∎

In the moment of removal, the removed leaf or its parent can be an active node . If this is the case, then was a prefix of the removed suffix. Recall that at any time, corresponds to the longest repeated suffix of the window. Since the oldest suffix is removed by shifting the window, a new longest repeated suffix is consequently shortened for one character by updating to . To find a new and an edge corresponding to the updated , we simply follow the suffix link of the ’s parent and navigate the remainder of from the obtained node. The navigation time is amortized over all expanding calls, so finding a new requires amortized constant time.

To remove a leaf from and we require constant time in the worst case [3].

During the shift operation, no additional data structures are used. Consequently, the space complexity of the sliding suffix tree remains asymptotically unchanged. We conclude with the following theorem.

The sliding suffix tree of size can be shifted in amortized time.

4 Conclusions and Open Problems

In this paper we presented a sliding suffix tree for performing online substring queries on a stream. By extending Ukkonen’s online suffix tree construction algorithm, the presented data structure supports queries in optimal time for alphabets of constant size while maintaining amortized constant time updates, where is the length of the query string and the number of occurrences.

An open question remains whether the data structure can be updated in worst case constant time. There is a well known linear time suffix sorting lower bound [4], but to our knowledge, no per-character lower bound has been explored. Ukkonen’s algorithm requires, by design, an amortized constant time for updates due to the implicit buffer of unfinalized nodes. To the best of our knowledge, no other online suffix tree construction algorithm has been developed without the implicit buffer.

In this paper, we assumed a constant size of the alphabet in asymptotic times for queries and updates. For arbitrary size of , the current implementation of data structure requires an additional factor of time to determine a child at each step and maintain the same space complexity whereas and data structures are oblivious to . An interesting question is whether the same asymptotic times can be achieved for integer alphabets as was done in [4] for texts of fixed length. In our case , but the alphabet can change in time.

Streaming algorithms are common in heavy throughput environments, therefore it seems feasible to involve parallelism. Recently, two methods were introduced for performing fine-grained parallel queries on suffix trees [8, 2]. Both methods perform queries on static data structures only and perhaps supporting the shift operation used by the sliding suffix tree might be feasible. From a more coarse-grained parallelism point of view, the current query and update operations must be executed atomically. An interesting design question is whether the data structure could be designed in a mutable way, so a query and an update can be performed simultaneously, if different parts of the data structure are involved.

Finally, the presented data structure, while theoretically feasible, should also be competitive in practice. From our point of view, the main issue with tree-based data structures used in the sliding suffix tree is space consumption. The majority of the size accounts for the auxiliary data structure used for constant time lowest common ancestor. Some work on practical lowest common ancestor data structures has already been done in [6]. We believe that once the data structure is succinctly implemented, it should present a viable alternative to existing solutions.

References