A compressed dynamic self-index for highly repetitive text collections

11/08/2017 ∙ by Takaaki Nishimoto, et al. ∙ 0

We present a novel compressed dynamic self-index for highly repetitive text collections. Signature encoding is a compressed dynamic self-index for highly repetitive texts and has a large disadvantage that the pattern search for short patterns is slow. We improve this disadvantage for faster pattern search by leveraging an idea behind truncated suffix tree and present the first compressed dynamic self-index named TST-index that supports not only fast pattern search but also dynamic update operation of index for highly repetitive texts. Experiments using a benchmark dataset of highly repetitive texts show that the pattern search of TST-index is significantly improved.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A highly repetitive text collection is a set of texts such that any two texts can be converted into each other with a few modifications. Such collections have become common in various fields of research and industry. Examples include genome sequences for the same species, version-controlled documents, and source code repositories. For human genome sequences, it is said that the difference between individual human genomes is around 0.1%, and there is a huge collection of human genomes, such as the 1000 Genome Project [1]. As another example, Wikipedia belongs to the category of version-controlled documents. There is clearly a strong, growing need for powerful methods to process huge collections of repetitive texts.

A self-index is a data representation for a text with support for random access and pattern search operations. Quite a few self-indexes for highly repetitive texts have been proposed thus far. Early methods include the SLP-index [5] and LZ-end [9]. The block tree index (BT-index) [14] is a recently proposed self-index that reaches compression close to that of Lempel-Ziv. The run-length FM-index (RLFM-index) [7] is a recent breakthrough built on the notion of a run-length-encoded Burrows-Wheeler transform (BWT). The RLFM-index can be constructed in a space-efficient, online manner while supporting fast operations. Although existing self-indexes for highly repetitive texts are constructible in a compressed space and also support fast operations, there are no proposed methods supporting dynamic updating of self-indexes for editing a set of texts with a highly repetitive structure. Thus, an important open challenge is to develop a self-index for highly repetitive texts that supports not only fast operations for random access and pattern search but also dynamic updating of the index.

The ESP-index [18] and signature encoding [15] are two self-indexes using the notion of a locally consistent parsing (LCPR) [11] for highly repetitive texts. While they have the advantages of being constructible in an online manner and supporting dynamic updates of data structures, they have a large disadvantage of slow pattern search for short patterns. From a given string, these methods build a parse tree that guarantees upper bounds for parsing discrepancies between different appearances of the same substring. For pattern search, the ESP-index performs top-down search of the parse tree to find candidate pattern appearances, and then it checks whether the pattern does occur around each candidate. In contrast, signature encoding uses a 2D range of reporting queries for pattern search. Traversing the search space, especially for short patterns, takes a long time for both methods, which limits their applicability in practice.

In this paper, we present a novel, compressed dynamic index for highly repetitive text collections, called a truncated suffix tree- (TST-) based index (TST-index). The TST-index improves on existing self-indexes on an LCPR to support faster pattern search by leveraging the idea behind a -truncated suffix tree (TST) [13]. A -TST is built from an input text, and a self-index is created for the transformed text by using the -TST, which greatly improves the search performance. In addition, the TST-index supports dynamic updates, which is a useful property when input texts can be edited, as noted above. Experiments with a benchmark dataset of highly repetitive texts show that pattern search with the TST-index is much faster than that with signature encoding. In addition, the TST-index has pattern search performance competitive with that of the RLFM-index, yet the size of the TST-index can be smaller than that of the RLFM-index.

2 Preliminaries

Let be an ordered alphabet and . Let string be an an element in , with its length denoted by . In general, a string is called a pattern. Let and throughout this paper. For string , , , and are called a prefix, substring, and suffix of , respectively. For two strings and , let .

The empty string is a string of length . Let . For any , denotes the -th character of . For any , denotes the substring of that begins at position and ends at position . Let and for any . For strings and and indexes and , we define insertion and deletion operations as follows: and . For any strings and , let denote all occurrence positions of in ; that is, . Let . Similarly, for .

For a string and integer , we say that is a -gram if . We then say that is a -short pattern if , or a -long pattern if . Similarly, we say that a suffix of length at most of is a -short suffix of . Let be the set of all substrings of length and all suffixes of length at most in , i.e., . For example, if and , then . For a string , we say that is a -colored sequence if holds for any .

For a string and an integer , the longest prefix of such that occurs in is called the longest previous factor without self-reference at position of . Furthermore, is called a factorization of if holds, where factor for each is a substring of and is the size of the factorization. Then, the run-length encoding is a factorization of string such that each factor is a maximal run of the same character in . Such a run is denoted as for length and character . For example, for , . is a -colored sequence when we treat each factor as a character.

Next, the Lempel-Ziv77 (LZ77) factorization without self-reference [21] for a string is a factorization satisfying the following conditions: (1) ; (2) ; (3) is the longest previous factor without self-reference at position of , if it exists for ; and (4) otherwise, . Hence, is the number of factors in LZ77, i.e., . For example, if , then .

A self-index is a data structure built on and supporting the following operations:

  • Count: Given a pattern , count the number of appearances of in .

  • Locate: Given a pattern , return all positions at which appears in .

  • Extract: Given a range , return .

Such a self-index is considered static. In contrast, a dynamic self-index supports not only the above three operations but also an update operation on the self-index for insertion and deletion of a (sub)string of length on string of maximal length . A compressed dynamic self-index is a compressed representation of a dynamic self-index. We assume that for a static setting and for a dynamic setting.

Our model of computation is a unit-cost word RAM with a machine word size of bits. We evaluate the space complexity here in terms of the number of machine words. A bitwise evaluation of space complexity can be obtained with a multiplicative factor.

Finally, let , , and for a real positive number .

3 Literature Review

Self-indexes for highly repetitive text collections constitute an active research area, and many methods have been proposed thus far. In this section, we review the representative methods summarized in the upper part of Table 1. See [7] for a complete survey.

Self-indexes for highly repetitive text collections are broadly classified into three categories. The first category is a grammar-based self-index. Claude and Navarro 

[5] presented the SLP-index, which is built on a straight-line program (SLP), a context-free grammar (CFG) in Chomsky normal form for deriving a single text. For the size of a CFG built from text , the SLP-index takes space in memory while supporting locate queries in time, where is a parameter [5].

Two other grammar-based self-indexes, the ESP-index [18] and signature encoding [15], use the notion of LCPR. Each takes space in memory while supporting locate queries in time, where and . Signature encoding also supports dynamic updates in time for highly repetitive texts. Although the ESP-index and signature encoding have an advantage in that they can be built in an online manner, and although signature encoding also supports dynamic updates, these self-indexes have a large disadvantage in that locate queries are slow for short patterns.

The second category includes self-indexes based on LZ77 factorization. Recently, Bille et al. [4] presented a self-index with a time-space trade-off on LZ77. Their self-index takes space while supporting locate queries in time, where is the number of factors in LZ77 with self-reference on and is an arbitrary constant. Navarro [14] presented a self-index based on a block tree (BT), called the BT-index. The BT-index uses space and locates a pattern in time for any constant .

The third category includes the RLFM-index built on the notion of a run-length-encoded BWT. The RLFM-index was originally proposed in [10] but has recently been extended to support locate queries [7]. It uses space for the number of BWT runs and takes time for locate queries. Although the RLFM-index supports fast locate queries in a compressed space, it has a serious issue with its inability to dynamically update indexes.

Despite the importance of compressed dynamic self-indexes for highly repetitive text collections, no previous method has both supported fast queries and dynamic updates of indexes while achieving a high compression ratio for highly repetitive texts. We thus present a compressed dynamic self-index, called the TST-index, that meets both demands and is applicable to highly repetitive text collections in practice. The TST-index has the following theoretical property, which will be proven over the remaining sections.

Theorem 1.

For a string and an integer , the TST-index takes space while supporting the following four operations: (i) count queries in time for a pattern of length ; (ii) locate queries in time for a pattern of length ;(iii) extract queries in time; and (iv) update operations in time, where .

Method Space Locate Time Update time
RLFM-index [7] Unsupported
Bille et al. [4] Unsupported
BT-index [14] Unsupported
SLP-index [5] Unsupported
Signature
encoding [15] on average
TST-index-s  () Unsupported
(this study)
TST-index-d  ()
(this study)
Table 1: Summary of self-indexes for highly repetitive text collections. Here, is the number of candidate occurrences of a given pattern as obtained by the ESP-index [18].

4 Fast queries with truncated suffix trees

In this section we present a novel text transformation, called a -TST transformation, to improve the search performance of a self-index. We first introduce the TST in the next subsection and then present the -TST transformation in the following subsection.

4.1 Tries, compact tries and truncated suffix trees

A trie for a set of strings is a rooted tree whose nodes represent all prefixes of the strings in (see the left side of Figure 1). Let be the set of nodes in , and let be the set of leaves in . We then define the following five operations for , , , and :

  • : Returns the string starting at the root and ending at node .

  • : Returns the node such that .

  • : Returns the set of all leaves whose prefixes contain .

  • : Returns the node such that holds if it exists.

  • : Returns the node such that holds if it exists.

All nodes in are categorized into two types. If node is an internal node and has only one child, then is called an implicit node; otherwise, is called an explicit node. Let and be the respective sets of implicit and explicit nodes in .

For , the function returns the lowest explicit node such that is an ancestor of . The computation times for , , , and are constant, , , and , respectively, where is the computation time for . Here, when we use perfect hashing [6] in space. Moreover, can be computed in constant time if node stores a pointer to in constant space. Hence, the data structure requires space.

A compact trie is a space-efficient representation of a trie such that all chains of implicit nodes in are collapsed into single edges (see the right side of Figure 1). We use two representations for node labels in a compact trie. The first representation uses a string such that each edge label in is represented as a pair of start and end positions in , resulting in a space-efficient representation of the edge labels in . This representation is called a compact trie with a reference string and takes space. The second approach represents each node label explicitly and is called a compact trie without a reference string. This representation takes space. The compact trie with a reference string is more space efficient and is used in the static case, while the compact trie without a reference string is used in the dynamic case.

We can insert or delete a string into or from a compact trie without a reference string in time [12], where is the computation time for updating the data structure for when a child is inserted into or removed from  (assume ). Here, when we use the predecessor/successor approach of Beame and Fich [3] in space, where is the number of children of .

Example 2.

The trie on the left side of Figure 1 is built on . In this trie, , , , , , , , , and .

Figure 1: Trie for  (left) and its corresponding compact trie with reference string  (right).

The -TST  [13] of a string is a trie for . We assume that ends with a special character not included in . Since for Example 2, the right side of Figure 1 shows a -TST built on . Trie is a -TST built on , because .

An important fact is that always exists for every leaf node in -TST. We thus explicitly store for every leaf .

Vitale et al. showed that the reference string of a -TST can be represented as a string of length , and that a -TST for a string can be constructed in time in an online manner while using working space [20].

4.2 -TST transformation

We now present a string transformation using a -TST, which we call a -TST transformation. A string is transformed into a new string by a -TST transformation built on . A self-index with improved search performance can then be built on .

A -TST transformation using a -TST on is a transformation of into a new string by replacing every -gram in by  (i.e., by its node id). Formally, , where . Similarly, a pattern is transformed into by a -TST transformation using the same -TST , i.e., . Given these definitions, the following lemma holds.

Lemma 3.

For two strings and -TST of , the following equation holds.

(1)

For , when cannot be computed, i.e, when contains a new -gram not occurring in .

Proof.

Recall that a -gram beginning at position in is transformed to the corresponding -TST leaf beginning at position in . This means that a substring of length at most and beginning at position in is transformed to a -TST leaf containing as a prefix beginning at position in . Let be the set of all leaves containing as a prefix. Then all occurrence positions of all leaves of in are given by . Since , Equation 1 holds for -short patterns.

Similarly for a -long pattern , a substring beginning at position in is transformed to beginning at position in . clearly holds when cannot be computed. Therefore, Equation 1 also holds for -long patterns. ∎

Example 4.

Let and , and let the trie in Figure 1 be a -TST of . Then . Let and . Then holds, because , , , and . Similarly, , because .

We can compute in time. also can be computed in time by using and . This is because for , , and , holds if and exist.

Lemma 3 tells us that we can improve search queries on a general self-index for -short patterns on a -TST in time if the computation time for pattern search on the general self-index is greater than , where is the computation time for with the general self-index. If the length of pattern is at most , i.e., , then we perform search queries on the -TST. Otherwise, we perform search queries of on a self-index for .

We obtain the following theorem by using Lemma 3.

Theorem 5.

Let be an index supporting in time and in time. Then there exists an index of space that supports in time for a -short pattern , and that supports in time for a -long pattern , where is the size of .

We apply Theorem 5 to signature encoding and present a novel self-index named TST-index in the next section.

5 TST-index

We obtain the TST-index by combining Theorem 5 with signature encoding. First, we introduce LCPR and signature encoding, and then we develop the TST-index.

5.1 Locally consistent parsing (LCPR)

LCPR [11] is a factorization of a string by using a bit string computed for . Let be the position of the -th in . Then and satisfy the conditions given in Lemma 6.

Lemma 6 ([11]).

For a -colored sequence of length at least , there exists a function that returns a bit sequence of length satisfying the following properties: (1) and hold for , where and . (2) If holds for integers and , then holds, where , and and are -colored sequences.

An LCPR for a -colored string using for is defined as .

Example 7.

Let , and , and assume that . Then , , , , , , and holds for any . Since holds, . Similarly, since holds, .

5.2 Signature encoding

A signature encoding [11] of a string is a context-free grammar generating the single text and built using and . Here, is a set of positive integers called variables. is a set of deterministic production rules (a.k.a. assignments), with each being either of a sequence of variables in or a single character in . is the start symbol deriving string .

A signature encoding corresponds to a balanced derivation tree of height built on the given string , where each internal node (respectively, leaf node) has a variable in (respectively, a character in ), as illustrated in Figure 2. Since this derivation tree is balanced, the node labels at each level can be considered as a sequence from the leftmost node to the rightmost node at a level. We define as a function returning a variable sequence , where is a single character in or a sequence of variables in , and is a single character in . In addition, holds for . Then, is a sequence of node labels at the -th level of the derivation tree built from string and defined using as follows.

Here, holds for the minimum positive integer satisfying . Hence, is the height of the derivation tree of . Let be the size of , whose bound is given by the following lemma.

Lemma 8 ([17]).

The size of is bounded by .

satisfies the following properties by the definition. (1) holds because for a positive even integer . (2) Every variable is at most because must be a - colored sequence to apply an LCPR to . (3) can be stored in space, because every variable in is in one of three cases: (i) ; (ii) , where and ; or (iii) derives a variable sequence of length .

Figure 2: The derivation tree of of Example 9.
Example 9.

Let be a context-free grammar, where , , , and . Assume that , , and hold. Then is the signature encoding of .

5.3 Static TST-indexes

We can extend signature encoding as a self-index for pattern search. A key idea for pattern search is that there exists a common variable sequence of length representing every substring in the parse tree of . This sequence is called the core of and is computable from , as described elsewhere [18, 15] in detail.

Since we can perform top-down searches of the parse tree for the core of instead of itself, we obtain the following lemma. 111Takabatake et al. [18] construct the grammar representing an input text by using a technique called alphabet reduction, which is essentially equal to LCPR. Although the definition of their grammar is slightly different from ours, we can use their search algorithm with our signature encoding.

Lemma 10 ([18]).

For a signature encoding, there exists an index of space supporting count and locate queries in time, in time, and extract queries in time, where is the number of candidate occurrences of by this index.

Proof.

In the original paper of Takabatake et al. [18], their index performs count and locate queries in time and in time. We can remove the term, however, by using perfect hashing [6]. Since the term is the computation time for , it can also be reduced to by using the data structure proposed by Alstrup et al. [2]. Both of these data structures use space.

Next, we show that we can compute in time by using space. The directed acyclic graph (DAG) of represents the derivation tree of , and each node in the DAG represents a distinct variable in . For an edge in the DAG, let the relevant offset between the parent and child be the length of the string generated by , where represents and represents the -th edge in . Then the occurrence position of a node in is the sum of relevant offsets in the path starting at the root and ending at the node. Hence, we can compute in time by storing all relevant offsets, because the height of is . Furthermore, the offsets can be stored in space, because we can represent the relevant offsets between and for by using space.

Next, for a node representing , we store the lowest ancestor of such that the in-degree of is at least 2 or is the root in the DAG, i.e, the node representing . We also store the sum of relevant offsets on the path starting at and ending at . By summing up these values, we can reduce to time, and the data structures use space. ∎

We call the signature encoding built on a -signature encoding. Our static self-index consists of the -TST of and the self-index of Lemma 10 for representing . In turn, we construct from , from , and the self- index of Lemma 10 from . The search algorithm is based on Theorem 5, meaning that, for locate queries, we compute by using and output all occurrences of on by using locate queries on if . Otherwise, we compute by using and output all occurrences of on by using locate queries on . Similarly, we can perform count queries. Hence, we obtain the following performance for our self-index from Theorem 5 and Lemma 10.

Theorem 11.

For a string and an integer , there exists an index using space and supporting (i) locate queries in time for -short patterns, (ii) locate and count queries in time for -long patterns, and (iii) extract queries in time, where is the size of the -signature encoding of .

We give the following upper bound for .

Theorem 12.

For a -signature encoding of , holds.

Proof.

See Section 6. ∎

We also give an upper bound  [19]. Thus, the size of our index is . The results of an empirical evaluation of the TST-index are given in Section 7.

5.4 Dynamic TST-indexes

In this section, we consider maintaining our self-index for -short patterns with a dynamic text . Recall that our index consists of a -TST of and an index of . In the dynamic setting, the index is still based on Theorem 5, but we use the following data structure for instead of that given by Lemma 10.

Lemma 13 (Dynamic signature encoding [16]).

With the addition of data structures taking space, can support update operations in time. The modified data structure also supports computing in time for and extract queries in time.

Hence, our dynamic index supports locate queries in time for -short patterns. For count queries, we append to as extra information. We also append to .

The remaining problem is how to update these data structures when is changed. The main problem is how changes when is changed. If we can understand that process, then we can update the data structures through each update operation.

Let for a string and two integers and , i.e., . For strings , the following equations hold, where and :