DeepAI
Log In Sign Up

A New Class of String Transformations for Compressed Text Indexing

05/11/2022
by   Raffaele Giancarlo, et al.
UniPa
0

Introduced about thirty years ago in the field of Data Compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the design of efficient self-indexing compressed data structures. Finding other string transformations with the same remarkable properties of BWT has been a challenge for many researchers for a long time. Among the known BWT variants, the only one that has been recently shown to be a valid alternative to BWT is the Alternating BWT (ABWT), another invertible string transformation introduced about ten years ago in connection with a generalization of Lyndon words. In this paper, we introduce a whole class of new string transformations, called local orderings-based transformations, which have all the myriad virtues of BWT. We show that this new family is a special case of a much larger class of transformations, based on context adaptive alphabet orderings, that includes BWT and ABWT. Although all transformations support pattern search, we show that, in the general case, the transformations within our larger class may take quadratic time for inversion and pattern search. As a further result, we show that the local orderings-based transformations can be used for the construction of the recently introduced r-index, which makes them suitable also for highly repetitive collections. In this context, we consider the problem of finding, for a given string, the BWT variant that minimizes the number of runs in the transformed string, and we provide an algorithm solving this problem in linear time.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/04/2019

A New Class of Searchable and Provably Highly Compressible String Transformations

The Burrows-Wheeler Transform is a string transformation that plays a fu...
07/04/2019

The Alternating BWT: an algorithmic perspective

The Burrows-Wheeler Transform (BWT) is a word transformation introduced ...
01/26/2022

Understanding and Compressing Music with Maximal Transformable Patterns

We present a polynomial-time algorithm that discovers all maximal patter...
02/22/2020

Robustness to Programmable String Transformations via Augmented Abstract Training

Deep neural networks for natural language processing tasks are vulnerabl...
04/06/2020

Indexing Highly Repetitive String Collections

Two decades ago, a breakthrough in indexing string collections made it p...
02/26/2022

A theoretical and experimental analysis of BWT variants for string collections

The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et ...
08/21/2022

Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform

The Burrows-Wheeler Transform (BWT) is often taught in undergraduate cou...

1 Introduction

Michael Burrows and David Wheeler introduced in 1994 a reversible word transformation [12], denoted by , that turned out to have “myriad virtues”. At the time of its introduction in the field of text compression, the Burrows-Wheeler Transform was perceived as a magic box: when used as a preprocessing step it would bring rather weak compressors to be competitive in terms of compression ratio with the best ones available [25]. In the years that followed, many studies have shown the effectiveness of BWT and its central role in the field of Data Compression, due to the fact that it can be seen as a “booster” of the performance of memoryless compressors [26, 37, 49].

The importance of this transformation has been further increased when in [27] it was proven that, in addition to making easier to represent a string in space close to its entropy, it also makes easier to search for pattern occurrences in the original string. After this discovery, data transformations inspired by the BWT have been proposed for compactly representing and searching other combinatorial objects such as: trees, graphs, finite automata, and even string alignments. See [1, 30] for an attempt to unify some of these results and [50] for an in-depth treatment of the field of compact data structures.

Going back to the original Burrows-Wheeler string transformation, we can summarize its salient features as follows: 1) it can be computed and inverted in linear time, 2) it produces strings which are provably compressible in terms of the high order entropy of the input, 3) it supports pattern search directly on the transformed string in time proportional to the pattern length. It is the combination of these three properties that makes the BWT a fundamental tool for the design of compressed self-indices. In Section 2 we review these properties and also the many attempts to modify the original design. However, we recall that, despite more than twenty years of intense scrutiny, the only non trivial known BWT variant that fully satisfies properties 13 is the Alternating BWT (ABWT). The ABWT has been introduced in [32] in the field of combinatorics of words and its basic algorithmic properties have been described in [34, 35].

In this paper we introduce a new whole family of transformations that satisfy properties 13 and can therefore replace the BWT in the construction of compressed self-indices with the same time efficiency of the original BWT and the potential of achieving better compression. We show that our family, supporting linear time computation, inversion, and search, is a special case of a much larger class of transformations that also satisfy properties 13 except that, in the general case, inversion and pattern search may take quadratic time. Our larger class includes as special cases also the BWT and the ABWT and therefore it constitutes a natural candidate for the study of additional properties shared by all known BWT variants.

More in detail, in Section 3 we describe a class of string transformations based on context adaptive alphabet orderings. The main feature of the above class of transformations is that, in the rotation sorting phase, we use alphabet orderings that depend on the context (i.e., the longest common prefix of the rotations being compared). We prove that such transformations are always invertible and provide a general inversion algorithm that runs in time quadratic with respect to the string length. We consider also some subclasses of such transformations in which the permutations associated to each prefix are defined by “simple” rules and we show that they have more efficient inversion algorithms.

In Section 4 we consider the subclass of transformations based on local orderings. In this subclass, the alphabet orderings only depend on a constant portion of the context. We prove that local ordering transformations can be inverted in linear time, and that pattern search in the transformed string takes time proportional to the pattern length. Thus, these transformations have the same properties 13 that were so far prerogative of the BWT and ABWT. We also show that it is always possible to implement an -index [31] on the top of a BWT based on a local ordering, thus making this transformation suitable also for highly repetitive collections.

Having now at our disposal a wide class of string transformations with the same remarkable properties of the BWT, it is natural to use them to improve BWT-based data structures by selecting the one more suitable for the task. In this paper we initiate this study by considering the problem of selecting the BWT variant that minimizes the number of runs in the transformed string. The motivation is that data centers often store highly repetitive collections, such as genome databases, source code repositories, and versioned text collections. For such collections theoretical and practical evidence suggests that entropy underestimates the compressibility of the collection and we can obtain much better compression ratios exploiting runs of equal symbols in the BWT [31, 41, 45, 48, 47, 52, 53, 55].

As we review in Section 2.2, run-minimization is a challenging and multifaceted problem: we believe the introduction of a new class of string transformations makes it even more interesting. In Section 5 we contribute to this problem showing that, for constant size alphabet, for the most general class of transformations considered in this paper, the BWT variant that minimizes the number of runs can be found in linear time using a dynamic programming algorithm. Although such result does not lead to a practical compression algorithm, such minimal number of runs constitutes a lower bound for the number of runs achievable by the other variants described in this paper and therefore constitutes a baseline for further theoretical or experimental studies.

A preliminary version of this paper appeared in [36]. In this new version we introduce and analyze new BWT variants, in particular the ones described in Sections 3.3 and 3.4. We also added Section 4.1 where we show that the recently introduced -index [31] can be adapted to work with some of the proposed BWT variants. This result is particularly important since the -index is particularly suited to highly repetitive collections which are relevant for the run minimization problem discussed in Section 5. Another new result is the characterization of local ordering transformations contained in Section 4.2. The new version also includes a more in-depth discussion of the existing literature: Section 2.1 has been largely extended and Section 2.2 is new. Finally, we have added some carefully designed examples in order to better illustrate some subtle points of the exposition. All references have been updated, and new relevant results have been discussed in the paper.

2 Notation and background

Let be a finite ordered alphabet of size with , where denotes the standard lexicographic order. We denote by the set of strings over . Given a string , we denote by its length . We use to denote the empty string. A factor of is written as with . A factor of type is called a prefix, while a factor of type is called a suffix. The -th symbol in is denoted by . Two strings are called conjugate, if and , where . We also say that is a cyclic rotation of . A string is primitive if all its cyclic rotations are distinct. Given a string and , we write to denote the number of occurrences of in , and to denote the position of the -th in .

Given a primitive string , we consider the matrix of all its cyclic rotations sorted in lexicographic order. Note that the rotations are all distinct by the primitivity of . The last column of the matrix is called the Burrows-Wheeler Transform of the string and it is denoted by (see Figure 1 (left)). It is shown in [12] that is always a permutation of , and that there exists a linear time procedure to recover , given and the position of in the rotations matrix (it is in Figure 1 (left)).

In this paper we follow the assumption usually done for text indexing that the last symbol of  is a unique end-of-string symbol. This guarantees the primitivity of . In addition, this assumption implies that, if we compare two cyclic rotations symbol-by-symbol, they differ as soon one of them reaches the end-of-string symbol (or sooner). This property ensures that all families of transformations defined in this paper can be computed in linear time using a suffix tree as described in Section 3 (note that the BWT and ABWT can be computed in linear time also for a generic primitive string [37, 35, 10]). The reader will notice that the algorithms computing the inverse transformations provided in this paper do not make use of the unique end-of-string symbol and therefore are valid for any primitive string.

Figure 1: The original BWT matrix for the string (left), and the ABWT matrix of cyclic rotations sorted using the alternating lexicographic order (right). In both matrices the horizontal arrow marks the position of the original string , and the last column is the output of the transformation.

The BWT has been introduced as a data compression tool: it was empirically observed that usually contains long runs of equal symbols. This notion was later mathematically formalized in terms of the empirical entropy of the input string [26, 49]. For , the -th order empirical entropy of a string , denoted as , is a lower bound to the compression ratio of any algorithm that encodes each symbol of using a codeword that only depends on the symbols preceding it in . The simplest compressors, such as Huffman coding, in which the code of a symbol does not depend on the previous symbols, typically achieve a (modest) compression bounded in terms of the zeroth-order entropy . This class of compressors are referred to as memoryless compressors. More sophisticated compressors, such as Lempel-Ziv compressors and derivatives, use knowledge from the already seen part of the input to compress the incoming symbols. They are slower than memoryless compressors but they achieve a much better compression ratio, which can be usually bounded in terms of the -th order entropy of the input string for a large  [42].

It is proven in [26, Theorem 5.4] that the informal statement “the output of the BWT is highly compressible” can be formally restated saying that can be compressed up to , for any , using any tool able to compress up to the zeroth-order entropy. In other words, after applying the BWT, we can achieve high order compression using a simple (and fast) memoryless compressor. This property is often referred to as the “boosting” property of the BWT. Another remarkable property of the BWT is that it can be used to build compressed indices. It is shown in [28] how to compute the number of occurrences of a pattern in in time, where is the cost of executing a query over . This result has spurred a great interest in data structures representing compactly a string and efficiently supporting the queries , , and (return given , which is a nontrivial operation when is represented in compressed form) and there are now many alternative solutions, with different trade-offs. In this paper, we assume a RAM model with word size and an alphabet of size . Under this assumption, we make use of the following result (Theorem 5.1 in [7]):

Let denote a string over an alphabet of size . We can represent  in bits and support constant time , , and queries.∎

The assumption is only required to ensure that , , and take constant time; we use it to simplify the statements of our results. It is straightforward to verify that for larger alphabets our resulbouchets still hold with the bounds on the running times multiplied by the largest of the cost of the above operations over the larger alphabets.

The properties of the BWT of being compressible and searchable combine nicely to give us indexing capabilities in compressed space. Indeed, combining a zeroth-order representation supporting , , and queries with the boosting property of the BWT, we obtain a full text self-index for that uses space bounded by bits; see [28, 44, 50, 51] for further details on these results and on the field of compressed data structures and algorithms that originated from this area of research.

2.1 Known BWT variants

We observed that the salient features of the Burrows-Wheeler transformation can be summarized as follows: 1) it can be computed and inverted in linear time, 2) it produces strings which are provably compressible in terms of the high order entropy of the input, 3) it supports linear time pattern search directly on the transformed string. The combination of these three properties makes the BWT a fundamental tool for the design of compressed self-indices. Over the years, many variants of the original BWT have been proposed; in the following we review them, in roughly chronological order, emphasizing to what extent they share the features 13 mentioned above.

The original BWT is defined by sorting in lexicographic order all the cyclic rotations of the input string. In [56] Schindler proposes a bounded context transformation that differs from the BWT in the fact that the rotations are lexicographically sorted considering only the first symbols of each rotation. In [20, 54] it has been shown that this variant satisfies properties 13, with the limitation that the compression ratio can reach at maximum the -th order entropy and that it supports searches of patterns of length at most . Chapin and Tate [15] have experimented with computing the BWT using a different alphabet order. This simple variant still satisfies properties 13, but it clearly does not bring any new theoretical insight. In the same paper, the authors also propose a variant in which rotations are sorted following a scheme inspired by reflected Gray codes. This variant shows some improvements in terms of compression, but it has never been analyzed theoretically and it does not seem to support property 3.

A variant called Bijective BWT (BBWT), has been proposed in [38] as a transformation which is bijective even without assuming that the input string  be primitive. In this variant, the output consists of the last characters of the lexicographically sorted cyclic rotations of all factors of the Lyndon Factorization [16] of . This variant has been recently shown in [3] to satisfy 1, but being based on the cyclic rotations of the Lyndon factors properties 2 and 3 have not been studied. Note that the related variant Extended BWT [46], which takes as input a collection of strings, does satisfy property 3 for the problem of circular pattern search [24, 29, 40, 39, 10, 11].

More recently, some authors have proposed variants in which the lexicographic order is replaced by a different order relation. The interested reader can find relevant work in a recent review [21]. It turns out that these variants satisfy property 1 in part but nothing is known with respect to properties 2 and 3 (an outline of a substring matching algorithm is given in [22, Sec. 6] without any time analysis, but is based on substrings rather than single symbols). A major problem when using ordering substantially different from the lexicographic order is that the rotations prefixed by the same substring are not necessarily consecutive in the sorted matrix. For instance, if the cyclic rotations of Figure 1 are sorted according to the V-order [23], the two rotations prefixed by are not consecutive in the ordering. Since all the BWT-based searching algorithms work by keeping track of the rows prefixed by larger and larger pattern substrings, the fact that these rows are not consecutive makes the design of an efficient algorithm extremely difficult.

To the best of our knowledge, the only non trivial BWT variant that fully satisfies properties 13 is the Alternating BWT (ABWT). This transformation has been derived in [32] starting from a result in combinatorics of words [19] characterizing the BWT as the inverse of a known bijection between words and multisets of primitive necklaces [33]. The ABWT is defined as the BWT except that when sorting rotation instead of the standard lexicographic order we use a different lexicographic order, called the alternating lexicographic order. In the alternating lexicographic order, the first character of each rotation is sorted according to the standard order of (i.e., ). However, if two rotations start with the same character we compare their second characters using the reverse ordering (i.e.,

) and so on, alternating the standard and reverse orderings in odd and even positions. Figure

1 (right) shows how the rotations of an input string are sorted using the alternating ordering and the resulting ABWT.

The algorithmic properties of the BWT and ABWT are compared in [34, 35]. It is shown that they can be both computed and inverted in linear time and that their main difference is in the definition of the LF-map, i.e. the correspondence between the characters in the first and last column of the sorted rotations matrix. In the original BWT, the -th occurrence of a character in the first column corresponds to the -th occurrence of in the last column , i.e., equal characters appear in the same relative order in and . Instead, in the ABWT, equal characters appear in the reverse order in and . That is, the -th occurrence of from the top in corresponds to the -th occurrence of from the bottom in . Since this modified LF-map can still be computed efficiently using operations, the ABWT can replace the BWT for the construction of self-indices. The experiments in [34] show that the ABWT has essentially the same compression performance of the BWT. However, we are not aware of any experiment using the ABWT for indexing purposes.

Note that in [34, 35] the ABWT has been studied within a larger class of transformations in which the alphabet ordering depends on the position of the characters within any cyclic rotation. Although the “compression boosting” property holds for all transformations in this class, in [35] it is shown that the algorithmic techniques, based on the computation of the function on the transformed string, that allow us to invert BWT and ABWT in linear time cannot be applied to any other transformation in that class [35, Theorem 5.9]. Indeed, apart from the BWT and ABWT, for all the transformations studied in [35], the best inversion algorithm takes cubic time [35, Theorem 4.4]. In this paper we improve this result by providing a quadratic time algorithm for a general class of string transformations that includes the one studied in [34, 35] (see Section 3.3).

2.2 Number of runs minimization

The problem of minimizing the number of runs in a transformed string has been studied initially for collections of (relatively short) strings because of the relevance of this setting for bioinformatics applications [18, 4, 43]. String collections are usually transformed with the Extended BWT [46] (EBWT) or with an EBWT variant in which a distinct end-marker is appended to each string, making the collection ordered. In [18, 5, 6]

the authors introduce some heuristics for reordering the strings on-the-fly during the EBWT construction in order to reduce the number of runs. Experiments show that these heuristics yields a significant improvement in the overall compression. Recently, the authors of 

[8, 9] extended these results obtaining an offline linear time algorithm for finding the string reordering yielding the minimum number of runs, and showing that the optimal reordering can reduce the number of runs by a factor , where is the sum of the string lengths and is the alphabet size. The authors of [13] consider instead the case in which the same end-marker is appended to each string and for this setting provide an time algorithm to find the string reordering that minimize the number of BWT runs. See [14] for a recent review of BWT variants for string collections and their properties with respect to the number of runs.

The problem of minimizing the number of runs in a single-string BWT has been studied in [9] where the authors prove that the decision problem of finding the alphabet permutation that minimizes the number of runs in the BWT is NP-complete and the optimization variant is APX-hard. Note that, by introducing new BWT variants, we are adding a new dimension to the run minimization problem: in addition to minimizing over alphabet or string reorderings, we can also minimize over a given class of BWT variants.

3 BWTs based on Context Adaptive Alphabet Orderings

In this section we introduce a class of string transformations that generalize the BWT in a very natural way. Given a primitive string , as in the original BWT definition, we consider the matrix containing all its cyclic rotations. In the original BWT, the matrix rows are sorted according to the standard lexicographic order. We generalize this concept by sorting the rows using an ordering that depends on their common context, i.e., their longest common prefix. Formally, for each string that prefixes two or more rows, we assume that an ordering is defined on the symbols of . When comparing two rows which are both prefixed by , their relative rank is determined by the ordering . Once the matrix rows have been ordered with this procedure, the output of the transformation is the last column of the matrix, as in the original BWT. Thus, these BWT variants are based on context adaptive alphabet orderings. For simplicity, in the following, we call them context adaptive BWTs.

Figure 2: The generalized BWT matrix for the string computed using the orderings , , , , and for every other substring . The horizontal arrow marks the position of the original string ; the last column is the output of the transformation.

We denote by the matrix obtained using this generalized sorting procedure, and by the last column of . Clearly depends on and the ordering used for each common prefix. Since we can arbitrarily choose an alphabet ordering for any substring of , and there are orderings to choose from, our definition includes a very large number of string transformations.

In Figure 2, the generalized BWT matrix for the string is shown. The ordering associated to the empty string is so, among the rows that have no common prefix, first we have those starting with , then those starting with , and finally the one starting with . Since , among the rows which have as their common prefix, first we have the ones in which is the next letter, then the ones in which is the next letter, followed by the ones in which is the next letter. The complete ordering of the rows is established in a similar way on the basis of the orderings . In particular, the ordering between any pair of rows is determined by where is their longest common prefix.

This class of transformations has been mentioned in [26, Sect. 5.2] under the name of string permutations realized by a Suffix Tree (the definition in [26] is slightly more general; for example it includes the bounded context BWT, which is not included in our class). Indeed, one can easily see that can also be obtained by visiting the suffix tree of  in depth first order, except that when we reach a node (including the root), we sort its outgoing edges according to their first characters using the permutation associated to the string labeling the path from the root to . During such a visit, each time we reach a leaf, we write the symbol associated to it: the resulting string is exactly (see Figure 3 (right)).

Figure 3: Standard suffix tree for with the symbol used as a string terminator (left), and suffix tree with edges reordered using the same orderings of Figure 2 (right). To each leaf it is associated the symbol preceding in the suffix spelled by that leaf. Note that reading left to right the symbols associated to each leaf gives (left) and (right).

Although in [26] the authors could not prove the invertibility of context adaptive transformations, which we do in Section 3.2, they observed that their relationship with the suffix tree has two important consequences: 1) they can be computed in time, and 2) they provably produce highly compressible strings, i.e., they have the “boosting” property of transforming a zeroth order compressor into a -th order compressor.

Summing up, context adaptive transformations generalize the BWT in two important aspects: efficient computation (linear time in ) and compressibility. In [26], the only known instances of reversible suffix tree induced transformations were the original BWT and the bounded context BWT. In the following, we prove that all context adaptive BWTs defined above are invertible. Interestingly, to prove invertibility we first establish another important property of these transformations, namely that they can be used to count the number of occurrences of a pattern in , which is another fundamental property of the original BWT.

We conclude this section by observing that both the BWT and ABWT belong to the class we have just defined. To get the BWT, we trivially define to be the standard ordering for every , and to get the ABWT, we define to be the standard ordering for every with even, and the reverse ordering for for every with odd.

3.1 Counting occurrences of patterns in Context Adaptive BWTs

Let denote a context adaptive BWT. In the following, we assume that is enriched with data structures supporting constant time rank queries as in Theorem 2. In this section we show that, given and the set of alphabet permutations used to build , then we can determine in time the set of rows prefixed by , for any string . We preliminary observe that, by construction, this set of rows, if non-empty, form a contiguous range inside . This observation justifies the following definitions.

Given a string , we denote by the range of rows of prefixed by . More precisely, if , then row is prefixed by if and only if it is . If no rows are prefixed , we set . Note that is the number of occurrences of in the circular string .

For technical reasons, given , we are also interested in the set of rows prefixed by the strings as varies in . Clearly, these sets of rows are consecutive in and their union coincides with .

Given a string , we denote by the set of integers such that is the lower extreme of and, for , is the number of rows of prefixed by .

Since is the union of the ranges , for , we have that, if , then . Note also that the ordering of the ranges within is determined by the permutation . As observed in Section 2, we can assume that supports constant time rank queries. This implies that, in constant time, we are also able to count the number of occurrences of a symbol inside a substring .

Let us consider the string and the generalized BWT matrix illustrated in Figure 2. Since , we have that , therefore .

Given and the permutation , the set of values for all can be computed in time.

Proof.

If , then with

(1)

where the summation in (1) is done over all such that is smaller than , according to the permutation . ∎

Let be any length- string with . Then, given and , the set of values can be computed in time.

Proof.

By Lemma 3.1, given and , we can compute . In order to compute , we additionally need the number of rows prefixed by , for any . These numbers can be obtained by first computing the ranges using again Lemma 3.1. The number of rows prefixed by can be obtained by counting the number of in the portions of corresponding to each range . The counting takes time since we are assuming supports constant time as in Theorem 2. ∎

Suppose we are given with constant time rank support, and the set of permutations used to compute the matrix . Then, given any string , the range of rows prefixed by can be computed in time and space.

Proof.

We need to compute . To this end we consider the following scheme, inspired by the Newton finite difference formula:

Using Lemma 3.1, we can compute given and . Thus, from two consecutive entries in the same column, we can compute one entry in the following column. To compute we can for example perform the computation bottom-up, proceeding row by row. In this case, we are essentially computing the ranges corresponding to , , and so on, in a sort of backward search. However, we can also perform the computation top down, diagonal by diagonal, and in this case, we are computing the ranges corresponding to , , and so on, up to . In both cases, the information one need to store from one iteration to the next is values, which take words. By Lemma 3.1, the computation of each value takes time, so the overall complexity is time. ∎

For the transformation described in Figure 2, the computation of Theorem 3.1 for computing works as follows

Finally, since , we have .

Note that our scheme for the computation of is based on the computation of for substrings of . If has many repetitions, the overall cost could be less than quadratic. In the extreme case, , can be computed in time.

3.2 Inverting Context Adaptive BWTs

We now show that the machinery we set up for counting occurrences can be used to retrieve given , thus to invert any context adaptive BWT.

Given and a row index with , the -st character of row can be computed in time.

Proof.

Let denote the alphabet symbols reordered according to permutation , and let denote the values reordered according to the same permutation. Since , row is prefixed by . Since the rows prefixed by  are sorted in their -st position according to , the -st symbol of row is the symbol such that

Given with constant time rank support, the permutations used to build the matrix , and the row index containing in , the original string can be recovered in time and working space.

Proof.

Let . From , in time we retrieve the number of occurrences of each character in and hence the ranges , , …, . From those and the row index , we retrieve the first character of , i.e. . Next, counting the number of occurrences of in the ranges of corresponding to , , …, , we compute .

Finally, we show by induction that, for , given , we can retrieve and in time. By Lemma 3.2, from and we retrieve . Next, assuming we maintained the ranges , for we can compute adding one diagonal to the scheme shown in the proof of Theorem 3.1. By Lemma 3.1, the overall cost is time as claimed. ∎

The above theorem establishes that all context adaptive BWTs are invertible. Note that in our definition, the alphabet ordering associated to can depend on the whole string ; in this sense the context has full memory. We consider this an important conceptual result. However, from a practical point of view, a transformation whose definition requires the specification of alphabet permutations appears rather cumbersome. For this reason, in the following, we consider some subclasses of transformations in which the permutations associated to each prefix are defined by “simple” rules. We show that, in some cases, non trivial generalized BWTs have simpler and more efficient inversion algorithms.

3.3 Special case: ordering based on context length

In [34], the authors introduced a set of generalized transformations that turn out to be a subclass of context adaptive BWTs. Given a -tuple of alphabet permutations , using the notation of this paper, the transformation in [34] is defined as the context adaptive BWT in which the permutation associated to the string is where . Hence, the permutation associated to each string only depends on its depth, and the -tuple completely determines the transformation. In [34, 35], it was shown that for every the transformation can be inverted in time; Theorem 3.2 therefore constitutes an alternative faster inversion algorithm. To our knowledge, no faster algorithm is known, even for , when the whole transformation depends on just two alphabet permutations.

Let us consider and the -tuple of alphabet permutations , where , and . The matrix for the string is shown in Figure 4. If is the longest common prefix between two rows, their ordering depend on the respective characters at the -th position, according to the permutation . For instance, , since is the longest common prefix and , according to .

Figure 4: The matrix for the string computed using the triple of alphabet permutations . The horizontal arrow marks the position of the original string ; the last column is the output of the transformation.

3.4 Special case: ordering

Given a permutation of the alphabet , we denote by the reversal of , that is, the permutation such that, for each pair of symbols , in ,

Let denote an arbitrary permutation of . We consider the subclass of context adaptive transformations in which the permutation associated to each substring can be either or its reversal . Once is established, for each string we only need an additional bit to specify the ordering . For this subclass we say that the matrix is based on a  ordering (see example in Figure 5). In this section, we show that, for the transformations in this subclass, the space and time for inversion can be reduced by a factor .

Figure 5: Example of a transformation based on a  ordering. Let so that . The generalized BWT matrix for the string is computed using the following permutations , , , , and for every other substring . The horizontal arrow marks the position of the original string ; the last column is the output of the transformation.

Let be based on a  ordering, and let be any length- string, with . Then, given , and , we can compute in time.

Proof.

Let and . Recall that there is a bijection between the rows in and the rows in whose last symbol is . We exploit this bijection to find given . The size of is equal to the number of rows in ending with , so to completely determine we just need to compute how many rows in precede . If this number is equal to the number of rows in above and ending with . If , since necessarily , this number is equal to the number of rows in below and ending with . Since counting the number of rows ending in in a given range can be done using rank operations on , the lemma follows. ∎

Consider again the example of Fig. 5. Let , so and . It is , and . The number of rows in ending with is . Such a number gives the size of . Moreover, since , we need to count the number of rows in above and ending with . Such number is . This means that row in precedes . Hence, . If we consider , it is and , , . The number of rows in ending with is . So, is the size of . Since , we have to count the number of rows in below and ending with . Such number is . This means that there is row in that precedes , hence .

Suppose is based on a  ordering. Given with constant time rank support, and any string , we can compute the range of rows prefixed by in time and space.

Proof.

We reason as in the proof of Theorem 3.1, except that because of Lemma 3.4 we work with instead of . The thesis follows observing that each entry takes space and can be computed in time. ∎

Suppose is based on a  ordering. Given with constant time rank support and the row index containing in , we can retrieve the original string in time and working space.∎

4 BWTs based on local orderings

In this section, we consider the context adaptive transformations in which the alphabet ordering associated to each string only depends on the last symbols of , where is fixed. In the following, we refer to these string transformations as BWTs based on local orderings. We show that local ordering transformations have properties very similar to the original BWT, since they can be inverted in linear time and also support the search of a pattern in the original text in time proportional to the pattern length.

Figure 6: The local ordering BWT matrix for the string computed using the orderings , , . The horizontal arrow marks the position of the original string ; the last column is the output of the transformation.

We start by analyzing the case . For such local ordering transformations, the matrix depends on only alphabet orderings: one for each symbol plus the one used to sort the first column of . See Figure 6 for an example. The following lemma establishes an important property of local ordering transformations.

If is based on a local ordering, then for any pair of characters and , there is an order preserving bijection between the set of rows starting with and the set of rows starting with and ending with .

Proof.

Note that both sets of rows contain a number of elements equal to the number of occurrences of in the circular string . In the following, we write to denote the cyclic rotation of starting with . Assume that rotations and both start with and end with and let denote the first column in which the two rotations differ. Rotation precedes in if and only if is smaller than according to the alphabet ordering associated to symbol . The two rotations and both start with and their relative position also depends on the relative ranks of and according to the alphabet ordering associated to symbol . Hence the relative order of and is the same as the one of and . ∎

Armed with the above lemma, we now show that for local ordering transformations we can establish much stronger results than the ones provided in Section 3.1.

Suppose supports constant time rank queries. Let be any length- string with . Then, given , and , the value can be computed in time.

Proof.

By Lemma 4, there is an order preserving bijection between the rows in and those in ending with . In this bijection, the rows in correspond to those in ending with . Because of this bijection, if, among the ordered set of rows starting with and ending with , those prefixed by are in positions , then, among the rows starting with , those prefixed by are consecutive and in positions . The lemma follows since if and , we have

and, if , it is . ∎

Suppose is based on a local ordering and supports constant time rank queries. After a time preprocessing, given any string