Michael Burrows and David Wheeler introduced in 1994 a reversible word transformation bwt94 , denoted by , that turned out to have “myriad virtues”. At the time of its introduction in the field of text compression, the Burrows-Wheeler Transform was perceived as a magic box: when used as a preprocessing step it would bring rather weak compressors to be competitive in terms of compression ratio with the best ones available cj/fenwick96 . In the years that followed, many studies have shown the effectiveness of and its central role in the field of Data Compression due to the fact that it can be seen as a “booster” of the performance of memoryless compressors FGMS2005 ; GIA07 ; Manzini2001 . Moreover, it was shown in Ferragina:2000 that the can be used to efficiently search for occurrences of patterns inside the original text. Such capabilities of the have originated the field of Compressed Full-text Self-indices books/MBCT2015 ; books/daglib/0038982 . The remarkable properties of the have aroused great interest both from the theoretical and applicative points of view MantaciRRS07 ; MantaciRRS08 ; MantaciRS08 ; YangZhangWang2010 ; LiDurbin10 ; bioinformatics/CoxBJR12 ; KKbioinf15 ; RosoneSciortino_CiE2013 ; tcs/GagieMS17 ; PPRS2019 .
In the context of Combinatorics on Words, many studies have addressed the characterization of the words that become the most compressible after the application of the MaReSc ; puglisiSimpson ; PakRedlich2008 ; RestivoRosoneTCS2009 ; RestivoRosoneTCS2011 ; Ferenczi_Zamboni2012arXiv . Recent studies have focused on measuring the “clustering effect” of , which is a property related to its boosting role as preprocessing of a text compressor MantaciRRSV17 ; MantaciRRS17 .
In CDP2005 , the authors characterize the as the inverse of a known bijection between words and multisets of primitive necklaces GeRe . From this result, in GesselRestivoReutenauer2012 the authors introduce and study the basic properties of the Alternating , from now on. It is a transformation on words analogous to the but the cyclic rotations of the input word are sorted by using the alternating
lexicographic order instead of the usual lexicographic order. The alternating lexicographic order is defined for infinite words as follows: the first letters are compared with the given alphabetic order, in case of equality the second letters are compared with the opposite order, and so on alternating the two orders for even/odd positions.
In this paper we show that the satisfies most of the properties that make the such a useful transformation. Not only the can be computed and inverted in linear time, but also the backward-search procedure, which is the basis for indexed exact string matching on the , can be efficiently generalized to the . This implies that the can be used to build an efficient compressed full text index for the transformed string, similarly to the . Note that the variants of the original which have been introduced so far in the literature Schindler1997 ; dcc/ChapinT98 ; DAYKIN2017 , were either simple modifications that do not bring new theoretical insight or they were significantly different but without the remarkable compression and search properties of the original (see (GMRS_CPM2019arxiv, , Section 2.1) for a more detailed discussion of these variants).
The existence of the shows that the classical lexicographic order is not the only order relation that one can use to obtain a reversible transformation. Indeed, lexicographic and alternating lexicographic order are two particular cases of a more general class of order relations considered in DOLCE_Reut_Rest2018 ; Reutenauer2006 . In a preliminary version of this paper GMRRS_DLT2018 we introduce therefore a class of reversible word transformations based on the above order relations that includes both the original and the Alternating . Within this class, we introduce the notion of rank-invertibility, a property that guarantees that the transformation can be efficiently inverted using rank operations, and we prove that and are the only transformations within this class that are rank-invertible.
We consider also the problem of efficiently computing the . We first show how to generalize to the alternating lexicographic order the Difference Cover technique introduced in Karkkainen:2006 . This result leads to the design of time optimal and space efficient algorithms for the construction of the in different models of computation when the input string ends with a unique end-of-string symbol. Finally, we explore some combinatorial properties of the Galois words, which are minimal cyclic rotations within a conjugacy class, with respect to the alternating lexicographical order. We provide a linear time and space algorithm to find the Galois rotation of a given word and we show that, combining this algorithm with the Difference Cover technique, the can be computed in linear time even when the input string does not end with a unique end-of-string symbol.
Motivated by the discovering of the , in GMRS_CPM2019arxiv the authors explore a class of string transformations that includes the one considered in this paper. In this larger class, the cyclic rotations of the input string are sorted using an alphabet ordering that depends on the longest common prefix of the rotations being compared. Somewhat surprisingly some of the transformations in this class do have the same properties of the and , thus showing that our understanding of these transformations is still incomplete.
Let be an ordered constant size alphabet with , where denotes the standard lexicographic order. We denote by the set of words over . Let be a finite word, we denote by its length . We use to denote the empty word. We denote by the number of occurrences of a letter in
. The Parikh vectorof a word is a -length array of integers such that for each , . Given a word and , we write to denote the number of occurrences of in .
Given a finite word , a factor of is written as , with . A factor of type is called a prefix, while a factor of type is called a suffix. The longest proper factor of that is both prefix and suffix is called border. The -th symbol in is denoted by . Two words are conjugate, if and , where . We also say that is a cyclic rotation of . A word is primitive if all its cyclic rotations are distinct. A primitive word is a Lyndon word if it is smaller than all of its conjugates. Conjugacy between words is an equivalence relation over . A word is called a circular factor of if it is a factor of some conjugate of .
Given two words of the same length and , we write if and only if or , where is the smallest index in which the corresponding characters of the two words differ. Analogously, and with the same notation as before, we write if and only if or (a) is even and or (b) is odd and . Notice that is the standard lexicographic order relation on words while is the alternating lexicographic order relation. Such orders are used in Section 3 to define two different transformations on words.
The run-length encoding of a word , denoted by , is a sequence of pairs ) such that is a maximal run of a letter (i.e., , and ), and all such maximal runs are listed in in the order they appear in . We denote by i.e., is the number of pairs in , or equivalently the number of equal-letter runs in . Moreover we denote by the number of pairs in where . Notice that , where is any partition of .
The zero-th order empirical entropy of the word is defined as
(all logarithms are taken to the base and we assume ). The value is the output size of an ideal compressor that uses bits to encode each occurrence of symbol . This is the minimum size we can achieve using a uniquely decodable code in which a fixed codeword is assigned to each symbol.
For any length- factor of , we denote by the sequence of characters preceding the occurrences of in , taken from left to right. If is not a factor of the word is empty. The -th order empirical entropy of is defined as
The value is a lower bound to the output size of any compressor that encodes each symbol with a code that only depends on the symbol itself and on the preceding symbols. Since the use of a longer context helps compression, it is not surprising that for any it is .
3 BWT and Alternating BWT
In this section we describe two different invertible transformations on words based on the lexicographic and alternating lexicographic order, respectively. Given a primitive word of length in , the Burrows-Wheeler transform, denoted by bwt94 and the Alternating Burrows-Wheeler transform, denoted by GesselRestivoReutenauer2012 for are defined constructively as follows:
Create the matrix of the cyclic rotations of ;
Create the matrix
for , by sorting the rows of according to ;
for , by sorting the rows of according to ;
Return as output the pair
for , , where is the last column in the matrix
for , where is the last column in the matrix
and, in both case, the integer giving the position of in that matrix.
An example of the above process, together with the corresponding output, is provided in Fig. 1.
If two words are conjugate the (resp. ) will have the same column and differ only in , whose purpose is only to distinguish between the different members of the conjugacy class. However, is not necessary in order to recover the matrix from the last column .
The following proposition, proved in GMRRS_DLT2018 , states that three well known properties of the hold, in a slightly modified form, for the as well. Here we report the proof for the sake of completeness.
Let be a word and let be the output of or applied to . The following properties hold:
Let denote the first column of (resp. , then is obtained by lexicographically sorting the symbols of .
For every , , circularly precedes in the original word, for both and .
For each symbol , and , the -th occurrence of in corresponds
for , to its -th occurrence in
for , to its -th occurrence in .
Properties 1, 2 and 3a for the have been established in bwt94 . Properties 1 and 2 for the are straightforward. To prove property 3b, consider two rows and in with starting with the symbol . Let and be the two conjugates of in rows and of . By construction we have , and . To establish Property 3b, we need to show that row cyclically rotated precedes in the order row cyclically rotated. In other words, we need to show that
To prove the above implication, we notice that if the first position in which and differ is odd (resp. even) then the first position in which and differ will be in an even (resp. odd) position. The thesis follow by the alternate use of the standard and reverse order in (see GesselRestivoReutenauer2012 for a different proof of the same property).∎
It is well known that in the the occurrences of the same symbol appear in columns and in the same relative order; according to Property 3b. In the , the occurrences in appear in the reverse order than in . For example, in Fig. 1 (right) we see that the ’s of in the columns appear in the order 1st, 3rd, and 2nd, while in column they are in the reverse order: 2nd, 3rd, and 1st.
Note that, although and are very similarly defined, they are very different combinatorial tools. Combinatorial aspects that distinguish and can be found in GesselRestivoReutenauer2012 ; GMRRS_DLT2018 , which makes it interesting to study in terms of tool characterizing families of words.
However, in GMRRS_DLT2018 we experimentally tested as pre-processing of a compression tool, by comparing its performance with a -based compressor. We have shown that the behaviour of the two transformations is essentially equivalent in terms of compression. Actually, such experiments confirm a theoretical result we proved in GMRRS_DLT2018 for a larger class of transformations that can be seen as a generalization of the and that includes the as a special case. In next section, we give a brief description of the properties we proved in GMRRS_DLT2018 for such a class of transformations, all of which also hold for the .
4 Generalized BWTs: a synopsis
In this section we describe the class of Generalized BWTs, introduced in GMRRS_DLT2018 , by reporting their main properties.
Given the alphabet of size , in the following, we denote by the set of permutations of the alphabet symbols. Two important permutations are distinguished in : the identity permutation corresponding to the lexicographic order, and the reverse permutation corresponding to the reverse lexicographic order. We consider generalized lexicographic orders introduced in Reutenauer2006 (cf. also DOLCE_Reut_Rest2018 ) that, for the purposes of this paper, can be formalized as follows.
Given a -tuple of elements of , we denote by the lexicographic order such that given two words of the same length and it is if and only if or where is the smallest index such that , and is the lexicographic order induced by the permutation . Without loss of generality, we can assume .
Using the above definition, a class of generalized s can be defined as follows:
Given a -tuple of elements of , we denote by the transformation mapping a primitive word to the last column of the matrix containing the cyclic rotations of sorted according to the lexicographic order . The output of applied to is the pair , where is the last column of the matrix and is the row of containing the word .
Note that for , is the usual , while for , coincides with the defined in Section 3.
For most applications, it is assumed that the last symbol of is a unique end-of-string marker smaller than each symbol of the alphabet . Under this assumption, lexicographically sorting ’s cyclic rotations is equivalent to building the suffix tree Lothaire:2005 ; Gusfield1997 for , which can be done in linear time. In this setting, we can compute in linear time: we do a depth-first visit of the suffix tree in which the children of each node are visited in the order induced by . In other words, the children of each node are visited according to the order where is the string-depth111The number of letters in the word obtained by concatenating the labels of the edges in the path from the root of the suffix tree to the node of node . Since the suffix tree has nodes, for a constant alphabet the whole procedure takes linear time.
The next result proved in GMRRS_DLT2018 guarantees that the transformations are invertible. As we specify in next section, the inversion procedure for is more efficient.
For every -tuple the transformation is invertible in time, where .∎
Note that recently GMRS_CPM2019arxiv , the complexity for the inversion of a generic transformation in has been improved to time.
The following theorem proved in GMRRS_DLT2018 shows that each transformation produces a number of equal-letter runs that is at most the double of the number of equal-letter runs of the input word. This fact generalizes a result proved for MantaciRRSV17 .
Given a -tuple and a word over a finite alphabet , then
A key property of is that it allows to reduce the problem of compressing a string up to its -th order entropy to the problem of compressing a collection of factors of up to their -th order entropy, where is the reverse of the word . This means that a -based compressor combining with a zero order (memoryless) compressor, is able to achieve the same high order compression typical of more complex tools such as Lempel-Ziv encoders. In GMRRS_DLT2018 , we prove that a similar result also holds for the transformation .
Let be a -tuple and , where is the reverse of the word . For each positive integer , there exists a factorization of such that
5 Rank-invertible transformations
It is well known that the key to efficiently compute the inverse of original is the existence of a easy-to-compute permutation mapping, in the matrix , a row index to the row index containing row right-shifted by one position. This permutation is called -mapping since, by Proposition 3.2, is the position in the first column of corresponding to the -th entry in column : in other words, is the same symbol in as . Again, by Proposition 3.2 we have that is the symbol preceding in the input word . Define and . If with , then by construction and we can recover with the formula:
Note that the inversion formula (1) only depends on Properties 1 and 2 of Proposition 3.2. Since such properties hold for every generalized transformation , (1) provides an inversion formula for every transformation in that class. In other words, inverting a generalized amounts to computing iterations of the -mapping.
Note that is simply the total number of occurrences of symbols smaller than in , and is the number of occurrences of symbol in among the first symbols of .
Since the rank operation on (compressed) arrays over finite alphabet can be computed in constant time BNtalg14 and the partial sums can be precomputed, the computation of the map for both the and takes time. This implies that, thanks to the simple structure of its -mapping, also the can be inverted in linear time.
The computation of the map is the main operation also for the so-called backward-search procedure which makes it possible to use (a compressed version of) as a full text index for Ferragina:2005 . The following proposition is the key to generalize the backward search procedure to the .
Given a string , let denote the range of rows of which are prefixed by . For any , let
If , then is the range of rows of which are prefixed by if then no rows of are prefixed by and therefore is not a (circular) substring of .
Assume first . It is immediate that if , are the positions of the first and last in , then and and every other in is mapped to a position between and . The thesis follows since all rows in are rotations of . If , then and there are no ’s in and is not a circular substring of .∎
Proposition 5.1 implies that if we use a compressed representation of the last columns of supporting constant time rank operations, then, for any pattern , we can compute in time the range of rows of the matrix which are prefixed by . Hence, the can be used as a compressed index in the same way as the .
The above results suggest that it is worthwhile to search for other transformations in the class which share the same properties of and . Because of the important role played by the rank operation, we introduce the notion of rank-invertibility for the class of transformations.
The transformation is rank-invertible if there exists a function such that, for any word , setting we have
In other words, only depends on the Parikh vector of , the symbol , and the number of occurrences of in up to position .∎
Note that we pose no limit to the complexity of the function , we only ask that it can be computed using only and the number of occurrences of in .
We observed that, for , coincides with and it is therefore rank-invertible. The main result of this section is Theorem 5.9 establishing that and are the only rank-invertible transformations in the class . We start our analysis considering the case .
Let , and , where is a permutation of . If there exist two pairs and of symbols of such that
then is not rank-invertible.
Consider for example the case . Two pairs satisfying the hypothesis are and since according to the ordering it is
Consider now the two words and . Both words contain two ’s. In the first word the ’s are followed respectively by (the symbols in ), and in the ’s are followed by (the symbols in ).
Let , (resp. , ) denote the first and last column of the matrix associated to (resp. ). By definition, each matrix is obtained sorting the cyclic rotations of and according to the lexicographic order where symbols in odd positions are sorted according to the usual alphabetic order, while symbols in even positions are sorted according to the ordering . We show the two matrices in Fig. 2, where we use subscripts to distinguish the two ’s occurrences in and .
The relative position of the two ’s in is determined by the symbols following them in , namely those in . Since these symbols are in the first column of the cyclic rotations matrix, which is sorted according to the usual alphabetic order, the two ’s appear in in the order . The same is true for : since the pair is also sorted, the two ’s appear in in the order .
The position of the two ’s in is also determined by the symbols following them in ; but since these symbols are now in the second column, their relative order is determined by the ordering . Hence the two ’s appear in in the order . In the ordering of the ’s is since it depends on the -ordering of the symbols of which by construction is different than their -ordering.
Note that and have the same Parikh vector . If, by contradiction, were rank invertible, the function should give the correct LF-mapping for both and . This is impossible since for we should have
while for we should have
In the general case of an arbitrary permutation satisfying the hypothesis of the lemma the reasoning is the same. Note that such permutations are , , and . Given the two pairs and we build two words and with Parikh vector such that in (resp. ) the two occurrences of are followed by the symbols in (resp. ). We then build the rotation matrices as before, and we find that in both and the two ’s are in the order . However, in columns and the two ’s are not in the same relative order since it depends on the ordering , and, by construction, such an order is not the same. Reasoning as before, we get that there cannot exist a function giving the correct LF-mapping for both and .∎
Let and . Then is rank-invertible if and only if or .
If the result is trivial since the only possible permutations on binary alphabet are the identity and reverse permutation. Let us assume . We need to prove that if and then is not rank-invertible.
Note that any permutation over the alphabet induces a new ordering on any triplet of symbols in . For example, if the permutation induces on the triplet the ordering . It is easy to prove by induction on the alphabet size that, if and , then there exists a triplet , with , such that and . That is, restricted to is different from the identity and reverse permutation. Without loss of generality we can assume that the triplet is .
For any permutation , different from and , there exist two pairs of symbols satisfying the hypothesis of Lemma 5.3. Hence, we can build two words and which show that is not rank-invertible. Note that the argument in the proof of Lemma 5.3 is still valid if we add to and the same number of occurrences of symbols in different from so that and are effectively over an alphabet of size .∎
Lemma 5.4 establishes which transformations are rank-invertible when . To study the general case , we start by establishing a simple corollary.
Let and . If and then is not rank-invertible.
We reason as in the proof of Lemma 5.4, observing that the presence of the permutations has no influence on the proof since the row ordering is determined by the first two symbols of each rotation.∎
The following three lemmas establish necessary conditions on the structure of the tuple for to be rank-invertible. In particular, the following lemma shows that is not rank-invertible if contains anywhere a triplet with .
Let and , , with . Then is not rank-invertible.
Note that when , the -tuple starts with the triplet . We first analyze the case implying that . Let us consider the words and , where we use subscripts to distinguish the two different occurrences of the symbol . It is easy to see that, in the cyclic rotations matrix for , precedes in both the first and the last column. Hence if were rank-invertible we should have
At the same time, in the cyclic rotations matrix for , precedes in the last columns, but in the first column precedes since the two rotations prefixed by differ in the third column and . Therefore we should have
Hence cannot be rank-invertible.
Let us consider the case . Since there are two symbols, say and , such that their relative order according to is reversed, that is, and . Consider now the words and where we use subscripts to distinguish the two different occurrences of the symbol . It is immediate to see that, in the cyclic rotations matrix for , precedes in both the first and the last column. Hence if were rank-invertible we should have
At the same time, in the cyclic rotations matrix for , precedes in the last columns, but in the first column precedes since the two rotations prefixed by differ in the -th column and . Hence we should have
hence cannot be rank-invertible.∎
The following lemma shows that is not rank-invertible if contains anywhere a triplet , with .
Let and , , with . Then is not rank-invertible.
As in the proof of Lemma 5.6, we can consider the words and in case of binary alphabet, and the words and in the general case by assuming that there are two symbols, say and , such that their relative order according to is reversed, that is, and . Recall that we use subscripts to distinguish the two different occurrences of the symbol . In the cyclic rotations matrix for , in the first column precedes while in the last column precedes . At the same time, in both the first and the last column of the cyclic rotations matrix for , precedes . Reasoning as in the proof of Lemma 5.6 we get that cannot be rank-invertible.∎
The following lemma shows that is not rank-invertible if contains anywhere a triplet , with .
Let and , , with . Then is not rank-invertible.
We reason as in the proof of Lemma 5.7 considering again the words and in case of binary alphabet and the words and in the general case.∎
We are now ready to establish the main result of this section.
If , and are the only transformations which are rank invertible.
For , the result follows from Lemma 5.4. Let us suppose with and assume is rank invertible. Both in the case of binary alphabet and in the general case, by Corollary 5.5, we must have or . If and then the -tuple must contain the triplet with which is impossible by Lemma 5.6. If , by Lemma 5.7 . We have therefore established that has the form . By Lemma 5.8 it is . By iterating the same reasoning we can conclude that coincides with , concluding the proof.∎
6 Efficient computation of the ABWT
The bottleneck for the computation of (as well as of any transformation ) of a given string is the -based (the -based) sorting of its cyclic rotations. In Remark 4.3 we have observed that, if a unique end-of-string symbol, which is smaller than any other symbol in the alphabet, is appended to the input string, all transformations in the class can be computed in linear time by first building the suffix tree for the input string. However, for computing the this strategy has never been used in practice. The reason is that the algorithms for building the suffix tree, although they take linear time, have a large multiplicative constant and are not fast in practice. In addition, the suffix tree itself requires a space of about ten/fifteen times the size of the input which is a huge amount of temporary space that is not necessarily available (considering also that saving space is the primary reason for using the ). For the above reasons the is usually computed by first building the Suffix Array Karkkainen:2003 ; MF02j which is the array giving the lexicographic order of all the suffixes of the input string.
A fundamental result on Suffix Array construction is the technique in Karkkainen:2006 that, using the concept of difference cover, makes it possible to design efficient Suffix Array construction algorithms for different models of computation including RAM, External Memory, and Cache Oblivious.
In this section, we show that this technique can be adapted to compute the within the same time bound of the .
Firstly, in order to use the notion of suffix array for the computation of we need to extend the definition of alternating lexicographic order also for strings having different length.
Let and with .
If is not a prefix of and is the smallest index in which Then, if is even iff . Otherwise, if is odd iff .
If is a prefix of , we say that if is even, if is odd.
Suffix array algorithms often assume that the input string ends with a unique end-of-string symbol smaller than any other in the alphabet . Remark that if we append the end-of-string symbol to the string , the -order relation between two suffixes of is determined by using Definition 6.1 (case 1). Moreover, using the end-of-string symbol implies that the -based sorting of the cyclic rotations of input string is induced by the -based sorting of its suffixes. Note that this property does not hold in general. However, it is easy to verify that, apart from the symbol , the output may be different from and the number of equal letter runs can be greater (see Fig. 4).
Here we assume that the input string contains a unique end-of-string symbol , but, in the next section, we show how to remove this hypothesis by using combinatorial properties of some special rotations of the input string.
To illustrate the idea behind difference cover algorithms, in the following, given a positive integer , we denote by the set .
A set is a difference cover modulo if every integer in can be expressed as a difference, modulo , of two elements of , i.e.
For example, for the set is a difference cover, since , , , , , and so on. An algorithm by Colbourn and Ling ipl/ColbournL00 ensures that for any a difference cover modulo of size at most can be computed in time. The suffix array construction algorithms described in Karkkainen:2006 are based on the general strategy shown in Algorithm 1. Steps 3 and 4 rely heavily on the following property of Difference covers: for any there exists such that and . This implies that to compare lexicographically suffixes and it suffices to compare at most symbols since and are both sampled suffixes and their relative order has been determined at Step 2.
To see how the algorithm works consider for example , and the string . The sampled suffixes are those starting at positions . To sort them, consider the string over whose elements are the -tuples starting at the sampled positions in the order :