DeepAI
Log In Sign Up

A theoretical and experimental analysis of BWT variants for string collections

The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et al. [Theor. Comput. Sci., 2007], is a generalization of the Burrows-Wheeler-Transform (BWT) to multisets of strings. While the original BWT is based on the lexicographic order, the eBWT uses the omega-order, which differs from the lexicographic order in important ways. A number of tools are available that compute the BWT of string collections; however, the data structures they generate in most cases differ from the one originally defined, as well as from each other. In this paper, we review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on several real-life datasets with different characteristics. We find that the differences can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences. The widely-used parameter r, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to 4.2.

READ FULL TEXT VIEW PDF
03/25/2019

Algorithms to compute the Burrows-Wheeler Similarity Distribution

The Burrows-Wheeler transform (BWT) is a well studied text transformatio...
04/06/2020

Indexing Highly Repetitive String Collections

Two decades ago, a breakthrough in indexing string collections made it p...
05/25/2018

Strong link between BWT and XBW via Aho-Corasick automaton and applications to Run-Length Encoding

The boom of genomic sequencing makes compression of set of sequences ine...
08/19/2020

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

The Burrows-Wheeler-Transform (BWT), a reversible string transformation,...
05/11/2022

A New Class of String Transformations for Compressed Text Indexing

Introduced about thirty years ago in the field of Data Compression, the ...
06/21/2021

Computing the original eBWT faster, simpler, and with less memory

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of t...
09/19/2018

The Read-Optimized Burrows-Wheeler Transform

The advent of high-throughput sequencing has resulted in massive genomic...

1 Introduction

The Burrows-Wheeler-Transform [BW94] (BWT) is a fundamental string transformation which is at the heart of many modern compressed data structures for text processing, in particular in bioinformatics [bowtie, bwa, bowtie2]. With the increasing availability of low-cost high-throughput sequencing technologies, the focus has moved from single strings to large string collections, such as the 1000 Genomes project [1000genomes], the Genome 10K Project [10K], the 100,000 Genomes Project [100K], the 1001 Arabidopsis Project [arab], and the 3,000 Rice Genomes Project (3K RGP) [rice]. This has led to a widespread use of compressed data structures for string collections.

Concurrently, ever increasing text sizes have been driving a trend towards ever smaller data structures. The size of BWT-based data structures is typically measured in the number of runs (maximal substrings consisting of the same letter) of the BWT, commonly denoted . This parameter has become fundamental as a measure of storage space required by such data structures. Moreover, much recent research effort has concentrated on the construction of data structures which can not only store but query, process, and mine strings in space and time proportional to  [GagieNP18, BannaiGI20, Oliva0SMKGB21, CobasGN21].

The parameter is also being increasingly seen as a measure of repetitiveness of the string, with several recent works theoretically exploring its suitability as such a measure, as well as its relationship to other such measures [Navarro21a, GiulianiILPST21, AkagiFI22].

Several tools exist that compute variants of the  for string collections, among these BCR [BauerCR13], ropebwt2 [Li14a], nvSetBWT [Pantaleoni14], msbwt [HoltM14], Merge-BWT [Siren16], eGSA [Louza2017d], BigBWT [BoucherGKLMM19], bwt-lcp-parallel [BonizzoniVPPR19], eGAP [EgidiLMT19], gsufsort [LouzaTGPR20], G2BWT [DominguezN21], and pfpebwt [BoucherCLMM21]. It should be noted though that, when the input is a collection of strings, it is not completely straightforward how to compute the BWT—since the BWT was originally designed for individual strings. In fact, there exists more than one way to compute a Burrows-Wheeler-type transform for a collection of strings, and it turns out that different tools not only use different algorithms, but they output different data structures. As a first example, in Table 1, we give the BWT variants as computed by tools on a toy example of DNA-strings.

max width=140mm variant result on example tools CGGGATGTACGTTAAAAA pfpebwt [BoucherCLMM21] dolEBWT GGAAACGG$$$TTACTGT$AAA$ G2BWT [DominguezN21], pfpebwt [BoucherCLMM21], msbwt [HoltM14] mdolBWT GAGAAGCG$$$TTATCTG$AAA$ BCR [BauerCR13], ropebwt2 [Li14a], nvSetBWT [Pantaleoni14], Merge-BWT [Siren16], eGSA [Louza2017d], eGAP [EgidiLMT19], bwt-lcp-parallel [BonizzoniVPPR19], gsufsort [LouzaTGPR20] concBWT $AAGAGGGC$#$TTACTGT$AAA$ BigBWT [BoucherGKLMM19], tools for single-string BWT colexBWT AAAGGCGG$$$TTACTGT$AAA$ ropebwt2 [Li14a]

Table 1: The different  variants on the multiset . For detailed explanations, see Section 3.

The classical way of computing text indexes of string collections is to concatenate the strings, adding a different end-of-string-symbol at the end of each string, and then computing the index for the concatenated string. This is the method traditionally used for generating classical data structures such as suffix trees and suffix arrays for more than one string, and results in the so-called generalized suffix tree resp. generalized suffix array (see e.g. [Gusfield1997, Ohlebusch2013]). The drawback of this method is an increase in the size of the alphabet, from , often a small constant in applications, to , where is the number of elements in the collection, typically in the thousands or even tens or hundreds of thousands. One way to avoid this is to use only conceptually different end-of-string-symbols, i.e. to have only one dollar-sign but apply string input order to break ties. This is the method used e.g. by ropebwt2 [Li14a] and by BCR [BauerCR13]. Another method to avoid increasing the alphabet is to separate the input strings using the same end-of-string-symbol, and add a different end-of-string-symbol to the end of the concatenated string, to ensure correctness, as e.g. in BigBWT [BoucherGKLMM19]. An equivalent solution is to concatenate the input strings without removing the end-of-line or end-of-file characters, since these act as separators; or to concatenate them without separators and use a bitvector to mark the end of each string.

Many recent studies use string collections in experiments (e.g. [PuglisiZ21, BannaiGI20, KuhnleMBGLM19]); often the input strings are turned into one single sequence using one of the methods described above, and then the single-string  is computed; it is, however, not always stated explicitly which was the method used to obtain one sequence. Underlying this is the implicit assumption that all methods are equivalent.

In 2007, Mantaci et al. [MantaciRRS07] introduced the extended Burrows-Wheeler-Transform (), which generalizes the  to a multiset of strings. The , like the , is reversible; moreover, it is independent of the order in which the strings in the collection are presented. This is not true of any of the other methods mentioned above. Note that the  differs from the  in several ways, most importantly in the order relation for sorting conjugates: while the  uses lexicographic order, the  uses the so-called omega-order. (For precise definitions, see Section 2.)

The only tool up to date that computes the  according to the original definition is pfpebwt [BoucherCLMM21]; all other tools append an end-of-string character to the input strings, explicitly or implicitly, and as a consequence, the resulting data structures differ from the one defined in [MantaciRRS07]. Moreover, the output in most cases depends on the input order of the sequences (except for [DominguezN21] and, using a specific option, [Li14a]). As a further complication, the exact nature of this dependence differs from one data structure to another, as we will show in this paper.

The result is that the  variants computed by different tools on the same dataset, or by the same tool on the same dataset but given in a different order, may vary considerably. As we will show, this variability extends to the parameter , the number of runs of the . This is all the more important given the fact that (and the related parameter , the average length of a run) is increasingly being used as a parameter characterizing the dataset itself, namely as a measure of its repetitiveness (see e.g. [CobasGN21, BannaiGI20, BoucherCGHMNR21]).

1.1 Our contribution

To the best of our knowledge, this is the first systematic treatment of the different BWT variants in use for collections of strings. Our contributions are:

  1. We define five distinct BWT variants which are computed by current tools specifically designed for string collections. We formally describe the differences between these, identifying specific intervals to which differences are restricted.

  2. We show the influence of the input order on the output, in dependence of the BWT variant.

  3. We describe the consequences on the number of runs of the BWT and give an upper bound on the amount by which the colexicographic order (sometimes referred to as ’reverse lexicographic order’) can differ from the optimal order of Bentley et al. [BentleyGT20].

  4. We complement our theoretical analysis with extensive experiments, comparing the five BWT variants on eight real-life datasets with different characteristics.

1.2 Related work

This paper deals with tools for string collections, so we did not include any tool that computes the BWT of a single string, such as libdivsufsort [libdivsufsort], sais-lite-lcp [sais-lite-lcp], libsais [libsais], bwtdisk [FerraginaGM12]. Even though, in many cases, these are the tools used for collections of strings, the data structure they compute depends on the method with which the string collection was turned into a single string, as explained above. Nor did we include other BWT variants for single strings such as the bijective BWT [GilScott12, KopplHHS20], since, again, these were not designed for string collections.

The Big-xBWT [GagieGM21] is a tool for compressing and indexing read collections, using the xBWT of Ferragina et al. [xbwt, FerraginaLMM09]. In addition to the string collection, it requires a reference sequence as input, in contrast to the other tools. Moreover, the output is not comparable either, since its length can vary—as opposed to all other BWT variants we review, the xBWT is not a permutation of the input characters but can be shorter, due to the fact that it first maps the input to a tree and then applies the xBWT to it, a BWT-like index for labeled trees, rather than for strings. Likewise, the tool [OhlebuschSB18] for reference-free xBWT is not included in this review: even though it does not require a reference sequence, it, too, computes the xBWT, which is a data structure that does not fall within the category we focus on.

Finally, we did not include SPRING [ChandakTOHW19], a reference-free compressor for FASTQ and FASTA files: even though it uses the BWT during computation, it does not output it.

1.3 Overview

We give the necessary definitions in Section 2; note that we assume familiarity of the reader with the Burrows-Wheeler-Transform. In Section 3, we present the BWT variants and analyse their differences. In Section 4 we discuss the effects on the repetitiveness measure , while our experimental results are presented in Section 5. We give some conclusions from our study in Section 6. The full tables with detailed results on all eight datasets can be found in the Appendix.

2 Preliminaries

Let be a finite ordered alphabet of size . We use the notation for a string of length over , for the th character, and for the substring of , where ; denotes the length of , and the empty string. Let be a string over and ; a string is called a conjugate of if for some , (also called the th rotation of ). For a string over and an integer , denotes the -fold concatenation of .

A string is called primitive if implies and . Every string can be written uniquely as , where is primitive; we refer to as and to as , i.e., . A run in string is a maximal substring consisting of the same character; we denote by the number of runs of . Often, an end-of-string character (usually denoted $) is appended to the end of ; this character is not element of and is assumed to be smaller than all characters from . Note that appending a $ makes any string primitive.

For two strings , the (unit-cost) edit distance is defined as the minimum number of operations necessary to transform into , where an operation can be deletion or insertion of a character, or substitution of a character by another. The Hamming distance , defined only if , is the number of positions such that .

The lexicographic order on is defined by if is a proper prefix of , or if there exists an index s.t.  and for all , . The colexicographic order, or colex-order (referred to as reverse lexicographic order in [Li14a, CoxBJR12]) is defined by if , where denotes the reverse of the string .

Given a string over , the Burrows-Wheeler-Transform [BW94], , is a permutation of the characters of , given by concatenating the last characters of the lexicographically sorted conjugates of . The number of runs of the  of string is denoted , i.e. . To make the BWT uniquely reversible, one can add an index to it marking the lexicographic rank of the conjugate in input. For example, , hence , and the index specifies that the input was the 4th conjugate in lexicographic order. Alternatively, one adds a $ to the end of , which makes the input unique: . Note that with and without end-of-string symbol can be quite different.

Next we define the omega-order [MantaciRRS07] on : if and , or if (implying ), where denotes the infinite string obtained by concatenating infinitely many times. The omega-order relation coincides with the lexicographic order if neither of the two strings is a proper prefix of the other. The two orders can differ otherwise, e.g.  but .

Given a multiset of strings , the extended Burrows-Wheeler-Transform [MantaciRRS07], , is a permutation of the characters of the strings in , given by concatenating the last characters of the conjugates of each , for , listed in omega-order. For example, the omega-sorted conjugates of are: CGT, GTC, GT, TCG, TG, hence, . Again, adding the indices of the input conjugates makes the  uniquely reversible, in this case .

3 BWT variants for string collections

We identified five distinct transforms, which we list below, that were computed by the programs listed in Table 1. Let be a multiset of strings, with total length . Since several of the data structures depend on the order in which the strings are listed, we implicitly regard as a list , and write for a specific permutation in which the strings are presented.

  1. : the extended  of of Mantaci et al. [MantaciRRS07]

  2.   (“dollar-eBWT”)

  3.    (“multidollar BWT”)

  4. ,   (“concatenated BWT”)

  5. , where is the permutation corresponding to the colexicographic (’reverse lexicographic’) order of the strings in

 variant example order of shared suffixes independent
of input order?
non-sep.based
C G G G AT GTA C G T T AAAA A omega-order of strings yes
separator-based
GGAAA CGG$$$TTA CT GT$AAA$ lexicographic order of strings yes
GAGAA GCG$$$TTA TC TG$AAA$ input order of strings no
AAGAG GGC$$$TTA CT GT$AAA$ lexicographic order of no
subsequent strings in input
AAAGG CGG$$$TTA CT GT$AAA$ colexicographic order yes
Table 2: Overview of properties of the five  variants considered in this paper. The colors in the example BWTs correspond to interesting intervals in separator-based variants, see Section 3.2.

Because all BWT variants except the  use additional end-of-string symbols as string separators, we refer to these four by the collective term separator-based BWT variants. In Table 2 we show the five data structures on our running example of DNA-strings, and give first properties of these data structures. For ease of exposition and comparison, we replaced all separator-symbols by the same dollar-sign $, even where, conceptually or concretely, different dollar-signs are assumed to terminate the individual strings, as is the case for mdolBWT. Moreover, the concBWT contains one additional character, the final end-of-string symbol, here denoted by #, which is smaller than all other characters; thus, the additional rotation starting with # is the smallest and results in an additional dollar in the first position of the transform. For ease of comparison, we remove this first symbol from concBWT and replace the # by $.

It is important to point out that the programs listed in Table 1 do not necessarily use the definitions given here; however, in each case, the resulting transform is the one claimed, up to renaming or removing separator characters, see Section 3.1 and 3.2.

3.1 The effect of adding separator symbols

The first obvious difference between the  and the separator-based variants is their length: has length , while all other variants have length , since they contain an additional character (the separator) for each input string.

In the four separator-based transforms, the -length prefix consists of a permutation of the last characters of the input strings. This is because the rotations starting with the dollars are the first lexicographically; in the , these characters occur interspersed with the rest of the transform; namely in the positions corresponding to the omega-ranks of the input strings (see Table 2).

The next point is that adding a $ to the end of the strings introduces a distinction, not present in the , between suffixes and other substrings: since the separators are smaller than all other characters, occurrences of a substring as suffix will be listed en bloc before all other occurrences of the same substring. On the other hand, in the , these occurrences will be listed interspersed with the other occurrences of the same substring.

Let and . occurs both as a suffix and as an internal factor; the characters preceding it are A (internal substring) and C,G (suffix), and we have C GACATAACC, dolEBWT CC$ GCAAATAC$.

Finally, it should be noted that adding end-of-string symbols to the input strings changes the definition of the order applied. As observed above, the omega-order coincides with the lexicographic order on all pairs of strings where neither is a proper prefix of the other; but with end-of-strings characters, no input string can be a proper prefix of another. Thus, on rotations of the ’s, the omega-order equals the lexicographic order. As an example, consider the multiset GTC$, GT$ from Section 2: we have the following omega-order among the rotations: $GT, $GTC, C$GT, GT$, GTC$, T$G, TC$G, which coincides with the lexicographic order. Similarly, adding different dollars $, $, …, $ and applying the omega-order results again in the lexicographic order between the rotations, with different dollar symbols considered as distinct characters. Indeed, if we append a different dollar-sign to each input string, then the omega-order, the lexicographic order, and the order of the suffixes of the concatenated string (i.e. our mdolBWT) are all equivalent.

Regarding the differences among the four separator-based BWT variants, we will show that all differences occur in certain well-defined intervals of the BWT, and that the differences themselves depend only on a specific permutation of , given by the combination of the input order, the lexicographic order of the input strings, and the BWT variant applied. In Tables 3 and 4 we give the full  matrices for all five BWT variants, along with the optimal one minimizing the number of runs, see Section 4.

max width=140mm index mdol rotation (1,6) G $ ATATG (2,4) A $ TGA (3,4) G $ ACG (4,5) A $ ATCA (5,4) A $ GGA (2,3) G A$ TG (4,4) C A$ ATC (5,3) G A$ GG (3,1) $ ACG$ (1,1) $ ATATG$ (4,1) $ ATCA$ (1,3) T ATG$ AT (4,3) T CA$ AT (3,2) A CG$ A (1,5) T G$ ATAT (3,3) C G$ AC (2,2) T GA$ T (5,2) G GA$ G (5,1) $ GGA$ (1,2) A TATG$ A (4,2) A TCA$ A (1,4) A TG$ ATA (2,1) $ TGA$ index dolE rotation (3,4) G $ACG (1,6) G $ATATG (4,5) A $ATCA (5,4) A $GGA (2,4) A $TGA (4,4) C A$ATC (5,3) G A$GG (2,3) G A$TG (3,1) $ ACG$ (1,1) $ ATATG$ (4,1) $ ATCA$ (1,3) T ATG$AT (4,3) T CA$AT (3,2) A CG$A (3,3) C G$AC (1,5) T G$ATAT (5,2) G GA$G (2,2) T GA$T (5,1) $ GGA$ (1,2) A TATG$A (4,2) A TCA$A (1,4) A TG$ATA (2,1) $ TGA$ index conc rotation 23 A $# ATATG$TGA$ACG$ATCA$GGA 10 A $ACG$ATCA$GGA$#ATATG$ TGA 14 G $ATCA$GGA$# ATATG$TGA$ACG 19 A $GGA$# ATATG$TGA$ACG$ATCA 6 G $TGA$ACG$ATCA$GGA$# ATATG 22 G A$# ATATG$TGA$ACG$ATCA$GG 9 G A$ACG$ATCA$GGA$# ATATG$TG 18 C A$GGA$# ATATG$TGA$ACG$ATC 11 $ ACG$ATCA$GGA$# ATATG$TGA$ 1 $ ATATG$TGA$ACG$ATCA$GGA$# 15 $ ATCA$GGA0# ATATG$TGA$ACG$ 3 T ATG$TGA$ACG$ATCA$GGA$# AT 17 T CA$GGA$# ATATG$TGA$ACG$AT 12 A CG$ATCA$GGA$# ATATG$TGA$A 13 C G$ATCA$GGA$# ATATG$TGA$AC 5 T G$TGA$ACG$ATCA$GGA$# ATAT 21 G GA$# ATATG$TGA$ACG$ATCA$G 8 T GA$ACG$ATCA$GGA$# ATATG$T 20 $ GGA$# ATATG$TGA$ACG$ATCA$ 2 A TATG$TGA$ACG$ATCA$GGA$# A 16 A TCA$GGA$# ATATG$TGA$ACG$A 4 A TG$TGA$ACG$ATCA$GGA$# ATA 7 $ TGA$ACG$ATCA$GGA$# ATATG$

Table 3: From left to right we show the mdolBWT, the dolEBWT and the concBWT of the string collection .

max width=140mm index rotation (4,4) C AATC (3,1) G ACG (5,3) G AGG (1,1) G ATATG (4,1) A ATCA (1,3) T ATGAT (2,3) G ATG (4,3) T CAAT (3,2) A CGA (3,3) C GAC (5,2) G GAG (1,5) T GATAT (2,2) T GAT (5,1) A GGA (1,2) A TATGA (4,2) A TCAA (1,4) A TGATA (2,1) A TGA index colexBWT rotation (1,5) A $ ATCA (2,4) A $ GGA (3,4) A $ TGA (4,4) G $ ACG (5,6) G $ ATATG (1,4) C A$ ATC (2,3) G A$ GG (3,3) G A$ TG (4,1) $ ACG$ (5,1) $ ATATG$ (1,1) $ ATCA$ (5,3) T ATG$ AT (1,3) T CA$ AT (4,2) A CG$ A (4,3) C G$ AC (5,5) T G$ ATAT (2,2) G GA$ G (3,2) T GA$ T (2,1) $ GGA$ (5,2) A TATG$ A (1,2) A TCA$ A (5,4) A TG$ ATA (3,1) $ TGA$ index optimum rotation (1,4) A $ TGA (2,4) A $ GGA (3,5) A $ ATCA (4,4) G $ ACG (5,6) G $ ATATG (1,3) G A$ TG (2,3) G A$ GG (3,4) C A$ ATC (4,1) $ ACG$ (5,1) $ ATATG$ (3,1) $ ATCA$ (5,3) T ATG$ AT (3,3) T CA$ AT (4,2) A CG$ A (4,3) C G$ AC (5,5) T G$ ATAT (1,2) T GA$ T (2,2) G GA$ G (2,1) $ GGA$ (5,2) A TATG$ A (3,2) A TCA$ A (5,4) A TG$ ATA (1,1) $ TGA$

Table 4: From left to right we show the   the colexBWT and the optimal BWT of the string collection , see Section 4.

3.2 Interesting intervals

Let us call a string a shared suffix w.r.t. multiset if it is the suffix of at least two strings in . Let be the lexicographic rank of the smallest rotation beginning with and the lexicographic rank of the largest rotation beginning with , among all rotations of strings , where . (One can think of as the suffix-array interval of .) We call an interesting interval if there exist s.t.  is a suffix of both and , and the preceding characters in and are different, i.e., the two occurrences of as suffix of and constitute a left-maximal repeat.111Interesting intervals correspond to internal nodes in the suffix tree of the reverse string, within the subtree of . Note that is always an interesting interval, unless all strings end with the same character.

Any two distinct interesting intervals are disjoint.

Proof.

Follows immediately from the fact that no two distinct substrings ending in can be one prefix of the other. ∎

We can now narrow down the differences between any two separator-based BWTs of the same multiset. The next proposition states that these can only occur in interesting intervals (part 1). This implies that the dollar-symbols appear in the same positions in all separator-based variants except for one very specific case (part 2). Moreover, we get an upper bound on the Hamming distance between two separator-based BWTs (part 3).

Let and be two separator-based BWTs of the same multiset .

  1. If then for some interesting interval .

  2. Let and be the positions of the dollars in resp. . If then there exist such that is a proper suffix of .

  3. .

Proof.

1. Let and . Since all separator-based  variants use the lexicographical order of the rotations, this means that there exists a substring which is preceded by x in one string and by y in another , the first occurrence has rank in one  and the other has rank in the other  variant. This implies that the two occurrences are followed by two dollars, and either the two dollars are different, or they are the same dollar, and the subsequent substrings are different. Therefore, defines an interesting interval. Parts 2. and 3. follow from 1.

Proposition 3.2 implies that the variation of the different transforms can be explained based solely on what rule is used to break ties for shared suffixes. We see next how the different BWT variants determine this tie-breaking rule.

3.3 Permutations induced by separator-based BWT variants

Let us now restrict ourselves to being a set, i.e., no string occurs more than once. (This is just for convenience since now the input order uniquely defines a permutation w.r.t. lexicographic order; the results of this section apply equally to multisets .) As we showed in the previous subsection, the only differences between the different separator-based BWT variants are given by the order in which shared suffixes are listed. It is also clear that the same order applies in each interesting interval, including the -length prefix of the transform. Therefore, it suffices to study the permutation of the dollars in this prefix. Since the strings are all distinct, they each have a unique lexicographic rank within the set , thus the permutation of the dollars can be given w.r.t. the lexicographic order. Let us refer to these permutations by the dolEBWT, mdolBWT, concBWT, and colexBWT by and , respectively.

Now, also the input order can be seen as a permutation of the lexicographic order222For those used to thinking about suffix arrays, can be seen as the inverse suffix array of the input if the strings are thought of as meta-characters.; if the strings are input in lexicographic order, then . For our toy example , we have .

It is easy to see that the permutation is equal to , since the dollar-symbols are ordered according to . For the dolEBWT, the rank of $ equals the lexicographic rank of among all input strings, i.e., . Further, by definition, where denotes the colexicographic order of the input strings. The situation is more complex in the case of concBWT. Since the # is the smallest character, the last string of the input will be the first, while for the others, the lexicographic rank of the following string decides the order. In our running example, . We next formalize this.

Let be the linking permutation [KucherovTV13] of , defined by , for , and , the permutation that maps each element to the element in the next position and the last element to the first. Let us also define, for and , by if and otherwise, i.e.  gives the rank of element in the set .

Let be the permutation of the input order w.r.t. the lexicographic order, i.e. the th input string has lexicographic rank . Then is given by:

(1)
Proof.

Follows straightforwardly from the tie-breaking rule of concBWT. ∎

The mapping for is as follows: , , , , , and . Note that no maps to .

As can be seen already for , not all permutations are reached by this mapping. We will call a permutation feasible if there exists an input order such that . For , there are feasible permutations (out of ), for , (out of ). In Table 5, we give the percentage of feasible permutations , for up to . The lexicographic order is always feasible, namely with ; however, the colex order is not always feasible, as the following example shows.

Let , thus , but as we have seen, no permutation will yield this order for concBWT. In addition, the AAAACGG$AT$$ has runs, while all feasible ones have at least : AAAGACG$AT$$, AAACGAG$AT$$, AAAAGCG$AT$$, AAAGCAG$AT$$, AAACAGG$AT$$.

max width=140mm no. of seq’s 3 4 5 6 7 8 9 10 11 83.33% 75.0% 68.33% 63.89% 60.12% 57.29% 54.8% 52.81% 51.0%

Table 5: Percentage of feasible permutations w.r.t. concBWT.

An important consequence is that the permutations induced by mdolBWT and concBWT are always differnt: holds always, since . This means that, in whatever order the strings are given w.r.t. lexicographic order, on most string sets the resulting transforms mdolBWT and concBWT will differ.

4 Effects on the parameter

What is the effect of the different permutations of the strings in , induced by these  variants, on the number of runs of the ? As the following example shows, the number of runs can differ significantly between different variants.

Let . Then
AAAAAAAAACACACACACACAC$$GTGTGT$$AC$$GT$$ has 28 runs, while
AAAAAAAAAAAACCCCAACCAC$$GGTTGT$$AC$$GT$$ has 18 runs.

The results of Section 3 give us a method to measure the degree to which the BWT variants can differ.

Let be an interesting interval, and

the Parikh vector of

, i.e.  is the number of occurrences of the th character. Let a be such that , and , the sum of the other character multiplicities. Then the maximum number of runs in interval is if , and otherwise.

Proof.

Place the a-characters in a row, creating gaps, namely one between each adjacent a, and one each at the beginning and at the end. Now place all b-characters, each in a different gap; since is maximum, there are enough gaps. Then place all c’s, first filling gaps that are still empty, if any, then into gaps without c, etc. We never have to place two identical characters in the same gap. If the total number of non-a-characters is at least than , then we can fill every gap, thus separating all a’s, and creating a run for every character of . If we have fewer than characters, then we are still creating two runs with each non-a-character, but we cannot separate all a’s. ∎

We will use this lemma to measure the variability of a dataset:

Let be a multiset. For an interesting interval , let be the upper bound on the number of runs in from Lemma 4. Then the variability of is

Which of the BWT variants produces the fewest runs? As we have shown, this depends on the input order with most BWT variants, and the only possible variation is within interesting intervals. The colexBWT has been shown experimentally to yield a low number of runs of the  [Li14a, CoxBJR12]. Even though it does not always minimize (one can easily create small examples where other permutations yield a lower number of runs), we can bound its distance from the optimum.

Let be the colexBWT of multiset , and let denote the minimum number of runs of any separator-based BWT of . Then , where is the number of interesting intervals.

Proof.

Let be an interesting interval containing distinct characters, and let be the shared suffix defining . Since the strings are listed according to the colex order, all strings in which is preceded by the same character will appear in one block, and therefore, has exactly runs in the interval . Let and . If occurs in and it is not the first run of (i.e., ), then listing first the strings where is preceded by would reduce the number of runs by ; similarly, listing those where precedes as last of the group would reduce the number of runs by . By Prop. 3.2, this is the only possibility for varying the number of runs. ∎

Bentley, Gibney, and Thankachan recently gave a linear-time algorithm for computing the order of the dollars which minimizes the number of runs [BentleyGT20], i.e. the optimal order for mdolBWT. The idea is, in effect, to start from the colex-order and then adjust, where possible, the order of the runs within interesting intervals in order to minimize character changes at the borders, i.e. such that the first and the last run of each interesting interval is identical to the run preceding and following that interesting interval. This is equivalent to sorting groups of sequences sharing the same left-maximal suffix. This sorting can be done on each interesting interval independently without affecting the other interesting intervals. In Table 4, we show the result on our toy example, where it reduces the number of runs by w.r.t. colex order. We implemented an algorithm that computes the number of optimal runs according to the method of [BentleyGT20] and applied it to our datasets. In the next section, we compare the number of runs of each of the five BWT variants to the optimum.

5 Experimental results

We computed the five BWT variants for eight different genomic datasets, with different characteristics. Four of the datasets contain short reads: SARS-CoV-2 short [Greaney22], Simons Diversity reads [Mallick2016], 16S rRNA short [Winand2019], Influenza A reads [Influenza2015], while four contain long sequences: SARS-CoV-2 long [STARR20201295], 16S rRNA long [16S2018], Candida auris reads [candida2019], one of which, SARS-CoV-2 genomes, consists of whole viral genomes [BoucherCLMM21]. The main features of the datasets, including the number of sequences, sequence length, and the mean runlength of the optimal BWT are reported in Table 6. We include the details of the experiment setup in the Appendix.

On each of the datasets, we computed the pairwise Hamming distance between separator-based BWTs. To compare them to the , we computed the pairwise edit distance on a small subset of the sequences (for obvious computational reasons), computing also the Hamming distance on the small set, for comparison. We generated some statistics on each of the data sets: the number of interesting intervals, the fraction of positions within interesting intervals (total length of interesting intervals divided by total length of the dataset), and the dataset’s variability (Def. 4). To study the variation of the -parameter, we implemented an algorithm for the optimal input order according to Bentley et al. [BentleyGT20] and computed for each data set, comparing it to the number of runs of all five BWT variants. In Table 8 and 9, we include a compact version of these results for the two datasets with the highest and the lowest variation between the BWT variants, the SARS-CoV-2 short sequences and the SARS-CoV-2 genomes, respectively. The full experimental results for all eight datasets are contained in the Appendix.

In Table 7 we give a brief summary of these results, reporting, for each dataset, the fraction of positions in interesting intervals, the dataset’s variability, the average pairwise Hamming distance between separator-based BWT variants, and the maximum and minimum value, among the five BWT variants, of the average runlength of the BWT.

The experiments showed a high variation in the number of runs in particular on datasets of short sequences. The highest difference was between colexBWT and concBWT, by a multiplicative factor of over , on the SARS-CoV-2 short dataset. In Figure 1 we plot the average runlength for the four short sequence datasets, and the percentage increase of the number of runs w.r.t. . The variation is less pronounced on the one dataset which is less repetitive, namely Simons Diversity reads. On most long sequence datasets, on the other hand, the differences were quite small (see Appendix). Recall also that the mdolBWT and concBWT vary depending on the input permutation. To better understand how far the colexBWT is from the optimum w.r.t. the number of runs, we plot in Figure 2 the number of runs of colexBWT w.r.t. to , on all eight datasets. The strongest increase is on short sequences, where the variation among all BWT variants is high, as well. On the long sequence datasets, with the exception of SARS-CoV-2 long sequences, the colexBWT is very close to the optimum; however, note that on those datasets, all BWTs are close to the optimum.

The average number of runs and the average pairwise Hamming distance strongly depend on the length of the sequences in the input collection. If the collection has a lot of short sequences which are very similar, then the differences between the BWTs both w.r.t. the number of runs, and as measured by the Hamming distance, can be large. This is because there are a lot of maximal shared suffixes and so, many positions are in interesting intervals. To better understand this relationship, we plotted, in Figure 3, the average Hamming distance against the two parameters variability and fraction of positions in interesting intervals. We see that the two datasets with highest average Hamming distance, SARS-CoV-2 short dataset and the Simons Diversity reads, have at least one of the two values very close to , while for those datasets where both values are very low, the BWT variants do not differ very much.

max width=150mm dataset no. seq total length avg max min (opt) SARS-CoV-2 short 500,000 25,000,000 50 50 50 35.125 Simons Diversity reads 500,000 50,000,000 100 100 100 8.133 16S rRNA short 500,000 75,929,833 152 69 301 44.873 Influenza A reads 500,000 115,692,842 231 60 251 50.275 SARS-CoV-2 long 50,000 53,726,351 1,075 265 3,355 74.498 16S rRNA long 16,741 25,142,323 1,502 1,430 1,549 47.140 Candida auris reads 50,000 124,150,880 2,483 214 8,791 1.732 SARS-CoV-2 genomes 2,000 59,610,692 29,805 22,871 29,920 523.240

Table 6: Table summarizing the main parameters of the eight datasets we used for running the experiments. From left to right we report the datasets names, the number of sequences, the total length, the average, maximum and minimum sequence length and the optimum mean runlength ().

max width=140mm dataset ratio pos.s varia- avg. Hamming d. max min in intr.int.s bility betw. $-sep. BWTs (avg. runlength) (avg. runlength) SARS-CoV-2 short 0.792 0.210 31.524 7.494 Simons Diversity reads 0.107 0.976 7.873 5.299 16S rRNA short 0.741 0.058 44.253 18.836 Influenza A reads 0.103 0.363 49.172 23.100 SARS-CoV-2 long 0.175 0.037 73.204 57.568 16S rRNA long 0.047 0.104 46.879 45.015 Candida auris reads 0.007 0.497 1.732 1.726 SARS-CoV-2 genomes 0.001 0.148 521.610 499.549

Table 7: Table summarizing the results on the eight datasets used for the experiments. From left to right we report dataset names followed by the ratio of positions in interesting intervals, the variability of the dataset (see Def. 4), the average normalized Hamming distance between any two separator-based  variants. In the last two columns we report the maximum and minimum runlength () taken over all five  variants.

SARS-CoV-2 short (500,000 short sequences)

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on the big dataset dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 3,014,183 2,926,602 2,912,860 mdolBWT 0.11820      0 3,013,908 3,102,887 concBWT 0.11477 0.11819      0 3,013,634 colexBWT 0.11423 0.12168 0.11818      0 dataset properties no. sequences 500,000 average length 50 total length 25,000,000 no. of interesting intervals 116,598 total length intr.int.s 20,187,840 fraction pos.s in intr.int.s 0.792 variability 0.210

max width=140mm [width=4.0cm, height=1.45cm] norm. edit d. edit d.  edit distance on a subset of 5,000 sequences dolEBWT mdolBWT concBWT colexBWT      0 28,702 43,903 43,828 46,936 dolEBWT 0.11256      0 17,000 16,921 20,104 mdolBWT 0.17217 0.06667      0 16,130 20,812 concBWT 0.17187 0.06636 0.06325      0 20,830 colexBWT 0.18406 0.07884 0.08162 0.08169      0 no. runs big dataset 1,902,148 dolEBWT 1,868,581 mdolBWT 3,113,818 concBWT 3,402,513 colexBWT 808,906 optimum 725,979

Table 8: Results for the SARS-CoV-2 short dataset. Top left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. Top right: summary of the dataset properties. Bottom left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Bottom right: number of runs and average runlength () of all BWT variants.

SARS-CoV-2 genomes (2,000 long sequences)

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on the big dataset dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 7,958 7,900 7,263 mdolBWT 0.00013      0 7,958 7,957 concBWT 0.00013 0.00013      0 7,990 colexBWT 0.00012 0.00013 0.00013      0 dataset properties no. sequences 2,000 total length 59,612,692 average length 29,085 no. interesting intervals 1863 total length intr.int.s 80,486 fraction pos.s in intr.int.s 0.001 variability 0.148

max width=140mm [width=4.0cm, height=1.45cm] norm. edit d. edit d.  edit distance on a subset of 50 sequences dolEBWT mdolBWT concBWT colexBWT      0 786 795 801 791 dolEBWT 0.00053      0 98 107 86 mdolBWT 0.00053 0.00007      0 105 112 concBWT 0.00054 0.00007 0.00007      0 114 colexBWT 0.00053 0.00006 0.00008 0.00008      0 no. runs big dataset 117,628 dolEBWT 117,410 mdolBWT 118,870 concBWT 119,334 colexBWT 114,287 optimum 113,930

Table 9: Results for the SARS-CoV-2 genomes dataset. Top left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. Top right: summary of the dataset properties. Bottom left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Bottom right: number of runs and average runlength () of all BWT variants.
Figure 1: Left: plot showing the average runlength () of all BWT variants on short sequence datasets. Right: number of runs (percentage increase with respect to optimal BWT) of all BWT variants on short sequence datasets.
Figure 2: Plot showing the number of runs of the colexBWT with respect to optimal BWT (percentage increase) on all eight datasets.
Figure 3: Plot showing the average normalized Hamming distance variations with respect to variability and fraction of positions in interesting intervals on all datasets.

6 Conclusion

We presented the first study of different variants of the Burrows-Wheeler-Transform for string collections. We found that the data structures computed by different tools differ not insignificantly, as measured by the pairwise Hamming distance: up to 12% between different BWT variants on the same dataset in our experiments. We showed that most BWT variants in use are input order dependent, so the same tool can produce different variants if the input set is permuted. These differences extend also to the number of runs , a parameter that is central in the analysis of BWT-based data structures, and which is increasingly being used as a measure of the repetitiveness of the dataset itself.

With string collections replacing individual sequences as the prime object of research and analysis, and thus becoming the standard input for text indexing algorithms, we believe that it is all the more important for users and researchers to be aware that not all methods are equivalent, and to understand the precise nature of the BWT variant produced by a particular tool. We suggest further to standardize the definition of the parameter for string collections, using either the colexicographic order or the optimal order of Bentley et al. [BentleyGT20].

References

Appendix A Full experimental results

a.1 Experimental setup

All datasets are stored in FASTA format.

We used three tools for computing the five  variants; pfpebwt, ropebwt2 and Big-BWT. In order to make the BWTs comparable we did some adaptations to both tools and inputs. We modified ropebwt2 to make it work with the same character order as the other tools, i.e. . Then we used ropebwt2 for computing both the mdolBWT and the colexBWT using the -R and -R -s flags respectively. We used pfpebwt for constructing both the  and the dolEBWT variants. In order to compute the dolEBWT, we modified the input files, appending an end-of-string character at the end of each sequence. Finally, for computing the concBWT, we removed the headers from the FASTA files, arranging the sequences in newline separated files, and ran Big-BWT without additional flags on these newline separated files.

a.2 Further information on the tools

We tested all 12 tools extensively, and determined which data structure they compute, using both our tests and the algorithm descriptions in the respective papers. In this section, we include further information about some of these tools.

  • pfpebwt is a tool computing the  of string collections (https://github.com/davidecenzato/PFP-eBWT.git). It takes in input a fasta file and gives in output the  in either plain ASCII text or RLE (run-length-encoded) format. We used (a) no flags for long sequences, and (b) the flags -w 10 -p 10 -n 3 --reads for short sequences. We included it in two different rows of Table 1 because by default pfpebwt computes the , but it can compute the dolEBWT if the sequences have explicit end-of-string characters (not in multi-thread mode).

  • G2BWT is a tool computing the dolEBWT of short sequence collections (https://bitbucket.org/DiegoDiazDominguez/lms_grammar/src/bwt_imp2). It takes in input newline separated files. Even though it is not stated explicitly, this tool computes the dolEBWT because, when it constructs the grammar, it uses dollars for separating adjacent strings. Thus, also the string rotations will contain dollars. We tested it using the default settings.

  • msbwt is a tool implementing the Holt and McMillan [HoltM14] merge-based BWT construction algorithm (https://github.com/holtjma/msbwt.git). It takes in input a list of one or several fastq files. Even if this tool uses the BCR approach [BauerCR13] for computing the BWTs to merge, it actually computes the dolEBWT. This is because it features a preprocessing where it sorts the input strings lexicographically. Thus, the resulting mdolBWT corresponds to the dolEBWT.

  • BCR is part of the BEETL tool library (https://github.com/giovannarosone/BCR_LCP_GSA). It takes in input a fasta file, a fastq file, or a gz compressed fastq file. It computes the mdolBWT following the approach of Bauer et al., described in [BauerCR13]. We set the ’dataTypeLengthSequences’ variable in Parameters.h to 1.

  • ropebwt2 is a tool computing the FM-index and the mdolBWT of string collections (https://github.com/lh3/ropebwt2.git), using an approach similar to BCR. It takes in input a fasta file, a fastq file, or a gz compressed fastq file. We listed it in two different rows of Table 1 because it computes the mdolBWT or the colexBWT, depending on the flags. We used the -R and the -R -s flags, respectively, to obtain the two transforms. In addition, we modified main.c in order to change the order of the characters to $ < A < C < G < N < T.

  • merge-BWT computes the mdolBWT of a string collection by merging the BWTs of subcollections of the input (https://github.com/jltsiren/bwt-merge.git). It takes in input a list of one or several mdolBWTs. The order of the dollars will depend on the order in which the input BWTs are listed. We tested it using -i plain_sorted and -o plain_sorted flags. We computed the BWTs of the subcollections using ropebwt2.

  • nvSetBWT is a tool included in nvbio suite (https://github.com/NVlabs/nvbio.git). It takes in input either a fastq or a newline separated file. We tested it using the -R flag for skipping the reverse strand. However, even if the algorithmic descriptions in [Pantaleoni14, LiuLL14] seem to describe the mdolBWT, the output of the current version (version 1.1) does not correspond to a possible BWT because the Parikh vector is different from that of the input.

  • eGSA computes the generalized enhanced suffix array and the mdolBWT of a string collection (https://github.com/felipelouza/egsa.git). It takes in input a text file, a fasta file, or a fastq file. It uses the gSACA-K algorithm for computing the suffix array of subcollections of the input and then merges all suffix arrays. Thus it computes the mdolBWT. We tested it with the -b flag.

  • eGAP computes the mdolBWT, and optionally the LCP-array (longest common prefix array) and DA (document array) of a string collection (https://github.com/felipelouza/egap.git). It takes in input a newline separated file, a fasta file, or a fastq file. We tested it with default settings.

  • bwt-lcp-parallel computes the mdolBWT and the LCP-array of a collection of short sequences (https://github.com/AlgoLab/bwt-lcp-parallel.git). It takes in input fasta files and does not support the N character. We tested it using standard settings.

  • gsufsort computes the SA, LCP and mdolBWT of a string collection (https://github.com/felipelouza/gsufsort.git), using the gSACA-K algorithm of [LouzaGT17b]. It takes in input a newline separated file, a fasta file, or a fastq file. We tested it using --fasta and --bwt flags.

  • BigBWT computes the concBWT, and optionally the suffix array, of a highly repetitive text or string collection (https://github.com/alshai/Big-BWT.git). It takes in input a newline separated file or a fasta file. This tool with the -f flag is used internally in the -index (https://github.com/alshai/r-index), producing the BWT of the strings concatenated without dollars, and an additional data structure is used to mark the string boundaries. On the other hand, the tool without the -f flag will compute the BWT of the fasta files without skipping the fasta headers. We used standard parameters and as input newline separated files, the output then is the concBWT.

a.3 Results on individual datasets

SARS-CoV-2 short (500,000 short sequences)

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on the big dataset dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 3,014,183 2,926,602 2,912,860 mdolBWT 0.11820      0 3,013,908 3,102,887 concBWT 0.11477 0.11819      0 3,013,634 colexBWT 0.11423 0.12168 0.11818      0 dataset properties no. sequences 500,000 average length 50 total length 25,000,000 no. of interesting intervals 116,598 total length intr.int.s 20,187,840 fraction pos.s in intr.int.s 0.792 variability 0.210

max width=140mm                                                                                                                          no. runs big dataset 1,902,148 dolEBWT 1,868,581 mdolBWT 3,113,818 concBWT 3,402,513 colexBWT 808,906 optimum 725,979

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on a subset of 5,000 sequences dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 21,362 21,196 20,626 mdolBWT 0.08377      0 21,376 21,256 concBWT 0.08312 0.08383      0 21,259 colexBWT 0.08089 0.08336 0.08337      0 small dataset properties no. of sequences 5,000 total length 250,000 average length 100 no. of interesting intervals 2,476 total length intr.int.s 180,038 fraction pos.s in intr.int.s 0.706 variability 0.173

max width=140mm [width=4.0cm, height=1.45cm] norm. edit d. edit d.  edit distance on a subset of 5,000 sequences dolEBWT mdolBWT concBWT colexBWT      0 28,702 43,903 43,828 46,936 dolEBWT 0.11256      0 17,000 16,921 20,104 mdolBWT 0.17217 0.06667      0 16,130 20,812 concBWT 0.17187 0.06636 0.06325      0 20,830 colexBWT 0.18406 0.07884 0.08162 0.08169      0 no. runs small dataset 52,979 dolEBWT 50,803 mdolBWT 54,766 concBWT 54,698 colexBWT 37,320 optimum 35,904

Table 10: Results for the SARS-CoV-2 short dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength () of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength () of all BWT variants on a subset of the input collection.

Simons Diversity reads (500,000 short sequences)

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on the big dataset dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 3,624,283 3,602,362 3,594,438 mdolBWT 0.07249      0 3,628,799 3,623,154 concBWT 0.07133 0.07186      0 3,617,679 colexBWT 0.07189 0.07246 0.07168      0 dataset properties no. of sequences 500,000 total length 50,000,000 average length 100 no. of interesting intervals 316,013 total length intr.int.s 5,387,549 fraction pos.s in intr.int.s 0.107 variability 0.976

max width=140mm                                                                                                                          no. runs big dataset 8,974,105 dolEBWT 9,337,122 mdolBWT 9,362,564 concBWT 9,530,334 colexBWT 6,414,356 optimum 6,209,567

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on a subset of 5,000 sequences dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 23,742 23,461 23,535 mdolBWT 0.04748      0 23,785 23,722 concBWT 0.04646 0.04710      0 23,660 colexBWT 0.04707 0.04744 0.04685      0 small dataset properties no. of sequences 5,000 total length 500,000 average length 100 no. of interesting intervals 3,111 total length intr.int.s 35,404 fraction pos.s in intr.int.s 0.070 variability 0.989

max width=140mm [width=4.0cm, height=1.45cm] norm. edit d. edit d.  edit distance on a subset of 5,000 sequences dolEBWT mdolBWT concBWT colexBWT      0 72,898 72,878 72,918 74,026 dolEBWT 0.14435      0 17,820 17,560 22,481 mdolBWT 0.14431 0.03529      0 17,726 22,586 concBWT 0.14439 0.03477 0.03510      0 22,595 colexBWT 0.14659 0.04452 0.04472 0.04474      0 no. runs small dataset 77,646 dolEBWT 81,758 mdolBWT 81,883 concBWT 82,779 colexBWT 64,229 optimum 62,117

Table 11: Results for the Simons Diversity reads dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength () of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength () of all BWT variants on a subset of the input collection.

16S rRNA short (500,000 short sequences)

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on the big dataset dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 2,202,008 2,540,310 1,748,072 mdolBWT 0.02881      0 2,201,003 2,202,717 concBWT 0.03324 0.02880      0 2,784,600 colexBWT 0.02287 0.02882 0.03643      0 dataset properties no. of sequences 500,000 total length 75,929,833 average length 152 no. of interesting intervals 54,366 total length intr.int.s 56,708,529 fraction pos.s in intr.int.s 0.742 variability 0.058

max width=140mm                                                                                                                          no. runs big dataset 1,992,130 dolEBWT 1,992,211 mdolBWT 4,057,541 concBWT 2,767,797 colexBWT 1,727,127 optimum 1,703,234

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on a subset of 5,000 sequences dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 20,159 23,229 15,835 mdolBWT 0.02635      0 20,024 20,092 concBWT 0.03036 0.02617      0 25,464 colexBWT 0.02070 0.02626 0.03329      0 small dataset properties no. of sequences 5,000 total length 765,037 average length 152 no. of interesting intervals 1,376 total length intr.int.s 139,041 fraction pos.s in intr.int.s 0.182 variability 0.222

max width=140mm [width=4.0cm, height=1.45cm] norm. edit d. edit d.  edit distance on a subset of 5,000 sequences dolEBWT mdolBWT concBWT colexBWT      0 51,683 62,799 63,303 61,732 dolEBWT 0.06756      0 16,968 20,180 14,166 mdolBWT 0.08209 0.02218      0 16,695 19,371 concBWT 0.08274 0.02638 0.02182      0 21,683 colexBWT 0.08069 0.01852 0.02532 0.02834      0 no. runs small dataset 35,262 dolEBWT 35,293 mdolBWT 50,581 concBWT 38,900 colexBWT 30,568 optimum 30,007

Table 12: Results for the 16S rRNA short dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength () of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength () of all BWT variants on a subset of the input collection.

Influenza A reads (500,000 short sequences)

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on the big dataset dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 3,040,590 3,038,509 2,938,706 mdolBWT 0.02617      0 3,039,095 3,041,816 concBWT 0.02615 0.02616      0 3,089,670 colexBWT 0.02529 0.02618 0.02659      0 dataset properties no. of sequences 500,000 total length 116,192,842 average length 231 no. of interesting intervals 213,735 total length intr.int.s 11,995,246 fraction pos.s in intr.int.s 0.103 variability 0.363

max width=140mm                                                                                                                          no. runs big dataset 3,258,605 dolEBWT 3,298,502 mdolBWT 5,030,032 concBWT 4,629,150 colexBWT 2,362,987 optimum 2,311,133

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on a subset of 5,000 sequences dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 23,456 23,456 22,873 mdolBWT 0.02018      0 23,509 23,407 concBWT 0.02018 0.02023      0 24,061 colexBWT 0.01968 0.02014 0.02070      0 small dataset properties no. of sequences 5,000 total length 1,162,319 average length 231 no. of interesting intervals 3,062 total length intr.int.s 36,019 fraction pos.s in intr.int.s 0.031 variability 0.966

max width=140mm [width=4.0cm, height=1.45cm] norm. edit d. edit d.  edit distance on a subset of 5,000 sequences dolEBWT mdolBWT concBWT colexBWT      0 75,966 75,935 75,991 76,437 dolEBWT 0.06536      0 18,043 18,316 21,869 mdolBWT 0.06533 0.01552      0 17,835 22,536 concBWT 0.06538 0.01576 0.01534      0 23,078 colexBWT 0.06576 0.01881 0.01939 0.01986      0 no. runs small dataset 81,992 dolEBWT 85,489 mdolBWT 89,256 concBWT 87,867 colexBWT 70,534 optimum 68,900

Table 13: Results for the Influenza A reads dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength () of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength () of all BWT variants on a subset of the input collection.

SARS-CoV-2 long (50,000 long sequences)

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on the big dataset dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 248,189 248,205 255,357 mdolBWT 0.00462      0 248,572 248,631 concBWT 0.00462 0.00462      0 248,765 colexBWT 0.00475 0.00462 0.00463      0 dataset properties no. sequences 50,000 total length 53,776,351 average length 1,075 no. of interesting intervals 31,931 total length intr.int.s 9,436,894 fraction pos.s in intr.int.s 0.17548 variability 0.03716

max width=140mm                                                                                                                          no. runs big dataset 882,634 dolEBWT 879,608 mdolBWT 934,129 concBWT 934,117 colexBWT 734,610 optimal 721,845

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on a subset of 1,500 sequences dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 4,936 4,939 5,296 mdolBWT 0.00306      0 4,884 4,908 concBWT 0.00306 0.00303      0 5,012 colexBWT 0.00328 0.00304 0.00310      0 small dataset properties no. sequences 1,500 total length 1,612,956 average length 1,075 no. of interesting intervals 1,046 total length intr.int.s 152,035 fraction pos.s in intr.int.s 0.094 variability 0.047

max width=140mm [width=4.0cm, height=1.45cm] norm. edit d. edit d.  edit distance on a subset of 1,500 sequences dolEBWT mdolBWT concBWT colexBWT      0 19,140 21,809 21,796 22,618 dolEBWT 0.01186      0 4,345 4,322 5,186 mdolBWT 0.01351 0.00269      0 4,110 4,820 concBWT 0.01350 0.00306 0.00255      0 4,893 colexBWT 0.01401 0.00321 0.00299 0.00303      0 no. runs small dataset 45,262 dolEBWT 45,155 mdolBWT 45,572 concBWT 45,644 colexBWT 42,516 optimum 42,093

Table 14: Results for the SARS-CoV-2 long dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength () of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength () of all BWT variants on a subset of the input collection.

16S rRNA long (16,741 long sequences)

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on the big dataset dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 85,960 42,948 67,103 mdolBWT 0.00342      0 85,961 82,890 concBWT 0.00171 0.00342      0 71,264 colexBWT 0.00267 0.00329 0.00283      0 dataset properties no. sequences 16,741 total length 25,159,064 average length 1,501 no. of interesting intervals 9,918 total length intr.int.s 1,173,284 fraction pos.s in intr.int.s 0.047 variability 0.104

max width=140mm                                                                                                                          no. runs big dataset 547,991 dolEBWT 547,793 mdolBWT 555,687 concBWT 558,902 colexBWT 536,682 optimum 533,712

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on a subset of 1,500 sequences dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 4,740 3,104 3,926 mdolBWT 0.00210      0 4,716 4,783 concBWT 0.00137 0.00209      0 4,208 colexBWT 0.00174 0.00212 0.00186      0 small dataset properties no. sequences 1,500 total length 2,260,229 average length 1,501 no. of interesting intervals 946 total length intr.int.s 72,933 fraction pos.s in intr.int.s 0.032 variability 0.104

max width=140mm [width=4.0cm, height=1.45cm] norm. edit d. edit d.  edit distance on a subset of 1,500 sequences dolEBWT mdolBWT concBWT colexBWT      0 18,328 22,194 21,021 21,987 dolEBWT 0.00811      0 4,410 2,761 3,858 mdolBWT 0.00982 0.00195      0 4,323 4,691 concBWT 0.00930 0.00122 0.00191      0 4,146 colexBWT 0.00973 0.00171 0.00208 0.00183      0 no. runs small dataset 62,077 dolEBWT 62,031 mdolBWT 62,712 concBWT 62,800 colexBWT 61,235 optimal 60,979

Table 15: Results for the 16S rRNA long dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength () of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength () of all BWT variants on a subset of the input collection.

Candida auris reads (50,000 long sequences)

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on the big dataset dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 306,071 306,431 305,665 mdolBWT 0.00246      0 305,649 305,713 concBWT 0.00247 0.00246      0 305,469 colexBWT 0.00246 0.00246 0.00246      0 dataset properties no. sequences 50,000 total length 124,200,880 average length 2,483 no. interesting intervals 39,076 total length intr.int.s 913,721 fraction pos.s in intr.int.s 0.007 variability 0.497

max width=140mm                                                                                                                          no. runs big dataset 72,014,777 dolEBWT 71,972,783 mdolBWT 71,972,346 concBWT 71,973,221 colexBWT 71,725,274 optimal 71,704,473

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on a subset of 1,500 sequences dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 6,333 6,393 6,260 mdolBWT 0.00169      0 6,411 6,354 concBWT 0.00170 0.00171      0 6,294 colexBWT 0.00167 0.00169 0.00168      0 small dataset properties no. sequences 1,500 total length 3,755,776 average length 2,503 no. of interesting intervals 1,189 total length intr.int.s 18,372 fraction pos.s in intr.int.s 0.005 variability 0.530

max width=140mm [width=4.0cm, height=1.45cm] norm. edit d. edit d.  edit distance on a subset of 1,500 sequences dolEBWT mdolBWT concBWT colexBWT      0 30,345 30,552 30,562 30,794 dolEBWT 0.00808      0 4,835 4,860 6,005 mdolBWT 0.00813 0.00129      0 4,824 6,105 concBWT 0.00814 0.00129 0.00128      0 6,035 colexBWT 0.00820 0.00160 0.00163 0.00161      0 no. runs small dataset 2,635,300 dolEBWT 2,633,676 mdolBWT 2,633,652 concBWT 2,633,727 colexBWT 2,629,094 optimum 2,628,470

Table 16: Results for the Candida auris reads dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength () of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength () of all BWT variants on a subset of the input collection.

SARS-CoV-2 genomes (2,000 long sequences)

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on the big dataset dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 7,958 7,900 7,263 mdolBWT 0.00013      0 7,958 7,957 concBWT 0.00013 0.00013      0 7,990 colexBWT 0.00012 0.00013 0.00013      0 dataset properties no. sequences 2,000 total length 59,612,692 average length 29,085 no. interesting intervals 1863 total length intr.int.s 80,486 fraction pos.s in intr.int.s 0.001 variability 0.148

max width=140mm                                                                                                                          no. runs big dataset 117,628 dolEBWT 117,410 mdolBWT 118,870 concBWT 119,334 colexBWT 114,287 optimum 113,930

max width=140mm [width=4.7cm, height=1.45cm] norm. Hamming d.  Hamming d. Hamming distance on a subset of 50 sequences dolEBWT mdolBWT concBWT colexBWT dolEBWT      0 105 119 90 mdolBWT 0.00007      0 124 116 concBWT 0.00008 0.00008      0 118 colexBWT 0.00006 0.00008 0.00008      0 small dataset properties no. sequences 50 total length 1,490,184 average length 29,802 no. interesting intervals 43 total length intr.int.s 271 fraction pos.s in intr.int.s variability 0.690

max width=140mm [width=4.0cm, height=1.45cm] norm. edit d. edit d.  edit distance on a subset of 50 sequences dolEBWT mdolBWT concBWT colexBWT      0 786 795 801 791 dolEBWT 0.00053      0 98 107 86 mdolBWT 0.00053 0.00007      0 105 112 concBWT 0.00054 0.00007 0.00007      0 114 colexBWT 0.00053 0.00006 0.00008 0.00008      0 no. runs small dataset 25,258 dolEBWT 25,255 mdolBWT 25,274 concBWT 25,285 colexBWT 25,221 optimum 25,210

Table 17: Results for the SARS-CoV-2 genomes dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength () of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength () of all BWT variants on a subset of the input collection.