The Burrows-Wheeler-Transform [BW94] (BWT) is a fundamental string transformation which is at the heart of many modern compressed data structures for text processing, in particular in bioinformatics [bowtie, bwa, bowtie2]. With the increasing availability of low-cost high-throughput sequencing technologies, the focus has moved from single strings to large string collections, such as the 1000 Genomes project [1000genomes], the Genome 10K Project [10K], the 100,000 Genomes Project [100K], the 1001 Arabidopsis Project [arab], and the 3,000 Rice Genomes Project (3K RGP) [rice]. This has led to a widespread use of compressed data structures for string collections.
Concurrently, ever increasing text sizes have been driving a trend towards ever smaller data structures. The size of BWT-based data structures is typically measured in the number of runs (maximal substrings consisting of the same letter) of the BWT, commonly denoted . This parameter has become fundamental as a measure of storage space required by such data structures. Moreover, much recent research effort has concentrated on the construction of data structures which can not only store but query, process, and mine strings in space and time proportional to [GagieNP18, BannaiGI20, Oliva0SMKGB21, CobasGN21].
The parameter is also being increasingly seen as a measure of repetitiveness of the string, with several recent works theoretically exploring its suitability as such a measure, as well as its relationship to other such measures [Navarro21a, GiulianiILPST21, AkagiFI22].
Several tools exist that compute variants of the for string collections, among these BCR [BauerCR13], ropebwt2 [Li14a], nvSetBWT [Pantaleoni14], msbwt [HoltM14], Merge-BWT [Siren16], eGSA [Louza2017d], BigBWT [BoucherGKLMM19], bwt-lcp-parallel [BonizzoniVPPR19], eGAP [EgidiLMT19], gsufsort [LouzaTGPR20], G2BWT [DominguezN21], and pfpebwt [BoucherCLMM21]. It should be noted though that, when the input is a collection of strings, it is not completely straightforward how to compute the BWT—since the BWT was originally designed for individual strings. In fact, there exists more than one way to compute a Burrows-Wheeler-type transform for a collection of strings, and it turns out that different tools not only use different algorithms, but they output different data structures. As a first example, in Table 1, we give the BWT variants as computed by tools on a toy example of DNA-strings.
The classical way of computing text indexes of string collections is to concatenate the strings, adding a different end-of-string-symbol at the end of each string, and then computing the index for the concatenated string. This is the method traditionally used for generating classical data structures such as suffix trees and suffix arrays for more than one string, and results in the so-called generalized suffix tree resp. generalized suffix array (see e.g. [Gusfield1997, Ohlebusch2013]). The drawback of this method is an increase in the size of the alphabet, from , often a small constant in applications, to , where is the number of elements in the collection, typically in the thousands or even tens or hundreds of thousands. One way to avoid this is to use only conceptually different end-of-string-symbols, i.e. to have only one dollar-sign but apply string input order to break ties. This is the method used e.g. by ropebwt2 [Li14a] and by BCR [BauerCR13]. Another method to avoid increasing the alphabet is to separate the input strings using the same end-of-string-symbol, and add a different end-of-string-symbol to the end of the concatenated string, to ensure correctness, as e.g. in BigBWT [BoucherGKLMM19]. An equivalent solution is to concatenate the input strings without removing the end-of-line or end-of-file characters, since these act as separators; or to concatenate them without separators and use a bitvector to mark the end of each string.
Many recent studies use string collections in experiments (e.g. [PuglisiZ21, BannaiGI20, KuhnleMBGLM19]); often the input strings are turned into one single sequence using one of the methods described above, and then the single-string is computed; it is, however, not always stated explicitly which was the method used to obtain one sequence. Underlying this is the implicit assumption that all methods are equivalent.
In 2007, Mantaci et al. [MantaciRRS07] introduced the extended Burrows-Wheeler-Transform (), which generalizes the to a multiset of strings. The , like the , is reversible; moreover, it is independent of the order in which the strings in the collection are presented. This is not true of any of the other methods mentioned above. Note that the differs from the in several ways, most importantly in the order relation for sorting conjugates: while the uses lexicographic order, the uses the so-called omega-order. (For precise definitions, see Section 2.)
The only tool up to date that computes the according to the original definition is pfpebwt [BoucherCLMM21]; all other tools append an end-of-string character to the input strings, explicitly or implicitly, and as a consequence, the resulting data structures differ from the one defined in [MantaciRRS07]. Moreover, the output in most cases depends on the input order of the sequences (except for [DominguezN21] and, using a specific option, [Li14a]). As a further complication, the exact nature of this dependence differs from one data structure to another, as we will show in this paper.
The result is that the variants computed by different tools on the same dataset, or by the same tool on the same dataset but given in a different order, may vary considerably. As we will show, this variability extends to the parameter , the number of runs of the . This is all the more important given the fact that (and the related parameter , the average length of a run) is increasingly being used as a parameter characterizing the dataset itself, namely as a measure of its repetitiveness (see e.g. [CobasGN21, BannaiGI20, BoucherCGHMNR21]).
1.1 Our contribution
To the best of our knowledge, this is the first systematic treatment of the different BWT variants in use for collections of strings. Our contributions are:
We define five distinct BWT variants which are computed by current tools specifically designed for string collections. We formally describe the differences between these, identifying specific intervals to which differences are restricted.
We show the influence of the input order on the output, in dependence of the BWT variant.
We describe the consequences on the number of runs of the BWT and give an upper bound on the amount by which the colexicographic order (sometimes referred to as ’reverse lexicographic order’) can differ from the optimal order of Bentley et al. [BentleyGT20].
We complement our theoretical analysis with extensive experiments, comparing the five BWT variants on eight real-life datasets with different characteristics.
1.2 Related work
This paper deals with tools for string collections, so we did not include any tool that computes the BWT of a single string, such as libdivsufsort [libdivsufsort], sais-lite-lcp [sais-lite-lcp], libsais [libsais], bwtdisk [FerraginaGM12]. Even though, in many cases, these are the tools used for collections of strings, the data structure they compute depends on the method with which the string collection was turned into a single string, as explained above. Nor did we include other BWT variants for single strings such as the bijective BWT [GilScott12, KopplHHS20], since, again, these were not designed for string collections.
The Big-xBWT [GagieGM21] is a tool for compressing and indexing read collections, using the xBWT of Ferragina et al. [xbwt, FerraginaLMM09]. In addition to the string collection, it requires a reference sequence as input, in contrast to the other tools. Moreover, the output is not comparable either, since its length can vary—as opposed to all other BWT variants we review, the xBWT is not a permutation of the input characters but can be shorter, due to the fact that it first maps the input to a tree and then applies the xBWT to it, a BWT-like index for labeled trees, rather than for strings. Likewise, the tool [OhlebuschSB18] for reference-free xBWT is not included in this review: even though it does not require a reference sequence, it, too, computes the xBWT, which is a data structure that does not fall within the category we focus on.
Finally, we did not include SPRING [ChandakTOHW19], a reference-free compressor for FASTQ and FASTA files: even though it uses the BWT during computation, it does not output it.
We give the necessary definitions in Section 2; note that we assume familiarity of the reader with the Burrows-Wheeler-Transform. In Section 3, we present the BWT variants and analyse their differences. In Section 4 we discuss the effects on the repetitiveness measure , while our experimental results are presented in Section 5. We give some conclusions from our study in Section 6. The full tables with detailed results on all eight datasets can be found in the Appendix.
Let be a finite ordered alphabet of size . We use the notation for a string of length over , for the th character, and for the substring of , where ; denotes the length of , and the empty string. Let be a string over and ; a string is called a conjugate of if for some , (also called the th rotation of ). For a string over and an integer , denotes the -fold concatenation of .
A string is called primitive if implies and . Every string can be written uniquely as , where is primitive; we refer to as and to as , i.e., . A run in string is a maximal substring consisting of the same character; we denote by the number of runs of . Often, an end-of-string character (usually denoted $) is appended to the end of ; this character is not element of and is assumed to be smaller than all characters from . Note that appending a $ makes any string primitive.
For two strings , the (unit-cost) edit distance is defined as the minimum number of operations necessary to transform into , where an operation can be deletion or insertion of a character, or substitution of a character by another. The Hamming distance , defined only if , is the number of positions such that .
The lexicographic order on is defined by if is a proper prefix of , or if there exists an index s.t. and for all , . The colexicographic order, or colex-order (referred to as reverse lexicographic order in [Li14a, CoxBJR12]) is defined by if , where denotes the reverse of the string .
Given a string over , the Burrows-Wheeler-Transform [BW94], , is a permutation of the characters of , given by concatenating the last characters of the lexicographically sorted conjugates of . The number of runs of the of string is denoted , i.e. . To make the BWT uniquely reversible, one can add an index to it marking the lexicographic rank of the conjugate in input. For example, , hence , and the index specifies that the input was the 4th conjugate in lexicographic order. Alternatively, one adds a $ to the end of , which makes the input unique: . Note that with and without end-of-string symbol can be quite different.
Next we define the omega-order [MantaciRRS07] on : if and , or if (implying ), where denotes the infinite string obtained by concatenating infinitely many times. The omega-order relation coincides with the lexicographic order if neither of the two strings is a proper prefix of the other. The two orders can differ otherwise, e.g. but .
Given a multiset of strings , the extended Burrows-Wheeler-Transform [MantaciRRS07], , is a permutation of the characters of the strings in , given by concatenating the last characters of the conjugates of each , for , listed in omega-order. For example, the omega-sorted conjugates of are: CGT, GTC, GT, TCG, TG, hence, . Again, adding the indices of the input conjugates makes the uniquely reversible, in this case .
3 BWT variants for string collections
We identified five distinct transforms, which we list below, that were computed by the programs listed in Table 1. Let be a multiset of strings, with total length . Since several of the data structures depend on the order in which the strings are listed, we implicitly regard as a list , and write for a specific permutation in which the strings are presented.
: the extended of of Mantaci et al. [MantaciRRS07]
, (“concatenated BWT”)
, where is the permutation corresponding to the colexicographic (’reverse lexicographic’) order of the strings in
|variant||example||order of shared suffixes||independent|
|of input order?|
|C G G G AT GTA C G T T AAAA A||omega-order of strings||yes|
|GGAAA CGG$$$TTA CT GT$AAA$||lexicographic order of strings||yes|
|GAGAA GCG$$$TTA TC TG$AAA$||input order of strings||no|
|AAGAG GGC$$$TTA CT GT$AAA$||lexicographic order of||no|
|subsequent strings in input|
|AAAGG CGG$$$TTA CT GT$AAA$||colexicographic order||yes|
Because all BWT variants except the use additional end-of-string symbols as string separators, we refer to these four by the collective term separator-based BWT variants. In Table 2 we show the five data structures on our running example of DNA-strings, and give first properties of these data structures. For ease of exposition and comparison, we replaced all separator-symbols by the same dollar-sign $, even where, conceptually or concretely, different dollar-signs are assumed to terminate the individual strings, as is the case for mdolBWT. Moreover, the concBWT contains one additional character, the final end-of-string symbol, here denoted by #, which is smaller than all other characters; thus, the additional rotation starting with # is the smallest and results in an additional dollar in the first position of the transform. For ease of comparison, we remove this first symbol from concBWT and replace the # by $.
It is important to point out that the programs listed in Table 1 do not necessarily use the definitions given here; however, in each case, the resulting transform is the one claimed, up to renaming or removing separator characters, see Section 3.1 and 3.2.
3.1 The effect of adding separator symbols
The first obvious difference between the and the separator-based variants is their length: has length , while all other variants have length , since they contain an additional character (the separator) for each input string.
In the four separator-based transforms, the -length prefix consists of a permutation of the last characters of the input strings. This is because the rotations starting with the dollars are the first lexicographically; in the , these characters occur interspersed with the rest of the transform; namely in the positions corresponding to the omega-ranks of the input strings (see Table 2).
The next point is that adding a $ to the end of the strings introduces a distinction, not present in the , between suffixes and other substrings: since the separators are smaller than all other characters, occurrences of a substring as suffix will be listed en bloc before all other occurrences of the same substring. On the other hand, in the , these occurrences will be listed interspersed with the other occurrences of the same substring.
Let and . occurs both as a suffix and as an internal factor; the characters preceding it are A (internal substring) and C,G (suffix), and we have C GACATAACC, dolEBWT CC$ GCAAATAC$.
Finally, it should be noted that adding end-of-string symbols to the input strings changes the definition of the order applied. As observed above, the omega-order coincides with the lexicographic order on all pairs of strings where neither is a proper prefix of the other; but with end-of-strings characters, no input string can be a proper prefix of another. Thus, on rotations of the ’s, the omega-order equals the lexicographic order. As an example, consider the multiset GTC$, GT$ from Section 2: we have the following omega-order among the rotations: $GT, $GTC, C$GT, GT$, GTC$, T$G, TC$G, which coincides with the lexicographic order. Similarly, adding different dollars $, $, …, $ and applying the omega-order results again in the lexicographic order between the rotations, with different dollar symbols considered as distinct characters. Indeed, if we append a different dollar-sign to each input string, then the omega-order, the lexicographic order, and the order of the suffixes of the concatenated string (i.e. our mdolBWT) are all equivalent.
Regarding the differences among the four separator-based BWT variants, we will show that all differences occur in certain well-defined intervals of the BWT, and that the differences themselves depend only on a specific permutation of , given by the combination of the input order, the lexicographic order of the input strings, and the BWT variant applied. In Tables 3 and 4 we give the full matrices for all five BWT variants, along with the optimal one minimizing the number of runs, see Section 4.
3.2 Interesting intervals
Let us call a string a shared suffix w.r.t. multiset if it is the suffix of at least two strings in . Let be the lexicographic rank of the smallest rotation beginning with and the lexicographic rank of the largest rotation beginning with , among all rotations of strings , where . (One can think of as the suffix-array interval of .) We call an interesting interval if there exist s.t. is a suffix of both and , and the preceding characters in and are different, i.e., the two occurrences of as suffix of and constitute a left-maximal repeat.111Interesting intervals correspond to internal nodes in the suffix tree of the reverse string, within the subtree of . Note that is always an interesting interval, unless all strings end with the same character.
Any two distinct interesting intervals are disjoint.
Follows immediately from the fact that no two distinct substrings ending in can be one prefix of the other. ∎
We can now narrow down the differences between any two separator-based BWTs of the same multiset. The next proposition states that these can only occur in interesting intervals (part 1). This implies that the dollar-symbols appear in the same positions in all separator-based variants except for one very specific case (part 2). Moreover, we get an upper bound on the Hamming distance between two separator-based BWTs (part 3).
Let and be two separator-based BWTs of the same multiset .
If then for some interesting interval .
Let and be the positions of the dollars in resp. . If then there exist such that is a proper suffix of .
1. Let and . Since all separator-based variants use the lexicographical order of the rotations, this means that there exists a substring which is preceded by x in one string and by y in another , the first occurrence has rank in one and the other has rank in the other variant. This implies that the two occurrences are followed by two dollars, and either the two dollars are different, or they are the same dollar, and the subsequent substrings are different. Therefore, defines an interesting interval. Parts 2. and 3. follow from 1. ∎
Proposition 3.2 implies that the variation of the different transforms can be explained based solely on what rule is used to break ties for shared suffixes. We see next how the different BWT variants determine this tie-breaking rule.
3.3 Permutations induced by separator-based BWT variants
Let us now restrict ourselves to being a set, i.e., no string occurs more than once. (This is just for convenience since now the input order uniquely defines a permutation w.r.t. lexicographic order; the results of this section apply equally to multisets .) As we showed in the previous subsection, the only differences between the different separator-based BWT variants are given by the order in which shared suffixes are listed. It is also clear that the same order applies in each interesting interval, including the -length prefix of the transform. Therefore, it suffices to study the permutation of the dollars in this prefix. Since the strings are all distinct, they each have a unique lexicographic rank within the set , thus the permutation of the dollars can be given w.r.t. the lexicographic order. Let us refer to these permutations by the dolEBWT, mdolBWT, concBWT, and colexBWT by and , respectively.
Now, also the input order can be seen as a permutation of the lexicographic order222For those used to thinking about suffix arrays, can be seen as the inverse suffix array of the input if the strings are thought of as meta-characters.; if the strings are input in lexicographic order, then . For our toy example , we have .
It is easy to see that the permutation is equal to , since the dollar-symbols are ordered according to . For the dolEBWT, the rank of $ equals the lexicographic rank of among all input strings, i.e., . Further, by definition, where denotes the colexicographic order of the input strings. The situation is more complex in the case of concBWT. Since the # is the smallest character, the last string of the input will be the first, while for the others, the lexicographic rank of the following string decides the order. In our running example, . We next formalize this.
Let be the linking permutation [KucherovTV13] of , defined by , for , and , the permutation that maps each element to the element in the next position and the last element to the first. Let us also define, for and , by if and otherwise, i.e. gives the rank of element in the set .
Let be the permutation of the input order w.r.t. the lexicographic order, i.e. the th input string has lexicographic rank . Then is given by:
Follows straightforwardly from the tie-breaking rule of concBWT. ∎
The mapping for is as follows: , , , , , and . Note that no maps to .
As can be seen already for , not all permutations are reached by this mapping. We will call a permutation feasible if there exists an input order such that . For , there are feasible permutations (out of ), for , (out of ). In Table 5, we give the percentage of feasible permutations , for up to . The lexicographic order is always feasible, namely with ; however, the colex order is not always feasible, as the following example shows.
Let , thus , but as we have seen, no permutation will yield this order for concBWT. In addition, the AAAACGG$AT$$ has runs, while all feasible ones have at least : AAAGACG$AT$$, AAACGAG$AT$$, AAAAGCG$AT$$, AAAGCAG$AT$$, AAACAGG$AT$$.
An important consequence is that the permutations induced by mdolBWT and concBWT are always differnt: holds always, since . This means that, in whatever order the strings are given w.r.t. lexicographic order, on most string sets the resulting transforms mdolBWT and concBWT will differ.
4 Effects on the parameter
What is the effect of the different permutations of the strings in , induced by these variants, on the number of runs of the ? As the following example shows, the number of runs can differ significantly between different variants.
Let . Then
AAAAAAAAACACACACACACAC$$GTGTGT$$AC$$GT$$ has 28 runs, while
AAAAAAAAAAAACCCCAACCAC$$GGTTGT$$AC$$GT$$ has 18 runs.
The results of Section 3 give us a method to measure the degree to which the BWT variants can differ.
Let be an interesting interval, and
the Parikh vector of, i.e. is the number of occurrences of the th character. Let a be such that , and , the sum of the other character multiplicities. Then the maximum number of runs in interval is if , and otherwise.
Place the a-characters in a row, creating gaps, namely one between each adjacent a, and one each at the beginning and at the end. Now place all b-characters, each in a different gap; since is maximum, there are enough gaps. Then place all c’s, first filling gaps that are still empty, if any, then into gaps without c, etc. We never have to place two identical characters in the same gap. If the total number of non-a-characters is at least than , then we can fill every gap, thus separating all a’s, and creating a run for every character of . If we have fewer than characters, then we are still creating two runs with each non-a-character, but we cannot separate all a’s. ∎
We will use this lemma to measure the variability of a dataset:
Let be a multiset. For an interesting interval , let be the upper bound on the number of runs in from Lemma 4. Then the variability of is
Which of the BWT variants produces the fewest runs? As we have shown, this depends on the input order with most BWT variants, and the only possible variation is within interesting intervals. The colexBWT has been shown experimentally to yield a low number of runs of the [Li14a, CoxBJR12]. Even though it does not always minimize (one can easily create small examples where other permutations yield a lower number of runs), we can bound its distance from the optimum.
Let be the colexBWT of multiset , and let denote the minimum number of runs of any separator-based BWT of . Then , where is the number of interesting intervals.
Let be an interesting interval containing distinct characters, and let be the shared suffix defining . Since the strings are listed according to the colex order, all strings in which is preceded by the same character will appear in one block, and therefore, has exactly runs in the interval . Let and . If occurs in and it is not the first run of (i.e., ), then listing first the strings where is preceded by would reduce the number of runs by ; similarly, listing those where precedes as last of the group would reduce the number of runs by . By Prop. 3.2, this is the only possibility for varying the number of runs. ∎
Bentley, Gibney, and Thankachan recently gave a linear-time algorithm for computing the order of the dollars which minimizes the number of runs [BentleyGT20], i.e. the optimal order for mdolBWT. The idea is, in effect, to start from the colex-order and then adjust, where possible, the order of the runs within interesting intervals in order to minimize character changes at the borders, i.e. such that the first and the last run of each interesting interval is identical to the run preceding and following that interesting interval. This is equivalent to sorting groups of sequences sharing the same left-maximal suffix. This sorting can be done on each interesting interval independently without affecting the other interesting intervals. In Table 4, we show the result on our toy example, where it reduces the number of runs by w.r.t. colex order. We implemented an algorithm that computes the number of optimal runs according to the method of [BentleyGT20] and applied it to our datasets. In the next section, we compare the number of runs of each of the five BWT variants to the optimum.
5 Experimental results
We computed the five BWT variants for eight different genomic datasets, with different characteristics. Four of the datasets contain short reads: SARS-CoV-2 short [Greaney22], Simons Diversity reads [Mallick2016], 16S rRNA short [Winand2019], Influenza A reads [Influenza2015], while four contain long sequences: SARS-CoV-2 long [STARR20201295], 16S rRNA long [16S2018], Candida auris reads [candida2019], one of which, SARS-CoV-2 genomes, consists of whole viral genomes [BoucherCLMM21]. The main features of the datasets, including the number of sequences, sequence length, and the mean runlength of the optimal BWT are reported in Table 6. We include the details of the experiment setup in the Appendix.
On each of the datasets, we computed the pairwise Hamming distance between separator-based BWTs. To compare them to the , we computed the pairwise edit distance on a small subset of the sequences (for obvious computational reasons), computing also the Hamming distance on the small set, for comparison. We generated some statistics on each of the data sets: the number of interesting intervals, the fraction of positions within interesting intervals (total length of interesting intervals divided by total length of the dataset), and the dataset’s variability (Def. 4). To study the variation of the -parameter, we implemented an algorithm for the optimal input order according to Bentley et al. [BentleyGT20] and computed for each data set, comparing it to the number of runs of all five BWT variants. In Table 8 and 9, we include a compact version of these results for the two datasets with the highest and the lowest variation between the BWT variants, the SARS-CoV-2 short sequences and the SARS-CoV-2 genomes, respectively. The full experimental results for all eight datasets are contained in the Appendix.
In Table 7 we give a brief summary of these results, reporting, for each dataset, the fraction of positions in interesting intervals, the dataset’s variability, the average pairwise Hamming distance between separator-based BWT variants, and the maximum and minimum value, among the five BWT variants, of the average runlength of the BWT.
The experiments showed a high variation in the number of runs in particular on datasets of short sequences. The highest difference was between colexBWT and concBWT, by a multiplicative factor of over , on the SARS-CoV-2 short dataset. In Figure 1 we plot the average runlength for the four short sequence datasets, and the percentage increase of the number of runs w.r.t. . The variation is less pronounced on the one dataset which is less repetitive, namely Simons Diversity reads. On most long sequence datasets, on the other hand, the differences were quite small (see Appendix). Recall also that the mdolBWT and concBWT vary depending on the input permutation. To better understand how far the colexBWT is from the optimum w.r.t. the number of runs, we plot in Figure 2 the number of runs of colexBWT w.r.t. to , on all eight datasets. The strongest increase is on short sequences, where the variation among all BWT variants is high, as well. On the long sequence datasets, with the exception of SARS-CoV-2 long sequences, the colexBWT is very close to the optimum; however, note that on those datasets, all BWTs are close to the optimum.
The average number of runs and the average pairwise Hamming distance strongly depend on the length of the sequences in the input collection. If the collection has a lot of short sequences which are very similar, then the differences between the BWTs both w.r.t. the number of runs, and as measured by the Hamming distance, can be large. This is because there are a lot of maximal shared suffixes and so, many positions are in interesting intervals. To better understand this relationship, we plotted, in Figure 3, the average Hamming distance against the two parameters variability and fraction of positions in interesting intervals. We see that the two datasets with highest average Hamming distance, SARS-CoV-2 short dataset and the Simons Diversity reads, have at least one of the two values very close to , while for those datasets where both values are very low, the BWT variants do not differ very much.
We presented the first study of different variants of the Burrows-Wheeler-Transform for string collections. We found that the data structures computed by different tools differ not insignificantly, as measured by the pairwise Hamming distance: up to 12% between different BWT variants on the same dataset in our experiments. We showed that most BWT variants in use are input order dependent, so the same tool can produce different variants if the input set is permuted. These differences extend also to the number of runs , a parameter that is central in the analysis of BWT-based data structures, and which is increasingly being used as a measure of the repetitiveness of the dataset itself.
With string collections replacing individual sequences as the prime object of research and analysis, and thus becoming the standard input for text indexing algorithms, we believe that it is all the more important for users and researchers to be aware that not all methods are equivalent, and to understand the precise nature of the BWT variant produced by a particular tool. We suggest further to standardize the definition of the parameter for string collections, using either the colexicographic order or the optimal order of Bentley et al. [BentleyGT20].
Appendix A Full experimental results
a.1 Experimental setup
All datasets are stored in FASTA format.
We used three tools for computing the five variants; pfpebwt, ropebwt2 and Big-BWT. In order to make the BWTs comparable we did some adaptations to both tools and inputs. We modified ropebwt2 to make it work with the same character order as the other tools, i.e. . Then we used ropebwt2 for computing both the mdolBWT and the colexBWT using the -R and -R -s flags respectively. We used pfpebwt for constructing both the and the dolEBWT variants. In order to compute the dolEBWT, we modified the input files, appending an end-of-string character at the end of each sequence. Finally, for computing the concBWT, we removed the headers from the FASTA files, arranging the sequences in newline separated files, and ran Big-BWT without additional flags on these newline separated files.
a.2 Further information on the tools
We tested all 12 tools extensively, and determined which data structure they compute, using both our tests and the algorithm descriptions in the respective papers. In this section, we include further information about some of these tools.
pfpebwt is a tool computing the of string collections (https://github.com/davidecenzato/PFP-eBWT.git). It takes in input a fasta file and gives in output the in either plain ASCII text or RLE (run-length-encoded) format. We used (a) no flags for long sequences, and (b) the flags -w 10 -p 10 -n 3 --reads for short sequences. We included it in two different rows of Table 1 because by default pfpebwt computes the , but it can compute the dolEBWT if the sequences have explicit end-of-string characters (not in multi-thread mode).
G2BWT is a tool computing the dolEBWT of short sequence collections (https://bitbucket.org/DiegoDiazDominguez/lms_grammar/src/bwt_imp2). It takes in input newline separated files. Even though it is not stated explicitly, this tool computes the dolEBWT because, when it constructs the grammar, it uses dollars for separating adjacent strings. Thus, also the string rotations will contain dollars. We tested it using the default settings.
msbwt is a tool implementing the Holt and McMillan [HoltM14] merge-based BWT construction algorithm (https://github.com/holtjma/msbwt.git). It takes in input a list of one or several fastq files. Even if this tool uses the BCR approach [BauerCR13] for computing the BWTs to merge, it actually computes the dolEBWT. This is because it features a preprocessing where it sorts the input strings lexicographically. Thus, the resulting mdolBWT corresponds to the dolEBWT.
BCR is part of the BEETL tool library (https://github.com/giovannarosone/BCR_LCP_GSA). It takes in input a fasta file, a fastq file, or a gz compressed fastq file. It computes the mdolBWT following the approach of Bauer et al., described in [BauerCR13]. We set the ’dataTypeLengthSequences’ variable in Parameters.h to 1.
ropebwt2 is a tool computing the FM-index and the mdolBWT of string collections (https://github.com/lh3/ropebwt2.git), using an approach similar to BCR. It takes in input a fasta file, a fastq file, or a gz compressed fastq file. We listed it in two different rows of Table 1 because it computes the mdolBWT or the colexBWT, depending on the flags. We used the -R and the -R -s flags, respectively, to obtain the two transforms. In addition, we modified main.c in order to change the order of the characters to $ < A < C < G < N < T.
merge-BWT computes the mdolBWT of a string collection by merging the BWTs of subcollections of the input (https://github.com/jltsiren/bwt-merge.git). It takes in input a list of one or several mdolBWTs. The order of the dollars will depend on the order in which the input BWTs are listed. We tested it using -i plain_sorted and -o plain_sorted flags. We computed the BWTs of the subcollections using ropebwt2.
nvSetBWT is a tool included in nvbio suite (https://github.com/NVlabs/nvbio.git). It takes in input either a fastq or a newline separated file. We tested it using the -R flag for skipping the reverse strand. However, even if the algorithmic descriptions in [Pantaleoni14, LiuLL14] seem to describe the mdolBWT, the output of the current version (version 1.1) does not correspond to a possible BWT because the Parikh vector is different from that of the input.
eGSA computes the generalized enhanced suffix array and the mdolBWT of a string collection (https://github.com/felipelouza/egsa.git). It takes in input a text file, a fasta file, or a fastq file. It uses the gSACA-K algorithm for computing the suffix array of subcollections of the input and then merges all suffix arrays. Thus it computes the mdolBWT. We tested it with the -b flag.
eGAP computes the mdolBWT, and optionally the LCP-array (longest common prefix array) and DA (document array) of a string collection (https://github.com/felipelouza/egap.git). It takes in input a newline separated file, a fasta file, or a fastq file. We tested it with default settings.
bwt-lcp-parallel computes the mdolBWT and the LCP-array of a collection of short sequences (https://github.com/AlgoLab/bwt-lcp-parallel.git). It takes in input fasta files and does not support the N character. We tested it using standard settings.
gsufsort computes the SA, LCP and mdolBWT of a string collection (https://github.com/felipelouza/gsufsort.git), using the gSACA-K algorithm of [LouzaGT17b]. It takes in input a newline separated file, a fasta file, or a fastq file. We tested it using --fasta and --bwt flags.
BigBWT computes the concBWT, and optionally the suffix array, of a highly repetitive text or string collection (https://github.com/alshai/Big-BWT.git). It takes in input a newline separated file or a fasta file. This tool with the -f flag is used internally in the -index (https://github.com/alshai/r-index), producing the BWT of the strings concatenated without dollars, and an additional data structure is used to mark the string boundaries. On the other hand, the tool without the -f flag will compute the BWT of the fasta files without skipping the fasta headers. We used standard parameters and as input newline separated files, the output then is the concBWT.