Computing the optimal BWT of very large string collections

12/02/2022
by   Davide Cenzato, et al.
0

It is known that the exact form of the Burrows-Wheeler-Transform (BWT) of a string collection depends, in most implementations, on the input order of the strings in the collection. Reordering strings of an input collection affects the number of equal-letter runs r, arguably the most important parameter of BWT-based data structures, such as the FM-index or the r-index. Bentley, Gibney, and Thankachan [ESA 2020] introduced a linear-time algorithm for computing the permutation of the input collection which yields the minimum number of runs of the resulting BWT. In this paper, we present the first tool that guarantees a Burrows-Wheeler-Transform with minimum number of runs (optBWT), by combining i) an algorithm that builds the BWT from a string collection (either SAIS-based [Cenzato et al., SPIRE 2021] or BCR [Bauer et al., CPM 2011]); ii) the SAP array data structure introduced in [Cox et al., Bioinformatics, 2012]; and iii) the algorithm by Bentley et al. We present results both on real-life and simulated data, showing that the improvement achieved in terms of r with respect to the input order is significant and the overhead created by the computation of the optimal BWT negligible, making our tool competitive with other tools for BWT-computation in terms of running time and space usage. In particular, on real data the optBWT obtains up to 31 times fewer runs with only a 1.39× slowdown. Source code is available at https://github.com/davidecenzato/optimalBWT.git.

READ FULL TEXT
research
06/21/2021

Computing the original eBWT faster, simpler, and with less memory

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of t...
research
02/26/2022

A theoretical and experimental analysis of BWT variants for string collections

The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et ...
research
05/12/2023

Matching Statistics speed up BWT construction

Due to the exponential growth of genomic data, constructing dedicated da...
research
12/21/2018

A Simple Algorithm for Computing the Document Array

We present a simple algorithm for computing the document array given the...
research
09/19/2018

The Read-Optimized Burrows-Wheeler Transform

The advent of high-throughput sequencing has resulted in massive genomic...
research
05/03/2022

Computing Maximal Unique Matches with the r-index

In recent years, pangenomes received increasing attention from the scien...
research
01/16/2019

Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform

We show that the Longest Common Prefix Array of a text collection of tot...

Please sign up or login with your details

Forgot password? Click here to reset