Matching Statistics speed up BWT construction

05/12/2023
by   Francesco Masillo, et al.
0

Due to the exponential growth of genomic data, constructing dedicated data structures has become the principal bottleneck in common bioinformatics applications. In particular, the Burrows-Wheeler Transform (BWT) is the basis of some of the most popular self-indexes for genomic data, due to its known favourable behaviour on repetitive data. Some tools that exploit the intrinsic repetitiveness of biological data have risen in popularity, due to their speed and low space consumption. We introduce a new algorithm for computing the BWT, which takes advantage of the redundancy of the data through a compressed version of matching statistics, the CMS of [Lipták et al., WABI 2022]. We show that it suffices to sort a small subset of suffixes, lowering both computation time and space. Our result is due to a new insight which links the so-called insert-heads of [Lipták et al., WABI 2022] to the well-known run boundaries of the BWT. We give two implementations of our algorithm, called -, both competitive in our experimental validation on highly repetitive real-life datasets. In most cases, they outperform other tools w.r.t. running time, trading off a higher memory footprint, which, however, is still considerably smaller than the total size of the input data.

READ FULL TEXT

page 11

page 12

research
12/02/2022

Computing the optimal BWT of very large string collections

It is known that the exact form of the Burrows-Wheeler-Transform (BWT) o...
research
01/13/2023

Computing matching statistics on Wheeler DFAs

Matching statistics were introduced to solve the approximate string matc...
research
02/19/2020

Translating Between Wavelet Tree and Wavelet Matrix Construction

The wavelet tree (Grossi et al. [SODA, 2003]) and wavelet matrix (Claude...
research
03/04/2020

Time-Space Tradeoffs for Finding a Long Common Substring

We consider the problem of finding, given two documents of total length ...
research
04/18/2018

On Abelian Longest Common Factor with and without RLE

We consider the Abelian longest common factor problem in two scenarios: ...
research
10/31/2021

Computing Matching Statistics on Repetitive Texts

Computing the matching statistics of a string P[1..m] with respect to a ...
research
07/24/2021

Accelerating Atmospheric Turbulence Simulation via Learned Phase-to-Space Transform

Fast and accurate simulation of imaging through atmospheric turbulence i...

Please sign up or login with your details

Forgot password? Click here to reset