Computing the original eBWT faster, simpler, and with less memory

06/21/2021
by   Christina Boucher, et al.
0

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings, however, since this introduction, it has been used more generally to describe any BWT of a collection of strings and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/02/2022

Computing the optimal BWT of very large string collections

It is known that the exact form of the Burrows-Wheeler-Transform (BWT) o...
research
12/21/2018

A Simple Algorithm for Computing the Document Array

We present a simple algorithm for computing the document array given the...
research
02/17/2021

Linear Time Runs over General Ordered Alphabets

A run in a string is a maximal periodic substring. For example, the stri...
research
05/07/2021

Faster Algorithms for Longest Common Substring

In the classic longest common substring (LCS) problem, we are given two ...
research
01/24/2022

A New Algebraic Approach for String Reconstruction from Substring Compositions

We consider the problem of binary string reconstruction from the multise...
research
01/17/2022

Linear Time Construction of Indexable Elastic Founder Graphs

Pattern matching on graphs has been widely studied lately due to its imp...
research
05/17/2018

External memory BWT and LCP computation for sequence collections with applications

We propose an external memory algorithm for the computation of the BWT a...

Please sign up or login with your details

Forgot password? Click here to reset