Prefix-Free Parsing for Building Big BWTs

03/29/2018
by   Travis Gagie, et al.
0

High-throughput sequencing technologies have led to explosive growth of genomic databases, the largest of which will soon be hundreds of terabytes or more. For many applications we want indexes of these databases but building them is a challenge. Fortunately, many genomic databases are highly repetitive and we should be able to use that to help us compute the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper we describe a preprocessing step, prefix-free parsing, that takes a text T and in one pass generates a dictionary D and a parse P of T, with the property that we can compute the BWT of T from D and P alone using workspace proportional only to their total size, and O (|T|) time when we can work in internal memory. Our experiments show that D and P are often significantly smaller than T in practice and so may fit in a reasonable internal memory even when T is very large.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/21/2020

PFP Data Structures

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a p...
research
06/30/2022

Prefix-free parsing for building large tunnelled Wheeler graphs

We propose a new technique for creating a space-efficient index for larg...
research
05/12/2020

Counting Distinct Patterns in Internal Dictionary Matching

We consider the problem of preprocessing a text T of length n and a dict...
research
03/05/2019

Lempel-Ziv-like Parsing in Small Space

Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and widel...
research
02/16/2018

Online LZ77 Parsing and Matching Statistics with RLBWTs

Lempel-Ziv 1977 (LZ77) parsing, matching statistics and the Burrows-Whee...
research
12/21/2018

Computational RAM to Accelerate String Matching at Scale

Traditional Von Neumann computing is falling apart in the era of explodi...
research
10/25/2021

Scalable Bayesian divergence time estimation with ratio transformations

Divergence time estimation is crucial to provide temporal signals for da...

Please sign up or login with your details

Forgot password? Click here to reset