PFP Data Structures

06/21/2020
by   Christina Boucher, et al.
0

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string S, it produces a dictionary D and a parse P of overlapping phrases such that BWT (S) can be computed from D and P in time and workspace bounded in terms of their combined size |PFP (S)|. In practice D and P are significantly smaller than S and computing BWT (S) from them is more efficient than computing it from S directly, at least when S consists of genomes from individuals of the same species. In this paper, we consider PFP (S) as a data structure and show how it can be augmented to support the following queries quickly, still in O (|PFP (S)|) space: longest common extension (LCE), suffix array (SA), longest common prefix (LCP) and BWT. Lastly, we provide experimental evidence that the PFP data structure can be efficiently constructed for very large repetitive datasets: it takes one hour and 54 GB peak memory for 1000 variants of human chromosome 19, initially occupying roughly 56 GB.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2020

Update Query Time Trade-off for dynamic Suffix Arrays

The Suffix Array SA(S) of a string S[1 ... n] is an array containing all...
research
03/29/2018

Prefix-Free Parsing for Building Big BWTs

High-throughput sequencing technologies have led to explosive growth of ...
research
05/12/2020

Counting Distinct Patterns in Internal Dictionary Matching

We consider the problem of preprocessing a text T of length n and a dict...
research
11/14/2022

Augmented Thresholds for MONI

MONI (Rossi et al., 2022) can store a pangenomic dataset T in small spac...
research
07/19/2018

The colored longest common prefix array computed via sequential scans

Due to the increased availability of large datasets of biological sequen...
research
11/16/2018

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

While short read aligners, which predominantly use the FM-index, are abl...
research
06/30/2022

Prefix-free parsing for building large tunnelled Wheeler graphs

We propose a new technique for creating a space-efficient index for larg...

Please sign up or login with your details

Forgot password? Click here to reset