Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

11/16/2018
by   Alan Kuhnle, et al.
0

While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that --- when used with the rank data structure --- allows us access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT --- we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.'s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes, and show that it improves over Bowtie with respect to both memory and time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2018

Strong link between BWT and XBW via Aho-Corasick automaton and applications to Run-Length Encoding

The boom of genomic sequencing makes compression of set of sequences ine...
research
02/05/2021

A Memory-Efficient FM-Index Constructor for Next-Generation Sequencing Applications on FPGAs

FM-index is an efficient data structure for string search and is widely ...
research
02/10/2022

MONI can find k-MEMs

Maximal exact matches (MEMs) have been widely used in bioinformatics at ...
research
06/10/2020

Tailoring r-index for metagenomics

A basic problem in metagenomics is to assign a sequenced read to the cor...
research
09/19/2018

The Read-Optimized Burrows-Wheeler Transform

The advent of high-throughput sequencing has resulted in massive genomic...
research
04/02/2020

On Locating Paths in Compressed Cardinal Trees

A compressed index is a data structure representing a text within compre...
research
06/21/2020

PFP Data Structures

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a p...

Please sign up or login with your details

Forgot password? Click here to reset