Computing Maximal Unique Matches with the r-index

05/03/2022
by   Sara Giuliani, et al.
0

In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the r-index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the r-index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.'s approach to enable the computation of MUMs on the r-index, while preserving the space and time bounds. We add additional O(r) samples of the longest common prefix (LCP) array, where r is the number of equal-letter runs of the BWT, that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs. We implemented a proof-of-concept of our approach, that we call mum-phinder, and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs. We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.

READ FULL TEXT

page 1

page 3

page 5

page 7

page 13

page 15

research
02/10/2022

MONI can find k-MEMs

Maximal exact matches (MEMs) have been widely used in bioinformatics at ...
research
11/14/2022

Augmented Thresholds for MONI

MONI (Rossi et al., 2022) can store a pangenomic dataset T in small spac...
research
09/19/2022

MARIA: Multiple-alignment r-index with aggregation

There now exist compact indexes that can efficiently list all the occurr...
research
12/02/2022

Computing the optimal BWT of very large string collections

It is known that the exact form of the Burrows-Wheeler-Transform (BWT) o...
research
05/22/2018

copMEM: Finding maximal exact matches via sampling both genomes

Genome-to-genome comparisons require designating anchor points, which ar...
research
09/19/2018

The Read-Optimized Burrows-Wheeler Transform

The advent of high-throughput sequencing has resulted in massive genomic...
research
10/23/2019

Resolution of the Burrows-Wheeler Transform Conjecture

Burrows-Wheeler Transform (BWT) is an invertible text transformation tha...

Please sign up or login with your details

Forgot password? Click here to reset