Acceleration of FM-index Queries Through Prefix-free Parsing

05/10/2023
by   Aaron Hong, et al.
0

FM-indexes are a crucial data structure in DNA alignment, for example, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing – which takes parameters that let us tune the average length of the phrases – instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory. Our source code is available at https://github.com/marco-oliva/afm .

READ FULL TEXT

page 12

page 17

page 18

page 19

research
06/25/2022

PalFM-index: FM-index for Palindrome Pattern Matching

Palindrome pattern matching (pal-matching) problem is a generalized patt...
research
10/04/2021

FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns

The run-length compressed Burrows-Wheeler transform (RLBWT) used in conj...
research
09/21/2020

Space/time-efficient RDF stores based on circular suffix sorting

In recent years, RDF has gained popularity as a format for the standardi...
research
04/11/2019

Gating Mechanisms for Combining Character and Word-level Word Representations: An Empirical Study

In this paper we study how different ways of combining character and wor...
research
06/21/2022

The Complexity of the Co-Occurrence Problem

Let S be a string of length n over an alphabet Σ and let Q be a subset o...
research
09/25/2019

Internal Dictionary Matching

We introduce data structures answering queries concerning the occurrence...
research
11/17/2021

Character Transformations for Non-Autoregressive GEC Tagging

We propose a character-based nonautoregressive GEC approach, with automa...

Please sign up or login with your details

Forgot password? Click here to reset