Nucleotide String Indexing using Range Matching

08/06/2023
by   Alon Rashelbach, et al.
0

The two most common data-structures for genome indexing, FM-indices and hash-tables, exhibit a fundamental trade-off between memory footprint and performance. We present Ranger, a new indexing technique for nucleotide sequences that is both memory efficient and fast. We observe that nucleotide sequences can be represented as integer ranges and leverage a range-matching algorithm based on neural networks to perform the lookup. We prototype Ranger in software and integrate it into the popular Minimap2 tool. Ranger achieves almost identical end-to-end performance as the original Minimap2, while occupying 1.7× and 1.2× less memory for short- and long-reads, respectively. With a limited memory capacity, Ranger achieves up to 4.3× speedup for short reads compared to FM-Index, and up to 4.2× and 1.8× speedups for short- and long-reads, compared to hash-tables. Ranger opens up new opportunities in the context of hardware acceleration by reducing the memory footprint of long-seed indexes used in state-of-the-art alignment accelerators by up to 23× which results with 3× faster alignment and negligible accuracy degradation. Moreover, its worst case memory bandwidth and latency can be bounded in advance without the need to inflate DRAM capacity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/16/2020

GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

Genome sequence analysis has enabled significant advancements in medical...
research
11/29/2021

Bounding the Last Mile: Efficient Learned String Indexing

We introduce the RadixStringSpline (RSS) learned index structure for eff...
research
11/10/2022

RAPIDx: High-performance ReRAM Processing in-Memory Accelerator for Sequence Alignment

Genome sequence alignment is the core of many biological applications. T...
research
01/23/2022

Cuckoo Trie: Exploiting Memory-Level Parallelism for Efficient DRAM Indexing

We present the Cuckoo Trie, a fast, memory-efficient ordered index struc...
research
02/05/2021

A Memory-Efficient FM-Index Constructor for Next-Generation Sequencing Applications on FPGAs

FM-index is an efficient data structure for string search and is widely ...
research
07/13/2021

FLAT: An Optimized Dataflow for Mitigating Attention Performance Bottlenecks

Attention mechanisms form the backbone of state-of-the-art machine learn...
research
05/09/2017

Model Complexity-Accuracy Trade-off for a Convolutional Neural Network

Convolutional Neural Networks(CNN) has had a great success in the recent...

Please sign up or login with your details

Forgot password? Click here to reset