E2FM: an encrypted and compressed full-text index for collections of genomic sequences

10/10/2019
by   Ferdinando Montecuollo, et al.
0

Next Generation Sequencing (NGS) platforms and, more generally, high-throughput technologies are giving rise to an exponential growth in the size of nucleotide sequence databases. Moreover, many emerging applications of nucleotide datasets – as those related to personalized medicine – require the compliance with regulations about the storage and processing of sensitive data. We have designed and carefully engineered E2FM-index, a new full-text index in minute space which was optimized for compressing and encrypting nucleotide sequence collections in FASTA format and for performing fast pattern-search queries. E2FM-index allows to build self-indexes which occupy till to 1/20 of the storage required by the input FASTA file, thus permitting to save about 95 of storage when indexing collections of highly similar sequences; moreover, it can exactly search the built indexes for patterns in times ranging from few milliseconds to a few hundreds milliseconds, depending on pattern length. Supplementary material and supporting datasets are available through Bioinformatics Online and https://figshare.com/s/6246ee9c1bd730a8bf6e.

READ FULL TEXT
research
10/07/2019

ER-index: a referential index for encrypted genomic databases

Huge DBMSs storing genomic information are being created and engineerize...
research
11/08/2017

A compressed dynamic self-index for highly repetitive text collections

We present a novel compressed dynamic self-index for highly repetitive t...
research
09/19/2018

The Read-Optimized Burrows-Wheeler Transform

The advent of high-throughput sequencing has resulted in massive genomic...
research
08/04/2019

Matching reads to many genomes with the r-index

The r-index is a tool for compressed indexing of genomic databases for e...
research
03/29/2021

A Fast and Small Subsampled R-index

The r-index (Gagie et al., JACM 2020) represented a breakthrough in comp...
research
12/20/2017

Text Indexing and Searching in Sublinear Time

We introduce the first index that can be built in o(n) time for a text o...
research
02/05/2021

A Memory-Efficient FM-Index Constructor for Next-Generation Sequencing Applications on FPGAs

FM-index is an efficient data structure for string search and is widely ...

Please sign up or login with your details

Forgot password? Click here to reset