KmerCo: A lightweight K-mer counting technique with a tiny memory footprint

04/28/2023
by   Sabuzima Nayak, et al.
0

K-mer counting is a requisite process for DNA assembly because it speeds up its overall process. The frequency of K-mers is used for estimating the parameters of DNA assembly, error correction, etc. The process also provides a list of district K-mers which assist in searching large databases and reducing the size of de Bruijn graphs. Nonetheless, K-mer counting is a data and compute-intensive process. Hence, it is crucial to implement a lightweight data structure that occupies low memory but does fast processing of K-mers. We proposed a lightweight K-mer counting technique, called KmerCo that implements a potent counting Bloom Filter variant, called countBF. KmerCo has two phases: insertion and classification. The insertion phase inserts all K-mers into countBF and determines distinct K-mers. The classification phase is responsible for the classification of distinct K-mers into trustworthy and erroneous K-mers based on a user-provided threshold value. We also proposed a novel benchmark performance metric. We used the Hadoop MapReduce program to determine the frequency of K-mers. We have conducted rigorous experiments to prove the dominion of KmerCo compared to state-of-the-art K-mer counting techniques. The experiments are conducted using DNA sequences of four organisms. The datasets are pruned to generate four different size datasets. KmerCo is compared with Squeakr, BFCounter, and Jellyfish. KmerCo took the lowest memory, highest number of insertions per second, and a positive trustworthy rate as compared with the three above-mentioned methods.

READ FULL TEXT
research
07/12/2016

DNA Image Pro -- A Tool for Generating Pixel Patterns using DNA Tile Assembly

Self-assembly is a process found everywhere in the Nature. In particular...
research
11/21/2017

Accelerating K-mer Frequency Counting with GPU and Non-Volatile Memory

The emergence of Next Generation Sequencing (NGS) platforms has increase...
research
04/24/2017

GaKCo: a Fast GApped k-mer string Kernel using COunting

String Kernel (SK) techniques, especially those using gapped k-mers as f...
research
08/14/2020

PANDA: Processing-in-MRAM Accelerated De Bruijn Graph based DNA Assembly

Spurred by widening gap between data processing speed and data communica...
research
04/26/2022

Managing Reliability Skew in DNA Storage

DNA is emerging as an increasingly attractive medium for data storage du...
research
09/19/2018

Extreme Scale De Novo Metagenome Assembly

Metagenome assembly is the process of transforming a set of short, overl...

Please sign up or login with your details

Forgot password? Click here to reset