ATHENA: Automated Tuning of Genomic Error Correction Algorithms using Language Models

12/30/2018
by   Mustafa Abdallah, et al.
0

The performance of most error-correction algorithms that operate on genomic sequencer reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction. We perform this in a data-driven manner, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different instruments and organisms. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently using the perplexity metric, prevalent in NLP. After training the language model, we show that the perplexity metric calculated for runtime data has a strong negative correlation with the correction of the erroneous NGS reads. Therefore, we use the perplexity metric to guide a hill climbing-based search, converging toward the best k-value. Our approach is suitable for both de novo and comparative sequencing (resequencing), eliminating the need for a reference genome to serve as the ground truth. This is important because the use of a reference genome often carries forward the biases along the stages of the pipeline.

READ FULL TEXT
research
12/19/2021

Lerna: Transformer Architectures for Configuring Error Correction Tools for Short- and Long-Read Genome Sequencing

Sequencing technologies are prone to errors, making error correction (EC...
research
06/04/2019

The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction

Recent work on Grammatical Error Correction (GEC) has highlighted the im...
research
08/03/2023

Does Correction Remain A Problem For Large Language Models?

As large language models, such as GPT, continue to advance the capabilit...
research
08/14/2016

Numerically Grounded Language Models for Semantic Error Correction

Semantic error detection and correction is an important task for applica...
research
09/04/2020

Recent Trends in the Use of Deep Learning Models for Grammar Error Handling

Grammar error handling (GEH) is an important topic in natural language p...
research
05/23/2022

Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond

Natural language processing technology has rapidly improved automated gr...
research
02/01/2019

Some Enumeration Problems in the Duplication-Loss Model of Genome Rearrangement

Tandem-duplication-random-loss (TDRL) is an important genome rearrangeme...

Please sign up or login with your details

Forgot password? Click here to reset