TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

12/09/2022
by   Meryem Banu Cavlak, et al.
0

Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally-inefficient and memory-hungry; bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first fast and widely-applicable pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. TargetCall filters out all off-target reads before basecalling; and the highly-accurate but slow basecalling is performed only on the raw signals whose noisy reads are labeled as on-target. Our thorough experimental evaluations using both real and simulated data show that TargetCall 1) improves the end-to-end basecalling performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88 sensitivity in keeping on-target reads, 2) maintains high accuracy in downstream analysis, 3) precisely filters out up to 94.71 and 4) achieves better performance, sensitivity, and generality compared to prior works. We freely open-source TargetCall at https://github.com/CMU-SAFARI/TargetCall.

READ FULL TEXT

page 3

page 5

page 6

research
09/18/2022

GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

Nanopore sequencing is a widely-used high-throughput genome sequencing t...
research
11/15/2022

Taming Large-Scale Genomic Analyses via Sparsified Genomics

Searching for similar genomic sequences is an essential and fundamental ...
research
04/06/2016

GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping

Motivation: High throughput DNA sequencing (HTS) technologies generate a...
research
02/21/2022

GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis

Read mapping is a fundamental, yet computationally-expensive step in man...
research
03/27/2021

GateKeeper-GPU: Fast and Accurate Pre-Alignment Filtering in Short Read Mapping

At the last step of short read mapping, the candidate locations of the r...
research
11/06/2022

A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers

Nanopore sequencing generates noisy electrical signals that need to be c...
research
04/17/2023

Lossy Compressor preserving variant calling through Extended BWT

A standard format used for storing the output of high-throughput sequenc...

Please sign up or login with your details

Forgot password? Click here to reset