Blind Biological Sequence Denoising with Self-Supervised Set Learning

09/04/2023
by   Nathan Ng, et al.
0

Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the midpoint of the subreads in both the latent and sequence spaces. This set embedding represents the "average" of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL methods denoise small reads of ≤ 6 subreads with 17 errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60 accurately denoising these reads, SSSL promises to better realize the potential of high-throughput DNA sequencing data for downstream scientific applications.

READ FULL TEXT
research
02/16/2021

Joint self-supervised blind denoising and noise estimation

We propose a novel self-supervised image blind denoising approach in whi...
research
03/06/2014

A Novel Method for Comparative Analysis of DNA Sequences by Ramanujan-Fourier Transform

Alignment-free sequence analysis approaches provide important alternativ...
research
05/31/2021

Sequenceable Event Recorders

With recent high-throughput technology we can synthesize large heterogen...
research
11/18/2022

Forecasting labels under distribution-shift for machine-guided sequence design

The ability to design and optimize biological sequences with specific fu...
research
09/12/2021

Single-Read Reconstruction for DNA Data Storage Using Transformers

As the global need for large-scale data storage is rising exponentially,...
research
06/02/2021

Learning to Rehearse in Long Sequence Memorization

Existing reasoning tasks often have an important assumption that the inp...
research
01/27/2020

diBELLA: Distributed Long Read to Long Read Alignment

We present a parallel algorithm and scalable implementation for genome a...

Please sign up or login with your details

Forgot password? Click here to reset