Sketching and Sequence Alignment: A Rate-Distortion Perspective

07/09/2021
by   Ilan Shomorony, et al.
0

Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. A standard approach to speed up this task is to compute "sketches" of the DNA reads (typically via hashing-based techniques) that allow the efficient computation of pairwise alignment scores. We propose a rate-distortion framework to study the problem of computing sketches that achieve the optimal tradeoff between sketch size and alignment estimation distortion. We consider the simple setting of i.i.d. error-free sources of length n and introduce a new sketching algorithm called "locational hashing." While standard approaches in the literature based on min-hashes require B = (1/D) · O( log n ) bits to achieve a distortion D, our proposed approach only requires B = log^2(1/D) · O(1) bits. This can lead to significant computational savings in pairwise alignment estimation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2020

Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment

Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinf...
research
08/24/2018

Introducing the Perception-Distortion Tradeoff into the Rate-Distortion Theory of General Information Sources

Blau and Michaeli recently introduced a novel concept for inverse proble...
research
04/05/2022

High-throughput Pairwise Alignment with the Wavefront Algorithm using Processing-in-Memory

We show that the wavefront algorithm can achieve higher pairwise read al...
research
07/10/2023

A Linear Time Quantum Algorithm for Pairwise Sequence Alignment

Sequence Alignment is the process of aligning biological sequences in or...
research
02/27/2023

On the Design of Codes for DNA Computing: Secondary Structure Avoidance Codes

In this work, we investigate a challenging problem, which has been consi...
research
09/26/2016

Robust Time-Series Retrieval Using Probabilistic Adaptive Segmental Alignment

Traditional pairwise sequence alignment is based on matching individual ...
research
07/16/2023

Optimal Compression of Unit Norm Vectors in the High Distortion Regime

Motivated by the need for communication-efficient distributed learning, ...

Please sign up or login with your details

Forgot password? Click here to reset