Computing all-vs-all MEMs in run-length encoded collections of HiFi reads

by   Diego Díaz-Domínguez, et al.

We describe an algorithm to find maximal exact matches (MEMs) among HiFi reads with homopolymer errors. The main novelty in our work is that we resort to run-length compression to help deal with errors. Our method receives as input a run-length-encoded string collection containing the HiFi reads along with their reverse complements. Subsequently, it splits the encoding into two arrays, one storing the sequence of symbols for equal-symbol runs and another storing the run lengths. The purpose of the split is to get the BWT of the run symbols and reorder their lengths accordingly. We show that this special BWT, as it encodes the HiFi reads and their reverse complements, supports bi-directional queries for the HiFi reads. Then, we propose a variation of the MEM algorithm of Belazzougui et al. (2013) that exploits the run-length encoding and the implicit bi-directional property of our BWT to compute approximate MEMs. Concretely, if the algorithm finds that two substrings, a_1 … a_p and b_1 … b_p, have a MEM, then it reports the MEM only if their corresponding length sequences, ℓ^a_1 …ℓ^a_p and ℓ^b_1 …ℓ^b_p, do not differ beyond an input threshold. We use a simple metric to calculate the similarity of the length sequences that we call the run-length excess. Our technique facilitates the detection of MEMs with homopolymer errors as it does not require dynamic programming to find approximate matches where the only edits are the lengths of the equal-symbol runs. Finally, we present a method that relies on a geometric data structure to report the text occurrences of the MEMs detected by our algorithm.


page 1

page 2

page 3

page 4


Improving Run Length Encoding by Preprocessing

The Run Length Encoding (RLE) compression method is a long standing simp...

On Computing Average Common Substring Over Run Length Encoded Sequences

The Average Common Substring (ACS) is a popular alignment-free distance ...

A New Lossless Data Compression Algorithm Exploiting Positional Redundancy

A new run length encoding algorithm for lossless data compression that e...

The Longest Run Subsequence Problem: Further Complexity Results

Longest Run Subsequence is a problem introduced recently in the context ...

Modelling Correlated Bernoulli Data Part I: Theory and Run Lengths

Binary data are very common in many applications, and are typically simu...

Fast algorithms for morphological operations using run-length encoded binary images

This paper presents innovative algorithms to efficiently compute erosion...

Run-Length Encoding in a Finite Universe

Text compression schemes and compact data structures usually combine sop...

Please sign up or login with your details

Forgot password? Click here to reset