The colored longest common prefix array computed via sequential scans

07/19/2018
by   F. Garofalo, et al.
0

Due to the increased availability of large datasets of biological sequences, the tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most of the alignment-free approaches require the computation of statistics of the sequences in the dataset. Such computations become impractical in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. Experimental results confirm the effectiveness of our approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2020

Update Query Time Trade-off for dynamic Suffix Arrays

The Suffix Array SA(S) of a string S[1 ... n] is an array containing all...
research
07/03/2022

Suffix sorting via matching statistics

We introduce a new algorithm for constructing the generalized suffix arr...
research
05/17/2018

External memory BWT and LCP computation for sequence collections with applications

We propose an external memory algorithm for the computation of the BWT a...
research
06/21/2020

PFP Data Structures

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a p...
research
11/15/2018

Vectorized Character Counting for Faster Pattern Matching

Many modern sequence alignment tools implement fast string matching usin...
research
11/23/2010

Evolutionary distances in the twilight zone -- a rational kernel approach

Phylogenetic tree reconstruction is traditionally based on multiple sequ...
research
12/14/2022

An Efficient Incremental Simple Temporal Network Data Structure for Temporal Planning

One popular technique to solve temporal planning problems consists in de...

Please sign up or login with your details

Forgot password? Click here to reset