Longest Common Prefix Arrays for Succinct k-Spectra

06/08/2023
by   Jarno N. Alanko, et al.
0

The k-spectrum of a string is the set of all distinct substrings of length k occurring in the string. K-spectra have many applications in bioinformatics including pseudoalignment and genome assembly. The Spectral Burrows-Wheeler Transform (SBWT) has been recently introduced as an algorithmic tool to efficiently represent and query these objects. The longest common prefix (LCP) array for a k-spectrum is an array of length n that stores the length of the longest common prefix of adjacent k-mers as they occur in lexicographical order. The LCP array has at least two important applications, namely to accelerate pseudoalignment algorithms using the SBWT and to allow simulation of variable-order de Bruijn graphs within the SBWT framework. In this paper we explore algorithms to compute the LCP array efficiently from the SBWT representation of the k-spectrum. Starting with a straightforward O(nk) time algorithm, we describe algorithms that are efficient in both theory and practice. We show that the LCP array can be computed in optimal O(n) time, where n is the length of the SBWT of the spectrum. In practical genomics scenarios, we show that this theoretically optimal algorithm is indeed practical, but is often outperformed on smaller values of k by an asymptotically suboptimal algorithm that interacts better with the CPU cache. Our algorithms share some features with both classical Burrows-Wheeler inversion algorithms and LCP array construction algorithms for suffix arrays.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2017

Lyndon Array Construction during Burrows-Wheeler Inversion

In this paper we present an algorithm to compute the Lyndon array of a s...
research
03/14/2023

Algorithms for Length Spectra of Combinatorial Tori

Consider a weighted, undirected graph cellularly embedded on a topologic...
research
06/24/2021

Breaking the O(n)-Barrier in the Construction of Compressed Suffix Arrays

The suffix array, describing the lexicographic order of suffixes of a gi...
research
09/02/2020

An Algorithm for Automatically Updating a Forsyth-Edwards Notation String Without an Array Board Representation

We present an algorithm that correctly updates the Forsyth-Edwards Notat...
research
07/06/2021

On Arithmetically Progressed Suffix Arrays and related Burrows-Wheeler Transforms

We characterize those strings whose suffix arrays are based on arithmeti...
research
06/25/2021

Spectral concepts in genome informational analysis

The concept of k-spectrum for genomes is here investigated as a basic to...
research
06/13/2023

Efficient GPU Implementation of Affine Index Permutations on Arrays

Optimal usage of the memory system is a key element of fast GPU algorith...

Please sign up or login with your details

Forgot password? Click here to reset