Algorithms to compute the Burrows-Wheeler Similarity Distribution

03/25/2019
by   Felipe A. Louza, et al.
0

The Burrows-Wheeler transform (BWT) is a well studied text transformation widely used in data compression and text indexing. The BWT of two strings can also provide similarity measures between them, based on the observation that the more their symbols are intermixed in the transformation, the more the strings are similar. In this article we present two new algorithms to compute similarity measures based on the BWT for string collections. In particular, we present practical and theoretical improvements to the computation of the Burrows-Wheeler similarity distribution for all pairs of strings in a collection. Our algorithms take advantage of the BWT computed for the concatenation of all strings, and use compressed data structures that allow reducing the running time with a small memory footprint, as shown by a set of experiments with real and artificial datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2018

Tunneling on Wheeler Graphs

The Burrows-Wheeler Transform (BWT) is an important technique both in da...
research
05/25/2018

Strong link between BWT and XBW via Aho-Corasick automaton and applications to Run-Length Encoding

The boom of genomic sequencing makes compression of set of sequences ine...
research
02/26/2022

A theoretical and experimental analysis of BWT variants for string collections

The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et ...
research
11/04/2020

Neural text normalization leveraging similarities of strings and sounds

We propose neural models that can normalize text by considering the simi...
research
06/25/2018

Handling Massive N-Gram Datasets Efficiently

This paper deals with the two fundamental problems concerning the handli...
research
02/04/2019

A New Class of Searchable and Provably Highly Compressible String Transformations

The Burrows-Wheeler Transform is a string transformation that plays a fu...
research
10/18/2019

b-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches

Recently, randomly mapping vectorial data to strings of discrete symbols...

Please sign up or login with your details

Forgot password? Click here to reset