String Sampling with Bidirectional String Anchors

12/20/2021
by   Grigorios Loukides, et al.
0

The minimizers sampling mechanism is a popular mechanism for string sampling introduced independently by Schleimer et al. [SIGMOD 2003] and by Roberts et al. [Bioinf. 2004]. Given two positive integers w and k, it selects the lexicographically smallest length-k substring in every fragment of w consecutive length-k substrings (in every sliding window of length w + k - 1). Minimizers samples are approximately uniform, locally consistent, and computable in linear time. Two main disadvantages of minimizers sampling mechanisms are: first, they do not have good guarantees on the expected size of their samples for every combination of w and k; and, second, indexes that are constructed over their samples do not have good worst-case guarantees for on-line pattern searches. We introduce bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given a positive integer ℓ, our mechanism selects the lexicographically smallest rotation in every length-ℓ fragment (in every sliding window of length ℓ). We show that bd-anchors samples are also approximately uniform, locally consistent, and computable in linear time. In addition, our experiments using several datasets demonstrate that the bd-anchors sample sizes decrease proportionally to ℓ; and that these sizes are competitive to or smaller than the minimizers sample sizes using the analogous sampling parameters. We provide theoretical justification for these results by analyzing the expected size of bd-anchors samples. As a negative result, we show that computing a total order ≤ on the input alphabet, which minimizes the bd-anchors sample size, is NP-hard. We also show that by using any bd-anchors sample, we can construct, in near-linear time, an index which requires linear (extra) space in the size of the sample and answers on-line pattern searches in near-optimal time.

READ FULL TEXT
research
04/19/2021

A Separation of γ and b via Thue–Morse Words

We prove that for n≥ 2, the size b(t_n) of the smallest bidirectional sc...
research
07/04/2023

Linear-time Computation of DAWGs, Symmetric Indexing Structures, and MAWs for Integer Alphabets

The directed acyclic word graph (DAWG) of a string y of length n is the ...
research
06/27/2022

Balancing Run-Length Straight-Line Programs*

It was recently proved that any SLP generating a given string w can be t...
research
05/18/2021

Combinatorics of minimal absent words for a sliding window

A string w is called a minimal absent word (MAW) for another string T if...
research
08/08/2023

Linear Time Construction of Cover Suffix Tree and Applications

The Cover Suffix Tree (CST) of a string T is the suffix tree of T with a...
research
05/19/2020

Linear Time Construction of Indexable Founder Block Graphs

We introduce a compact pangenome representation based on an optimal segm...
research
03/30/2023

Efficient distributed representations beyond negative sampling

This article describes an efficient method to learn distributed represen...

Please sign up or login with your details

Forgot password? Click here to reset