Breaking the O(n)-Barrier in the Construction of Compressed Suffix Arrays

06/24/2021
by   Dominik Kempa, et al.
0

The suffix array, describing the lexicographic order of suffixes of a given text, is the central data structure in string algorithms. The suffix array of a length-n text uses Θ(n log n) bits, which is prohibitive in many applications. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. For a length-n text over an alphabet of size σ, these data structures use only O(n logσ) bits. Immediately after their discovery, they almost completely replaced plain suffix arrays in practical applications, and a race started to develop efficient construction procedures. Yet, after more than 20 years, even for σ=2, the fastest algorithm remains stuck at O(n) time [Hon et al., FOCS 2003], which is slower by a Θ(log n) factor than the lower bound of Ω(n / log n) (following simply from the necessity to read the entire input). We break this long-standing barrier with a new data structure that takes O(n logσ) bits, answers suffix array queries in O(log^ϵ n) time, and can be constructed in O(nlogσ / √(log n)) time using O(nlogσ) bits of space. Our result is based on several new insights into the recently developed notion of string synchronizing sets [STOC 2019]. In particular, compared to their previous applications, we eliminate orthogonal range queries, replacing them with new queries that we dub prefix rank and prefix selection queries. As a further demonstration of our techniques, we present a new pattern-matching index that simultaneously minimizes the construction time and the query time among all known compact indexes (i.e., those using O(n logσ) bits).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2020

Lower Bound for Succinct Range Minimum Query

Given an integer array A[1..n], the Range Minimum Query problem (RMQ) as...
research
06/10/2020

Tailoring r-index for metagenomics

A basic problem in metagenomics is to assign a sequenced read to the cor...
research
08/07/2023

Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

In the last decades, the necessity to process massive amounts of textual...
research
03/05/2018

Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

We consider the problem of encoding a string of length n from an alphabe...
research
06/08/2023

Longest Common Prefix Arrays for Succinct k-Spectra

The k-spectrum of a string is the set of all distinct substrings of leng...
research
05/23/2019

COBS: a Compact Bit-Sliced Signature Index

We present COBS, a compact bit-sliced signature index, which is a cross-...
research
01/04/2022

Dynamic Suffix Array with Polylogarithmic Queries and Updates

The suffix array SA[1..n] of a text T of length n is a permutation of {1...

Please sign up or login with your details

Forgot password? Click here to reset