Breaking the O(n)-Barrier in the Construction of Compressed Suffix Arrays

by   Dominik Kempa, et al.

The suffix array, describing the lexicographic order of suffixes of a given text, is the central data structure in string algorithms. The suffix array of a length-n text uses Θ(n log n) bits, which is prohibitive in many applications. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. For a length-n text over an alphabet of size σ, these data structures use only O(n logσ) bits. Immediately after their discovery, they almost completely replaced plain suffix arrays in practical applications, and a race started to develop efficient construction procedures. Yet, after more than 20 years, even for σ=2, the fastest algorithm remains stuck at O(n) time [Hon et al., FOCS 2003], which is slower by a Θ(log n) factor than the lower bound of Ω(n / log n) (following simply from the necessity to read the entire input). We break this long-standing barrier with a new data structure that takes O(n logσ) bits, answers suffix array queries in O(log^ϵ n) time, and can be constructed in O(nlogσ / √(log n)) time using O(nlogσ) bits of space. Our result is based on several new insights into the recently developed notion of string synchronizing sets [STOC 2019]. In particular, compared to their previous applications, we eliminate orthogonal range queries, replacing them with new queries that we dub prefix rank and prefix selection queries. As a further demonstration of our techniques, we present a new pattern-matching index that simultaneously minimizes the construction time and the query time among all known compact indexes (i.e., those using O(n logσ) bits).


page 1

page 2

page 3

page 4


Lower Bound for Succinct Range Minimum Query

Given an integer array A[1..n], the Range Minimum Query problem (RMQ) as...

Tailoring r-index for metagenomics

A basic problem in metagenomics is to assign a sequenced read to the cor...

Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

In the last decades, the necessity to process massive amounts of textual...

Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

We consider the problem of encoding a string of length n from an alphabe...

Longest Common Prefix Arrays for Succinct k-Spectra

The k-spectrum of a string is the set of all distinct substrings of leng...

COBS: a Compact Bit-Sliced Signature Index

We present COBS, a compact bit-sliced signature index, which is a cross-...

Dynamic Suffix Array with Polylogarithmic Queries and Updates

The suffix array SA[1..n] of a text T of length n is a permutation of {1...

Please sign up or login with your details

Forgot password? Click here to reset