Text Indexing and Searching in Sublinear Time

12/20/2017
by   J. Ian Munro, et al.
0

We introduce the first index that can be built in o(n) time for a text of length n, and also queried in o(m) time for a pattern of length m. On a constant-size alphabet, for example, our index uses O(n^1/2+εn) bits, is built in O(n/^1/2-ε n) deterministic time, and finds the occ pattern occurrences in time O(m/ n + √( n) n + occ), where ε>0 is an arbitrarily small constant. As a comparison, the most recent classical text index uses O(n n) bits, is built in O(n) time, and searches in time O(m/ n + n + occ). We build on a novel text sampling based on difference covers, which enjoys properties that allow us efficiently computing longest common prefixes in constant time. We extend our results to the secondary memory model as well, where we give the first construction in o(Sort(n)) time of a data structure with suffix array functionality, which can search for patterns in the almost optimal time, with an additive penalty of O(√(_M/B n) n), where M is the size of main memory available and B is the disk block size.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/25/2022

PalFM-index: FM-index for Palindrome Pattern Matching

Palindrome pattern matching (pal-matching) problem is a generalized patt...
research
09/08/2018

Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

Indexing highly repetitive texts --- such as genomic databases, software...
research
04/11/2020

Grammar-compressed Self-index with Lyndon Words

We introduce a new class of straight-line programs (SLPs), named the Lyn...
research
08/04/2023

Optimally Computing Compressed Indexing Arrays Based on the Compact Directed Acyclic Word Graph

In this paper, we present the first study of the computational complexit...
research
11/24/2022

A fast and simple O (z log n)-space index for finding approximately longest common substrings

We describe how, given a text T [1..n] and a positive constant ϵ, we can...
research
11/03/2018

Compressed Multiple Pattern Matching

Given d strings over the alphabet {0,1,...,σ-1}, the classical Aho--Cora...
research
10/10/2019

E2FM: an encrypted and compressed full-text index for collections of genomic sequences

Next Generation Sequencing (NGS) platforms and, more generally, high-thr...

Please sign up or login with your details

Forgot password? Click here to reset