Faster Algorithms for Longest Common Substring

05/07/2021
∙
by   Panagiotis Charalampopoulos, et al.
∙
0
∙

In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an 𝒪(n logσ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an 𝒪(n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in 𝒪(n logσ/log n ) space and read in 𝒪(n logσ/log n ) time. We show that, in this model, we can compute an LCS in time 𝒪(n logσ / √(log n)), which is sublinear in n if σ=2^o(√(log n)) (in particular, if σ=𝒪(1)), using optimal space 𝒪(n logσ/log n). We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Thankachan et al. showed how to compute a k-mismatch LCS in 𝒪(n log^k n) time for k=𝒪(1) [J. Comput. Biol. 2016]. We show an 𝒪(n log^k-1/2 n)-time algorithm, for any constant k>0 and irrespective of the alphabet size, using 𝒪(n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
∙ 06/13/2021

The k-mappability problem revisited

The k-mappability problem has two integers parameters m and k. For every...
research
∙ 08/08/2023

Linear Time Construction of Cover Suffix Tree and Applications

The Cover Suffix Tree (CST) of a string T is the suffix tree of T with a...
research
∙ 09/02/2022

Elastic-Degenerate String Matching with 1 Error

An elastic-degenerate string is a sequence of n finite sets of strings o...
research
∙ 01/16/2020

Faster STR-EC-LCS Computation

The longest common subsequence (LCS) problem is a central problem in str...
research
∙ 06/21/2021

Computing the original eBWT faster, simpler, and with less memory

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of t...
research
∙ 05/24/2018

Longest Unbordered Factor in Quasilinear Time

A border u of a word w is a proper factor of w occurring both as a prefi...
research
∙ 09/08/2023

Recursive Error Reduction for Regular Branching Programs

In a recent work, Chen, Hoza, Lyu, Tal and Wu (FOCS 2023) showed an impr...

Please sign up or login with your details

Forgot password? Click here to reset