Computing MEMs on Repetitive Text Collections
We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern P[1..m] on a large repetitive text collection T[1..n], which is represented as a (hopefully much smaller) run-length context-free grammar of size g_rl. We show that the problem can be solved in time O(m^2 log^ϵ n), for any constant ϵ > 0, on a data structure of size O(g_rl). Further, on a locally consistent grammar of size O(δlogn/δ), the time decreases to O(mlog m(log m + log^ϵ n)). The value δ is a function of the substring complexity of T and Ω(δlogn/δ) is a tight lower bound on the compressibility of repetitive texts T, so our structure has optimal size in terms of n and δ. We extend our results to the problem of finding q-MEMs, which must appear at least q times in T.
READ FULL TEXT