Towards a Definitive Measure of Repetitiveness

10/04/2019
by   Tomasz Kociumaka, et al.
0

Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such a clear measure exists for the compressibility of repetitive sequences other than the uncomputable Kolmogorov's complexity. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel-Ziv parse are frequently used to estimate it. Recently, a more principled measure, the size γ of the smallest attractor of a string S[1..n], was introduced. Measure γ lower bounds all the previous relevant ones (e.g., z), yet S can be represented and indexed within space O(γlog(n/γ)), which also upper bounds most measures. While γ is certainly a better measure of repetitiveness, it is NP-complete to compute, and it is not known if S can always be represented in O(γ) space. In this paper we study a smaller measure, δ<γ, which can be computed in linear time. We show that δ captures better the concept of compressibility in repetitive strings: We prove that, for some string families, it holds γ = Ω(δlog n). Still, we can build a representation of S of size O(δlog(n/δ)), which supports direct access to any S[i] in time O(log(n/δ)) and finds the occ occurrences of any pattern P[1..m] in time O(mlog n + occlog^ϵ n) for any constant ϵ>0. Further, such representation is worst-case optimal because, in some families, S can only be represented in Ω(δlog n) space. We complete our characterization of δ by showing that γ, z and other measures of repetitiveness are always O(δlog(n/δ)), but in some string families, the smallest context-free grammar is of size g=Ω(δlog^2 n / loglog n). No such a lower bound is known to hold for γ.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset