Towards a Definitive Measure of Repetitiveness

10/04/2019
by   Tomasz Kociumaka, et al.
0

Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such a clear measure exists for the compressibility of repetitive sequences other than the uncomputable Kolmogorov's complexity. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel-Ziv parse are frequently used to estimate it. Recently, a more principled measure, the size γ of the smallest attractor of a string S[1..n], was introduced. Measure γ lower bounds all the previous relevant ones (e.g., z), yet S can be represented and indexed within space O(γlog(n/γ)), which also upper bounds most measures. While γ is certainly a better measure of repetitiveness, it is NP-complete to compute, and it is not known if S can always be represented in O(γ) space. In this paper we study a smaller measure, δ<γ, which can be computed in linear time. We show that δ captures better the concept of compressibility in repetitive strings: We prove that, for some string families, it holds γ = Ω(δlog n). Still, we can build a representation of S of size O(δlog(n/δ)), which supports direct access to any S[i] in time O(log(n/δ)) and finds the occ occurrences of any pattern P[1..m] in time O(mlog n + occlog^ϵ n) for any constant ϵ>0. Further, such representation is worst-case optimal because, in some families, S can only be represented in Ω(δlog n) space. We complete our characterization of δ by showing that γ, z and other measures of repetitiveness are always O(δlog(n/δ)), but in some string families, the smallest context-free grammar is of size g=Ω(δlog^2 n / loglog n). No such a lower bound is known to hold for γ.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/16/2020

Substring Complexity in Sublinear Space

Shannon's entropy is a definitive lower bound for statistical compressio...
research
03/26/2018

On the Approximation Ratio of Greedy Parsings

Shannon's entropy is a clear lower bound for statistical compression. Th...
research
06/03/2022

L-systems for Measuring Repetitiveness*

An L-system (for lossless compression) is a CPD0L-system extended with t...
research
07/05/2023

Compressibility measures for two-dimensional data

In this paper we extend to two-dimensional data two recently introduced ...
research
07/19/2021

Sensitivity of string compressors and repetitiveness measures

The sensitivity of a string compression algorithm C asks how much the ou...
research
12/26/2022

Codes for Load Balancing in TCAMs: Size Analysis

Traffic splitting is a required functionality in networks, for example f...
research
05/28/2021

On Stricter Reachable Repetitiveness Measures*

The size b of the smallest bidirectional macro scheme, which is arguably...

Please sign up or login with your details

Forgot password? Click here to reset