Sensitivity of string compressors and repetitiveness measures

07/19/2021
by   Tooru Akagi, et al.
0

The sensitivity of a string compression algorithm C asks how much the output size C(T) for an input string T can increase when a single character edit operation is performed on T. This notion enables one to measure the robustness of compression algorithms in terms of errors and/or dynamic changes occurring in the input string. In this paper, we analyze the worst-case multiplicative sensitivity of string compression algorithms, defined by max_T ∈Σ^n{C(T')/C(T) : ed(T, T') = 1}, where ed(T, T') denotes the edit distance between T and T'. For the most common versions of the Lempel-Ziv 77 compressors, we prove that the worst-case multiplicative sensitivity is only a small constant (2 or 3, depending on the version of the Lempel-Ziv 77 and the edit operation type). We strengthen our upper bound results by presenting matching lower bounds on the worst-case sensitivity for all these major versions of the Lempel-Ziv 77 factorizations. This contrasts with the previously known related results such that the size z_ 78 of the Lempel-Ziv 78 factorization can increase by a factor of Ω(n^3/4) [Lagarde and Perifel, 2018], and the number r of runs in the Burrows-Wheeler transform can increase by a factor of Ω(log n) [Giuliani et al., 2021] when a character is prepended to an input string of length n. We also study the worst-case sensitivity of several grammar compression algorithms including Bisection, AVL-grammar, GCIS, and CDAWG. Further, we extend the notion of the worst-case sensitivity to string repetitiveness measures such as the smallest string attractor size γ and the substring complexity δ, and present matching upper and lower bounds of the worst-case multiplicative sensitivity for γ and δ.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/03/2023

On Sensitivity of Compact Directed Acyclic Word Graphs

Compact directed acyclic word graphs (CDAWGs) [Blumer et al. 1987] are a...
research
11/13/2020

Substring Query Complexity of String Reconstruction

Suppose an oracle knows a string S that is unknown to us and we want to ...
research
03/09/2020

Smoothed Analysis of Trie Height by Star-like PFAs

Tries are general purpose data structures for information retrieval. The...
research
10/04/2019

Towards a Definitive Measure of Repetitiveness

Unlike in statistical compression, where Shannon's entropy is a definiti...
research
10/08/2018

Approximate Online Pattern Matching in Sub-linear Time

We consider the approximate pattern matching problem under edit distance...
research
05/01/2022

Dynamic data structures for parameterized string problems

We revisit classic string problems considered in the area of parameteriz...
research
07/01/2021

Data Deduplication with Random Substitutions

Data deduplication saves storage space by identifying and removing repea...

Please sign up or login with your details

Forgot password? Click here to reset