The Many Qualities of a New Directly Accessible Compression Scheme

03/31/2023
by   Domenico Cantone, et al.
0

We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length n over an alphabet of size σ and a fixed parameter λ, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected 𝒪((F_σ - λ + 3 - 3)/F_σ+1) overhead, where F_j is the j-th number of the Fibonacci sequence. In the overall it uses N+𝒪(n (λ - (F_σ+3-3)/F_σ+1) ) = N + 𝒪(n) bits, where N is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a computation-friendly compression scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/01/2021

Data Deduplication with Random Substitutions

Data deduplication saves storage space by identifying and removing repea...
research
05/16/2018

On Computing Average Common Substring Over Run Length Encoded Sequences

The Average Common Substring (ACS) is a popular alignment-free distance ...
research
04/12/2022

Efficient Construction of the BWT for Repetitive Text Using String Compression

We present a new semi-external algorithm that builds the Burrows-Wheeler...
research
01/13/2022

Optimal alphabet for single text compression

A text can be viewed via different representations, i.e. as a sequence o...
research
02/12/2023

Efficient Integer Retrieving from Unordered Compressed Sequences

The variable-length Reverse Multi-Delimiter (RMD) codes are known to rep...
research
02/26/2023

Data-Efficient Sequence-Based Visual Place Recognition with Highly Compressed JPEG Images

Visual Place Recognition (VPR) is a fundamental task that allows a robot...
research
07/30/2021

Fast direct access to variable length codes

We consider the issue of direct access to any letter of a sequence encod...

Please sign up or login with your details

Forgot password? Click here to reset