Optimal alphabet for single text compression

01/13/2022
by   Armen E. Allahverdyan, et al.
0

A text can be viewed via different representations, i.e. as a sequence of letters, n-grams of letters, syllables, words, and phrases. Here we study the optimal noiseless compression of texts using the Huffman code, where the alphabet of encoding coincides with one of those representations. We show that it is necessary to account for the codebook when compressing a single text. Hence, the total compression comprises of the optimally compressed text – characterized by the entropy of the alphabet elements – and the codebook which is text-specific and therefore has to be included for noiseless (de)compression. For texts of Project Gutenberg the best compression is provided by syllables, i.e. the minimal meaning-expressing element of the language. If only sufficiently short texts are retained, the optimal alphabet is that of letters or 2-grams of letters depending on the retained length.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2018

O(n n)-time text compression by LZ-style longest first substitution

Mauer et al. [A Lempel-Ziv-style Compression Method for Repetitive Texts...
research
08/25/2023

EntropyRank: Unsupervised Keyphrase Extraction via Side-Information Optimization for Language Model-based Text Compression

We propose an unsupervised method to extract keywords and keyphrases fro...
research
01/29/2016

Zipf's law is a consequence of coherent language production

The task of text segmentation may be undertaken at many levels in text a...
research
04/03/2023

Crossword: A Semantic Approach to Data Compression via Masking

The traditional methods for data compression are typically based on the ...
research
09/23/2021

Text Ranking and Classification using Data Compression

A well-known but rarely used approach to text categorization uses condit...
research
07/07/2019

Bidirectional Text Compression in External Memory

Bidirectional compression algorithms work by substituting repeated subst...
research
03/31/2023

The Many Qualities of a New Directly Accessible Compression Scheme

We present a new variable-length computation-friendly encoding scheme, n...

Please sign up or login with your details

Forgot password? Click here to reset