Crossword: A Semantic Approach to Data Compression via Masking

04/03/2023
by   Mingxiao Li, et al.
0

The traditional methods for data compression are typically based on the symbol-level statistics, with the information source modeled as a long sequence of i.i.d. random variables or a stochastic process, thus establishing the fundamental limit as entropy for lossless compression and as mutual information for lossy compression. However, the source (including text, music, and speech) in the real world is often statistically ill-defined because of its close connection to human perception, and thus the model-driven approach can be quite suboptimal. This study places careful emphasis on English text and exploits its semantic aspect to enhance the compression efficiency further. The main idea stems from the puzzle crossword, observing that the hidden words can still be precisely reconstructed so long as some key letters are provided. The proposed masking-based strategy resembles the above game. In a nutshell, the encoder evaluates the semantic importance of each word according to the semantic loss and then masks the minor ones, while the decoder aims to recover the masked words from the semantic context by means of the Transformer. Our experiments show that the proposed semantic approach can achieve much higher compression efficiency than the traditional methods such as Huffman code and UTF-8 code, while preserving the meaning in the target text to a great extent.

READ FULL TEXT

page 1

page 3

research
01/13/2022

Optimal alphabet for single text compression

A text can be viewed via different representations, i.e. as a sequence o...
research
09/19/2023

Semantic Text Compression for Classification

We study semantic compression for text where meanings contained in the t...
research
06/04/2023

Information-Theoretic Limits on Compression of Semantic Information

As conventional communication systems based on classic information theor...
research
09/06/2022

Cross Modal Compression: Towards Human-comprehensible Semantic Compression

Traditional image/video compression aims to reduce the transmission/stor...
research
02/11/2021

Text Compression-aided Transformer Encoding

Text encoding is one of the most important steps in Natural Language Pro...
research
06/13/2019

Meaning to Form: Measuring Systematicity as Information

A longstanding debate in semiotics centers on the relationship between l...

Please sign up or login with your details

Forgot password? Click here to reset