LLMZip: Lossless Text Compression using Large Language Models

We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in <cit.>, <cit.>. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

Lexinvariant Language Models

Token embeddings, a mapping from discrete lexical symbols to continuous ...
research
01/24/2023

A Watermark for Large Language Models

Potential harms of large language models can be mitigated by watermarkin...
research
07/28/2023

Robust Distortion-free Watermarks for Language Models

We propose a methodology for planting watermarks in text from an autoreg...
research
09/23/2021

Text Ranking and Classification using Data Compression

A well-known but rarely used approach to text categorization uses condit...
research
04/25/2023

Semantic Compression With Large Language Models

The rise of large language models (LLMs) is revolutionizing information ...
research
11/14/2018

Extractive Summary as Discrete Latent Variables

In this paper, we compare various methods to compress a text using a neu...
research
01/05/2023

CiT: Curation in Training for Effective Vision-Language Data

Large vision-language models are generally applicable to many downstream...

Please sign up or login with your details

Forgot password? Click here to reset