LLMZip: Lossless Text Compression using Large Language Models

06/06/2023

∙

by Chandra Shekhara Kaushik Valmeekam, et al.

∙

We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in <cit.>, <cit.>. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.

READ FULL TEXT

LLMZip: Lossless Text Compression using Large Language Models

Sign in with Google

Consider DeepAI Pro