Run-Length Encoding in a Finite Universe

09/15/2019
by   N. Jesper Larsson, et al.
0

Text compression schemes and compact data structures usually combine sophisticated probability models with basic coding methods whose average codeword length closely match the entropy of known distributions. In the frequent case where basic coding represents run-lengths of outcomes that have probability p, i.e. the geometric distribution (i)=p^i(1-p), a Golomb code is an optimal instantaneous code, which has the additional advantage that codewords can be computed using only an integer parameter calculated from p, without need for a large or sophisticated data structure. Golomb coding does not, however, gracefully handle the case where run-lengths are bounded by a known integer n. In this case, codewords allocated for the case i>n are wasted. While negligible for large n, this makes Golomb coding unattractive in situations where n is recurrently small, e.g., when representing many short lists of integers drawn from limited ranges, or when the range of n is narrowed down by a recursive algorithm. We address the problem of choosing a code for this case, considering efficiency from both information-theoretic and computational perspectives, and arrive at a simple code that allows computing a codeword using only O(1) simple computer operations and O(1) machine words. We demonstrate experimentally that the resulting representation length is very close (equal in a majority of tested cases) to the optimal Huffman code, to the extent that the expected difference is practically negligible. We describe efficient branch-free implementation of encoding and decoding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2023

Low-Complexity Vector Source Coding for Discrete Long Sequences with Unknown Distributions

In this paper, we propose a source coding scheme that represents data fr...
research
09/07/2021

Simple Worst-Case Optimal Adaptive Prefix-Free Coding

Gagie and Nekrich (2009) gave an algorithm for adaptive prefix-free codi...
research
01/19/2009

An Upper Limit of AC Huffman Code Length in JPEG Compression

A strategy for computing upper code-length limits of AC Huffman codes fo...
research
04/15/2022

Generalized Universal Coding of Integers

Universal coding of integers (UCI) is a class of variable-length code, s...
research
06/11/2021

Encoding of probability distributions for Asymmetric Numeral Systems

Many data compressors regularly encode probability distributions for ent...
research
08/31/2022

Computing all-vs-all MEMs in run-length encoded collections of HiFi reads

We describe an algorithm to find maximal exact matches (MEMs) among HiFi...
research
09/18/2021

A Tighter Upper Bound of the Expansion Factor for Universal Coding of Integers and Its Code Constructions

In entropy coding, universal coding of integers (UCI) is a binary univer...

Please sign up or login with your details

Forgot password? Click here to reset