Compressing Multisets with Large Alphabets

07/15/2021 ∙ by Daniel Severo, et al. ∙ Facebook UCL 0

Current methods that optimally compress multisets are not suitable for high-dimensional symbols, as their compute time scales linearly with alphabet size. Compressing a multiset as an ordered sequence with off-the-shelf codecs is computationally more efficient, but has a sub-optimal compression rate, as bits are wasted encoding the order between symbols. We present a method that can recover those bits, assuming symbols are i.i.d., at the cost of an additional 𝒪(|ℳ|log M) in average time complexity, where |ℳ| and M are the total and unique number of symbols in the multiset. Our method is compatible with any prefix-free code. Experiments show that, when paired with efficient coders, our method can efficiently compress high-dimensional sources such as multisets of images and collections of JSON files.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Official code accompanying the arXiv paper Compressing Multisets with Large Alphabets

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Appendix A Asymmetric numeral systems

The multiset compression method in this paper depends on asymmetric numeral systems (ANS) [4], a last-in-first-out, or ‘stack-like’ sequential codec. This appendix provides slightly more detail on ANS, in particular we define the encoding and decoding functions and sketch a proof of optimality. Note that compression of sequences is also solved by arithmetic coding (AC) [23], but AC is first-in-first-out, or ‘queue-like’. We give a brief overview of ANS here, as it is a critical component of our method, for more detail, see [4, 21].

Similarly to how an ‘exact’ version of AC, using infinite precision arithmetic on rational numbers in the interval , is often used to describe the method on a high level, we will describe ANS on a high level using arithmetic on arbitrarily large natural numbers. In practice the exact version described here would be very slow, and a ‘renormalization’ technique is used to keep the coder’s state within a fixed, finite interval (equivalently one can think of the state as unbounded, with the encoder and decoder operating only on its highest-order bits). For the sake of brevity we do not describe renormalization here, but details can be found in [4, 21].

ANS encoding works by moving information into a single, large, natural number, which we will denote , in a reversible way so that the data and the value of before encoding can be easily recovered. To encode a symbol

, ANS requires access to a probability distribution over

, specified by a quantized cumulative distribution function (CDF) and probability mass function (PMF). In particular, encoding requires a precision parameter and access to what we refer to as a forward lookup function:


where and are integers in such that


With these quantities in hand, the encoding function, which we denote encode, is


where denotes integer division, discarding any remainder.

Observe that if , using the fact that (by the definition of ), and (by the definition of and ), we have


which, together with eq. 16, implies that


Thus for large we have an accurate approximation


[21] gives more detail and shows that in a standard ANS implementation, renormalization ensures that the inaccuracy of eq. 21 is bounded by bits per operation, which is equivalent to one bit of redundancy for every 45,000 operations. As well as this small ‘per-symbol’ redundancy, in practical ANS implementations there are also ‘one-time’ redundancies incurred when initializing and terminating encoding. The one-time overhead is usually bounded by 16, 32 or 64 bits, depending on the implementation [21, 4].

To see that can be recovered losslessly from the ANS state , observe that


That means that , and can all be recovered from by binary search on all intervals , for the interval containing . In the worst case, this search is , although in some cases a search can be avoided, by mathematically computing the required interval and symbol. Whether implemented using search or otherwise, we refer to the function which recovers , and as the reverse lookup function:


Having recovered and , eq. 18 can be solved for :


Thus has a well defined, computationally straightforward inverse, which we denote , with


It is possible to run the function with a different distribution to that which was used for the most recent . Instead of recovering the last encoded symbol, then generates a sample from the new distribution, in a way which is reversible, and efficient in the sense that it consumes a number of random bits close to the information content of the generated sample. This idea is central to the multiset compression method which we describe.

More practical details of ANS, including other variants, can be found in [4, 21]. There is also a ‘vectorized’ version, described in [8].

Appendix B Initial bits

This section provides detail on the initial bits overhead of our method.

At encoding step , a symbol is sampled without replacement from , and then encoded using bits. Sampling is done with ANS as discussed in the method section. Encoding can be done with any coder compatible with ANS, including ANS itself, as long as both sampling and encoding are done on the same integer state.

Sampling decreases the integer state, while encoding increases it. Hence, the number of bits required to represent the integer state changes at step by approximately


Naively calculated, the final integer state would need approximately


bits, which is exactly the information content of the multiset.

However, if the ANS state reaches zero for any , it must be artificially increased to allow for sampling, implying the savings at step can be less than . Just as the final sequence is completely determined by the initial ANS state, so is the final ANS state, and hence the message length. In the worst case, unlikely symbols will be sampled from the multiset early on (i.e. for small ), wasting potential savings. In our experiments, this ‘depletion’ of did not seem to occur. There are at least two theoretical results that corroborate these findings.

First, for a fixed multiset we can easily show that the expected change in message length is always non-negative:


where is the marginal distribution .

Second, since is an exchangeable process, then it is also stationary, therefore as well as . Finally, eq. 31 together with


implies that


This suggests that the only time the state is likely to be empty is at the very beginning of encoding, i.e. the initial bits overhead is a one-time overhead.