Bit-Swap: Recursive Bits-Back Coding for Lossless Compression with Hierarchical Latent Variables

05/16/2019 ∙ by Friso H. Kingma, et al. ∙ 11

The bits-back argument suggests that latent variable models can be turned into lossless compression schemes. Translating the bits-back argument into efficient and practical lossless compression schemes for general latent variable models, however, is still an open problem. Bits-Back with Asymmetric Numeral Systems (BB-ANS), recently proposed by Townsend et al. (2019), makes bits-back coding practically feasible for latent variable models with one latent layer, but it is inefficient for hierarchical latent variable models. In this paper we propose Bit-Swap, a new compression scheme that generalizes BB-ANS and achieves strictly better compression rates for hierarchical latent variable models with Markov chain structure. Through experiments we verify that Bit-Swap results in lossless compression rates that are empirically superior to existing techniques. Our implementation is available at https://github.com/fhkingma/bitswap.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Likelihood-based generative models—models of joint probability distributions trained by maximum likelihood—have recently achieved large advances in density estimation performance on complex, high-dimensional data. Variational autoencoders

(Kingma & Welling, 2013; Kingma et al., 2016), PixelRNN and PixelCNN and their variants (Oord et al., 2016; van den Oord et al., 2016b; Salimans et al., 2017; Parmar et al., 2018; Chen et al., 2018), and flow-based models like RealNVP (Dinh et al., 2015, 2017; Kingma & Dhariwal, 2018) can successfully model high dimensional image, video, speech, and text data (Karras et al., 2018; Kalchbrenner et al., 2017; van den Oord et al., 2016a; Kalchbrenner et al., 2016, 2018; Vaswani et al., 2017).

The excellent density estimation performance of modern likelihood-based models suggests another application: lossless compression. Any probability distribution can in theory be converted into a lossless code, in which each datapoint is encoded into a number of bits equal to its negative log probability assigned by the model. Since the best expected codelength is achieved when the model matches the true data distribution, designing a good lossless compression algorithm is a matter of jointly solving two problems:

  1. Approximating the true data distribution as well as possible with a model

  2. Developing a practical compression algorithm, called an entropy coding scheme, that is compatible with this model and results in codelengths equal to .

Figure 1: Schematic overview of lossless compression. The sender encodes data to a code with the least amount of bits possible without losing information. The receiver decodes the code and must be able to exactly reconstruct .

width=0.912 Compression Scheme Rate Uncompressed 8.00 GNU Gzip 5.96 bzip2 5.07 LZMA 5.09 PNG 4.71 WebP 3.66 BB-ANS 3.62 Bit-Swap (ours) 3.51

Table 1:

Lossless compression rates (in bits per dimension) on unscaled and cropped ImageNet of Bit-Swap against other compression schemes. See Section 

5 for an explanation of Bit-Swap and Section 6 for detailed results.

Unfortunately, it is generally difficult to jointly design a likelihood-based model and an entropy coding scheme that together achieve a good compression rate while remaining computationally efficient enough for practical use. Any model with tractable density evaluation can be converted into a code using Huffman coding, but building a Huffman tree requires resources that scale exponentially with the dimension of the data. The situation is more tractable, but still practically too inefficient, when autoregressive models are paired with arithmetic coding or asymmetric numeral systems (explained in Section

2.1). The compression rate will be excellent due to to the effectiveness of autoregressive models in density estimation, but the resulting decompression process, which is essentially identical to the sample generation process, will be extremely slow.

Fortunately, fast compression and decompression can be achieved by pairing variational autoencoders with a recently proposed practically efficient coding method called Bits-Back with Asymmetric Numeral Systems (BB-ANS) (Townsend et al., 2019). However, the practical efficiency of BB-ANS rests on two requirements:

  1. All involved inference and recognition networks are fully factorized probability distributions

  2. There are few latent layers in the variational autoencoder.

The first requirement ensures that encoding and decoding is always fast and parallelizable. The second, as we will discuss later, ensures that BB-ANS achieves an efficient bitrate: it turns out that BB-ANS incurs an overhead that grows with the number of latent variables. But these requirements restrict the capacity of the variational autoencoder and pose difficulties for density estimation performance, and hence the resulting compression rate suffers.

To work toward designing a computationally efficient compression algorithm with a good compression rate, we propose Bit-Swap, which improves BB-ANS’s performance on hierarchical latent variable models with Markov chain structure. Compared to latent variables models with only one latent layer, these hierarchical latent variable models allow us to achieve better density estimation performance on complex high-dimensional distributions. Meanwhile, Bit-Swap, as we show theoretically and empirically, yields a better compression rate compared to BB-ANS on these models due to reduced overhead.

2 Background

First, we will set the stage by introducing the lossless compression problem. Let be a distribution over discrete data . Each component of is called a symbol. Suppose that a sender would like to communicate a sample to a receiver through a code, or equivalently a message. The goal of lossless compression is to send this message using the minimum amount of bits on average over , while ensuring that is always fully recoverable by the receiver. See Figure 1 for an illustration.

Entropy coding schemes use a probabilistic model to define a code with codelengths . If matches well, then resulting average codelength will be close to the entropy of the data , which is the average codelength of an optimal compression scheme.

2.1 Asymmetric Numeral Systems

We will employ a particular entropy coding scheme called Asymmetric Numeral Systems (ANS) (Duda et al., 2015). Given a univariate probability distribution , ANS encodes a symbol into a sequence of bits, or bitstream, of length approximately

bits. ANS can also code a vector of symbols

using a fully factorized probability distribution , resulting in bits. (It also works with autoregressive , but throughout this work we will only use fully factorized models for parallelizability purposes.)

ANS has an important property: if a sequence of symbols is encoded, then they must be decoded in the opposite order. In other words, the state of the ANS algorithm is a bitstream with a stack structure. Every time a symbol is encoded, bits are pushed on to the right of the stack; every time a symbol is decoded, bits are popped from the right of the stack. See Figure 2 for an illustration. This property will become important when ANS is used in BB-ANS and Bit-Swap for coding with latent variable models.

Figure 2: Asymmetric Numeral Systems (ANS) operates on a bitstream in a stack-like manner. Symbols are decoded in opposite order as they were encoded.

2.2 Latent Variable Models

The codelength of an entropy coding technique depends on how well its underlying model approximates the true data distribution . In this paper, we focus on latent variable models, which approximate with a marginal distribution defined by

(1)

where is an unobserved latent variable. For continuous , can be seen as an infinite mixture, which makes such an implicit distribution over potentially highly flexible.

Since exactly evaluating and optimizing the marginal likelihood is intractable, variational autoencoders introduce an inference model , which approximates the model posterior . For any choice of , we can rewrite the marginal likelihood as follows:

(2)

As , the inference model and generative model can be found by jointly optimizing the Evidence Lower BOund (ELBO), a lower bound on :

(3)

For continuous and a differentiable inference model and generative model, the ELBO can be optimized using the reparameterization trick (Kingma & Welling, 2013).

2.3 Bits-Back Coding with ANS

It is not straightforward to use a latent variable model for compression, but it is possible with the help of the inference network . Assume that both the sender and receiver have access to , , and an entropy coding scheme. Let be the datapoint the sender wishes to communicate. The sender can send a latent sample by coding using the prior , along with , coded with . This scheme is clearly valid and allows the receiver to recover the , but results in an inefficient total codelength of . Wallace (1990) and Hinton & Van Camp (1993) show in a thought experiment, called the bits-back argument, it is possible to instead transmit fewer bits in a certain sense, thereby yielding a better net codelength equal to the negative ELBO of the latent variable model.

BB-ANS (Townsend et al., 2019), illustrated in Figure 3, makes the bits-back argument concrete. BB-ANS operates by starting with ANS initialized with a bitstream of random bits. Then, to encode , BB-ANS performs the following steps:

  1. Decode from bitstream using , subtracting bits from the bitstream,

  2. Encode to bitstream using , adding bits to the bitstream,

  3. Encode to bitstream using , adding bits to the bistream.

The resulting bitstream, which has a length of bits, is then sent to the receiver.

Figure 3: Bits-Back with Asymmetric Numeral Systems (BB-ANS).

The receiver decodes the data by initializing ANS to the received bitstream, then proceeds in reverse order, with the encode and decode operations swapped: the receiver decodes using , decodes using , then encodes using . The final step of encoding will recover the bits that the encoder used to initialize ANS. Thus, the sender will have successfully transmitted to the receiver, along with the initial bits—and it will have taken bits to do so.

To summarize, it takes bits to transmit plus bits. In this sense, the net number of bits sent regarding only, ignoring the initial bits, is

which is on average equal to , the negative ELBO.

3 Initial Bits in Bits-Back Coding

We now turn to the core issue that our work addresses: the amount of initial bits required for BB-ANS to function properly.

It is crucial for there to be enough initial bits in the ANS state for the sender to decode from the initial bitstream. That is, we must have

(4)

in order to guarantee that the receiver can recover the initial bits. If not, then to sample , the sender must draw bits from an auxiliary random source, and those bits will certainly not be recoverable by the receiver. And, if those bits are not recoverable, then the sender will have spent bits to transmit only, without bits in addition. So, we must commit to sending at least initial bits to guarantee a short net codelength for .

Unfortunately, the initial number of bits required can be significant. As an example, if the latent variables are continuous, as is common with variational autoencoders, one must discretize the density into bins of volume , yielding a probability mass function . But this imposes a requirement on the initial bits: now increases as the discretization resolution increases.

Townsend et al. (2019) remark that initial bits can be avoided by transmitting multiple datapoints in sequence, where every datapoint (except for the first one ) uses the bitstream built up thus far as initial bitstream. This amortizes the initial cost when the number of datapoints transmitted is large, but the cost can be significant for few or moderate numbers of datapoints, as we will see in experiments in Section 6.

4 Problem Scenario: Hierarchical Latent Variables

  Input: data , depth , ,
  Require: ANS
  Initialize: bitstream
  repeat
     Take
     decode with
     for  to  do
        decode with
     end for
     encode with
     for  to  do
        encode with
     end for
     encode with
  until 
  Send: bitstream
Algorithm 1 BB-ANS for lossless compression with hierarchical latent variables. The operations below show the procedure for encoding a dataset onto a bitstream.

Initial bits issues also arise when the model has many latent variables. Models with multiple latent variables are more expressive in practice can more closely model , leading to better compression performance. But since generally grows with the dimension of , adding more expressive power to the latent variable model via more latent variables will incur a larger initial bitstream for BB-ANS.

  Input: data , depth , ,
  Require: ANS
  Initialize: bitstream
  repeat
     Take
     decode with
     encode with
     for  to  do
        decode with
        encode with
     end for
     encode with
  until 
  Send: bitstream
Algorithm 2 Bit-Swap (ours) for lossless compression with hierarchical latent variables. The operations below show the procedure for encoding a dataset onto a bitstream.

We specialize our discussion to the case of hierarchical latent variable models: variational autoencoders with multiple latent variables whose sampling process obeys a Markov chain of the form , shown schematically in Figure 4. (It is well known that such models are better density estimators than shallower models, and we will verify in experiments in Section 6 that these models indeed can model more closely than standard variational autoencoders. A discussion regarding other topologies can be found in Appendix G.) Specifically, we consider a model whose marginal distributions are

(5)

and whose marginal distribution over is

(6)

We define an inference model for every latent layer , so that we can optimize a variational bound on the marginal likelihood . The resulting optimization objective (ELBO) is

(7)
Figure 4: The model class we are targeting: hierarchical latent variable models. Specifically, variational autoencoders whose sampling process obeys a Markov chain.

Now, consider what happens when this model is used with BB-ANS for compression. Figure 5(b) illustrates BB-ANS for such a model with three latent layers ; the algorithm for arbitrary latent depths of a hierarchical latent variable model is shown in Algorithm 1.

The first thing the sender must do is decode the latent variables from the initial bitstream of ANS. So, the number of bits present in the initial bitstream must be at least

(8)

where . Notice that must grow with the depth of the latent variable model. With sufficiently large, the required initial bits could make BB-ANS impractical as a compression scheme with hierarchical latent variables.

(b) BB-ANS
(a) Bit-Swap (ours)
Figure 5: Bit-Swap (ours, left) vs. BB-ANS (right) on a hierarchical latent variable model with three latent layers. Notice that BB-ANS needs a longer initial bitstream compared to Bit-Swap.
(a) Bit-Swap (ours)

5 Bit-Swap

To mitigate this issue, we propose Bit-Swap (Algorithm 2), an improved compression scheme that makes bits-back coding efficiently compatible with the layered structure of hierarchical latent variable models.

In our proposed model (Equations 5-6), the sampling process of both the generative model and the inference model obeys a Markov chain dependency between the stochastic variables. The data is generated conditioned on a latent variable , as in a standard variational autoencoder. However, instead of using a fixed prior for , we assume that is generated by a second latent variable . Subsequently, instead of using a fixed prior for , we assume that is generated by third latent variable , and so on.

These nested dependencies enable us to recursively apply the bits-back argument as follows. Suppose we aim to compress one datapoint in a lossless manner. With standard BB-ANS, the sender begins by decoding , which incurs a large cost of initial bits. With Bit-Swap, we notice that we can apply the first two steps of the bits-back argument on the first latent variable: first decode and directly afterwards encode . This adds bits to the bitstream, which means that further decoding operations for will need fewer initial bits to proceed. Now, we recursively apply the bits-back argument for the second latent variable in a similar fashion: first decode and afterwards encode . Similar operations of encoding and decoding can be performed for the remaining latent variables : right before decoding , Bit-Swap always encodes , and hence at least are available to decode without an extra cost of initial bits. Therefore, the amount of initial bits that Bit-Swap needs is bounded by , where we used the convention and . We can guarantee that Bit-Swap requires no more initial bits than BB-ANS:

(9)
(10)

See Figure 5(a) for an illustration of Bit-Swap on a model with three latent variables .

6 Experiments

width=0.79 Depth () ELBO # Parameters Avg. Net Bitrate Scheme Initial () CMA () CMA () 1 1.35 2.84M - - - - - 2 1.28 2.75M BB-ANS Bit-Swap 4 1.27 2.67M BB-ANS Bit-Swap 8 1.27 2.60M BB-ANS Bit-Swap

Table 2: MNIST model optimization results (columns 2 and 3) and test data compression results (columns 4 to 8) for various depths of the model (column 1). Column 2 shows the ELBO in bits/dim of the trained models evaluated on the test data. Column 3 denotes the number of parameters used (in millions). Using the trained models, we executed Bit-Swap and BB-ANS on the test data. We used bins to discretize the latent space (see Appendix F). Column 5 denotes the scheme used; BB-ANS or Bit-Swap. Column 4 denotes the average net bitrate in bits/dim (see Section 6.2), averaged over Bit-Swap and BB-ANS. Columns 6-8 show the cumulative moving average in bits/dim (CMA) (see Section 6.2) at various timesteps (1, 50 and 100 respectively). The reported bitrates are the result of compression of 100 datapoints (timesteps), averaged over 100 experiments. We believe that the small discrepancy between the ELBO and the net bitrate comes from the noise resulting from discretization. Also, Bit-Swap reduces to BB-ANS for .

width=0.79 Depth () ELBO # Parameters Avg. Net Bitrate Scheme Initial () CMA () CMA () 1 4.57 45.3M - - - - - 2 3.83 45.0M BB-ANS Bit-Swap 4 3.81 44.9M BB-ANS Bit-Swap 8 3.78 44.7M BB-ANS Bit-Swap

Table 3: CIFAR-10 model optimization (columns 2 and 3) and test data compression results (columns 4 to 8) for various depths of the model (column 1). Equal comments apply as Table 2.

width=0.79 Depth () ELBO # Parameters Avg. Net Bitrate Scheme Initial () CMA () CMA () 1 4.94 45.3M - - - - - 2 4.53 45.0M BB-ANS Bit-Swap 4 4.48 44.9M BB-ANS Bit-Swap

Table 4: ImageNet () model optimization (columns 2 and 3) and test data compression results (columns 4 to 8) for various depths of the model (column 1). Equal comments apply as Table 2.
Figure 6:

Cumulative moving average of compression rate over time for Bit-Swap (blue) and BB-ANS (orange) for sequences of 100 datapoints, averaged over 100 experiments. The blue dotted line and region represent the average and standard deviation of the net bitrate across the entire test set, without the initial bits (see Section 

6.2).

To compare Bit-Swap against BB-ANS, we use the following image datasets: MNIST, CIFAR-10 and ImageNet (). Note that the methods are not constrained to this specific type of data. As long as it is feasible to learn a hierarchical latent variable model with Markov chain structure of the data under the given model assumptions, and the data is discrete, it is possible to execute the compression schemes Bit-Swap and BB-ANS on this data.

Referring back to the introduction, designing a good lossless compression algorithm is a matter of jointly solving two problems: 1) approximating the true data distribution as well as possible with a model , and 2) developing a practical compression scheme that is compatible with this model and results in codelengths equal to . We address the first point in Section 6.1. As for the second point, we achieve bitrates that are approximately equal to the , the negative ELBO, which is an upper bound on . We will address this in Section 6.2.

6.1 Performance of Hierarchical Latent Variable Models

We begin our experiments by demonstrating how hierarchical latent variable models with Markov chain structure with different latent layer depths compare to a latent variable model with only one latent variable in terms of how well the models are able to approximate a true data distribution . A detailed discussion on the model architecture design can be found in Appendix D.

The results of training of the hierarchical latent variable models are shown in the left three columns of Table 2 (MNIST), 3 (CIFAR-10) and 4 (ImageNet ()). One latent layer corresponds to one latent variable . The metric we used is bits per dimension (bits/dim) as evaluated by the negative ELBO . Note from the resulting ELBO that, as we add more latent layers, the expressive power increases. A discussion on the utility of more latent layers can be found in Appendix E.

6.2 Performance of Bit-Swap versus BB-ANS

We now show that Bit-Swap indeed reduces the initial bits required (as discussed in Section 5) and outperforms BB-ANS on hierarchical latent variable models in terms of actual compression rates. To compare the performance of Bit-Swap versus BB-ANS for different depths of the latent layers, we conducted 100 experiments for every model and dataset. In every experiment we compressed 100 datapoints in sequence and calculated the cumulative moving average (CMA) of the resulting lengths of the bitstream after each datapoint. Note that this includes the initial bits necessary for decoding latent layers. In addition, we calculated the net number of bits added to the bitstream after every datapoint, as explained in Section 2.3, and averaged them over all datapoints and experiments for one dataset and model. This can be interpreted as a lower bound of the CMA of a particular model and dataset. We discretized the continuous latent variables using discretization bins for all datasets and experiments, as explained in Appendix F.

The CMA (with the corresponding average net bitrate) over 100 experiments for every model and dataset is shown in Figure 6. Bit-Swap is depicted in blue and BB-ANS in orange. These graphs show two properties of Bit-Swap and BB-ANS: the difference between Bit-Swap and BB-ANS in the need for initial bits, and the fact that the CMA of Bit-Swap and BB-ANS both amortize towards the average net bitrate. The last five columns of Table 2 (MNIST), 3 (CIFAR-10) and 4 (ImageNet ()) show the CMA (in bits/dim) after 1, 50 and 100 datapoints for the Bit-Swap versus BB-ANS and the average net bitrate (in bits/dim).

The initial cost is amortized (see Section 3) as the amount of datapoints compressed grows. Also, the CMA converges to the average net bitrate. The relatively high initial cost of both compression schemes comes from the fact that the initial cost increases with the number of discretization bins, discussed in Appendix F. Furthermore, discretizing the latent space adds noise to the distributions. When using BB-ANS, remember that this initial cost also grows linearly with the amount of latent layers . Bit-Swap does not have this problem. This results in a CMA performance gap that grows with the amount of latent layers . The efficiency of Bit-Swap compared to BB-ANS results in much faster amortization, which makes Bit-Swap a more practical algorithm.

Finally, we compared both Bit-Swap and BB-ANS against a number of benchmark lossless compression schemes. For MNIST, CIFAR-10 and Imagenet () we report the bitrates, shown in Table 5, as a result of compressing 100 datapoints in sequence (averaged over 100 experiments) and used the best models reported in Table 2, 3 and 4 to do so. We also compressed 100 single images independently taken from the original unscaled ImageNet, cropped to multiples of 32 pixels on each side, shown in Table 6. First, we trained the same model as used for Imagenet () on random patches of the corresponding train set. Then we executed Bit-Swap and BB-ANS by compressing one block at the time and averaging the bitrates of all the blocks in one image. We used the same cropped images for the benchmark schemes. We did not include deep autoregressive models as benchmark, because they are too slow to be practical (see introduction). Bit-Swap clearly outperforms all other benchmark lossless compression schemes.

width=0.79 MNIST CIFAR-10 ImageNet Uncompressed 8.00 8.00 8.00 GNU Gzip 1.65 7.37 7.31 bzip2 1.59 6.98 7.00 LZMA 1.49 6.09 6.15 PNG 2.80 5.87 6.39 WebP 2.10 4.61 5.29 BB-ANS 1.48 4.19 4.66 Bit-Swap 1.29 3.82 4.50

Table 5: Compression rates (in bits/dim) on MNIST, CIFAR-10, Imagenet (). The experimental set-up is explained in Section 6.2.

width=0.79 ImageNet (unscaled & cropped) Uncompressed 8.00 GNU Gzip (Gailly & Adler, 2018) 5.96 bzip2 (Seward, 2010) 5.07 LZMA (Pavlov, 1996) 5.09 PNG 4.71 WebP 3.66 BB-ANS 3.62 Bit-Swap 3.51

Table 6: Compression rates (in bits/dim) on 100 images taken independently from unscaled and cropped ImageNet. The experimental set-up is explained in Section 6.2.

7 Conclusion

Bit-Swap advances the line of work on practical compression using latent variable models, starting from the theoretical bits-back argument (Wallace, 1990; Hinton & Van Camp, 1993), and continuing on to practical algorithms based on arithmetic coding (Frey & Hinton, 1996; Frey, 1997) and asymmetric numeral systems (Townsend et al., 2019).

Bit-Swap enables us to efficiently compress using hierarchical latent variable models with a Markov chain structure, as it is able to avoid a significant number of initial bits that BB-ANS requires to compress with the same models. The hierarchical latent variable models function as powerful density estimators, so combined with Bit-Swap, we obtain an efficient, low overhead lossless compression algorithm capable of effectively compressing complex, high-dimensional datasets.

Acknowledgements

We want to thank Diederik Kingma for helpful comments and fruitful discussions. This work was funded in part by ONR PECASE N000141612723, Huawei, Amazon AWS, and Google Cloud.

References

Appendix A Asymmetric Numeral Systems (ANS)

We will describe a version of Assymetric Numeral Systems that we have assumed access to throughout the paper and used in the experiments, namely the range variant (rANS). All the other versions and interpretations can be found in (Duda et al., 2015).

ANS encodes a (sequence of) data point(s) into a natural number , which is called the state. We will use unconventional notation, yet consistent with our work: to denote a single datapoint and to denote the state. The goal is to obtain a state whose length of the binary representation grows with a rate that closely matches the entropy of the data distribution involved.

Suppose we wish to encode a datapoint that can take on two symbols that have equal probability. Starting with . A valid scheme for this distribution is

(11)

This simply assigns 0 to and 1 to in binary. Therefore, it appends 1 or 0 to the right of the binary representation of the state . Note that this scheme is fully decodable: if the current state is , we can read of the last encoded symbol by telling if the state is even (last symbol was

) or odd (last symbol was

). Consequently, after figuring out which symbol was last encoded, the state before encoding that symbol is obtained by

(12)

Now, for the general case, suppose that the datapoint can take on a multitude of symbols with probability . In order to obtain a scheme analogous to the case with two symbols , we have to assign every possible symbol to a specific subset of the natural numbers , that partitions the natural numbers. Consequently, is a disjoint union of the subsets . Also, the elements in the subset corresponding to must be chosen such that they occur in with probability .

This is accomplished by choosing a multiplier , called the precision of ANS, that scales up the probabilities . The scaled up probability is denoted by and the ’s are chosen such that . We also choose subsets that form intervals of length and partition the natural numbers. That means, the first numbers belong to , the second numbers belong to , and so on. Then, in every partition , the first numbers are assigned to symbol and form the subset , the second numbers are assigned to symbol and form the subset , and so on.

Now, we define . The resulting subsets partition the natural numbers . Furthermore, the elements of occur with probability approximately equal to in .

Now, suppose we are given an initial state . The scheme rANS can be interpreted as follows. Encoding a symbol is done by converting the state to a new state that equals the occurrence in the set . This operation is made concrete in the following formula:

(13)

where , and denotes the floor function.

Furthermore, suppose we are given a state and we wish to know which number was last encoded (or in other words, we wish to decode from ). Note that the union of the subsets partitions the the natural numbers , so every number can be uniquely identified with one of the symbols . Afterwards, if we know what the last encoded symbol was, we can figure out the state that preceded that symbol by doing a look-up for in the set . The index of in equals the state that preceded . This operation is made concrete in the following formula, which returns a symbol-state pair.

(14)

The new state after encoding using this scheme is approximately equal to . Consequently, encoding a sequence of symbols onto the initial state results in a state approximately equal to . Thus, the resulting codelength is

(15)

If we divide by , we obtain an average codelength which approaches the entropy of the data.

Appendix B The bits-back argument

We will present a detailed explanation of the bits-back argument that is fitted to our case. Again, suppose that a sender would like to communicate a sample to a receiver through a code that comprises the minimum amount of bits on average over . Now suppose that both the sender and receiver have access to the distributions , and , which parameters are optimized such that approximates well. Furthermore, both the sender and receiver have access to an entropy coding technique like ANS. A naive compression scheme for the sender is

  1. Sample

  2. Encode using

  3. Encode using .

This would result in a bitrate equal to . The resulting bitstream gets sent over and the receiver would proceed in the following way:

  1. Decode using

  2. Decode using .

Consequently, the sample is recovered in a lossless manner at the receiver’s end. However, we can do better using the bits-back argument, which lets us achieve a bitrate of , which is equal to . To understand this, we have to clarify what decoding means. If the model fits the true distributions perfectly, entropy coding can be understood as a (bijective) mapping between datapoints and uniformly random distributed bits. Therefore, assuming that the bits are uniformly random distributed, decoding information or from the bitstream can be interpreted as sampling or .

Now, assume the sender has access to an arbitrary bitstream that is already set in place. This can be in the form of previously compressed datapoints or other auxilary information. Looking at the naive compression scheme, we note that the step ’Sample ’ can be substituted by ’Decode using ’. Consequently, the compression scheme at the sender’s end using the bits-back argument is

  1. Decode using

  2. Encode using

  3. Encode using .

This results in a total bitrate equal to . The total resulting bitstream gets sent over, and the receiver now proceeds as:

  1. Decode using .

  2. Decode using

  3. Encode using

and, again, the sample is recovered in a lossless manner. But now, in the last step, the receiver has recovered the bits of auxiliary information that were set in place, thus gaining bits “back”. Ignoring the initial bits , the net number of bits regarding is

(16)

which is on average equal to the negative ELBO .

If the bits consist of relevant information, the receiver can then proceed by decompressing that information after gaining these ”bits back”. As an example, Townsend et al. (2019) point out that it is possible to compress a sequence of datapoints , where every datapoint (except for the first one ) uses the bitstream built up thus far as initial bitstream. Then, at the receiver’s end, the datapoints get decompressed in reverse order. This way the receiver effectively gains the “bits back” after finishing the three decompression steps of each datapoint , such that decompression of the next datapoint can proceed. The only bits that can be irrelevant or redundant are the initial bits needed to compress the first datapoint , though this information gets amortized when compressing multiple datapoints.

Appendix C AC and ANS: Queue vs. Stack

Entropy coding techniques make lossless compression possible given an arbitrary discrete distribution over data. There exist several practical compression schemes, of which the most common flavors are

  1. Arithmetic Coding (AC) (Witten et al., 1987)

  2. Asymmetric Numeral Systems (ANS). (Duda et al., 2015)

Both schemes operate by encoding the data into single number (in its binary representation equivalent to a sequence of bits or bitstream) and decoding the other way around, where the single number serves as the message to be sent. By doing so, both schemes result in a message length with a small overhead of around 2 bits. That is, in case of compressing , sending/receiving a corresponding codelength of approximately

(17)

However, AC and ANS differ in how they operate on the bitstream. In AC, the bitstream is treated as a queue structure, where the first bits of the bitstream are the first to be decoded (FIFO).

Figure 7: Arithmetic Coding (AC) operates on a bitstream in a queue-like manner. Symbols are decoded in the same order as they were encoded.

In ANS, the bitstream is treated as a stack structure, where the last bits of the bitstream are the first to be decoded (LIFO).

Figure 8: Asymmetric Numeral Systems (ANS) operates on a bitstream in a stack-like manner. Symbols are decoded in opposite order as they were encoded.

AC and ANS both operate on one discrete symbol at the time. Therefore, the compression schemes are constrained to a distribution that encompasses a product of discrete directed univariate distributions in order to operate. Furthermore, both the sender and receiver need to have access to the compression scheme and the model in order to be able to communicate.

Frey & Hinton (1996) implement the bits-back theory on a single datapoint using AC. Then the net codelength is equal to the ELBO, given the fact that we have access to an initial random bitstream. However, we must consider the length of the initial bitstream, which we call the initial bits, from which we decode when calculating the actual codelength, in which case the codelength degenerates to . So implementing bits-back on a single datapoint will not result in the advantage of getting “bits back”.

By communicating a sequence of datapoints , only the first datapoint needs to have an initial random bitstream set in place. Afterwards, a subsequent datapoint may just use the existing bitstream build up thus far to decode the corresponding latent variable . This procedure was first described by (Frey, 1997), and was called bits-back with feedback. We will use the shorter and more convenient term chaining, which was introduced by (Townsend et al., 2019).

Chaining is not directly possible with AC, because the existing bitstream is treated as a queue structure. Whereas bits-back only works if the latent variable is decoded earlier than the corresponding datapoint , demanding to be ’stacked’ on top of when decoding. Frey solves this problem by reversing the order of bits of the encoded before adding it to the bitstream. This incurs a cost between 2 to 32 bits to the encoding procedure of each datapoint , depending on implementation.

Appendix D Model Architecture

. width= MNIST CIFAR-10 ImageNet () Dimension Dimension Dimension Residual Latent Layers () ’Processing’ Residual () 4 4 4 ’Ordinary’ Residual () 8 8 8 Dropout Rate () 0.2 0.3 0.0

Table 7: Hyperparameters of the model architecture of MNIST, CIFAR-10 and ImageNet (). The first three rows denote the dimensions of , and the output of the used Residual blocks respectively. The fourth row marks the amount of latent layers used. The fifth and sixth row denote the amount of ’processing’ Residual blocks and the ’ordinary’ Residual blocks respectively, as explained in D
Figure 9: A schematic representation of the networks corresponding to (left) and (right) of the inference model. The arrows show the direction of the forward propagation.
Figure 10: A schematic representation of the networks corresponding to (left) and (right) of the generative model. The arrows show the direction of the forward propagation.

For all three datasets (MNIST, CIFAR-10 and ImageNet ()), we chose a Logistic distribution for the prior and conditional Logistic distributions for , and . The distribution is chosen to be a discretized Logistic distribution as defined in (Kingma et al., 2016). We modeled the Logistic distributions by a neural network for every pair of parameters . A schematic representation of the different networks is shown in Figure 9 and 10. The parameter of is modeled unconditionally, and optimized directly. We chose Residual blocks (He et al., 2016) as hidden layers. We also used Dropout (Srivastava et al., 2014) to prevent overfitting, Weight Normalization and Data-Dependent Initialization (Salimans & Kingma, ), Polyak averaging (Polyak & Juditsky, 1992) of the model’s parameters and the Adam optimizer (Kingma & Ba, 2015).

To make a fair comparison between the different latent layer depths for one dataset, we used ‘ordinary’ Residual blocks for the entire inference model and for the generative model, that is kept fixed for all latent layer depths . The blocks are distributed over the networks that make up the inference model and networks that make up the generative model. In addition, we added ‘processing’ Residual blocks at the beginning/end of the network corresponding to and respectively. Finally, we decreased the channel dimension of the output of all the Residual blocks in order to ensure that the parameter count stays constant (or regresses) as we add more latent layers. All the chosen hyperparameters are shown in Table 7. We refer to the code https://github.com/fhkingma/bitswap for further details on the implementation.

Appendix E Usefulness of Latent Layers & Posterior Collapse

Figure 11: Stack plots of the number of bits/dim required per stochastic layer to encode the test set over time. The bottom-most (white) area corresponds to the bottom-most (reconstruction of ) layer, the second area from the bottom denotes the first latent layer, the third area denotes the second latent layer, and so on.

Posterior collapse is one of the drawbacks of using variational auto-encoders (Chen et al., 2017). Especially when using deep hierarchies of latent variables, the higher latent layers can become redundant and therefore unused (Zhao et al., 2017). We will counter this problem by using the free-bits technique as explained in (Chen et al., 2017) and (Kingma et al., 2016). As a result of this technique, all latent layers across all models and datasets are used. To demonstrate this, we generated stack plots of the number of bits/dim required per stochastic layer to encode the test set over time shown in Figure 11. The bottom-most (white) area corresponds to the bottom-most (reconstruction of ) layer, the second area from the bottom denotes the first latent layer, the third area denotes the second latent layer, and so on.

Appendix F Discretization of

In order to perform lossless compression with continuous latent distributions, we need to determine how to discretize the latent space for every corresponding distribution , and . In (Townsend et al., 2019), based on (MacKay, 2003), they show that if the bins of match the bins of , continuous latents can be discretized up to arbitrary precision, without affecting the net compression rate as a result of getting ”bits back”. We generalize this result to hierarchical latent variables by stating that the bins of the latent conditional generative distributions have to match the bins of the inference distributions in order to avoid affecting the compression rate. Nonetheless, the length of the initial bitstream needed to decode latent sample (or possibly samples ) is still dependent on the corresponding bin size(s) . Therefore, we cannot make the bin sizes too small without affecting the total codelength too much.

There are several discetization techniques we could use. One option is to simply discretize uniformly, which means dividing the space into bins of equal width

. However, given the constraint that the initial bitstream needed increases with larger precision, we have to make bin sizes reasonably large. Accordingly, uniform discretization of non-uniform distributions could lead to large discretization errors and this could lead to inefficient codelengths.

An option is to follow the discretization technique used in (Townsend et al., 2019) by dividing the latent space into bins that have equal mass under some distribution (as opposed to equal width). Ideally, the bins of match the bins of and the bins have equal mass under either or . However, when using ANS with hierarchical latent variable models it is not possible to let the discretization of depend on bins based on , because is not yet available for the sender when decoding . Conversely, discretization of cannot depend on bins based on , since is not yet available for the receiver decoding . Note that the operations of the compression scheme at the sender end have to be the opposite of the operations at the receiver end and we need the same discretizations for both ends. Under this conditions, it is not possible to use either or for the bin sizes and at the same time match the bins of with the bins of .

So, we sampled a batch from the latent generative model

by ancestral sampling and a batch from the latent inference model (using the training dataset) right after learning. This gives us unbiased estimates of the statistics of the marginal distributions

, defined in Equation 5, which we can save as part of the model. Consequently, we used the marginal distributions to determine the bin sizes for discretization of and . Note that we only have to perform this sampling operation once, hence this process does not affect the compression speed.

However, we found that using uniform discretization for all latent layers except for the top one (corresponding to the prior) gives the best discretization and leads to the best compression results. Nevertheless, the top layer is discretized with bins that have equal mass under the prior, following Townsend et al. (2019).

Appendix G General Applicability of Bit-Swap

Our work only concerns a very particular case of hierarchical latent variable models, namely hierarchical latent variable models in which the sampling process of both the generative- and inference model corresponding variational autoencoder obey Markov chains of the form

and

respectively. It might seem very restrictive to only be able to assume this topology of stochastic dependencies. However, the Bit-Swap idea can in fact be applied to any latent variable topology in which it is possible to apply the bits-back argument in a recursive manner.

To show this we present two hypothetical latent variable topologies in Figure 12(a). Figure 12(a) shows an asymmetrical tree structure of stochastic dependencies and Figure 12(b) shows a symmetrical tree structure of stochastic dependencies where the variables of one hierarchical layer can also be connected. In Figure 12(c) and 12(d) we show the corresponding Bit-Swap compression schemes.

The more general applicability of Bit-Swap allows us to design complex stochastic dependencies and potentially stronger models. This might be an interesting direction for future work.

(b) Symmetrical tree structure including dependencies within a hierarchical layer
(c) Bit-Swap executed on 12(a)
(d) Bit-Swap executed on 12(b)
(a) Asymmetrical tree structure
Figure 12: The top left Figure shows an asymmetrical tree structure of stochastic dependencies and the top right Figure shows a symmetrical tree structure of stochastic dependencies where the variables of one hierarchical layer can also be connected. The black arrows indicate the direction of the generative model and the gray dotted arrows show the direction of the inference model. The black dotted arrows show where the prior(s) is/are defined on. In the bottom left and the bottom right we show the corresponding Bit-Swap compression schemes. In the right column of every Figure, we show the variables that are being operated on. On the left of every Figure we show the operations that must be executed by the sender and in the middle we show the operations executed by the receiver. The operations must be executed in the order that is dictated by the direction of the corresponding arrow. The sender always uses the inference model for decoding and the generative model for encoding. The receiver always uses the generative model for decoding and the inference model for encoding.
(a) Asymmetrical tree structure