1 Introduction
To devise a lossless compression algorithm means to devise a uniquely decodable code whose expected length is as close as possible to the entropy of the data. A general recipe for this is to first train a generative model by minimizing cross entropy to the data distribution, and then construct a code that achieves lengths close to the negative log likelihood of the model. This recipe is justified by classic results in information theory that ensure that the second step is possible—in other words, optimizing cross entropy optimizes the performance of some hypothetical compressor. And, thanks to recent advances in deep likelihoodbased generative models, these hypothetical compressors are quite good. Autoregressive models, latent variable models, and flow models are now achieving stateoftheart cross entropy scores on a wide variety of realworld datasets in speech, videos, text, images, and other domains (oord2016pixel, ; oord2016conditional, ; salimans2017pixelcnn++, ; parmar2018image, ; chen2018pixelsnail, ; menick2018generating, ; child2019generating, ; kalchbrenner2017video, ; dinh2014nice, ; dinh2016density, ; kingma2018glow, ; prenger2019waveglow, ; ho2019flow++, ; oord2016wavenet, ; kingma2013auto, ; maaloe2019biva, ).
But we are not interested in hypothetical compressors. We are interested in practical, computationally efficient compressors that scale to highdimensional data and harness the excellent cross entropy scores of modern deep generative models. Unfortunately, naively applying existing codes, like Huffman coding (huffman1952method, ), requires computing the model likelihood for all possible values of the data, which expends computational resources scaling exponentially with the data dimension. This inefficiency stems from the lack of assumptions about the generative model’s structure.
Coding algorithms must be tailored to specific types of generative models if we want them to be efficient enough for practical use. There is already a rich literature of tailored coding algorithms for autoregressive models and variational autoencoders, assuming they are built from conditional distributions which are already tractable for coding (rissanen1976generalized, ; duda2013asymmetric, ; hinton1993keeping, ; frey1997efficient, ; townsend2018practical, ). On the other hand, there are currently no such algorithms for flow models in general (mentzer2018practical, )
. It seems that this lack of efficient coding algorithms is a con of these models that stands in odd contrast with their many pros, like fast and realistic sampling, interpretable latent spaces, fast likelihood evaluation, competitive cross entropy scores, and ease of training with unbiased log likelihood gradients
(dinh2014nice, ; dinh2016density, ; kingma2018glow, ; ho2019flow++, ).To rectify this situation, we introduce local bitsback coding, a new technique for turning a general, pretrained, offtheshelf flow model into an efficient coding algorithm suitable for continuous data discretized to high precision. We show how to implement local bitsback coding without assumptions on the flow structure, leading to an algorithm that runs in polynomial time and space with respect to the data dimension. Going further, we show how to tailor our implementation to various specific types of flows, culminating in a fully parallelizable algorithm for RealNVPtype flows that runs in linear time and space with respect to the data dimension and is fully parallelizable for both encoding and decoding. We then show how to adapt local bitsback coding to losslessly code data discretized to arbitrarily low precision, and in doing so, we end up with a new compression interpretation of dequantization, a method commonly used to train flow models on discrete data. We test our algorithms on stateoftheart flow models trained on realworld image datasets, and we find that they are computationally efficient and attain codelengths in close agreement with theoretical predictions. We plan on providing an open source code release.
2 Preliminaries
Lossless compression We begin by defining lossless compression of dimensional discrete data
using a probability mass function
represented by a generative model. It means to construct a uniquely decodable code , which is an injective map from data sequences to binary strings, whose lengths are close to (cover2012elements, ).^{1}^{1}1We always use base 2 logarithms. The rationale is that if the generative model is expressive and trained well, its cross entropy will be close to the entropy of the data distribution. So, if the lengths of match the model’s negative log probabilities, the expected length of will be small, and hence will be a good compression algorithm. Constructing such a code is always possible in theory, because the KraftMcMillan inequality (kraft1949device, ; mcmillan1956two, ) ensures that there always exists some code with lengths .Flow models We wish to construct a computationally efficient code specialized to a flow model , which is a differentiable bijection between continuous data and latents (deco1995higher, ; dinh2014nice, ; dinh2016density, ). A flow model comes with a density on the latent space, and thus has an associated sampling process— for
—under which it defines a probability density function via the changeofvariables formula for densities:
(1) 
where denotes the Jacobian of at . Flow models are straightforward train with maximum likelihood, as Eq. 1 allows unbiased exact log likelihood gradients to be computed efficiently.
Dequantization To make a flow viable for discrete data , which is what many popular datasets provide, it is standard practice to define a derived discrete model to be trained by minimizing a dequantization objective averaged over a dataset. A dequantization objective is a variational bound on the codelength of :
(2) 
Here, proposes dequantization noise that transforms discrete data into continuous data
; it can be fixed to either a uniform distribution
(uria2016neural, ; theis2016note, ; salimans2017pixelcnn++, ) or to another parameterized flow to be trained jointly with (ho2019flow++, ). This dequantization objective serves as a theoretical codelength for flow models trained on discrete data, just like negative log probability mass serves as a theoretical codelength for discrete generative models (theis2016note, ).3 Local bitsback coding
Our goal is to develop computationally efficient coding algorithms for flows trained with a dequantization objective, and we want to attain codelengths that closely match the theoretical codelengths 2 these flows are trained to minimize. First, in Sections 3.1 to 3.4, we develop algorithms that use flows to code continuous data discretized to high precision. These algorithms will attain a codelength that matches the negative log density of the flow 1, plus a constant that depends on the discretization precision. Second, in Section 3.5, we show how to adapt these algorithms to losslessly code data discretized to low precision—specifically, however many bits of precision are present in the dataset used to train the flow model—thereby attaining our desired codelength 2 for discrete data.
3.1 Coding continuous data using discretization
We first address the problem of developing coding algorithms that attain codelengths given by negative log densities of flow models, such as Eq. 1. Probability density functions do not directly map to codelength, unlike probability mass functions which enjoy the result of the KraftMcMillan inequality. So, following standard procedure (cover2012elements, , section 8.3), we discretize the data to a high precision and code this discretized data with a certain probability mass function derived from the density model. Specifically, we tile with hypercubes of volume ; we call each hypercube a bin. For , let be the unique bin that contains , and let be the center of the bin . We call the discretized version of . For a sufficiently smooth probability density function
, such as a density coming from a neural network flow model, the probability mass function
takes on the pleasingly simple form when the precision is large. Now we invoke the KraftMcMillan inequality, so the theoretical codelength for using is(3) 
bits. This is the compression interpretation of the negative log density: it is a codelength for data discretized to high precision, when added to the total number of bits of discretization precision. It is this codelength, Eq. 3, that we will try to achieve with an efficient algorithm for flow models. We defer the problem of coding data discretized to low precision to Section 3.5.
3.2 Background on bitsback coding
The main tool we will employ to develop our coding algorithms is bitsback coding (wallace1968information, ; hinton1993keeping, ; frey1997efficient, ; honkela2004variational, ), a coding technique originally designed for latent variable models (connecting bitsback coding to flow models is done in Section 3.3 and is new to our work). Bitsback coding codes using a distribution of the form , where includes a hidden variable . Bitsback coding is relevant when ranges over an exponentially large set, making it intractable to code with directly, even though coding with and may be tractable individually. To code in this case, bitsback coding introduces a new distribution with tractable coding, and the encoder jointly encodes along with via these steps:

[topsep=1ex,itemsep=1ex,partopsep=1ex,parsep=1ex]

Decode from an auxiliary source of random bits

Encode using

Encode using
The first step, which decodes from random bits, produces a sample . The second and third steps transmit along with . At decoding time, the decoder recovers , then recovers the bits the encoder used to sample using . So, the encoder will have transmitted extra information in addition to —precisely bits on average. Consequently, the net number of bits transmitted regarding only will be , which is redundant compared to the desired length by an amount equal to the KL divergence from to the true posterior.
Bitsback coding also works with continuous discretized to high precision, with negligible change in codelength (hinton1993keeping, ; townsend2018practical, ). In this case, and are probability density functions. Discretizing to bins of small volume and defining the probability mass functions and by the method in Section 3.1, we see that the bitsback codelength remains approximately unchanged:
(4) 
When bitsback coding is applied to a particular latent variable model, such as a VAE, the distributions involved may take on a certain meaning: would be the prior, would be the decoder network, and would be the encoder network (kingma2013auto, ; rezende2014stochastic, ; dayan1995helmholtz, ; chen2016variational, ; frey1997efficient, ; townsend2018practical, ; kingma2019bitswap, ). However, it is important to note that these distributions do not need to correspond explicitly to parts of the model at hand. Any will do for coding data losslessly (though some choices will result in better codelength). We exploit this fact in Section 3.3, where we apply bitsback coding to flow models by constructing artificial distributions and , which do not come with a flow model by default.
3.3 Local bitsback coding
We now present local bitsback coding, our new highlevel principle for using a flow model to code data discretized to high precision. Following Section 3.1, we discretize continuous data into , which is the center of a bin of volume . The codelength we desire for is the negative log density of 1, plus a constant depending on the discretization precision:
(5) 
where is the Jacobian of at . We will construct two densities and such that bitsback coding attains Eq. 5. We need a small scalar parameter , with which we define
(6) 
To encode , local bitsback coding follows the method described in Section 3.2 with continuous :

[topsep=1ex,itemsep=1ex,partopsep=1ex,parsep=1ex]

Decode from an auxiliary source of random bits

Encode using

Encode using
From these steps, we see that 6 is artificially injected noise, scaled by (the flow model remains unmodified). The distribution of this noise represents how a local linear approximation of would behave if it were to act on a small Gaussian around .
To justify local bitsback coding, we simply calculate its expected codelength. First, our choices of and 6 satisfy the following equation:
(7) 
Next, just like standard bitsback coding 4, local bitsback coding attains an average codelength close to , where
(8) 
Equations 7, 8 and 6 imply that the expected codelength of local bitsback coding matches our desired codelength 5, up to first order in (see Appendix A for details):
(9) 
Note that local bitsback coding exactly achieves the desired codelength for flows 5, up to first order in . This is in stark contrast to bitsback coding with latent variable models like VAEs, for which the bitsback codelength is the negative evidence lower bound, which is redundant by an amount equal to the KL divergence from the approximate posterior to the true posterior (kingma2013auto, ).
Local bitsback coding always codes losslessly, no matter the setting of , , and . However, must be small for the inaccuracy in Eq. 9 to be negligible. But for to be small, the discretization volumes and must be small too, otherwise the discretized Gaussians and will be poor approximations of the original Gaussians and . So, because must be small, the data must be discretized to high precision. And, because must be small, a relatively large number of auxiliary bits must be available to decode . We will resolve the high precision requirement for the data with another application of bitsback coding in Section 3.5, and we will explore the impact of varying , , and on realworld data in experiments in Section 4.
3.4 Concrete local bitsback coding algorithms
We have shown that local bitsback coding attains the desired codelength 5 for data discretized to high precision. Now, we instantiate local bitsback coding with concrete algorithms.
3.4.1 Black box flows
Algorithm 1 is the most straightforward implementation of local bitsback coding. It directly implements the steps in Section 3.3 by explicitly computing the Jacobian of the flow (using, say, automatic differentiation). It therefore makes no assumptions on the structure of the flow, and hence we call it the black box algorithm.
Coding with 6 is efficient because its coordinates are independent (townsend2018practical, ). The same applies to the prior if its coordinates are independent too, or if another efficient coding algorithm already exists for it (see Section 3.4.3). However, to enable efficient coding with —for instance, when decoding from auxiliary random bits during bitsback encoding—we must rely on the fact that any multivariate Gaussian can be converted into a linear autoregressive model, which can be coded efficiently, one coordinate at a time, using arithmetic coding or asymmetric numeral systems.
To see how, suppose , where and is a fullrank matrix (such as a Jacobian of a flow model). Let be the Cholesky decomposition of . Since , the distribution of is equal to the distribution of . So, solutions to the linear system have the same distribution as , and because is triangular, is easily computable and also triangular, and thus solving for can be done with back substitution: , where increases from to . In other words, is a linear autoregressive model that represents the same distribution as .
If nothing is known about the structure of the Jacobian of the flow, Algorithm 1 requires space to store the Jacobian and time to compute the Cholesky decomposition. This is certainly an improvement on exponential space and time, which is what naive algorithms require (Section 1), but it is still not efficient enough for highdimensional data in practice. To make our coding algorithms more efficient, we need to make additional assumptions on the flow. If the Jacobian is always block diagonal, say with fixed block size , then the steps in Algorithm 1 can be modified to process each block separately in parallel, thereby reducing the required space and time to and , respectively. This makes Algorithm 1 efficient for flows that operate as elementwise transformations or as convolutions, such as activation normalization flows and invertible convolution flows (kingma2018glow, ).
3.4.2 Autoregressive flows
An autoregressive flow is a sequence of onedimensional flows for each coordinate (papamakarios2017masked, ; kingma2016improved, ). Algorithm 2 shows how to code with an autoregressive flow in linear time and space. It never explicitly calculates and stores the Jacobian of the flow, unlike Algorithm 1. Rather, it invokes onedimensional local bitsback coding on one coordinate of the data at a time, thus exploiting the structure of the autoregressive flow in an essential way.
A key difference between Algorithm 1 and Algorithm 2 is that the former needs to run the forward and inverse directions of the entire flow and compute and factorize a Jacobian, whereas the latter only needs to do so for each onedimensional flow on each coordinate of the data. Consequently, Algorithm 2 runs time and space (excluding resource requirements of the flow itself). The encoding procedure of Algorithm 2 is similar to log likelihood computation for autoregressive flows, so the model evaluations it requires are completely parallelizable over dimensions. The decoding procedure, on the other hand, is similar to sampling, so it requires model evaluations in serial (the full decoding procedure is listed in Appendix B). These tradeoffs are entirely analogous to those of coding with discrete autoregressive models.
Autoregressive flows with further special structure lead to even more efficient implementations of Algorithm 2. As an example, let us focus on a NICE/RealNVP coupling layer (dinh2014nice, ; dinh2016density, ). This type of flow computes by splitting the coordinates of the input into two halves, , and . The first half is passed through unchanged as , and the second half is passed through an elementwise transformation which is conditioned on the first half only. Specializing Algorithm 2 to this kind of flow allows both encoding and decoding to be parallelized over coordinates, reminiscent of how the forward and inverse directions for inference and sampling can be parallelized for these flows (dinh2014nice, ; dinh2016density, ). See Appendix B for the complete algorithm listing.
Efficient coding algorithms already exist for certain autoregressive flows. For example, if is an autoregressive flow whose prior is independent over coordinates, then can be rewritten as a continuous autoregressive model , which can be discretized and coded one coordinate at a time using arithmetic coding or asymmetric numeral systems. The advantage of Algorithm 2, as we will see next, is that it applies to more complex priors that prevent the distribution over from naturally factorizing as an autoregressive model.
3.4.3 Compositions of flows
Flows like NICE (dinh2014nice, ), RealNVP (dinh2016density, ), Glow (kingma2018glow, ), and Flow++ (ho2019flow++, ) are composed of many intermediate flows: they have the form , where each of the layers
is one of the types of flows discussed above. These models derive their density estimation power from applying simple flows many times, resulting in an extremely complex and expressive composite flow. The expressiveness of the composite flow suggests that coding will be difficult, but we can exploit the compositional structure to code efficiently. Since the composite flow
can be interpreted as a single flow with a flow prior , all we have to do is code the first layer using the appropriate local bitsback coding algorithm, and when coding its output , we recursively invoke local bitsback coding for the prior (kingma2019bitswap, ). A straightforward inductive argument shows that this leads to the correct codelength. If coding any with achieves the expected codelength , then the expected codelength for , using as a prior, is . Continuing the same into , we conclude that the resulting expected codelength(10) 
where , is what we expect from coding with the whole composite flow . This codelength is averaged over noise injected into each layer , but we find that this is not an issue in practice. Our experiments in Section 4 show that it is easy to make small enough to be negligible for neural network flow models, which are generally resistant to activation noise.
We call this the compositional algorithm. Its significance is that, provided that coding with each intermediate flow is efficient, coding with the composite flow is efficient too, despite the complexity of the composite flow as a function class. The composite flow’s Jacobian never needs to be calculated or factorized, leading to dramatic speedups over using Algorithm 1 on the composite flow as a black box. Coding with RealNVPtype models needs just time and space, is fully parallelizable, and attains stateoftheart codelengths thanks to the cross entropy scores of these models (Section 4).
3.5 Dequantization for coding unrestrictedprecision data
We have shown how to code data discretized to high precision, achieving codelengths close to
. In practice, however, data is usually discretized to low precision; for example, images from CIFAR10 and ImageNet consist of integers in
. Coding this kind of data directly would force us to code at a precision much higher than 1, which would be a waste of bits.To resolve this issue, we propose to use this extra precision within another bitsback coding scheme to arrive at a good lossless codelength for data at its original precision. Let us focus on the setting of coding integervalued data up to precision . Recall from Section 2 that flow models are trained on such data by minimizing a dequantization objective 2, which we reproduce here:
(11) 
Above, is a dequantizer, which adds noise to turn into continuous data for the flow model to fit (uria2016neural, ; theis2016note, ; salimans2017pixelcnn++, ; ho2019flow++, ). We assume that the dequantizer is itself provided as a flow model, specified by for , as in (ho2019flow++, ). In Algorithm 3, we propose a bitsback coding scheme in which is decoded from auxiliary bits using local bitsback coding, and is encoded using the original flow , also using local bitsback coding.
The decoder, upon receiving , recovers the original and by rounding (see Appendix B for the full pseudocode). So, the net codelength for Algorithm 3 is given by subtracting the bits needed to decode from the bits needed to encode :
(12) 
This codelength closely matches the dequantization objective 11 on average, and it is reasonable for the lowprecision discrete data because, as we stated in Section 2, it is a variational bound on the codelength of a certain discrete generative model for , and modern flow models are explicitly trained to minimize this bound (uria2016neural, ; theis2016note, ; ho2019flow++, ). The resulting code is lossless for , and Algorithm 3 thus provides a new compression interpretation of dequantization: it converts a code suitable for high precision data into a code suitable for low precision data, just as the dequantization objective 11 converts a model suitable for continuous data into a model suitable for discrete data (theis2016note, ).
4 Experiments
We designed experiments to investigate the following: (1) how well local bitsback codelengths match the theoretical codelengths of modern, stateoftheart flow models on highdimensional data, (2) the effects of the precision and noise parameters and on codelengths (Section 3.3), and (3) the computational efficiency of local bitsback coding for use in practice.
We focused on Flow++ (ho2019flow++, ), a stateoftheart RealNVPtype flow that uses a flowbased dequantizer. Our coding implementation involves all concepts presented in this paper: Algorithm 1 for elementwise and convolution flows (kingma2018glow, ), Algorithm 2 for coupling layers, the compositional method of Section 3.4.3, and Algorithm 3 for dequantization. We used asymmetric numeral systems (ANS) (duda2013asymmetric, ), following the BBANS (townsend2018practical, ) and BitSwap (kingma2019bitswap, ) algorithms for VAEs (though the ideas behind our algorithms do not depend on ANS). We expect our implementation to easily extend to other models, like flows for video (kumar2019videoflow, ) and audio (prenger2019waveglow, ), though we leave that for future work.
Codelengths Table 1 lists the local bitsback codelengths on the test sets of CIFAR10, 32x32 ImageNet, and 64x64 ImageNet. The listed theoretical codelengths are the average negative log likelihoods of our model reimplementations (without importance sampling for the variational dequantization bound), and we find that our coding algorithm attains very similar lengths. To the best of our knowledge, these results are stateoftheart for lossless compression with fully parallelizable compression and decompression.
Compression algorithm  CIFAR10  ImageNet 32x32  ImageNet 64x64 
Theoretical  3.116  3.871  3.701 
Local bitsback (ours)  3.118  3.875  3.703 
Effects of precision and noise Recall from Section 3.3 that the noise level should be small to attain accurate codelengths. This means that the discretization volumes and should be small as well to make discretization effects negligible, at the expense of a larger requirement of auxiliary bits, which are not counted into bitsback codelengths (hinton1993keeping, ). Above, we fixed and , but here, we study the impact of varying and : on each dataset, we compressed 20 random datapoints in sequence, then calculated the local bitsback codelength and the auxiliary bits requirement; we did this for 5 random seeds and averaged the results. See Fig. 1 for CIFAR results, and see Appendix C
for results on all models with standard deviation bars. We indeed find that as
and decrease, the codelength becomes more accurate, and we find a sharp transition in performance when is too large relative to , indicating that coarse discretization destroys noise with small scale. Also, as expected, we find that the auxiliary bits requirement grows as shrinks. If auxiliary bits are not available, they must be counted into the codelength for the first datapoint (townsend2018practical, ; kingma2019bitswap, ), but the cost is negligible for long sequences, as one would have when encoding an entire test set or when encoding audio or video data with large numbers of frames (prenger2019waveglow, ; kumar2019videoflow, ).Computational efficiency We used OpenMPbased CPU code for compression with parallel ANS streams (giesen2014interleaved, ), with neural net operations running on a GPU. See Table 2 for encoding timings (decoding timings in Appendix C are nearly identical), averaged over 5 runs, on 16 CPU cores and 1 Titan X GPU. We timed the black box algorithm (Algorithm 1) and the compositional algorithm (Section 3.4.3) on single datapoints, and we also timed the latter with batches of datapoints, made possible by its low memory requirements (this was not possible with the black box algorithm, which already needs batching to compute the Jacobian for one datapoint). We find that the compositional algorithm is only slightly slower than running the neural net on its own, whereas the black box algorithm is significantly slower due to Jacobian computation. This confirms that our Jacobianfree coding techniques are crucial for practical use.
Compression algorithm  Batch size  CIFAR10  ImageNet 32x32  ImageNet 64x64 
Black box (Algorithm 1)  1  
Compositional (Section 3.4.3)  1  
64  
Neural net only, without coding  1  
64 
5 Related work
We have built upon the bitsback argument (wallace1968information, ; hinton1993keeping, ) and its practical implementations (rissanen1976generalized, ; frey1997efficient, ; duda2013asymmetric, ; townsend2018practical, ; kingma2019bitswap, ). Our work enables flow models to perform lossless compression, which is already possible with VAEs and autoregressive models with certain tradeoffs. VAEs and flow models (RealNVPtype models specifically) currently attain similar theoretical codelengths on image datasets (ho2019flow++, ; maaloe2019biva, ) and have similarly fast coding algorithms, but VAEs are more difficult to train due to posterior collapse (chen2016variational, ), which implies worse codelengths unless they are very carefully tuned by the practitioner. Meanwhile, autoregressive models currently attain the best codelengths (2.80 bits/dim on CIFAR10 and 3.44 bits/dim on ImageNet 64x64 (child2019generating, )), but decoding is extremely slow due to serial model evaluations, just like sampling. Our compositional algorithm for RealNVPtype flows, on the other hand, is parallelizable over data dimensions and uses a single model pass for both encoding and decoding.
Concurrent work (gritsenko2019relationship, ) proposes Eq. 6 and its analysis in Appendix A to connect flows with VAEs to design new types of generative models, while by contrast, we take a pretrained, offtheshelf flow model and employ Eq. 6 as artificial noise for compression. While the local bitsback coding concept and the blackbox Algorithm 1 work for any flow, our fast linear time coding algorithms are specialized to autoregressive flows and the RealNVP family; it would be interesting to find fast coding algorithms for other types of flows (grathwohl2018ffjord, ; behrmann2018invertible, ), investigate nonimage modalities (kumar2019videoflow, ; prenger2019waveglow, ), and explore connections with other literature on compression with neural networks (balle2016end, ; balle2018variational, ; theis2017lossy, ; rippel2017real, ).
6 Conclusion
We presented local bitsback coding, a technique for designing lossless compression algorithms backed by flow models. Along with a compression interpretation of dequantization, we presented concrete coding algorithms for various types of flows, culminating in an algorithm for RealNVPtype models that is fully parallelizable for encoding and decoding, runs in linear time and space, and achieves codelengths very close to theoretical predictions on highdimensional realworld datasets. As modern flow models are capable of attaining excellent theoretical codelengths via straightforward, stable training, we hope that they will become serious contenders for practical compression with the help of our algorithms, and more broadly, we hope that our work will open up new possibilities for compression technology to harness the density estimation power of modern deep generative models.
References
 [1] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. Endtoend optimized image compression. arXiv preprint arXiv:1611.01704, 2016.
 [2] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436, 2018.
 [3] Jens Behrmann, David Duvenaud, and JörnHenrik Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2018.
 [4] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.

[5]
Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel.
PixelSNAIL: An improved autoregressive generative model.
In
International Conference on Machine Learning
, pages 863–871, 2018.  [6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
 [7] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 [8] Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889–904, 1995.
 [9] Gustavo Deco and Wilfried Brauer. Higher order statistical decorrelation without information loss. In Advances in Neural Information Processing Systems, pages 247–254, 1995.
 [10] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 [11] Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using Real NVP. arXiv preprint arXiv:1605.08803, 2016.
 [12] Jarek Duda. Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding. arXiv preprint arXiv:1311.2540, 2013.

[13]
Brendan J. Frey and Geoffrey E. Hinton.
Efficient stochastic source coding and an application to a Bayesian network source model.
The Computer Journal, 40(2_and_3):157–165, 1997.  [14] Fabian Giesen. Interleaved entropy coders. arXiv preprint arXiv:1402.3392, 2014.
 [15] Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Freeform continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.

[16]
Alexey A Gritsenko, Jasper Snoek, and Tim Salimans.
On the relationship between normalising flows and variationaland denoising autoencoders.
2019. 
[17]
Geoffrey Hinton and Drew Van Camp.
Keeping neural networks simple by minimizing the description length
of the weights.
In
in Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory
. Citeseer, 1993.  [18] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flowbased generative models with variational dequantization and architecture design. In International Conference on Machine Learning, 2019.
 [19] Antti Honkela and Harri Valpola. Variational learning and bitsback coding: an informationtheoretic view to bayesian learning. IEEE Transactions on Neural Networks, 15(4):800–810, 2004.
 [20] David A Huffman. A method for the construction of minimumredundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.
 [21] Nal Kalchbrenner, Aäron Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. In International Conference on Machine Learning, pages 1771–1779, 2017.
 [22] Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224, 2018.
 [23] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751, 2016.
 [24] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [25] Friso H Kingma, Pieter Abbeel, and Jonathan Ho. Bitswap: Recursive bitsback coding for lossless compression with hierarchical latent variables. In International Conference on Machine Learning, 2019.
 [26] Leon Gordon Kraft. A device for quantizing, grouping, and coding amplitudemodulated pulses. PhD thesis, Massachusetts Institute of Technology, 1949.
 [27] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A flowbased generative model for video. arXiv preprint arXiv:1903.01434, 2019.
 [28] Lars Maaløe, Marco Fraccaro, Valentin Liévin, and Ole Winther. Biva: A very deep hierarchy of latent variables for generative modeling. arXiv preprint arXiv:1902.02102, 2019.
 [29] Brockway McMillan. Two inequalities implied by unique decipherability. IRE Transactions on Information Theory, 2(4):115–116, 1956.
 [30] Jacob Menick and Nal Kalchbrenner. Generating high fidelity images with subscale pixel networks and multidimensional upscaling. arXiv preprint arXiv:1812.01608, 2018.
 [31] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. Practical full resolution learned lossless image compression. arXiv preprint arXiv:1811.12817, 2018.
 [32] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347, 2017.
 [33] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, and Alexander Ku. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
 [34] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flowbased generative network for speech synthesis. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019.
 [35] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [36] Oren Rippel and Lubomir Bourdev. Realtime adaptive image compression. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2922–2930. JMLR. org, 2017.
 [37] Jorma J Rissanen. Generalized Kraft inequality and arithmetic coding. IBM Journal of research and development, 20(3):198–203, 1976.
 [38] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
 [39] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395, 2017.
 [40] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR 2016), pages 1–10, 2016.
 [41] James Townsend, Thomas Bird, and David Barber. Practical lossless compression with latent variables using bits back coding. In International Conference on Learning Representations, 2019.
 [42] Benigno Uria, MarcAlexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220, 2016.
 [43] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

[44]
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.
Pixel recurrent neural networks.
International Conference on Machine Learning (ICML), 2016.  [45] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with PixelCNN decoders. arXiv preprint arXiv:1606.05328, 2016.
 [46] Christopher S Wallace and David M Boulton. An information measure for classification. The Computer Journal, 11(2):185–194, 1968.
Appendix A Details on local bitsback coding
Here, we show that the expected codelength of local bitsback coding agrees with Eq. 5 up to first order:
(13) 
Sufficient conditions for the following argument are that the prior log density and the inverse of the flow have bounded derivatives of all orders. Let and let be the Jacobian of at . If we write for , the local bitsback codelength satisfies:
(14)  
We proceed by calculating each term. The first term (a) is the negative differential entropy of a Gaussian with covariance matrix :
(15) 
We calculate the second term (b) by taking a Taylor expansion of around . Let denote the coordinate of . The inverse function theorem yields
(16)  
(17) 
where . Write
, so that the previous equation can be written in vector form as
. With this in hand, term (b) reduces to:(18)  
(19)  
(20) 
Because the coordinates of
are independent and have zero third moment, we have
(21) 
which implies that
(22) 
Appendix B Full algorithms
This appendix lists the full pseudocode of our coding algorithms including decoding procedures, which we omitted from the main text for brevity.