Compression with Flows via Local Bits-Back Coding

05/21/2019 ∙ by Jonathan Ho, et al. ∙ berkeley college 1

Likelihood-based generative models are the backbones of lossless compression, due to the guaranteed existence of codes with lengths close to negative log likelihood. However, there is no guaranteed existence of computationally efficient codes that achieve these lengths, and coding algorithms must be hand-tailored to specific types of generative models to ensure computational efficiency. Such coding algorithms are known for autoregressive models and variational autoencoders, but not for general types of flow models. To fill in this gap, we introduce local bits-back coding, a new compression technique compatible with flow models. We present efficient algorithms that instantiate our technique for many popular types of flows, and we demonstrate that our algorithms closely achieve theoretical codelengths for state-of-the-art flow models on high-dimensional data.

READ FULL TEXT VIEW PDF

Authors

page 8

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

To devise a lossless compression algorithm means to devise a uniquely decodable code whose expected length is as close as possible to the entropy of the data. A general recipe for this is to first train a generative model by minimizing cross entropy to the data distribution, and then construct a code that achieves lengths close to the negative log likelihood of the model. This recipe is justified by classic results in information theory that ensure that the second step is possible—in other words, optimizing cross entropy optimizes the performance of some hypothetical compressor. And, thanks to recent advances in deep likelihood-based generative models, these hypothetical compressors are quite good. Autoregressive models, latent variable models, and flow models are now achieving state-of-the-art cross entropy scores on a wide variety of real-world datasets in speech, videos, text, images, and other domains (oord2016pixel, ; oord2016conditional, ; salimans2017pixelcnn++, ; parmar2018image, ; chen2018pixelsnail, ; menick2018generating, ; child2019generating, ; kalchbrenner2017video, ; dinh2014nice, ; dinh2016density, ; kingma2018glow, ; prenger2019waveglow, ; ho2019flow++, ; oord2016wavenet, ; kingma2013auto, ; maaloe2019biva, ).

But we are not interested in hypothetical compressors. We are interested in practical, computationally efficient compressors that scale to high-dimensional data and harness the excellent cross entropy scores of modern deep generative models. Unfortunately, naively applying existing codes, like Huffman coding (huffman1952method, ), requires computing the model likelihood for all possible values of the data, which expends computational resources scaling exponentially with the data dimension. This inefficiency stems from the lack of assumptions about the generative model’s structure.

Coding algorithms must be tailored to specific types of generative models if we want them to be efficient enough for practical use. There is already a rich literature of tailored coding algorithms for autoregressive models and variational autoencoders, assuming they are built from conditional distributions which are already tractable for coding (rissanen1976generalized, ; duda2013asymmetric, ; hinton1993keeping, ; frey1997efficient, ; townsend2018practical, ). On the other hand, there are currently no such algorithms for flow models in general (mentzer2018practical, )

. It seems that this lack of efficient coding algorithms is a con of these models that stands in odd contrast with their many pros, like fast and realistic sampling, interpretable latent spaces, fast likelihood evaluation, competitive cross entropy scores, and ease of training with unbiased log likelihood gradients 

(dinh2014nice, ; dinh2016density, ; kingma2018glow, ; ho2019flow++, ).

To rectify this situation, we introduce local bits-back coding, a new technique for turning a general, pretrained, off-the-shelf flow model into an efficient coding algorithm suitable for continuous data discretized to high precision. We show how to implement local bits-back coding without assumptions on the flow structure, leading to an algorithm that runs in polynomial time and space with respect to the data dimension. Going further, we show how to tailor our implementation to various specific types of flows, culminating in a fully parallelizable algorithm for RealNVP-type flows that runs in linear time and space with respect to the data dimension and is fully parallelizable for both encoding and decoding. We then show how to adapt local bits-back coding to losslessly code data discretized to arbitrarily low precision, and in doing so, we end up with a new compression interpretation of dequantization, a method commonly used to train flow models on discrete data. We test our algorithms on state-of-the-art flow models trained on real-world image datasets, and we find that they are computationally efficient and attain codelengths in close agreement with theoretical predictions. We plan on providing an open source code release.

2 Preliminaries

Lossless compression We begin by defining lossless compression of -dimensional discrete data

using a probability mass function

represented by a generative model. It means to construct a uniquely decodable code , which is an injective map from data sequences to binary strings, whose lengths are close to  (cover2012elements, ).111We always use base 2 logarithms. The rationale is that if the generative model is expressive and trained well, its cross entropy will be close to the entropy of the data distribution. So, if the lengths of match the model’s negative log probabilities, the expected length of will be small, and hence will be a good compression algorithm. Constructing such a code is always possible in theory, because the Kraft-McMillan inequality (kraft1949device, ; mcmillan1956two, ) ensures that there always exists some code with lengths .

Flow models We wish to construct a computationally efficient code specialized to a flow model , which is a differentiable bijection between continuous data and latents  (deco1995higher, ; dinh2014nice, ; dinh2016density, ). A flow model comes with a density on the latent space, and thus has an associated sampling process— for

—under which it defines a probability density function via the change-of-variables formula for densities:

(1)

where denotes the Jacobian of at . Flow models are straightforward train with maximum likelihood, as Eq. 1 allows unbiased exact log likelihood gradients to be computed efficiently.

Dequantization To make a flow viable for discrete data , which is what many popular datasets provide, it is standard practice to define a derived discrete model to be trained by minimizing a dequantization objective averaged over a dataset. A dequantization objective is a variational bound on the codelength of :

(2)

Here, proposes dequantization noise that transforms discrete data into continuous data

; it can be fixed to either a uniform distribution 

(uria2016neural, ; theis2016note, ; salimans2017pixelcnn++, ) or to another parameterized flow to be trained jointly with  (ho2019flow++, ). This dequantization objective serves as a theoretical codelength for flow models trained on discrete data, just like negative log probability mass serves as a theoretical codelength for discrete generative models (theis2016note, ).

3 Local bits-back coding

Our goal is to develop computationally efficient coding algorithms for flows trained with a dequantization objective, and we want to attain codelengths that closely match the theoretical codelengths 2 these flows are trained to minimize. First, in Sections 3.1 to 3.4, we develop algorithms that use flows to code continuous data discretized to high precision. These algorithms will attain a codelength that matches the negative log density of the flow 1, plus a constant that depends on the discretization precision. Second, in Section 3.5, we show how to adapt these algorithms to losslessly code data discretized to low precision—specifically, however many bits of precision are present in the dataset used to train the flow model—thereby attaining our desired codelength 2 for discrete data.

3.1 Coding continuous data using discretization

We first address the problem of developing coding algorithms that attain codelengths given by negative log densities of flow models, such as Eq. 1. Probability density functions do not directly map to codelength, unlike probability mass functions which enjoy the result of the Kraft-McMillan inequality. So, following standard procedure (cover2012elements, , section 8.3), we discretize the data to a high precision  and code this discretized data with a certain probability mass function derived from the density model. Specifically, we tile with hypercubes of volume ; we call each hypercube a bin. For , let be the unique bin that contains , and let be the center of the bin . We call the discretized version of . For a sufficiently smooth probability density function

, such as a density coming from a neural network flow model, the probability mass function

takes on the pleasingly simple form when the precision is large. Now we invoke the Kraft-McMillan inequality, so the theoretical codelength for using is

(3)

bits. This is the compression interpretation of the negative log density: it is a codelength for data discretized to high precision, when added to the total number of bits of discretization precision. It is this codelength, Eq. 3, that we will try to achieve with an efficient algorithm for flow models. We defer the problem of coding data discretized to low precision to Section 3.5.

3.2 Background on bits-back coding

The main tool we will employ to develop our coding algorithms is bits-back coding (wallace1968information, ; hinton1993keeping, ; frey1997efficient, ; honkela2004variational, ), a coding technique originally designed for latent variable models (connecting bits-back coding to flow models is done in Section 3.3 and is new to our work). Bits-back coding codes using a distribution of the form , where includes a hidden variable . Bits-back coding is relevant when ranges over an exponentially large set, making it intractable to code with directly, even though coding with and may be tractable individually. To code in this case, bits-back coding introduces a new distribution with tractable coding, and the encoder jointly encodes along with via these steps:

  1. [topsep=-1ex,itemsep=-1ex,partopsep=1ex,parsep=1ex]

  2. Decode from an auxiliary source of random bits

  3. Encode using

  4. Encode using

The first step, which decodes from random bits, produces a sample . The second and third steps transmit along with . At decoding time, the decoder recovers , then recovers the bits the encoder used to sample using . So, the encoder will have transmitted extra information in addition to —precisely bits on average. Consequently, the net number of bits transmitted regarding only will be , which is redundant compared to the desired length by an amount equal to the KL divergence from to the true posterior.

Bits-back coding also works with continuous discretized to high precision, with negligible change in codelength (hinton1993keeping, ; townsend2018practical, ). In this case, and are probability density functions. Discretizing to bins of small volume and defining the probability mass functions and by the method in Section 3.1, we see that the bits-back codelength remains approximately unchanged:

(4)

When bits-back coding is applied to a particular latent variable model, such as a VAE, the distributions involved may take on a certain meaning: would be the prior, would be the decoder network, and would be the encoder network (kingma2013auto, ; rezende2014stochastic, ; dayan1995helmholtz, ; chen2016variational, ; frey1997efficient, ; townsend2018practical, ; kingma2019bitswap, ). However, it is important to note that these distributions do not need to correspond explicitly to parts of the model at hand. Any will do for coding data losslessly (though some choices will result in better codelength). We exploit this fact in Section 3.3, where we apply bits-back coding to flow models by constructing artificial distributions and , which do not come with a flow model by default.

3.3 Local bits-back coding

We now present local bits-back coding, our new high-level principle for using a flow model to code data discretized to high precision. Following Section 3.1, we discretize continuous data into , which is the center of a bin of volume . The codelength we desire for is the negative log density of 1, plus a constant depending on the discretization precision:

(5)

where is the Jacobian of at . We will construct two densities and such that bits-back coding attains Eq. 5. We need a small scalar parameter , with which we define

(6)

To encode , local bits-back coding follows the method described in Section 3.2 with continuous :

  1. [topsep=-1ex,itemsep=-1ex,partopsep=1ex,parsep=1ex]

  2. Decode from an auxiliary source of random bits

  3. Encode using

  4. Encode using

From these steps, we see that 6 is artificially injected noise, scaled by (the flow model remains unmodified). The distribution of this noise represents how a local linear approximation of would behave if it were to act on a small Gaussian around .

To justify local bits-back coding, we simply calculate its expected codelength. First, our choices of and 6 satisfy the following equation:

(7)

Next, just like standard bits-back coding 4, local bits-back coding attains an average codelength close to , where

(8)

Equations 7, 8 and 6 imply that the expected codelength of local bits-back coding matches our desired codelength 5, up to first order in (see Appendix A for details):

(9)

Note that local bits-back coding exactly achieves the desired codelength for flows 5, up to first order in . This is in stark contrast to bits-back coding with latent variable models like VAEs, for which the bits-back codelength is the negative evidence lower bound, which is redundant by an amount equal to the KL divergence from the approximate posterior to the true posterior (kingma2013auto, ).

Local bits-back coding always codes losslessly, no matter the setting of , , and . However, must be small for the inaccuracy in Eq. 9 to be negligible. But for to be small, the discretization volumes and must be small too, otherwise the discretized Gaussians and will be poor approximations of the original Gaussians and . So, because must be small, the data must be discretized to high precision. And, because must be small, a relatively large number of auxiliary bits must be available to decode . We will resolve the high precision requirement for the data with another application of bits-back coding in Section 3.5, and we will explore the impact of varying , , and on real-world data in experiments in Section 4.

3.4 Concrete local bits-back coding algorithms

We have shown that local bits-back coding attains the desired codelength 5 for data discretized to high precision. Now, we instantiate local bits-back coding with concrete algorithms.

3.4.1 Black box flows

Algorithm 1 is the most straightforward implementation of local bits-back coding. It directly implements the steps in Section 3.3 by explicitly computing the Jacobian of the flow (using, say, automatic differentiation). It therefore makes no assumptions on the structure of the flow, and hence we call it the black box algorithm.

1:data , flow , discretization volumes , , noise level
2: Compute the Jacobian of at
3:Decode By converting to an AR model (Section 3.4.1)
4:Encode using
5:Encode using
Algorithm 1 Local bits-back encoding: for black box flows (decoding in Appendix B)

Coding with 6 is efficient because its coordinates are independent (townsend2018practical, ). The same applies to the prior if its coordinates are independent too, or if another efficient coding algorithm already exists for it (see Section 3.4.3). However, to enable efficient coding with —for instance, when decoding from auxiliary random bits during bits-back encoding—we must rely on the fact that any multivariate Gaussian can be converted into a linear autoregressive model, which can be coded efficiently, one coordinate at a time, using arithmetic coding or asymmetric numeral systems.

To see how, suppose , where and is a full-rank matrix (such as a Jacobian of a flow model). Let be the Cholesky decomposition of . Since , the distribution of is equal to the distribution of . So, solutions to the linear system have the same distribution as , and because is triangular, is easily computable and also triangular, and thus solving for can be done with back substitution: , where increases from to . In other words, is a linear autoregressive model that represents the same distribution as .

If nothing is known about the structure of the Jacobian of the flow, Algorithm 1 requires space to store the Jacobian and time to compute the Cholesky decomposition. This is certainly an improvement on exponential space and time, which is what naive algorithms require (Section 1), but it is still not efficient enough for high-dimensional data in practice. To make our coding algorithms more efficient, we need to make additional assumptions on the flow. If the Jacobian is always block diagonal, say with fixed block size , then the steps in Algorithm 1 can be modified to process each block separately in parallel, thereby reducing the required space and time to and , respectively. This makes Algorithm 1 efficient for flows that operate as elementwise transformations or as convolutions, such as activation normalization flows and invertible convolution flows (kingma2018glow, ).

3.4.2 Autoregressive flows

An autoregressive flow is a sequence of one-dimensional flows for each coordinate  (papamakarios2017masked, ; kingma2016improved, ). Algorithm 2 shows how to code with an autoregressive flow in linear time and space. It never explicitly calculates and stores the Jacobian of the flow, unlike Algorithm 1. Rather, it invokes one-dimensional local bits-back coding on one coordinate of the data at a time, thus exploiting the structure of the autoregressive flow in an essential way.

1:data , autoregressive flow , discretization volumes , , noise level
2:for  do Iteration ordering not mandatory, but convenient for ANS
3:     Decode Neural net operations parallelizable over
4:     Encode using
5:end for
6:Encode using
Algorithm 2 Local bits-back encoding: for autoregressive flows (decoding in Appendix B)

A key difference between Algorithm 1 and Algorithm 2 is that the former needs to run the forward and inverse directions of the entire flow and compute and factorize a Jacobian, whereas the latter only needs to do so for each one-dimensional flow on each coordinate of the data. Consequently, Algorithm 2 runs time and space (excluding resource requirements of the flow itself). The encoding procedure of Algorithm 2 is similar to log likelihood computation for autoregressive flows, so the model evaluations it requires are completely parallelizable over dimensions. The decoding procedure, on the other hand, is similar to sampling, so it requires model evaluations in serial (the full decoding procedure is listed in Appendix B). These tradeoffs are entirely analogous to those of coding with discrete autoregressive models.

Autoregressive flows with further special structure lead to even more efficient implementations of Algorithm 2. As an example, let us focus on a NICE/RealNVP coupling layer (dinh2014nice, ; dinh2016density, ). This type of flow computes by splitting the coordinates of the input into two halves, , and . The first half is passed through unchanged as , and the second half is passed through an elementwise transformation which is conditioned on the first half only. Specializing Algorithm 2 to this kind of flow allows both encoding and decoding to be parallelized over coordinates, reminiscent of how the forward and inverse directions for inference and sampling can be parallelized for these flows (dinh2014nice, ; dinh2016density, ). See Appendix B for the complete algorithm listing.

Efficient coding algorithms already exist for certain autoregressive flows. For example, if is an autoregressive flow whose prior is independent over coordinates, then can be rewritten as a continuous autoregressive model , which can be discretized and coded one coordinate at a time using arithmetic coding or asymmetric numeral systems. The advantage of Algorithm 2, as we will see next, is that it applies to more complex priors that prevent the distribution over from naturally factorizing as an autoregressive model.

3.4.3 Compositions of flows

Flows like NICE (dinh2014nice, ), RealNVP (dinh2016density, ), Glow (kingma2018glow, ), and Flow++ (ho2019flow++, ) are composed of many intermediate flows: they have the form , where each of the layers

is one of the types of flows discussed above. These models derive their density estimation power from applying simple flows many times, resulting in an extremely complex and expressive composite flow. The expressiveness of the composite flow suggests that coding will be difficult, but we can exploit the compositional structure to code efficiently. Since the composite flow

can be interpreted as a single flow with a flow prior , all we have to do is code the first layer using the appropriate local bits-back coding algorithm, and when coding its output , we recursively invoke local bits-back coding for the prior  (kingma2019bitswap, ). A straightforward inductive argument shows that this leads to the correct codelength. If coding any with achieves the expected codelength , then the expected codelength for , using as a prior, is . Continuing the same into , we conclude that the resulting expected codelength

(10)

where , is what we expect from coding with the whole composite flow . This codelength is averaged over noise injected into each layer , but we find that this is not an issue in practice. Our experiments in Section 4 show that it is easy to make small enough to be negligible for neural network flow models, which are generally resistant to activation noise.

We call this the compositional algorithm. Its significance is that, provided that coding with each intermediate flow is efficient, coding with the composite flow is efficient too, despite the complexity of the composite flow as a function class. The composite flow’s Jacobian never needs to be calculated or factorized, leading to dramatic speedups over using Algorithm 1 on the composite flow as a black box. Coding with RealNVP-type models needs just time and space, is fully parallelizable, and attains state-of-the-art codelengths thanks to the cross entropy scores of these models (Section 4).

3.5 Dequantization for coding unrestricted-precision data

We have shown how to code data discretized to high precision, achieving codelengths close to

. In practice, however, data is usually discretized to low precision; for example, images from CIFAR10 and ImageNet consist of integers in

. Coding this kind of data directly would force us to code at a precision much higher than 1, which would be a waste of bits.

To resolve this issue, we propose to use this extra precision within another bits-back coding scheme to arrive at a good lossless codelength for data at its original precision. Let us focus on the setting of coding integer-valued data up to precision . Recall from Section 2 that flow models are trained on such data by minimizing a dequantization objective 2, which we reproduce here:

(11)

Above, is a dequantizer, which adds noise to turn into continuous data for the flow model to fit (uria2016neural, ; theis2016note, ; salimans2017pixelcnn++, ; ho2019flow++, ). We assume that the dequantizer is itself provided as a flow model, specified by for , as in (ho2019flow++, ). In Algorithm 3, we propose a bits-back coding scheme in which is decoded from auxiliary bits using local bits-back coding, and is encoded using the original flow , also using local bits-back coding.

1:discrete data , flow density , dequantization flow conditional density , discretization volume
2:Decode via local bits-back coding
3: Dequantize
4:Encode using via local bits-back coding
Algorithm 3 Local bits-back encoding with variational dequantization (decoding in Appendix B)

The decoder, upon receiving , recovers the original and by rounding (see Appendix B for the full pseudocode). So, the net codelength for Algorithm 3 is given by subtracting the bits needed to decode from the bits needed to encode :

(12)

This codelength closely matches the dequantization objective 11 on average, and it is reasonable for the low-precision discrete data because, as we stated in Section 2, it is a variational bound on the codelength of a certain discrete generative model for , and modern flow models are explicitly trained to minimize this bound (uria2016neural, ; theis2016note, ; ho2019flow++, ). The resulting code is lossless for , and Algorithm 3 thus provides a new compression interpretation of dequantization: it converts a code suitable for high precision data into a code suitable for low precision data, just as the dequantization objective 11 converts a model suitable for continuous data into a model suitable for discrete data (theis2016note, ).

4 Experiments

We designed experiments to investigate the following: (1) how well local bits-back codelengths match the theoretical codelengths of modern, state-of-the-art flow models on high-dimensional data, (2) the effects of the precision and noise parameters and on codelengths (Section 3.3), and (3) the computational efficiency of local bits-back coding for use in practice.

We focused on Flow++ (ho2019flow++, ), a state-of-the-art RealNVP-type flow that uses a flow-based dequantizer. Our coding implementation involves all concepts presented in this paper: Algorithm 1 for elementwise and convolution flows (kingma2018glow, ), Algorithm 2 for coupling layers, the compositional method of Section 3.4.3, and Algorithm 3 for dequantization. We used asymmetric numeral systems (ANS) (duda2013asymmetric, ), following the BB-ANS (townsend2018practical, ) and Bit-Swap (kingma2019bitswap, ) algorithms for VAEs (though the ideas behind our algorithms do not depend on ANS). We expect our implementation to easily extend to other models, like flows for video (kumar2019videoflow, ) and audio (prenger2019waveglow, ), though we leave that for future work.

CodelengthsTable 1 lists the local bits-back codelengths on the test sets of CIFAR10, 32x32 ImageNet, and 64x64 ImageNet. The listed theoretical codelengths are the average negative log likelihoods of our model reimplementations (without importance sampling for the variational dequantization bound), and we find that our coding algorithm attains very similar lengths. To the best of our knowledge, these results are state-of-the-art for lossless compression with fully parallelizable compression and decompression.

Compression algorithm CIFAR10 ImageNet 32x32 ImageNet 64x64
Theoretical 3.116 3.871 3.701
Local bits-back (ours) 3.118 3.875 3.703
Table 1: Local bits-back codelengths (in bits per dimension)

Effects of precision and noise Recall from Section 3.3 that the noise level should be small to attain accurate codelengths. This means that the discretization volumes and should be small as well to make discretization effects negligible, at the expense of a larger requirement of auxiliary bits, which are not counted into bits-back codelengths (hinton1993keeping, ). Above, we fixed and , but here, we study the impact of varying and : on each dataset, we compressed 20 random datapoints in sequence, then calculated the local bits-back codelength and the auxiliary bits requirement; we did this for 5 random seeds and averaged the results. See Fig. 1 for CIFAR results, and see Appendix C

for results on all models with standard deviation bars. We indeed find that as

and decrease, the codelength becomes more accurate, and we find a sharp transition in performance when is too large relative to , indicating that coarse discretization destroys noise with small scale. Also, as expected, we find that the auxiliary bits requirement grows as shrinks. If auxiliary bits are not available, they must be counted into the codelength for the first datapoint (townsend2018practical, ; kingma2019bitswap, ), but the cost is negligible for long sequences, as one would have when encoding an entire test set or when encoding audio or video data with large numbers of frames (prenger2019waveglow, ; kumar2019videoflow, ).

[width=0.8]delta_sigma_cifar.pdf

Figure 1: Effects of precision and noise parameters and on coding a random subset of CIFAR10

Computational efficiency We used OpenMP-based CPU code for compression with parallel ANS streams (giesen2014interleaved, ), with neural net operations running on a GPU. See Table 2 for encoding timings (decoding timings in Appendix C are nearly identical), averaged over 5 runs, on 16 CPU cores and 1 Titan X GPU. We timed the black box algorithm (Algorithm 1) and the compositional algorithm (Section 3.4.3) on single datapoints, and we also timed the latter with batches of datapoints, made possible by its low memory requirements (this was not possible with the black box algorithm, which already needs batching to compute the Jacobian for one datapoint). We find that the compositional algorithm is only slightly slower than running the neural net on its own, whereas the black box algorithm is significantly slower due to Jacobian computation. This confirms that our Jacobian-free coding techniques are crucial for practical use.

Compression algorithm Batch size CIFAR10 ImageNet 32x32 ImageNet 64x64
Black box (Algorithm 1) 1
Compositional (Section 3.4.3) 1
64
Neural net only, without coding 1
64
Table 2: Encoding time (in seconds per datapoint). Decoding times are nearly identical (Appendix C)

5 Related work

We have built upon the bits-back argument (wallace1968information, ; hinton1993keeping, ) and its practical implementations (rissanen1976generalized, ; frey1997efficient, ; duda2013asymmetric, ; townsend2018practical, ; kingma2019bitswap, ). Our work enables flow models to perform lossless compression, which is already possible with VAEs and autoregressive models with certain tradeoffs. VAEs and flow models (RealNVP-type models specifically) currently attain similar theoretical codelengths on image datasets (ho2019flow++, ; maaloe2019biva, ) and have similarly fast coding algorithms, but VAEs are more difficult to train due to posterior collapse (chen2016variational, ), which implies worse codelengths unless they are very carefully tuned by the practitioner. Meanwhile, autoregressive models currently attain the best codelengths (2.80 bits/dim on CIFAR10 and 3.44 bits/dim on ImageNet 64x64 (child2019generating, )), but decoding is extremely slow due to serial model evaluations, just like sampling. Our compositional algorithm for RealNVP-type flows, on the other hand, is parallelizable over data dimensions and uses a single model pass for both encoding and decoding.

Concurrent work (gritsenko2019relationship, ) proposes Eq. 6 and its analysis in Appendix A to connect flows with VAEs to design new types of generative models, while by contrast, we take a pretrained, off-the-shelf flow model and employ Eq. 6 as artificial noise for compression. While the local bits-back coding concept and the black-box Algorithm 1 work for any flow, our fast linear time coding algorithms are specialized to autoregressive flows and the RealNVP family; it would be interesting to find fast coding algorithms for other types of flows (grathwohl2018ffjord, ; behrmann2018invertible, ), investigate non-image modalities (kumar2019videoflow, ; prenger2019waveglow, ), and explore connections with other literature on compression with neural networks (balle2016end, ; balle2018variational, ; theis2017lossy, ; rippel2017real, ).

6 Conclusion

We presented local bits-back coding, a technique for designing lossless compression algorithms backed by flow models. Along with a compression interpretation of dequantization, we presented concrete coding algorithms for various types of flows, culminating in an algorithm for RealNVP-type models that is fully parallelizable for encoding and decoding, runs in linear time and space, and achieves codelengths very close to theoretical predictions on high-dimensional real-world datasets. As modern flow models are capable of attaining excellent theoretical codelengths via straightforward, stable training, we hope that they will become serious contenders for practical compression with the help of our algorithms, and more broadly, we hope that our work will open up new possibilities for compression technology to harness the density estimation power of modern deep generative models.

References

Appendix A Details on local bits-back coding

Here, we show that the expected codelength of local bits-back coding agrees with Eq. 5 up to first order:

(13)

Sufficient conditions for the following argument are that the prior log density and the inverse of the flow have bounded derivatives of all orders. Let and let be the Jacobian of at . If we write for , the local bits-back codelength satisfies:

(14)

We proceed by calculating each term. The first term (a) is the negative differential entropy of a Gaussian with covariance matrix :

(15)

We calculate the second term (b) by taking a Taylor expansion of around . Let denote the coordinate of . The inverse function theorem yields

(16)
(17)

where . Write

, so that the previous equation can be written in vector form as

. With this in hand, term (b) reduces to:

(18)
(19)
(20)

Because the coordinates of

are independent and have zero third moment, we have

(21)

which implies that

(22)

The final term (c) is given by

(23)
(24)
(25)

Altogether, summing Eqs. 25, 22 and 15 yields the total codelength

(26)

which, to first order, does not depend on , and matches Eq. 5.

Appendix B Full algorithms

This appendix lists the full pseudocode of our coding algorithms including decoding procedures, which we omitted from the main text for brevity.

1:flow , discretization volumes , , noise level
2:procedure Encode()
3:      Compute the Jacobian of at
4:     Decode By converting to an AR model (Section 3.4.1)
5:     Encode using
6:     Encode using
7:end procedure
8:
9:procedure Decode( )
10:     Decode
11:     Decode
12:      Compute the Jacobian of at
13:     Encode using By converting to an AR model (Section 3.4.1)
14:     return
15:end procedure
Algorithm 1 Local bits-back coding: for black box flows
1:autoregressive flow , discretization volumes , , noise level
2:procedure Encode()
3:     for  do Iteration ordering not mandatory, but convenient for ANS
4:          Decode Neural net operations parallelizable over
5:          Encode using
6:     end for
7:     Encode using
8:end procedure
9:
10:procedure Decode( )
11:     Decode
12:     for  do Order should be the opposite of encoding when using ANS
13:          Decode
14:          Encode using
15:     end for
16:     return
17:end procedure
Algorithm 2 Local bits-back coding: for autoregressive flows
1:coupling layer , discretization volumes , , noise level
2: has the form , where operates elementwise
3:
4:procedure Encode()
5:     for  do Neural net operations parallelizable over
6:          Decode
7:          Encode using
8:     end for
9:     for  do
10:          
11:     end for
12:     Encode using
13:end procedure
14:
15:procedure Decode( )
16:     Decode
17:     for  do
18:          
19:     end for
20:     for  do Neural net operations parallelizable over
21:          Decode
22:          Encode using
23:     end for
24:     return
25:end procedure
Algorithm 2 Local bits-back coding: for autoregressive flows, specialized to coupling layers
1:flow density , dequantization flow conditional density , discretization volume
2:procedure Encode() is discrete data
3:     Decode via local bits-back coding
4:      Dequantize
5:     Encode using via local bits-back coding
6:end procedure
7:
8:procedure Decode( )
9:     Decode via local bits-back coding
10:      Quantize
11:     
12:     Encode using via local bits-back coding
13:     return