Variable-Bitrate Neural Compression via Bayesian Arithmetic Coding

02/18/2020 ∙ by Yibo Yang, et al. ∙ 14

Deep Bayesian latent variable models have enabled new approaches to both model and data compression. Here, we propose a new algorithm for compressing latent representations in deep probabilistic models, such as variational autoencoders, in post-processing. The approach thus separates model design and training from the compression task. Our algorithm generalizes arithmetic coding to the continuous domain, using adaptive discretization accuracy that exploits estimates of posterior uncertainty. A consequence of the "plug and play" nature of our approach is that various rate-distortion trade-offs can be achieved with a single trained model, eliminating the need to train multiple models for different bit rates. Our experimental results demonstrate the importance of taking into account posterior uncertainties, and show that image compression with the proposed algorithm outperforms JPEG over a wide range of bit rates using only a single machine learning model. Further experiments on Bayesian neural word embeddings demonstrate the versatility of the proposed method.



There are no comments yet.


page 13

page 14

page 15

page 17

page 19

page 20

page 21

page 22

Code Repositories


Code for paper "Variable-Bitrate Neural Compression via Bayesian Arithmetic Coding"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Probabilistic latent-variable models have become a mainstay of modern machine learning. Scalable approximate Bayesian inference methods, in particular Black Box Variational Inference

(Ranganath et al., 2014; Rezende et al., 2014), have spurred the development of increasingly large and expressive probabilistic models, including deep generative probabilistic models such as variational autoencoders (Kingma and Welling, 2014a)

and Bayesian neural networks

(MacKay, 1992; Blundell et al., 2015).  One natural application of deep latent variable modeling is data compression, and recent work has focused on end-to-end procedures that optimize a model for a particular compression objective. Here, we study a related but different problem: given a trained model, what is the best way to encode the information contained in its continuous latent variables?

As we demonstrate, our proposed solution provides an entirely new “plug & play” approach to lossy compression that separates the compression task from modeling and training. Our method can be applied to both model and data compression, and allows tuning the trade-off between bitrate and reconstruction quality without the need to retrain the model.

Compression aims to best describe some data in as few bits as possible. For continuous-valued data like natural images, videos, or distributed representations, digital compression is necessarily

lossy, as arbitrary real numbers cannot be perfectly represented by a finite number of bits. Lossy compression algorithms therefore typically find a discrete approximation of some semantic representation of the data, which is then encoded with a lossless compression method.

In classical lossy compression methods such as JPEG or MP3, the semantic representation is carefully designed to support compression at variable bitrates. By contrast, state-of-the-art deep learning based approaches to lossy data compression

(Ballé et al., 2017, 2018; Rippel and Bourdev, 2017; Mentzer et al., 2018; Lombardo et al., 2019) are trained to minimize a distortion metric at a fixed bitrate. To support variable-bitrate compression, one has to train several models for different bitrates. While training several models may be viable in many cases, a bigger issue of this approach is the increase in decoder size as the decoder has to store the parameters of not one but several deep neural networks for each bitrate setting. In applications like video streaming under fluctuating connectivity, the decoder further has to load a new deep learning model into memory every time a change in bandwidth requires adjusting the bitrate.

By contrast, we propose a lossy neural compression method that decouples training from compression, and that enables variable-bitrate compression with a single model. We generalize a classical entropy coding algorithm, Arithmetic Coding (Witten et al., 1987; MacKay, 2003), from discrete data to the continuous domain. At the heart of the proposed Bayesian Arithmetic Coding algorithm is an adaptive quantization scheme that exploits posterior uncertainty estimates to automatically reduce the accuracy of latent variables for which the model is uncertain anyway. This strategy is analogous to the way humans communicate quantitative information. For example, Wikipedia lists the population of Rome in 2017 with the specific number . By contrast, its population in the year  AD is estimated by the rounded number because the high uncertainty would make a more precise number meaningless. Our ablation studies show that this posterior-informed quantization scheme is crucial to obtaining competitive performance.

In detail, our contributions are as follows:

  • A new algorithm. We present a fundamentally novel approach to compressing latent variables in a variational inference framework. Our approach generalizes arithmetic coding from discrete to continuous distributions and takes posterior uncertainty into account.

  • Single-model compression at variable bitrates. The decoupling of modeling and compression allows us to adjust the trade-off between bitrate and distortion in post-processing. This is in contrast to existing approaches to both data and model compression, which often require specialized models for each bitrate.

  • Automatic self-pruning. Deep latent variable models often exhibit posterior collapse, i.e., the variational posterior collapses to the model prior. In our approach, latent dimensions with collapsed posteriors require close to zero bits, thus don’t require manual pruning.

  • Competitive experimental performance. We show that our method outperforms JPEG over a wide range of bitrates using only a single model. We also show that we can successfully compress a word embeddings with minimal loss, as evaluated on semantic reasoning task.

The paper is structured as follows: Section 2 discusses related work in neural compression; Section 3 describes our proposed Bayesian Arithmetic Coding algorithm. We give empirical results in Section 4, and conclude in Section 5.

2 Related Work

Compressing continuous-valued data is a classical problem in the signal processing community. Typically, a distortion measure (often the squared error) and a source distribution are assumed, and the goal is to design a quantizer that optimizes the rate-distortion (R-D) performance (Lloyd, 1982; Berger, 1972; Chou et al., 1989)

. Optimal vector quantization, although theoretically well-motivated

(Gallager, 1968), is not tractable in high-dimensional spaces (Gersho and Gray, 2012) and not scalable in practice. Therefore most classical lossy compression algorithms map data to a suitably designed semantic representation, in such a way that coordinate-wise scalar quantization can be fruitfully applied.

Recent machine-learning-based data compression methods learn such hand-designed representation from data, but similar to classical methods, most such ML methods directly take quantization into account in the generative model design or training. Various approaches replace the non-differentiable quantization operation with either stochastic binarization

(Toderici et al., 2016, 2017), additive uniform noise (Ballé et al., 2017, 2018; Habibian et al., 2019), or other differentiable approximation (Agustsson et al., 2017; Theis et al., 2017; Mentzer et al., 2018; Rippel and Bourdev, 2017); many such schemes result in uniform quantization of the latent variables, with the exception of (Agustsson et al., 2017), which optimizes for quantization grid points.

We depart from such approaches by considering quantization as a post-processing step that decouples quantization from model design and training. An important feature of our algorithm is a new quantization scheme that automatically adapts to different length scales in the representation space by exploiting posterior uncertainty estimates. To the best of our knowledge, the only prior work that uses posterior uncertainty for compression is in the context of bits-back coding (Honkela and Valpola, 2004; Townsend et al., 2019), but these works focus on lossless compression.

Most existing neural image compression methods require training a separate machine learning model for each desired bitrate setting (Ballé et al., 2017, 2018; Mentzer et al., 2018; Theis et al., 2017; Lombardo et al., 2019). In fact, Alemi et al. (2018) showed that any particular fitted VAE model only targets one specific point on the rate-distortion curve. One approach has the same goal of variable-bitrate single-model compression in mind as methods based on recurrent VAEs (Gregor et al., 2016; Toderici et al., 2016, 2017; Johnston et al., 2018), which use dedicated model architecture for progressive image reconstruction; but instead focus more broadly on lossy compression for any given generative model, designed and trained for specific application purposes (possibly other than compression).

3 Posterior-Informed Variable-Bitrate Compression

We now propose an algorithm for compressing latent variables in trained models. After describing the problem setup and assumptions (Subsection 3.1), we briefly review Arithmetic Coding (Subection 3.2). Subsection 3.3 describes our proposed lossy compression algorithm, which generalizes Arithmetic Coding to the continuous domain.

3.1 Problem Setup

Generative Model and Variational Inference.

We consider a wide class of generative probabilistic models with data  and unknown (or “latent”) variables from some continuous latent space with dimension 

. The generative model is defined by a joint probability distribution,


with a prior and a likelihood . Although our presentation focuses on unsupervised representation learning, our framework also captures the supervised setup.111

For supervised learning with labels 

, we would consider a conditional generative model with conditional likelihood , where are the model parameters, treated as a Bayesian latent variable with associated prior .

Our proposed compression method uses  as a proxy to describe the data . This requires “solving” Eq. 1 for  given , i.e., inferring the posterior . Since exact Bayesian inference is often intractable, we resort to Variational Inference (VI) (Jordan et al., 1999; Blei et al., 2017; Zhang et al., 2019), which approximates the posterior by a so-called variational distribution

by minimizing the Kullback-Leibler divergence

over a set of variational parameters .

Factorization Assumptions.

We assume that both the prior and the variational distribution are fully factorized (mean-field assumption). For concreteness, our examples use a Gaussian variational distribution. Thus,


where is a prior for the th component of , and the means 

and standard deviations 

together comprise the variational parameters  over which VI optimizes.222These parameters are often amortized by a neural network (in which case and depend on ), but don’t have to (in which case and do not depend on and are directly optimized).

Prominently, the model class defined by Eqs. 1-3 includes variational autoencoders (VAEs) (Kingma and Welling, 2014b) for data compression, but we stress that the class is much wider, capturing also Bayesian neural nets (MacKay, 2003), probabilistic word embeddings (Barkan, 2017; Bamler and Mandt, 2017), matrix factorization (Mnih and Salakhutdinov, 2008), and topic models (Blei et al., 2003).

Protocol Overview.

We consider two parties in communication, a sender and a receiver. Given a probabilistic model (Eq. 1), the goal is to transmit a data sample  as efficiently as possible. Both parties have access to the model, but only the sender has access to , which it uses to fit a variational distribution . It then uses the algorithm proposed below to select a latent variable vector  that has high probability under , and that can be encoded into a compressed bitstring, which gets transmitted to the receiver. The receiver losslessly decodes the compressed bitstring back into  and uses the likelihood to generate a reconstructed data point , typically setting .

The rest of this section describes how the proposed algorithm selects  and encodes it into a compressed bitstring.

3.2 Background: Arithmetic Coding

Our lossy compression algorithm, introduced in Section 3.3 below, generalizes a lossless compression algorithm, arithmetic coding (AC) (Witten et al., 1987; MacKay, 2003), from discrete data to the continuous space of latent variables . To get there, we first review the main idea of AC that our proposed algorithm borrows.

AC is an instance of so-called entropy coding. It uniquely maps messages  from a discrete set  to a compressed bitstring of some length  (the “bitrate”). Entropy coding exploits prior knowledge of the distribution of messages to map probable messages to short bitstrings while spending more bits on improbable messages. This way, entropy coding algorithms aim to minimize the expected rate . For lossless compression, the expected rate has a fundamental lower bound, the entroy , where is the Shannon information content of . AC provides near optimal lossless compression as it maps each message  to a bitstring of length , where denotes the ceiling function.

Figure 1: Comparison of standard Arithmetic Coding (AC, left) and Bayesian AC (right, proposed). Both methods use a prior CDF (orange) to map nonuniformly distributed data to a number , and require an uncertainty region for truncation.

AC is usually discussed in the context of streaming compression where is a sequence of symbols from a finite alphabet, as AC improves on this task over the more widely known Huffman coding (Huffman, 1952)

. In our work, we focus on a different aspect of AC: its use of a cumulative probability distribution function to map a nonuniformly distributed random variable

to a number 

that is nearly uniformly distributed over the interval


Figure 1

(left) illustrates AC for a binomial-distributed message

(the number of ‘heads’ in a sequence of ten coin flips). The solid and dashed orange lines show the left and right sided cumulative distribution function,

333If is a sequence of symbols, and are defined by lexicographical order and can be constructed in a streaming manner. and , respectively. They define a partitioning of the interval (vertical axis in Figure 1 (left)) into pairwise disjoint subintervals (orange squares). Since the intervals are disjoint for all , any number uniquely identifies a given message . AC picks such a number  and encodes it into a string of bits , by writing it in binary representation,


Since any may be used to identify the message , we can interpret the interval  as an uncertainty region in -space. AC picks the number with the shortest binary representation. This requires at most bits because the numbers  that can be represented by Eq. 4 with form a uniform grid with spacing , which is at most as wide as the size of the interval, . The red arrows in Figure 1 (left) illustrate how AC would encode the message in the toy example into the bitstring “”. Decoding works in the opposite direction and maps back to .

In the next section, we generalize AC to the continuous domain. As we will show, the concept of an “uncertainty region” in -space becomes again crucial.

3.3 Bayesian Arithmetic Coding

We now present our proposed algorithm, Bayesian Arithmetic Coding (Bayesian AC), which generalizes standard AC from the domain of discrete messages  to the domain of continuous latent variables 

. Similar to AC, Bayesian AC exploits knowledge of a prior probability distribution

in combination with a (soft) uncertainty region to encode probable values of  into short bitstrings.

From Intervals to Distributions.

The main ideas that Bayesian AC borrows from standard AC are as follows: (1) the use of a cumulative distribution function to map any non-uniformly distributed random variable to a uniformly distributed random variable over the interval , and (2) the use of an “uncertainty region” to select a number on this interval to encode the message with as few bits as possible. While AC was characterized by an interval  with hard boundaries, Bayesian AC softens this uncertainty region by drawing on posterior uncertainty.

We consider a single continuous latent variable with arbitrary prior . The cumulative (CDF) of the prior,


is shown in orange in Figure 1 (right). It maps to . In contrast to the discrete case discussed in Section 3.2, where the prior CDF maps each message  to an entire interval

, note that the CDF of a continuous random variable maps real numbers to real numbers.

Since is almost surely an irrational number, its binary representation is infinitely long, and a practical encoding method has to truncate it to some finite length. We find an optimal truncation by generalizing the idea of the uncertainty region  to the continuous space. To this end, we consider the posterior uncertainty in -space and map it to -space. Approximating the posterior by the variational distribution , see Eq. 3, we thus consider the function



is the inverse CDF (the quantile function), which maps

back to . Note that is not a normalized probability distribution, as Eq. 6 deliberately does not include the Jacobian because the final objective will be to maximize at a single point (see Eq. 7 below).


The solid and dashed purple curves in figure Figure 1 (right) plot and on the horizontal and vertical axis, respectively. The red arrows illustrate how a finite uncertainty region in -space is mapped to a finite width of  in -space. Bayesian AC encodes a quantile  that has high value under while at the same time having a short binary representation. The two purple arrowheads on the vertical axis point to two viable candidates, and , that both lie within the uncertainty region. The choice between these two points poses a rate-distortion trade-off: while has higher value under  (i.e., it identifies a point

with higher approximate posterior probability

), the alternative can be encoded in fewer bits.

Optimizing the Rate-Distortion Trade-Off.

Rather than considering a hard uncertainty region, Bayesian AC simply tries to find a point that identifies latent variables with high probability under the variational distribution while being expressible in few bits. We thus express in terms of the coordinates using Eq. 3,


For each dimension , we restrict the quantile  to the set of code points that can be represented in binary via Eq. 4 with a finite but arbitrary bitlength . We define the total bitlength , i.e., the length of the concatenation of all codes , neglecting, for now, an overhead for delimiters (see below). Using a rate penalty parameter that is shared across all dimensions , we minimize the rate-distortion objective


The optimization thus decouples across all latent dimensions , and can be solved efficiently and in parallel by minimizing the independent objective functions


Although the bitlength is discontinuous (it counts the number of binary digits, see Eq. 4), can be efficiently minimized over  using Algorithm 1. The algorithm iterates over all rates and searches for the code point  that minimizes . For each , the algorithm only needs to consider the two code points and with rate at most that enclose the optimum and are closest to it; these two code points can be easily computed in constant time. The iteration terminates as soon as the maximally possible remaining increase in is smaller than the minimum penalty for an increasing bitlength (in practice, the iteration rarely exceeds ).


After finding the optimal code points , they have to be encoded into a single bitstring. Simply concatenating the binary representations (Eq. 4) of all would be ambiguous due to their variable lengths (see detailed discussion in the Supplementary Material). Instead, we treat the code points as symbols from a discrete vocabulary and encode them via lossless entropy coding, e.g., standard Arithmetic Coding. The entropy coder requires a probabilistic model over all code points; here we simply use their empirical distribution. When using our method for model compression, this empirical distribution has to be transmitted to the receiver as additional header information that counts towards the total bitrate. For data compression, by contrast, we obtain the empirical distribution of code points on training data and include it in the decoder.

Input: Prior CDF , rate penalty ,
variational mode

and variance

Output: Optimal code point .
  Evaluate .
  Initialize .
     Set .
     if  and  then
        Update .
     end if
     if  and  then
        Update .
     end if
  until .
Algorithm 1 Rate-Distortion Optimization for Dimension
Figure 2:

Effect of an anisotropic posterior distribution on lossy compression. Left: linear regression model with optimal fit (green) and fits from two lossy compression methods. Right: posterior distribution and positions of the compressed models in latent space. Although both compressed models are equally far away from the optimal solution (green dot), Bayesian AC (orange) fits the data better because it takes the anisotropy of the posterior into account.


The proposed algorithm adjusts the accuracy for each latent variable  based on two factors: (i) a global rate setting that is shared across all dimensions ; and (ii) a per-dimension posterior uncertainty estimate . Point (i) allows tuning the rate-distortion trade-off whereas (ii) takes the anisotropy of the latent space into account.

Figure 2 illustrates the effect of anisotropy in latent space. The right panel plots the posterior of a toy Bayesian linear regression model (see left panel) with only two latent variables . Due to the elongated shape of the posterior, Bayesian AC uses a higher accuracy for  than for . As a result, the algorithm encodes a point  (orange dot in right panel) that is closer to the optimal (MAP) solution (green dot) along the -axis than along the -axis.

The purple dot in Figure 2 (right) compares to a more common quantization method, which simply rounds the MAP solution to the nearest point (which is then entropy coded) from a fixed grid with spacing . We tuned so that the resulting encoded point (purple dot) has the same distance to the optimum as our proposed solution (orange dot). Despite the equal distance to the optimum, Bayesian AC encodes model parameters with higher posterior probability. The resulting model fits the data better (left panel).

This concludes the description of the proposed Bayesian Arithmetic Coding algorithm. In the next section, we analyze the algorithm’s behaviour experimentally and demonstrate its performance for variable-bitrate compression on both word embeddings and images.

4 Experiments

We tested our approach in two very different domains: word embeddings and images. For word embeddings, we measured the performance drop on a semantic reasoning task due to lossy compression. Our proposed Bayesian AC method significantly improves model performance over uniform discretization and compression with either Arithmetic Coding (AC), gzip, bzip2, or lzma at equal bitrate. For image compression, we show that a single standard VAE, compressed with Bayesian AC, outperforms JPEG and other baselines at a wide range of bitrates, both quantitatively and visually.

4.1 Compressing Word Embeddings

Figure 3: Performance of compressed word embeddings on a standard semantic and syntactic reasoning task (Mikolov et al., 2013a). Bayesian AC (orange, proposed) leads to much smaller file sizes at equal model performance over a wide range of performances.

We consider the Bayesian Skip-gram model for neural word embeddings (Barkan, 2017), a probabilistic generative formulation of word2vec (Mikolov et al., 2013b) which interprets word and context embedding vectors as latent variables and associates them with Gaussian approximate posterior distributions. Point estimating the latent variables would result in classical word2vec. Even though the model was not specifically designed or trained with model compression taken into consideration, the proposed algorithm can successfully compress it in post-processing.

Experiment Setup.

We implemented the Black Box VI version of the Bayesian Skip-gram model proposed in (Bamler and Mandt, 2017),444

See Supplement for hyperparameters. Our code is available at and trained the model on books published between and from the Google Books corpus (Michel et al., 2011), following the preprocessing described in (Bamler and Mandt, 2017) with a vocabulary of words and embedding dimension .

In the trained model, we observed that the distribution of posterior modes across all words  and all dimensions  of the embedding space was quite different from the prior. To improve the bitrate of our method, we used an “empirical prior” for encoding that is shared across all  and ; we chose a Gaussian where is the empirical variance of all variational means .

We compare our method’s performance to a baseline that quantizes to a uniform grid and then uses the empirical distribution of quantized coordinates for lossless entropy coding. We also compare to uniform quantization baselines that replace the entropy coding step with the standard compression libraries gzip, bzip2, and lzma. These methods are not restricted by a factorized distribution of code points and could therefore detect and exploit correlations between quantized code points across words or dimensions.

Figure 4: Qualitative behavior of three different image compression methods upon reducing the bitrate (bitrates are on different scales). While JPEG and uniformly quantizing a VAE see loss in pixel-level detail, Bayesian AC tends to preserve details but semantically confuses the encoded object with a generic one.

We evaluate performance on the semantic and syntactic reasoning task proposed in (Mikolov et al., 2013a), a popular dataset of semantic relations like “Japan : yen = Russia : ruble” and syntactic relations like “amazing : amazingly = lucky : luckily”, where the goal is to predict the last word given the first three words. We report Hits@10, i.e., the fraction of challenges for which the compressed model ranks the correct prediction among the top ten.


Figure 3 shows the model performance on the semantic and syntactic reasoning tasks as a function of compression rate. Our proposed Bayesian AC significantly outperforms all baselines and reaches the same Hits@10 at less than half the bitrate over a wide range.555 The uncompressed model performance (dotted gray line in Figure 3) is not state of the art. This is not a shortcoming of the compression method but merely of the model, and can be attributed to the smaller vocabulary and training set used compared to (Mikolov et al., 2013b) due to hardware constraints.

4.2 Image Compression

(a) Original
(b) JPEG
(c) Bayesian AC
(d) Uniform grid
(e) Ballé et al.
Figure 5: Image reconstructions at matching bitrate (0.24 bits per pixel). Bayesian AC (c; proposed) outperforms AC with uniform quantization (d) and JPEG (b) and is comparable to the approach by (Ballé et al., 2017) (e) despite using a model that is not optimized for this specific bitrate. Uniform quantization here used a modified version of the VAE in Figure 6, using an additional conv layer with smaller dimensions to reduce the bitrate down to 0.24 (this was not possible in the original model even with the largest possible grid spacing).

While Section 4.1 demonstrated the proposed Bayesian AC method for model compression, we now apply the same method to data compression using a variational autoencoder (VAE). We first provide a qualitative evaluation on MNIST, and then quantitative results on full resolution color images.


For simplicity, we consider regular VAEs with a standard normal prior and Gaussian variational posterior. The generative network parameterizes a factorized Bernoulli or Gaussian likelihood model in the two experiments, respectively. Network architectures are described below and in more detail in Supplementary Material.


We consider the following baselines:

  • Uniform quantization: for a given image , we quantize each dimension of the posterior mean vector to a uniform grid. We report the bitrate for encoding the resulting quantized latent representation via standard entropy coding (e.g., arithmetic coding). Entropy coding requires prior knowledge of the probabilities of each grid point. Here, we use the empirical frequencies of grid points over a subset of the training set;

  • -means quantization: similar to “uniform quantization”, but with the placement of grid points optimized via -means clustering on a subset of the training data;

  • JPEG: we used the libjpeg implementation packaged with the Python Pillow library, using default configurations (e.g., 4:2:0 subsampling), and we adjust the quality parameter to vary the rate-distortion trade-off;

  • Deep learning baseline: we compare to Ballé et al. (2017), who directly optimized for the rate and distortion, training a separate model for each point on the R-D curve. In our large-scale experiment, we adopte their model architecture, so their performance essentially represents the end-to-end optimized performance upper bound for our method (which uses a single model).

4.2.1 Qualitative Analysis on Toy Experiment

We trained a simple VAE on MNIST digits, and compared our method to uniform quantization and JPEG.

Figure 4 shows example compressed digits, ranging from the highest to the lowest bitrate that each method allows. As the rate decreases, we see that unlike JPEG, which introduces pixel-level artifacts, both VAE-based methods were able to preserve semantic aspects of the original image. It is interesting to see how the performance degrades in the strongly compressed regime for Bayesian AC. With aggressive decrease in bitrate (as ), our method gradually “confuses” the original image with a generic image (8 being in the center of the embedding space), while preserving approximately the same level of sharpness.

4.2.2 Full-Resolution color image Compression

We apply our Bayesian AC method to a VAE trained on color images and demonstrate its practical image compression performance rivaling JPEG.

Model and Dataset.

The inference and generative networks of the VAE are identical to the analysis and synthesis networks of Ballé et al. (2017), using 3 layers of 256 filters each in a convolutional architecture. We used a diagonal Gaussian likelihood model, whose mean is computed by the generative net and the variance is fixed as a hyper-parameter, similar to a -VAE (Higgins et al., 2017) approach ( was tuned to 0.001 to ensure the VAE achieved overall good R-D trade-off; see (Alemi et al., 2018)

). We trained the model on the same subset of the ImageNet dataset as used in

(Ballé et al., 2017). We evaluated performance on the standard Kodak dataset (35), a separate set of 24 uncompressed color images. As in the word embedding experiment, we also observed that using an empirical prior for our method improved the bitrate. We used the same generic density model as in (Ballé et al., 2018), fitting a different distribution for each latent channel, on samples of posterior means (treating spatial dimensions as i.i.d.).

Figure 6: Aggregate rate-distortion performance on the Kodak dataset (higher is better). Bayesian AC (blue, proposed) outperforms JPEG for all tested bitrates with a single model. By contrast, (Ballé et al., 2017) (black squares) relies on individually optimized models for each bitrate that all have to be included in the decoder.

As common in image compression work, we measure the distortion between the original and compressed image under two quality metrics (the higher the better): Peak Signal-to-Noise ratio (PSNR), and MS-SSIM

(Wang et al., 2003), over all RGB channels. Figure 6 shows rate-distortion performance, where we averaged both the bits per pixel (BPP) and quality measure across all images in the Kodak dataset, for each fixed R-D trade-off setting (we obtained similar results when averaging only over the quality metrics for fixed bitrates). The results for Ballé et al. (2017) are taken from the paper and the authors’ website.

We found that our method generally produced images with higher quality, both in terms of PSNR and perceptual quality, compared to JPEG and uniform quantization. Similar to (Ballé et al., 2017), our method avoids jarring artifacts, and introduces blurriness at low bitrate. See Figure 5 for example image reconstructions. For more examples and R-D curves on individual images, see Supplementary Material.

Although our results fall short of the end-to-end optimized rate-distortion performance of Ballé et al. (2017), it is worth emphasizing that our method allows operating anywhere on the R-D curve with a single trained VAE model, unlike Ballé et al. (2017), which requires costly optimization and storage of individual models for each point on the R-D curve.

Figure 7: Variational posteriors and the encoding cost for the first 6 latent channels of an image-compression VAE trained with high  setting. “Baseline” refers to uniform quantization. The proposed Bayesian AC method wastes fewer bits on channels that exhibit posterior collapse (channels 2, 3, and 5) than the baseline method. It instead spends more bits on channels without posterior collapse.
Indifference to Posterior Collapse.

A known issue in deep generative models such as VAEs is the phenomenon of posterior collapse, where the model ignores some subset of latent variables, and the corresponding variational posterior distributions collapse to closely match the prior. Since such collapsed dimensions do not contribute to the model’s performance, they constitute an overhead in regular neural compression approaches and may need to be pruned.

One curious consequence of our approach is that it spends close to zero bits encoding the collapsed latent dimensions. As an illustration, we trained a VAE as used in the color image compression experiment with a high setting to purposefully induce posterior collapse, and examine the average number of bits spent on various latent channels.

Figure 7 shows the prior , aggregated (approximate) posterior , and histograms of posterior means for the first six channels of the VAE; all the quantities were averaged over an image batch and across latent spatial dimensions. We observe that channels 2, 3, and 5 appear to exhibit posterior collapse, as the aggregated posteriors closely match the prior while the posterior means tightly cluster at zero; this is also reflected by low average KL-divergence between the variational posterior and prior , see text inside each panel. We observe that, for these collapsed channels, our method spends fewer bits on average than uniform quantization (baseline) at the same total bitrate, and more bits instead on channels 1, 4, and 6, which do not exhibit posterior collapse. The explanation is that a collapsed posterior has unusually high variance , causing our model to refrain from long code words due to the high penalty per bitrate in Eq. 9.

5 Conclusions

We proposed a novel algorithm for lossy compression, based on a new quantization scheme that automatically adapts encoding accuracy to posterior uncertainty estimates. This decouples the task of compression from model design and training, and enables variable-bitrate compression for probabilistic generative models with mean-field variational distributions, in post-processing.

We empirically demonstrated the effectiveness of our approach for both model and data compression. Our proposed algorithm can be readily applied to many existing models. In particular, we believe it holds promise for compressing Bayesian neural networks.


Stephan Mandt acknowledges funding from DARPA (HR001119S0038), NSF (FW-HTF-RM), and Qualcomm.


  • E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool (2017) Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems, Cited by: §2.
  • A. Alemi, B. Poole, I. Fischer, J. Dillon, R. A. Saurous, and K. Murphy (2018) Fixing a broken elbo. In International Conference on Machine Learning, Cited by: §2, §4.2.2.
  • J. Ballé, V. Laparra, and E. P. Simoncelli (2017) End-to-end optimized image compression. International Conference on Learning Representations. Cited by: §1, §2, §2, Figure 5, Figure 6, 4th item, §4.2.2, §4.2.2, §4.2.2, §4.2.2.
  • J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston (2018)

    Variational image compression with a scale hyperprior

    In ICLR. Cited by: §1, §2, §2, §4.2.2.
  • R. Bamler and S. Mandt (2017) Dynamic word embeddings. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 380–389. Cited by: §3.1, §4.1.
  • O. Barkan (2017) Bayesian neural word embedding.. In

    Association for the Advancement of Artificial Intelligence

    pp. 3135–3143. Cited by: §3.1, §4.1.
  • T. Berger (1972) Optimum quantizers and permutation codes. IEEE Trans. Information Theory 18, pp. 759–765. Cited by: §2.
  • D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: a review for statisticians. Journal of the American statistical Association 112 (518), pp. 859–877. Cited by: §3.1.
  • D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. Journal of machine Learning research 3, pp. 993–1022. Cited by: §3.1.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §1.
  • P. A. Chou, T. D. Lookabaugh, and R. M. Gray (1989) Entropy-constrained vector quantization. IEEE Trans. Acoustics, Speech, and Signal Processing 37, pp. 31–42. Cited by: §2.
  • R. G. Gallager (1968) Information theory and reliable communication. Vol. 2, Springer. Cited by: §2.
  • A. Gersho and R. M. Gray (2012) Vector quantization and signal compression. Vol. 159, Springer Science & Business Media. Cited by: §2.
  • K. Gregor, F. Besse, D. J. Rezende, I. Danihelka, and D. Wierstra (2016) Towards conceptual compression. External Links: 1604.08772 Cited by: §2.
  • A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen (2019) Video compression with rate-distortion autoencoders. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 7033–7042. Cited by: §2.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. International Conference on Learning Representations. Cited by: §4.2.2.
  • A. Honkela and H. Valpola (2004) Variational learning and bits-back coding: an information-theoretic view to bayesian learning. IEEE transactions on Neural Networks 15 (4), pp. 800–810. Cited by: §2.
  • D. A. Huffman (1952) A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40 (9), pp. 1098–1101. Cited by: §3.2.
  • N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici (2018) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4385–4393. Cited by: §2.
  • M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999) An introduction to variational methods for graphical models. Machine learning 37 (2), pp. 183–233. Cited by: §3.1.
  • D. P. Kingma and M. Welling (2014a) Auto-encoding variational Bayes. In International Conference on Learning Representations, pp. 1–9. Cited by: §1.
  • D. P. Kingma and M. Welling (2014b) Auto-encoding variational Bayes. In International Conference on Learning Representations, Cited by: §3.1.
  • S. Lloyd (1982) Least squares quantization in pcm. IEEE transactions on information theory 28 (2), pp. 129–137. Cited by: §2.
  • S. Lombardo, J. Han, C. Schroers, and S. Mandt (2019) Deep generative video compression. In Advances in Neural Information Processing Systems, pp. 9283–9294. Cited by: §1, §2.
  • D. J. MacKay (1992)

    A practical bayesian framework for backpropagation networks

    Neural computation 4 (3), pp. 448–472. Cited by: §1.
  • D. J. MacKay (2003) Information theory, inference and learning algorithms. Cambridge University Press. Cited by: §1, §3.1, §3.2.
  • F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool (2018) Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402. Cited by: §1, §2, §2.
  • J. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, et al. (2011) Quantitative analysis of culture using millions of digitized books. science 331 (6014), pp. 176–182. Cited by: §4.1.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: Figure 3, §4.1.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013b) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119. Cited by: §4.1, footnote 5.
  • A. Mnih and R. R. Salakhutdinov (2008) Probabilistic matrix factorization. In Advances in neural information processing systems, pp. 1257–1264. Cited by: §3.1.
  • R. Ranganath, S. Gerrish, and D. M. Blei (2014) Black box variational inference.. In International Conference on Artificial Intelligence and Statistics, pp. 814–822. Cited by: §1.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286. Cited by: §1.
  • O. Rippel and L. Bourdev (2017) Real-time adaptive image compression. External Links: 1705.05823 Cited by: §1, §2.
  • [35] The kodak photocd dataset. Note: 2020-01-09 Cited by: §4.2.2.
  • L. Theis, W. Shi, A. Cunningham, and F. Huszár (2017) Lossy image compression with compressive autoencoders. International Conference on Learning Representations. Cited by: §2, §2.
  • G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar (2016)

    Variable rate image compression with recurrent neural networks

    International Conference on Learning Representations. Cited by: §2, §2.
  • G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell (2017) Full resolution image compression with recurrent neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5435–5443. Cited by: §2, §2.
  • J. Townsend, T. Bird, and D. Barber (2019) Practical lossless compression with latent variables using bits back coding. arXiv preprint arXiv:1901.04866. Cited by: §2.
  • Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §4.2.2.
  • I. H. Witten, R. M. Neal, and J. G. Cleary (1987) Arithmetic coding for data compression. Communications of the ACM 30 (6), pp. 520–540. Cited by: §1, §3.2.
  • C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt (2019) Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence. Cited by: §3.1.