Improving Inference for Neural Image Compression

06/07/2020 ∙ by Yibo Yang, et al. ∙ University of California, Irvine 19

We consider the problem of lossy image compression with deep latent variable models. State-of-the-art methods build on hierarchical variational autoencoders (VAEs) and learn inference networks to predict a compressible latent representation of each data point. Drawing on the variational inference perspective on compression, we identify three approximation gaps which limit performance in the conventional approach: (i) an amortization gap, (ii) a discretization gap, and (iii) a marginalization gap. We propose improvements to each of these three shortcomings based on iterative inference, stochastic annealing for discrete optimization, and bits-back coding, resulting in the first application of bits-back coding to lossy compression. In our experiments, which include extensive baseline comparisons and ablation studies, we achieve new state-of-the-art performance on lossy image compression using an established VAE architecture, by changing only the inference method.



There are no comments yet.


page 8

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning methods are reshaping the field of data compression, and recently started to outperform state-of-the-art classical codecs on image compression (Minnen et al., 2018). Besides useful in its own right, image compression can be seen as a stepping stone towards better video codecs (Lombardo et al., 2019; Habibian et al., 2019), which can reduce a sizable amount of global internet traffic.

State-of-the-art neural methods for lossy image compression (Ballé et al., 2018; Minnen et al., 2018; Lee et al., 2019) learn a mapping between images and latent variables with a variational autoencoder (VAE). An inference network maps a given image to a compressible latent representation, which a generative network can then map back to a reconstructed image. In fact, compression can be more broadly seen as a form of inference: to compress—or “encode”—data, one has to perform inference over a well-specified decompression—or “decoding”—algorithm.

In classical compression codecs, the decoder has to follow a well-specified procedure to ensure interoperability between different implementations of the same codec. By contrast, the encoding process is typically not uniquely defined: different encoder implementations of the same codec often compress the same input data to different bitstrings. For example, pngcrush (Randers-Pehrson, 1997) typically produces smaller PNG files than ImageMagick. Yet, both programs are standard-compliant PNG encoder implementations as both produce a compressed file that decodes to the same image. In other words, both encoder implementations perform correct inference over the standardized decoder specification, but use different inference algorithms with different performance characteristics.

The insight that better inference leads to better compression performance even within the same codec motivates us to reconsider how inference is typically done in neural data compression with VAEs. In this paper, we show that the conventional amortized inference (Kingma and Welling, 2013; Rezende et al., 2014) in VAEs leaves substantial room for improvement when used for data compression.

We propose an improved inference method for data compression tasks with three main innovations:

  1. Improved amortization: The amortized inference strategy in VAEs speeds up training, but is restrictive at compression time. We draw a connection between a recently proposed iterative procedure for compression (Campos et al., 2019) to the broader literature of VI that closes the amortization gap, which provides the basis of the following two novel inference methods.

  2. Improved discretization: Compression requires discretizing the latent representation from VAEs, because only discrete values can be entropy coded. As inference over discrete variables is difficult, existing methods typically relax the discretization constraint in some way during inference and then discretize afterwards. We instead propose a novel method based on a stochastic annealing scheme that performs inference directly over discrete points.

  3. Improved entropy coding: In lossless compression with latent-variable models, bits-back coding (Wallace, 1990; Hinton and Van Camp, 1993) allows approximately coding the latents with the marginal prior. It is so far believed that bits-back coding is incompatible with lossy compression (Habibian et al., 2019) because it requires inference on the decoder side, which does not have access to the exact (undistorted) input data. We propose a remedy to this limitation, resulting in the first application of bits-back coding to lossy compression.

We evaluate the above three innovations on an otherwise unchanged architecture of an established model and compare against a wide range of baselines and ablations. Our proposals significantly improve compression performance and results in a new state of the art in lossy image compression.

The rest of the paper is structured as follows: Section 2 summarizes lossy compression with VAEs, for which Section 3 proposes the above three improvements. Section 4 reports experimental results. We conclude in Section 5. We review related work in each relevant subsection.

2 Background: Lossy Neural Image Compression as Variational Inference

In this section, we summarize an existing framework for lossy image compression with deep latent variable models, which will be the basis of three proposed improvements in Section 3.

Related Work.

Ballé et al. (2017) and Theis et al. (2017)

were among the first to recognize a connection between the rate-distortion objective of lossy compression and the loss function of a certain kind of Variational Autoencoders (VAEs), and to apply it to end-to-end image compression.

(Ballé et al., 2018) proposes a hierarchical model, upon which current state-of-the-art methods (Minnen et al., 2018; Lee et al., 2019) further improve by adding an auto-regressive component. For simplicity, we adopt the VAE of (Minnen et al., 2018) without the auto-regressive component, reviewed in this section below, which has also been a basis for recent compression research (Johnston et al., 2019).

Generative Model.

Figure 1a shows the generative process of an image . The Gaussian likelihood

has a fixed variance 

and its mean is computed from latent variables 

by a deconvolutional neural network (DNN) 

with weights . A second DNN  with weights  outputs the parameters (location and scale) of the prior conditioned on “hyperlatents” .


Figure 1: Graphical model and control flow charts. a) generative model with hyperlatents , latents , and image  (Minnen et al., 2018); b) conventional method for compression (dashed blue, see Eq. 1) and decompression (solid black); c) common training objective due to (Ballé et al., 2017) (see Eq. 3); d) proposed hybrid amortized (dashed blue) / iterative (dotted red) inference (Section 3.1); e)-f) inference in the proposed lossy bitsback method (Section 3.3); the encoder first executes e) and then keeps fixed while executing f); the decoder reconstructs  and then executes f) to get bits back.

Compression and Decompression.

Figure 1b illustrates compression (“encoding”, dashed blue) and decompression (“decoding”, solid black) as proposed in (Minnen et al., 2018). The encoder passes a target image  through trained inference networks and  with weights and , respectively. The resulting continuous latent representations, , are rounded (denoted ) to discrete ,


The encoder then entropy-codes and , using discretized versions and

of the hyperprior

and (conditional) prior , respectively, as entropy models. This step is simplified by a clever restriction on the shape of , such that it agrees with the discretized on all integers (see (Ballé et al., 2017, 2018; Minnen et al., 2018)). The decoder then recovers and  and obtains a lossy image reconstruction .

Model Training.

Let be sampled from a training set of images. To train the above VAE for lossy compression, Theis et al. (2017); Minnen et al. (2018) consider minimizing a rate-distortion objective:


where is computed from the inference networks as in Eq. 1, and the parameter controls the trade-off between bitrate  under entropy coding, and distortion (reconstruction error) . As the rounding operations prevent gradient-based optimization, Ballé et al. (2017) propose to replace rounding during training by adding uniform noise from the interval to each coordinate of and  (see Figure 1

c). This is equivalent to sampling from uniform distributions

and with a fixed width of one centered around and , respectively. One thus obtains the following relaxed rate-distortion objective, for a given data point ,


Connection to Variational Inference.

As pointed out in (Ballé et al., 2017), the relaxed objective in Eq. 3 is the negative evidence lower bound (NELBO) of variational inference (VI) if we identify . This draws a connection between lossy compression and VI (Blei et al., 2017; Zhang et al., 2019). We emphasize a distinction between variational inference and variational expectation maximization (EM) (Beal and Ghahramani, 2003): whereas variational EM trains a model, VI is used to compress data using a trained model. A central result of this paper is that improving inference in a fixed model at compression time already suffices to significantly improve compression performance.

3 Novel Inference Techniques for Data Compression

This section presents our main contributions. We identify three approximation gaps in VAE-based compression methods (see Section 2): an amortization gap, a discretization gap, and a marginalization gap. Recently, the amortization gap was considered by  (Campos et al., 2019); we expand on this idea in Section 3.1, bringing it under a wider framework of algorithms in the VI literature. We then propose two specific methods that improve inference at compression time: a novel inference method over discrete representations (Section 3.2) that closes the discretization gap, and a novel lossy bits-back coding method (Section 3.3) that closes the marginalization gap.

3.1 Amortization Gap and Hybrid Amortized-Iterative Inference

Amortization Gap.

Amortized variational inference (Kingma and Welling, 2013; Rezende et al., 2014) turns optimization over local (per-data) variational parameters into optimization over the global weights of an inference network. In the VAE of Section 2, the inference networks and  map an image  to local parameters and  of the variational distributions and , respectively (Eq. 1). Amortized VI speeds up training by avoiding an expensive inner inference loop of variational EM, but it leads to an amortization gap (Cremer et al., 2018; Krishnan et al., 2018), which is the difference between the value of the NELBO (Eq. 3) when and  are obtained from the inference networks, compared to its true minimum when and  are directly minimized. Since the NELBO approximates the rate-distortion objective (Section 2), the amortization gap translates into sub-optimal performance when amortization is used at compression time. As discussed below, this gap can be closed by refining the output of the inference networks by iterative inference.

Related Work.

The idea of combining amortized inference with iterative optimization has been studied and refined by various authors (Hjelm et al., 2016; Kim et al., 2018; Krishnan et al., 2018; Marino et al., 2018). While not formulated in the language of variational autoencoders and variational inference, Campos et al. (2019) apply a simple version of this idea to compression. Drawing on the connection to hybrid amortized-iterative inference, we show that the method can be drastically improved by addressing the discretization gap (Section 3.2) and the marginalization gap (Section 3.3).

Hybrid Amortized-Iterative Inference.

We reinterpret the proposal in (Campos et al., 2019) as a basic version of a hybrid amortized-iterative inference idea that changes inference only at compression time but not during model training. Figure 1d illustrates the approach. When compressing a target image , one initializes and  from the trained inference networks and , see Eq. 1. One then treats and  as local variational parameters, and minimizes the NELBO (Eq. 3) over

with stochastic gradient descent. This approach thus separates inference

at trainng time from inference during model training. We show in the next two sections that this simple idea forms a powerful basis for new inference approaches that drastically improve compression performance.

3.2 Discretization Gap and Stochastic Gumbel Annealing (SGA)

Discretization Gap.

Compressing data to a bitstring is an inherently discrete optimization problem. As discrete optimization in high dimensions is difficult, neural compression methods instead optimize some relaxed objective function (such as the NELBO  in Eq. 3) over continuous representations (such as and ), which are then discretized afterwards for entropy coding (see Eq. 1). This leads to a discretization gap: the difference between the true rate-distortion objective  at the discretized representation , and the relaxed objective function   at the continuous approximation .

Related Work.

Current neural compression methods use a differentiable approximation to discretization during model training

, such as Straight-Through Estimator (STE) 

(Bengio et al., 2013; Oord et al., 2017; Yin et al., 2019), adding uniform noise (Ballé et al., 2017) (see Eq. 3

), stochastic binarization 

(Toderici et al., 2016), and soft-to-hard quantization (Agustsson et al., 2017). Discretization at compression time was addressed in (Yang et al., 2020) without assuming that the training procedure takes discretization into account, whereas our work makes this additional assumption.

Stochastic Gumbel Annealing (SGA).

Refining the hybrid amortized-iterative inference approach of Section 3.1, for a given image , we aim to find its discrete (in our case, integer) representation that optimizes the rate-distortion objective in Eq. 2, this time re-interpreted as a function of directly. Our proposed annealing scheme approaches this discretization optimization problem with the help of continuous proxy variables and , which we initialize from the inference networks as in Eq. 1. We limit the following discussion to the latents and , treating the hyperlatents and  analogously. For each dimension  of , we map the continuous coordinate to an integer coordinate by rounding either up or down. Let

be a one-hot vector that indicates the rounding direction, with

for rounding down (denoted ) and for rounding up (denoted ). Thus, the result of rounding is the inner product, . Now we let be a Bernoulli variable with a “tempered” distribution


with temperature parameter , and is an increasing function satisfying (this is chosen to ensure continuity of the resulting objective (5); we used ). Thus, as

approaches an integer, the probability of rounding to that integer approaches one. Defining

, we thus minimize the stochastic objective


where we reintroduce the hyperlatents , and denotes the discrete vector  obtained by rounding each coordinate of according to its rounding direction . At any temperature , the objective in Eq. 5

smoothly interpolates the R-D objective in Eq. 

2 between all integer points. We minimize Eq. 5 over with stochastic gradient descent, propagating gradients through Bernoulli samples via the Gumbel-softmax trick (Jang et al., 2016; Maddison et al., 2016) using, for simplicity, the same temperature  as in . We note that in principle, REINFORCE (Williams, 1992) can also work instead of Gumbel-softmax. We anneal  towards zero over the course of optimization, such that the Gumbel approximation becomes exact and the stochastic rounding operation converges to the deterministic one in Eq. 1. We thus name this method “Stochastic Gumbel Annealing” (SGA).


Figure 2: Comparing Stochastic Gumbel Annealing (Section 3.2) to alternatives (Table 1, [A1]-[A4]).

Comparison to Alternatives.

Alternatively, we could start with Eq. 2, and optimize it as a function of , using existing discretization methods to relax rounding; the optimized would subsequently be rounded to . However, our method outperforms all these alternatives. Specifically, we tried (with shorthands [A1]-[A4] for ‘ablation’, referring to Table 1 of Section 4): [A1] MAP: ignoring rounding during optimization; [A2] Straight-Through Estimator (STE) (Bengio et al., 2013): rounding only on forward pass of automatic differentiation; [A3] Uniform Noise (Campos et al., 2019): optimizing Eq. 3 over at compression time; and [A4] Deterministic Annealing: turning the stochastic rounding operations into weighted averages, by pushing the expectation under  inside the arguments of in Eq. 5 (resulting in an algorithm similar to (Agustsson et al., 2017)).

We tested the above discretization methods on images from the Kodak data set (Kodak, ), using a pre-trained hyperprior model (Section 2). Figure 2 (left) shows learning curves for the true rate-distortion objective , evaluated via rounding the intermediate throughout optimization. Figure 2 (right) shows the discretization gap , where is the relaxed objective of each method. Our proposed SGA method closes the discretization gap and achieves the lowest true rate-distortion loss among all methods compared, using the same initialization. Deterministic Annealing also closes the gap and Uniform noise produces a consistently negative gap (as the initial tends to be close to ), but both converge to worse solutions than SGA. MAP converges to a non-integer solution with large discretization gap, and STE consistently diverges even with a tiny learning rate, as studied in (Yin et al., 2019)

(the fixes via ReLU or clipped ReLU proposed in this paper did not help).

3.3 Marginalization Gap and Lossy Bits-Back Coding

Marginalization Gap.

To compress an image with the hierarchical model of Section 2, ideally we would only need to encode and transmit its latent representation  but not the hyper-latents , as only  is needed for reconstruction. This would require entropy coding with (a discretization of) the marginal prior , which unfortunately is computationally intractable. Therefore, the standard compression approach reviewed in Section 2 instead encodes some hyperlatent  using the discretized hyperprior first, and then encodes  using the entropy model . This leads to a marginalization gap, which is the difference between the information content of the transmitted tuple and the information content of  alone,


Related Work.

A similar marginalization gap has been addressed in lossless compression by bits-back coding (Wallace, 1990; Hinton and Van Camp, 1993) and made more practical by the BB-ANS algorithm (Townsend et al., 2019b). Recent work has improved its efficiency in hierarchical latent variable models (Townsend et al., 2019a; Kingma et al., 2019), and extended it to flow models (Ho et al., 2019). To our knowledge, bits-back coding has not yet been used in lossy compression. This is likely since bits-back coding requires Bayesian posterior inference on the decoder side, which seems incompatible with lossy compression, as the decoder does not have access to the undistorted data .

Lossy Bits-Back Coding.

We extend the bits-back idea to lossy compression by noting that the discretized latent representation  is encoded losslessly (Figure 1b) and thus amenable to bits-back coding. A complication arises as bits-back coding requires that the encoder’s inference over  can be exactly reproduced by the decoder (see below). This requirement is violated by the inference network in Eq. 1, where depends on , which is not available to the decoder in lossy compression. It turns out that the naive fix of setting instead would hurt performance by more than what bits-back saves (see ablations in Section 4). We propose instead a two-stage inference algorithm that cuts the dependency on  after an initial joint inference over and .

Bits-back coding bridges the marginalization gap in Eq. 6 by encoding a limited number of bits of some additional side information (e.g., an image caption or a previously encoded image of a slide show) into the choice of  using an entropy model with parameters and . We obtain by discretizing a Gaussian variational distribution , thus extending the last layer of the hyper-inference network  (Eq. 1) to output now a tuple of means and diagonal variances. This replaces the form of variational distribution of Eq 3 as the restriction to a box-shaped distribution with fixed width is not necessary in bits-back coding. We train the resulting VAE by minimizing the NELBO. Since now has a variable width, the NELBO has an extra term compared to Eq. 3 that subtracts the entropy of , reflecting the expected number of bits we ‘get back’. Thus, the NELBO is again a relaxed rate-distortion objective, where now the net rate is the compressed file size minus the amount of embedded side information.


Algorithm LABEL:alg:bits-back describes lossy compression and decompression with the trained model. Subroutine encode initializes the variational parameters , , and  conditioned on the target image  using the trained inference networks (line LABEL:step:bits-back-jopt-init). It then jointly performs SGA over  by following Section 3.2, and Black-Box Variational Inference (BBVI) over and by minimizing the the NELBO (line LABEL:step:bits-back-jopt). At the end of the routine, the encoder decodes the provided side information  (an arbitrary bitstring) into  using as entropy model (line LABEL:step:bits-back-decode-side-info) and then encodes and  as usual (line LABEL:step:bits-back-encode).

The important step happens on line LABEL:step:bits-back-encode-ropt: up until this step, the fitted variational parameters and  depend on the target image  due to their initialization on line LABEL:step:bits-back-jopt-init. This would prevent the decoder, which does not have access to , from reconstructing the side information  by encoding  with the entropy model . Line LABEL:step:bits-back-encode-ropt therefore re-fits the variational parameters and  in a way that is exactly reproducible based on  alone. This is done in the subroutine reproducible_BBVI, which performs BBVI in the prior model , treating as observed and only  as latent. Although we reset and  on line LABEL:step:bits-back-encode-ropt immediately after optimizing over them on line LABEL:step:bits-back-jopt, optimizing jointly over both  and on line LABEL:step:bits-back-jopt allows the method to find a better .

The decoder decodes and  as usual (line LABEL:step:bits-back-decode). Since the subroutine reproducible_BBVI depends only on  and uses a fixed random seed (line LABEL:step:bits-back-ropt-init), calling it from the decoder on line LABEL:step:bits-back-decode-zinf yields the exact same entropy model as used by the encoder, allowing the decoder to recover (line LABEL:step:bits-back-encode-side-info).

4 Experiments

We demonstrate the empirical effectiveness of our approach by applying two variants of it —with and without bits-back coding—to an established (but not state-of-the-art) base model (Minnen et al., 2018). We improve its performance drastically, achieving an average of over BD rate savings on Kodak and on Tecnick (Asuni and Giachetti, 2014), outperforming the previous state-of-the-art of both classical and neural lossy image compression methods. We conclude with ablation studies.

Name Explanation and Reference (for baselines)


[M1] SGA Proposed standalone variant: same trained model as [M3], SGA (Section 3.2) at compression time.
[M2] SGA + BB Proposed variant with bits-back coding: both proposed improvements of Sections 3.2, and 3.3.


[M3] Base Hyperprior Base method of our two proposals, reviewed in Section 2 of the present paper (Minnen et al., 2018).
[M4] Context + Hyperprior Like [M3] but with an extra context model defining the prior (Minnen et al., 2018).
[M5] Context-Adaptive Context-Adaptive Entropy Model proposed in (Lee et al., 2019).
[M6] Hyperprior Scale-Only Like [M3] but the hyperlatents  model only the scale (not the mean) of (Ballé et al., 2018).
[M7] CAE Pioneering “Compressive Autoencoder” model proposed in (Theis et al., 2017).
[M8] BPG 4:4:4 State-of-the-art classical lossy image compression codec ‘Better Portable Graphics’ (Bellard, 2014).
[M9] JPEG 2000 Classical lossy image compression method (Adams, 2001).

[origin=c]90 ablations

[A1] MAP Like [M1] but with continuous optimization over and followed by rounding instead of SGA.
[A2] STE Like [M1]

but with straight-through estimation (i.e., round only on backpropagation) instead of SGA.

[A3] Uniform Noise Like [M1] but with uniform noise injection instead of SGA (see Section 3.1) (Campos et al., 2019).
[A4] Deterministic Annealing Like [M1] but with deterministic version of Stochastic Gumbel Annealing
[A5] BB without SGA Like [M2] but without optimization over at compression time.
[A6] BB without iterative inference Like [M2] but without optimization over , , and at compression time.
Table 1: Compared methods ([M1]-[M9]) and ablations ([A1]-[A6]). We propose two variants: [M1] compresses images, and [M2] compresses images + side information via bits-back coding.

Proposed Methods.

Table 1 describes all compared methods, marked as [M1]-[M9] for short. We tested two variants of our method ([M1] and [M2]): SGA builds on the exact same model and training procedure as in (Minnen et al., 2018) (the Base Hyperprior model [M3]) and changes only how the trained model is used for compression by introducing hybrid amortized-iterative inference (Campos et al., 2019) (Section 3.1) and Stochastic Gumbel Annealing (Section 3.2). SGA+BB adds to it bits-back coding (Section 3.3), which requires changing the inference model over hyperlatents  to admit a more flexible variational distribution as discussed in Section 3.3; it also lifts a restriction on the shape of the hyperprior in (Minnen et al., 2018) as explained below Eq. 1. The two proposed variants address different use cases: SGA is a “standalone” variant of our method that compresses individual images while SGA+BB encodes images with additional side information. In all results, we annealed the temperature of SGA with an exponential decay schedule following (Jang et al., 2016) and used Adam (Kingma and Ba, 2015)

for optimization; we found good convergence without extensive hyperparameter tuning, and provide details in the Supplementary Material.


We compare to the Base Hyperprior model from (Minnen et al., 2018) without our proposed improvements ([M3] in Table 1), two state-of-the-art neural methods ([M4] and [M5]), two other neural methods ([M6] and [M7]), the state-of-the art classical codec BPG [M8], and JPEG 2000 [M9]. We reproduced the Base Hyperprior results, and took the other results from (Ballé et al., ).

Figure 3: Compression performance comparisons on Kodak against existing baselines. Left: R-D curves. Right: BD rate savings (%) relative to BPG. Legend shared; higher values are better in both.
Figure 4: BD rate savings of various ablation methods relative to Base Hyperprior on Kodak.


Figure 4 compares the compression performance of our proposed method to existing baselines on the Kodak

dataset, using the standard Peak Signal-to-Noise Ratio (PSNR) quality metric (higher is better), averaged over all images for each considered quality setting 

. The left panel of Figure 4 plots PSNR vs. bitrate, and the right panel shows the resulting BD rate savings (Bjontegaard, 2001) computed relative to BPG as a function of PSNR for readability (higher is better). The BD plot cuts out CAE [M7] and JPEG 2000 [M9] at the bottom, which performed worse. Overall, both variants of the proposed method (blue and orange lines in Figure 4) improve substantially over the Base Hyperprior model (brown), and outperform the previous state-of-the-art. We report similar results on the Tecnick dataset (Asuni and Giachetti, 2014) in the Supplementary Material. Figure 5 shows a qualitative comparison of a compressed image (in the order of BPG, SGA (proposed), and the Base Hyperprior), in which we see that our method notably enhances the baseline image reconstruction at comparable bitrate, while avoiding unpleasant visual artifacts of BPG.

Ablations. Figure 4 compares BD rate improvements of the two variants of our proposal (blue and orange) to six alternative choices (ablations [A1]-[A6] in Table 1), measured relative to the Base Hyperprior model (Minnen et al., 2018) (zero line; [M3] in Table 1), on which our proposals build. [A1]-[A4] replace the discretization method of SGA [M1] with four alternatives discussed at the end of Section 3.2. [A5] and [A6] are ablations of our proposed bits-back variant SGA+BB [M2], which remove iterative optimization over , or over and , respectively, at compression time. Going from red [A3] to orange [A2] to blue [M1] traces our proposed improvements over (Campos et al., 2019) by adding SGA (Section 3.2) and then bits-back (Section 3.3). It shows that iterative inference (Section 3.1) and SGA contribute approximately equally to the performance gain, whereas the gain from bits-back coding is smaller. Interestingly, lossy bits-back coding without iterative inference and SGA actually hurts performance [A6]. As Section 3.3 mentioned, this is likely because bits-back coding constrains the posterior over  to be conditioned only on  and not on the original image .


(a) Original image
(from Kodak dataset)


(b) BPG 4:4:4;
0.14 BPP ()


(c) Proposed (SGA);
0.13 BPP ()


(d) (Minnen et al., 2018);
0.12 BPP ()
Figure 5: Qualitative comparison of lossy compression performance. Our method (c; [M1] in Table 1) significantly boosts the visual quality of the Base Hyperprior method (d; [M3] in Table 1) at similar bit rates. It sharpens details of the hair and face (see insets) obscured by the baseline method (d), while avoiding the ringing artifacts around the jaw and ear pendant produced by a classical codec (b).

5 Discussion

Starting from the variational inference view on data compression, we proposed three enhancements to the standard inference procedure in VAEs: hybrid amortized-iterative inference, Stochastic Gumbel Annealing, and lossy bits-back coding, which translated to dramatic performance gains on lossy image compression. Improved inference provides a new promising direction to improved compression, orthogonal to modeling choices (e.g., with auto-regressive priors (Minnen et al., 2018), which can harm decoding efficiency). Although lossy-bits-back coding in the present VAE only gave relatively minor benefits, it may reach its full potential in more hierarchical architectures as in (Kingma et al., 2019). Similarly, carrying out iterative inference also at training time may lead to even more performance improvement, with techniques like iterative amortized inference (Marino et al., 2018).

6 Acknowledgements

We thank Yang Yang for valuable feedback on the manuscript. Yibo Yang acknowledges funding from the Hasso Plattner Foundation. Stephan Mandt acknowledges funding from DARPA (HR001119S0038), NSF (FW-HTF-RM), and Qualcomm.


  • M. D. Adams (2001) The jpeg-2000 still image compression standard. Cited by: Table 1.
  • E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool (2017) Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems, Cited by: §3.2, §3.2.
  • A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy (2017) Fixing a broken elbo. arXiv preprint arXiv:1711.00464. Cited by: Improving Inference for Neural Image Compression.
  • N. Asuni and A. Giachetti (2014) TESTIMAGES: a large-scale archive for testing visual devices and basic image processing algorithms (SAMPLING 1200 RGB set). In STAG: Smart Tools and Apps for Graphics, External Links: Link Cited by: Figure S2, §S3, §4, §4.
  • [5] J. Ballé, S. J. Hwang, N. Johnston, and D. Minnen Tensorflow-compression: data compression in tensorflow). External Links: Link Cited by: §S1, §S3, §4.
  • J. Ballé, V. Laparra, and E. P. Simoncelli (2017) End-to-end optimized image compression. International Conference on Learning Representations. Cited by: Figure 1, §2, §2, §2, §2, §3.2.
  • J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston (2018) Variational image compression with a scale hyperprior. In ICLR. Cited by: Improving Inference for Neural Image Compression, item 2, §1, §S1, §S1, §2, §2, §S3, Table 1.
  • M. J. Beal and Z. Ghahramani (2003) The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting, Vol. 7, pp. 453–464. Cited by: §2.
  • F. Bellard (2014) BPG specification. Note: (accessed June 3, 2020) External Links: Link Cited by: Table 1.
  • Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    arXiv preprint arXiv:1308.3432. Cited by: §3.2, §3.2.
  • G. Bjontegaard (2001) Calculation of average psnr differences between rd-curves. VCEG-M33. Cited by: §4.
  • D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: a review for statisticians. Journal of the American statistical Association 112 (518), pp. 859–877. Cited by: §2.
  • J. Campos, S. Meierhans, A. Djelouah, and C. Schroers (2019) Content adaptive optimization for neural image compression. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    Cited by: item 1, §3.1, §3.1, §3.2, §3, §4, §4, Table 1.
  • C. Cremer, X. Li, and D. Duvenaud (2018) Inference suboptimality in variational autoencoders. In

    International Conference on Machine Learning

    pp. 1078–1086. Cited by: §3.1.
  • A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen (2019) Video compression with rate-distortion autoencoders. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7033–7042. Cited by: item 3, §1.
  • G. E. Hinton and D. Van Camp (1993) Keeping the neural networks simple by minimizing the description length of the weights. In

    Proceedings of the sixth annual conference on Computational learning theory

    pp. 5–13. Cited by: item 3, §3.3.
  • D. Hjelm, R. R. Salakhutdinov, K. Cho, N. Jojic, V. Calhoun, and J. Chung (2016) Iterative refinement of the approximate posterior for directed belief networks. In Advances in Neural Information Processing Systems, pp. 4691–4699. Cited by: §3.1.
  • J. Ho, E. Lohn, and P. Abbeel (2019) Compression with flows via local bits-back coding. In Advances in Neural Information Processing Systems, pp. 3874–3883. Cited by: §3.3.
  • E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §3.2, §4.
  • N. Johnston, E. Eban, A. Gordon, and J. Ballé (2019) Computationally efficient neural image compression. arXiv preprint arXiv:1912.08771. Cited by: §2.
  • Y. Kim, S. Wiseman, A. Miller, D. Sontag, and A. Rush (2018) Semi-amortized variational autoencoders. In International Conference on Machine Learning, pp. 2678–2687. Cited by: §3.1.
  • D. P. Kingma and J. L. Ba (2015) Adam: a method for stochastic gradient descent. In ICLR: International Conference on Learning Representations, Cited by: §4.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.1.
  • F. H. Kingma, P. Abbeel, and J. Ho (2019) Bit-swap: recursive bits-back coding for lossless compression with hierarchical latent variables. arXiv preprint arXiv:1905.06845. Cited by: §3.3, §5.
  • [25] E. Kodak Kodak lossless true color image suite (PhotoCD PCD0992). External Links: Link Cited by: 0(a), Figure S1, §3.2, §S3, Figure 4, 4(a), §4, §4.
  • R. Krishnan, D. Liang, and M. Hoffman (2018)

    On the challenges of learning with inference networks on sparse, high-dimensional data


    International Conference on Artificial Intelligence and Statistics

    pp. 143–151. Cited by: §3.1, §3.1.
  • J. Lee, S. Cho, and S. Beack (2019) Context-adaptive entropy model for end-to-end optimized image compression. In the 7th Int. Conf. on Learning Representations, Cited by: Improving Inference for Neural Image Compression, §1, §2, Table 1.
  • S. Lombardo, J. Han, C. Schroers, and S. Mandt (2019) Deep generative video compression. In Advances in Neural Information Processing Systems 32, pp. 9287–9298. Cited by: §1.
  • C. J. Maddison, A. Mnih, and Y. W. Teh (2016)

    The concrete distribution: a continuous relaxation of discrete random variables

    arXiv preprint arXiv:1611.00712. Cited by: §3.2.
  • J. Marino, Y. Yue, and S. Mandt (2018) Iterative amortized inference. In Proceedings of the 35th International Conference on Machine Learning, pp. 3403–3412. Cited by: §3.1, §5.
  • D. Minnen, J. Ballé, and G. D. Toderici (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: Improving Inference for Neural Image Compression, §1, §1, §S1, Figure 1, §2, §2, §2, 0(d), 4(d), §4, §4, §4, Table 1, §4, §5.
  • A. v. d. Oord, O. Vinyals, and K. Kavukcuoglu (2017) Neural discrete representation learning. arXiv preprint arXiv:1711.00937. Cited by: §3.2.
  • G. Randers-Pehrson (1997) MNG: a multiple-image format in the png family. World Wide Web Journal 2 (1), pp. 209–211. Cited by: §1.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286. Cited by: §1, §3.1.
  • L. Theis, W. Shi, A. Cunningham, and F. Huszár (2017) Lossy image compression with compressive autoencoders. International Conference on Learning Representations. Cited by: §2, §2, Table 1.
  • G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar (2016)

    Variable rate image compression with recurrent neural networks

    International Conference on Learning Representations. Cited by: §3.2.
  • J. Townsend, T. Bird, J. Kunze, and D. Barber (2019a) HiLLoC: lossless image compression with hierarchical latent variable models. arXiv preprint arXiv:1912.09953. Cited by: §3.3.
  • J. Townsend, T. Bird, and D. Barber (2019b) Practical lossless compression with latent variables using bits back coding. arXiv preprint arXiv:1901.04866. Cited by: §3.3.
  • C. S. Wallace (1990) Classification by minimum-message-length inference. In International Conference on Computing and Information, pp. 72–81. Cited by: item 3, §3.3.
  • R. J. Williams (1992)

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Machine learning 8 (3-4), pp. 229–256. Cited by: §3.2.
  • Y. Yang, R. Bamler, and S. Mandt (2020) Variable-bitrate neural compression via bayesian arithmetic coding. In International Conference on Machine Learning, Cited by: §3.2.
  • P. Yin, J. Lyu, S. Zhang, S. Osher, Y. Qi, and J. Xin (2019) Understanding straight-through estimator in training activation quantized neural nets. arXiv preprint arXiv:1903.05662. Cited by: §S2, §3.2, §3.2.
  • C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt (2019) Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.

S1 Model Architecture and Training

As mentioned in the main text, the Base Hyperprior model uses the same architecture as in Table 1 of Minnen et al. [2018] except without the “Context Prediction” and “Entropy Parameters” components (this model was referred to as “Mean & Scale Hyperprior” in this work). The model is mostly already implemented in bmshj2018.py111 from Ballé et al. , Ballé et al. [2018], and we modified their implementation to double the number of output channels of the HyperSynthesisTransform to predict both the mean and (log) scale of the conditional prior over the latents .

As mentioned in Section 3.3 and 4, lossy bits-back in this model makes the following modifications:

  1. The number of output channels of the hyper inference network is doubled to compute both the mean and diagonal (log) variance of Gaussian ;

  2. The hyperprior is no longer restricted to the form of a flexible density model convolved with the uniform distribution on ; instead it simply uses the flexible density, as described in the Appendix of Ballé et al. [2018].

The above models were trained on CLIC-2018 images for 2 million iterations, using minibatches of eight randomly-cropped image patches. We trained models with on a log-scale. Following Ballé et al. [2018], we found that increasing the number of latent channels helps avoid the “bottleneck” effect and improve rate-distortion performance at higher rates, and increased the number of latent channels from 192 to 256 for models trained with and .

S2 Hyper-Parameters of Various Methods Considered

Given a pre-trained model (see above) and image(s)  to be compressed, our experiments in Section 4 explored hybrid amortized-iterative inference to improve compression performance. Below we provide hyper-parameters of each method considered:

Stochastic Gumbel Annealing.

In all experiments using SGA (including the SGA+BB experiments, which optimized over using SGA), we used the Adam optimizer with an initial learning rate of , and an exponentially decaying temperature schedule , where is the iteration number, and controls the speed of the decay. We found that lower generally increases the number of iterations needed for convergence, but can also yield slightly better solutions; in all of our experiments we set , and obtained good convergence with iterations.

Lossy bits-back.

In the SGA+BB experiments, line LABEL:step:bits-back-jopt of the encode subroutine of Algorithm LABEL:alg:bits-back performs joint optimization w.r.t. , , and with Black-box Variational Inference (BBVI), using the reparameterization trick to differentiate through Gumbel-softmax samples for , and Gaussian samples for . We used Adam with an initial learning rate of , and ran stochastic gradient descent iterations; the optimization w.r.t.  (SGA) used the same temperature annealing schedule as above. The reproducible_BBVI subroutine of Algorithm LABEL:alg:bits-back used Adam with an initial learning rate of and stochastic gradient descent iterations; we used the same random seed for the encoder and decoder for simplicity, although a hash of can also be used instead.

Alternative Discretization Methods

In Figure 2 and ablation studies of Section 4, all alternative discretization methods were optimized with Adam for iterations to compare with SGA. The initial learning rate was for MAP, Uniform Noise, and Deterministic Annealing, and for STE. All methods were tuned on a best-effort basis to ensure convergence, except that STE consistently encountered convergence issues even with a tiny learning rate (see [Yin et al., 2019]). The rate-distortion results for MAP and STE were calculated with early stopping (i.e., using the intermediate with the lowest true rate-distortion objective during optimization), just to give them a fair chance. Lastly, the comparisons in Figure 2 used the Base Hyperprior model trained with .

S3 Additional Results

On Kodak, we achieve the following BD rate savings: for SGA+BB and for SGA relative to Base Hyperprior; for SGA+BB and for SGA relative to BPG. On Tecnick, we achieve the following BD rate savings: for SGA+BB and for SGA relative to Base Hyperprior; for SGA+BB and for SGA relative to BPG. We note that the bitrates in our rate-distortion results were based on rate estimates from entropy losses; for SGA, we confirmed that the rate estimate from the prior agrees with (typically within 1% of) the actual file size produced by the codec implementation of Ballé et al. , Ballé et al. [2018].

Below we provide an additional qualitative comparison on the Kodak dataset, and report detailed rate-distortion performance on the Tecnick [Asuni and Giachetti, 2014] dataset.

[width=]figs/qualitative/kodim19/kodim19.png [width=]figs/qualitative/kodim19/kodim19.png-cropped-200x300+207+96.png

(a) Original image
(kodim19 from Kodak )

[width=]figs/qualitative/kodim19/kodim19.png.q41.bpg.png [width=]figs/qualitative/kodim19/kodim19.png.q41.bpg.png-cropped-200x300+207+96.png

(b) BPG 4:4:4;
0.143 BPP ()

[width=]figs/qualitative/kodim19/recon-gmmp_jopt_sth-lmbda=0.0025-rd_opt_its=2000.png [width=]figs/qualitative/kodim19/recon-gmmp_jopt_sth-lmbda=0.0025-rd_opt_its=2000-cropped-200x300+207+96.png

(c) Proposed (SGA);
0.142 BPP ()

[width=]figs/qualitative/kodim19/recon-gmmp_jopt_sth-lmbda=0.0025-rd_opt_its=0.png [width=]figs/qualitative/kodim19/recon-gmmp_jopt_sth-lmbda=0.0025-rd_opt_its=0-cropped-200x300+207+96.png

(d) [Minnen et al., 2018];
0.130 BPP ()
Figure S1: Qualitative comparison of lossy compression performance on an image from the Kodak dataset. Figures in the bottom row focus on the same cropped region of images in the top row. Our method (c; [M1] in Table 1) significantly boosts the visual quality of the Base Hyperprior method (d; [M3] in Table 1) at similar bit rates. Unlike the learning-based methods (subfigures c,d), the classical codec BPG (subfigure b) introduces blocking and geometric artifacts near the rooftop, and ringing artifacts around the contour of the lighthouse.


Figure S2: Rate-distortion performance comparisons on the Tecnick [Asuni and Giachetti, 2014] dataset against existing baselines. Image quality measured in Peak Signal-to-Noise Ratio (PSNR) in RGB; higher values are better.


Figure S3: BD rate savings (%) relative to BPG as a function of PSNR, computed from R-D curves (Figure S2) on Tecnick. Higher values are better.