1 Introduction
Deep learning methods are reshaping the field of data compression, and recently started to outperform stateoftheart classical codecs on image compression (Minnen et al., 2018). Besides useful in its own right, image compression can be seen as a stepping stone towards better video codecs (Lombardo et al., 2019; Habibian et al., 2019), which can reduce a sizable amount of global internet traffic.
Stateoftheart neural methods for lossy image compression (Ballé et al., 2018; Minnen et al., 2018; Lee et al., 2019) learn a mapping between images and latent variables with a variational autoencoder (VAE). An inference network maps a given image to a compressible latent representation, which a generative network can then map back to a reconstructed image. In fact, compression can be more broadly seen as a form of inference: to compress—or “encode”—data, one has to perform inference over a wellspecified decompression—or “decoding”—algorithm.
In classical compression codecs, the decoder has to follow a wellspecified procedure to ensure interoperability between different implementations of the same codec. By contrast, the encoding process is typically not uniquely defined: different encoder implementations of the same codec often compress the same input data to different bitstrings. For example, pngcrush (RandersPehrson, 1997) typically produces smaller PNG files than ImageMagick. Yet, both programs are standardcompliant PNG encoder implementations as both produce a compressed file that decodes to the same image. In other words, both encoder implementations perform correct inference over the standardized decoder specification, but use different inference algorithms with different performance characteristics.
The insight that better inference leads to better compression performance even within the same codec motivates us to reconsider how inference is typically done in neural data compression with VAEs. In this paper, we show that the conventional amortized inference (Kingma and Welling, 2013; Rezende et al., 2014) in VAEs leaves substantial room for improvement when used for data compression.
We propose an improved inference method for data compression tasks with three main innovations:

Improved amortization: The amortized inference strategy in VAEs speeds up training, but is restrictive at compression time. We draw a connection between a recently proposed iterative procedure for compression (Campos et al., 2019) to the broader literature of VI that closes the amortization gap, which provides the basis of the following two novel inference methods.

Improved discretization: Compression requires discretizing the latent representation from VAEs, because only discrete values can be entropy coded. As inference over discrete variables is difficult, existing methods typically relax the discretization constraint in some way during inference and then discretize afterwards. We instead propose a novel method based on a stochastic annealing scheme that performs inference directly over discrete points.

Improved entropy coding: In lossless compression with latentvariable models, bitsback coding (Wallace, 1990; Hinton and Van Camp, 1993) allows approximately coding the latents with the marginal prior. It is so far believed that bitsback coding is incompatible with lossy compression (Habibian et al., 2019) because it requires inference on the decoder side, which does not have access to the exact (undistorted) input data. We propose a remedy to this limitation, resulting in the first application of bitsback coding to lossy compression.
We evaluate the above three innovations on an otherwise unchanged architecture of an established model and compare against a wide range of baselines and ablations. Our proposals significantly improve compression performance and results in a new state of the art in lossy image compression.
2 Background: Lossy Neural Image Compression as Variational Inference
In this section, we summarize an existing framework for lossy image compression with deep latent variable models, which will be the basis of three proposed improvements in Section 3.
Related Work.
Ballé et al. (2017) and Theis et al. (2017)
were among the first to recognize a connection between the ratedistortion objective of lossy compression and the loss function of a certain kind of Variational Autoencoders (VAEs), and to apply it to endtoend image compression.
(Ballé et al., 2018) proposes a hierarchical model, upon which current stateoftheart methods (Minnen et al., 2018; Lee et al., 2019) further improve by adding an autoregressive component. For simplicity, we adopt the VAE of (Minnen et al., 2018) without the autoregressive component, reviewed in this section below, which has also been a basis for recent compression research (Johnston et al., 2019).Generative Model.
Figure 1a shows the generative process of an image . The Gaussian likelihood
has a fixed variance
and its mean is computed from latent variablesby a deconvolutional neural network (DNN)
with weights . A second DNN with weights outputs the parameters (location and scale) of the prior conditioned on “hyperlatents” .Compression and Decompression.
Figure 1b illustrates compression (“encoding”, dashed blue) and decompression (“decoding”, solid black) as proposed in (Minnen et al., 2018). The encoder passes a target image through trained inference networks and with weights and , respectively. The resulting continuous latent representations, , are rounded (denoted ) to discrete ,
(1) 
The encoder then entropycodes and , using discretized versions and
of the hyperprior
and (conditional) prior , respectively, as entropy models. This step is simplified by a clever restriction on the shape of , such that it agrees with the discretized on all integers (see (Ballé et al., 2017, 2018; Minnen et al., 2018)). The decoder then recovers and and obtains a lossy image reconstruction .Model Training.
Let be sampled from a training set of images. To train the above VAE for lossy compression, Theis et al. (2017); Minnen et al. (2018) consider minimizing a ratedistortion objective:
(2) 
where is computed from the inference networks as in Eq. 1, and the parameter controls the tradeoff between bitrate under entropy coding, and distortion (reconstruction error) . As the rounding operations prevent gradientbased optimization, Ballé et al. (2017) propose to replace rounding during training by adding uniform noise from the interval to each coordinate of and (see Figure 1
c). This is equivalent to sampling from uniform distributions
and with a fixed width of one centered around and , respectively. One thus obtains the following relaxed ratedistortion objective, for a given data point ,(3) 
Connection to Variational Inference.
As pointed out in (Ballé et al., 2017), the relaxed objective in Eq. 3 is the negative evidence lower bound (NELBO) of variational inference (VI) if we identify . This draws a connection between lossy compression and VI (Blei et al., 2017; Zhang et al., 2019). We emphasize a distinction between variational inference and variational expectation maximization (EM) (Beal and Ghahramani, 2003): whereas variational EM trains a model, VI is used to compress data using a trained model. A central result of this paper is that improving inference in a fixed model at compression time already suffices to significantly improve compression performance.
3 Novel Inference Techniques for Data Compression
This section presents our main contributions. We identify three approximation gaps in VAEbased compression methods (see Section 2): an amortization gap, a discretization gap, and a marginalization gap. Recently, the amortization gap was considered by (Campos et al., 2019); we expand on this idea in Section 3.1, bringing it under a wider framework of algorithms in the VI literature. We then propose two specific methods that improve inference at compression time: a novel inference method over discrete representations (Section 3.2) that closes the discretization gap, and a novel lossy bitsback coding method (Section 3.3) that closes the marginalization gap.
3.1 Amortization Gap and Hybrid AmortizedIterative Inference
Amortization Gap.
Amortized variational inference (Kingma and Welling, 2013; Rezende et al., 2014) turns optimization over local (perdata) variational parameters into optimization over the global weights of an inference network. In the VAE of Section 2, the inference networks and map an image to local parameters and of the variational distributions and , respectively (Eq. 1). Amortized VI speeds up training by avoiding an expensive inner inference loop of variational EM, but it leads to an amortization gap (Cremer et al., 2018; Krishnan et al., 2018), which is the difference between the value of the NELBO (Eq. 3) when and are obtained from the inference networks, compared to its true minimum when and are directly minimized. Since the NELBO approximates the ratedistortion objective (Section 2), the amortization gap translates into suboptimal performance when amortization is used at compression time. As discussed below, this gap can be closed by refining the output of the inference networks by iterative inference.
Related Work.
The idea of combining amortized inference with iterative optimization has been studied and refined by various authors (Hjelm et al., 2016; Kim et al., 2018; Krishnan et al., 2018; Marino et al., 2018). While not formulated in the language of variational autoencoders and variational inference, Campos et al. (2019) apply a simple version of this idea to compression. Drawing on the connection to hybrid amortizediterative inference, we show that the method can be drastically improved by addressing the discretization gap (Section 3.2) and the marginalization gap (Section 3.3).
Hybrid AmortizedIterative Inference.
We reinterpret the proposal in (Campos et al., 2019) as a basic version of a hybrid amortizediterative inference idea that changes inference only at compression time but not during model training. Figure 1d illustrates the approach. When compressing a target image , one initializes and from the trained inference networks and , see Eq. 1. One then treats and as local variational parameters, and minimizes the NELBO (Eq. 3) over
with stochastic gradient descent. This approach thus separates inference
at trainng time from inference during model training. We show in the next two sections that this simple idea forms a powerful basis for new inference approaches that drastically improve compression performance.3.2 Discretization Gap and Stochastic Gumbel Annealing (SGA)
Discretization Gap.
Compressing data to a bitstring is an inherently discrete optimization problem. As discrete optimization in high dimensions is difficult, neural compression methods instead optimize some relaxed objective function (such as the NELBO in Eq. 3) over continuous representations (such as and ), which are then discretized afterwards for entropy coding (see Eq. 1). This leads to a discretization gap: the difference between the true ratedistortion objective at the discretized representation , and the relaxed objective function at the continuous approximation .
Related Work.
Current neural compression methods use a differentiable approximation to discretization during model training
, such as StraightThrough Estimator (STE)
(Bengio et al., 2013; Oord et al., 2017; Yin et al., 2019), adding uniform noise (Ballé et al., 2017) (see Eq. 3), stochastic binarization
(Toderici et al., 2016), and softtohard quantization (Agustsson et al., 2017). Discretization at compression time was addressed in (Yang et al., 2020) without assuming that the training procedure takes discretization into account, whereas our work makes this additional assumption.Stochastic Gumbel Annealing (SGA).
Refining the hybrid amortizediterative inference approach of Section 3.1, for a given image , we aim to find its discrete (in our case, integer) representation that optimizes the ratedistortion objective in Eq. 2, this time reinterpreted as a function of directly. Our proposed annealing scheme approaches this discretization optimization problem with the help of continuous proxy variables and , which we initialize from the inference networks as in Eq. 1. We limit the following discussion to the latents and , treating the hyperlatents and analogously. For each dimension of , we map the continuous coordinate to an integer coordinate by rounding either up or down. Let
be a onehot vector that indicates the rounding direction, with
for rounding down (denoted ) and for rounding up (denoted ). Thus, the result of rounding is the inner product, . Now we let be a Bernoulli variable with a “tempered” distribution(4) 
with temperature parameter , and is an increasing function satisfying (this is chosen to ensure continuity of the resulting objective (5); we used ). Thus, as
approaches an integer, the probability of rounding to that integer approaches one. Defining
, we thus minimize the stochastic objective(5) 
where we reintroduce the hyperlatents , and denotes the discrete vector obtained by rounding each coordinate of according to its rounding direction . At any temperature , the objective in Eq. 5
smoothly interpolates the RD objective in Eq.
2 between all integer points. We minimize Eq. 5 over with stochastic gradient descent, propagating gradients through Bernoulli samples via the Gumbelsoftmax trick (Jang et al., 2016; Maddison et al., 2016) using, for simplicity, the same temperature as in . We note that in principle, REINFORCE (Williams, 1992) can also work instead of Gumbelsoftmax. We anneal towards zero over the course of optimization, such that the Gumbel approximation becomes exact and the stochastic rounding operation converges to the deterministic one in Eq. 1. We thus name this method “Stochastic Gumbel Annealing” (SGA).Comparison to Alternatives.
Alternatively, we could start with Eq. 2, and optimize it as a function of , using existing discretization methods to relax rounding; the optimized would subsequently be rounded to . However, our method outperforms all these alternatives. Specifically, we tried (with shorthands [A1][A4] for ‘ablation’, referring to Table 1 of Section 4): [A1] MAP: ignoring rounding during optimization; [A2] StraightThrough Estimator (STE) (Bengio et al., 2013): rounding only on forward pass of automatic differentiation; [A3] Uniform Noise (Campos et al., 2019): optimizing Eq. 3 over at compression time; and [A4] Deterministic Annealing: turning the stochastic rounding operations into weighted averages, by pushing the expectation under inside the arguments of in Eq. 5 (resulting in an algorithm similar to (Agustsson et al., 2017)).
We tested the above discretization methods on images from the Kodak data set (Kodak, ), using a pretrained hyperprior model (Section 2). Figure 2 (left) shows learning curves for the true ratedistortion objective , evaluated via rounding the intermediate throughout optimization. Figure 2 (right) shows the discretization gap , where is the relaxed objective of each method. Our proposed SGA method closes the discretization gap and achieves the lowest true ratedistortion loss among all methods compared, using the same initialization. Deterministic Annealing also closes the gap and Uniform noise produces a consistently negative gap (as the initial tends to be close to ), but both converge to worse solutions than SGA. MAP converges to a noninteger solution with large discretization gap, and STE consistently diverges even with a tiny learning rate, as studied in (Yin et al., 2019)
(the fixes via ReLU or clipped ReLU proposed in this paper did not help).
3.3 Marginalization Gap and Lossy BitsBack Coding
Marginalization Gap.
To compress an image with the hierarchical model of Section 2, ideally we would only need to encode and transmit its latent representation but not the hyperlatents , as only is needed for reconstruction. This would require entropy coding with (a discretization of) the marginal prior , which unfortunately is computationally intractable. Therefore, the standard compression approach reviewed in Section 2 instead encodes some hyperlatent using the discretized hyperprior first, and then encodes using the entropy model . This leads to a marginalization gap, which is the difference between the information content of the transmitted tuple and the information content of alone,
(6) 
Related Work.
A similar marginalization gap has been addressed in lossless compression by bitsback coding (Wallace, 1990; Hinton and Van Camp, 1993) and made more practical by the BBANS algorithm (Townsend et al., 2019b). Recent work has improved its efficiency in hierarchical latent variable models (Townsend et al., 2019a; Kingma et al., 2019), and extended it to flow models (Ho et al., 2019). To our knowledge, bitsback coding has not yet been used in lossy compression. This is likely since bitsback coding requires Bayesian posterior inference on the decoder side, which seems incompatible with lossy compression, as the decoder does not have access to the undistorted data .
Lossy BitsBack Coding.
We extend the bitsback idea to lossy compression by noting that the discretized latent representation is encoded losslessly (Figure 1b) and thus amenable to bitsback coding. A complication arises as bitsback coding requires that the encoder’s inference over can be exactly reproduced by the decoder (see below). This requirement is violated by the inference network in Eq. 1, where depends on , which is not available to the decoder in lossy compression. It turns out that the naive fix of setting instead would hurt performance by more than what bitsback saves (see ablations in Section 4). We propose instead a twostage inference algorithm that cuts the dependency on after an initial joint inference over and .
Bitsback coding bridges the marginalization gap in Eq. 6 by encoding a limited number of bits of some additional side information (e.g., an image caption or a previously encoded image of a slide show) into the choice of using an entropy model with parameters and . We obtain by discretizing a Gaussian variational distribution , thus extending the last layer of the hyperinference network (Eq. 1) to output now a tuple of means and diagonal variances. This replaces the form of variational distribution of Eq 3 as the restriction to a boxshaped distribution with fixed width is not necessary in bitsback coding. We train the resulting VAE by minimizing the NELBO. Since now has a variable width, the NELBO has an extra term compared to Eq. 3 that subtracts the entropy of , reflecting the expected number of bits we ‘get back’. Thus, the NELBO is again a relaxed ratedistortion objective, where now the net rate is the compressed file size minus the amount of embedded side information.
algocf[t]
Algorithm LABEL:alg:bitsback describes lossy compression and decompression with the trained model. Subroutine encode initializes the variational parameters , , and conditioned on the target image using the trained inference networks (line LABEL:step:bitsbackjoptinit). It then jointly performs SGA over by following Section 3.2, and BlackBox Variational Inference (BBVI) over and by minimizing the the NELBO (line LABEL:step:bitsbackjopt). At the end of the routine, the encoder decodes the provided side information (an arbitrary bitstring) into using as entropy model (line LABEL:step:bitsbackdecodesideinfo) and then encodes and as usual (line LABEL:step:bitsbackencode).
The important step happens on line LABEL:step:bitsbackencoderopt: up until this step, the fitted variational parameters and depend on the target image due to their initialization on line LABEL:step:bitsbackjoptinit. This would prevent the decoder, which does not have access to , from reconstructing the side information by encoding with the entropy model . Line LABEL:step:bitsbackencoderopt therefore refits the variational parameters and in a way that is exactly reproducible based on alone. This is done in the subroutine reproducible_BBVI, which performs BBVI in the prior model , treating as observed and only as latent. Although we reset and on line LABEL:step:bitsbackencoderopt immediately after optimizing over them on line LABEL:step:bitsbackjopt, optimizing jointly over both and on line LABEL:step:bitsbackjopt allows the method to find a better .
The decoder decodes and as usual (line LABEL:step:bitsbackdecode). Since the subroutine reproducible_BBVI depends only on and uses a fixed random seed (line LABEL:step:bitsbackroptinit), calling it from the decoder on line LABEL:step:bitsbackdecodezinf yields the exact same entropy model as used by the encoder, allowing the decoder to recover (line LABEL:step:bitsbackencodesideinfo).
4 Experiments
We demonstrate the empirical effectiveness of our approach by applying two variants of it —with and without bitsback coding—to an established (but not stateoftheart) base model (Minnen et al., 2018). We improve its performance drastically, achieving an average of over BD rate savings on Kodak and on Tecnick (Asuni and Giachetti, 2014), outperforming the previous stateoftheart of both classical and neural lossy image compression methods. We conclude with ablation studies.
Name  Explanation and Reference (for baselines)  

[origin=c]90ours 
[M1]  SGA  Proposed standalone variant: same trained model as [M3], SGA (Section 3.2) at compression time. 
[M2]  SGA + BB  Proposed variant with bitsback coding: both proposed improvements of Sections 3.2, and 3.3.  
[origin=c]90baselines 
[M3]  Base Hyperprior  Base method of our two proposals, reviewed in Section 2 of the present paper (Minnen et al., 2018). 
[M4]  Context + Hyperprior  Like [M3] but with an extra context model defining the prior (Minnen et al., 2018).  
[M5]  ContextAdaptive  ContextAdaptive Entropy Model proposed in (Lee et al., 2019).  
[M6]  Hyperprior ScaleOnly  Like [M3] but the hyperlatents model only the scale (not the mean) of (Ballé et al., 2018).  
[M7]  CAE  Pioneering “Compressive Autoencoder” model proposed in (Theis et al., 2017).  
[M8]  BPG 4:4:4  Stateoftheart classical lossy image compression codec ‘Better Portable Graphics’ (Bellard, 2014).  
[M9]  JPEG 2000  Classical lossy image compression method (Adams, 2001).  
[origin=c]90 ablations 
[A1]  MAP  Like [M1] but with continuous optimization over and followed by rounding instead of SGA. 
[A2]  STE 
Like [M1] but with straightthrough estimation (i.e., round only on backpropagation) instead of SGA. 

[A3]  Uniform Noise  Like [M1] but with uniform noise injection instead of SGA (see Section 3.1) (Campos et al., 2019).  
[A4]  Deterministic Annealing  Like [M1] but with deterministic version of Stochastic Gumbel Annealing  
[A5]  BB without SGA  Like [M2] but without optimization over at compression time.  
[A6]  BB without iterative inference  Like [M2] but without optimization over , , and at compression time. 
Proposed Methods.
Table 1 describes all compared methods, marked as [M1][M9] for short. We tested two variants of our method ([M1] and [M2]): SGA builds on the exact same model and training procedure as in (Minnen et al., 2018) (the Base Hyperprior model [M3]) and changes only how the trained model is used for compression by introducing hybrid amortizediterative inference (Campos et al., 2019) (Section 3.1) and Stochastic Gumbel Annealing (Section 3.2). SGA+BB adds to it bitsback coding (Section 3.3), which requires changing the inference model over hyperlatents to admit a more flexible variational distribution as discussed in Section 3.3; it also lifts a restriction on the shape of the hyperprior in (Minnen et al., 2018) as explained below Eq. 1. The two proposed variants address different use cases: SGA is a “standalone” variant of our method that compresses individual images while SGA+BB encodes images with additional side information. In all results, we annealed the temperature of SGA with an exponential decay schedule following (Jang et al., 2016) and used Adam (Kingma and Ba, 2015)
for optimization; we found good convergence without extensive hyperparameter tuning, and provide details in the Supplementary Material.
Baselines.
We compare to the Base Hyperprior model from (Minnen et al., 2018) without our proposed improvements ([M3] in Table 1), two stateoftheart neural methods ([M4] and [M5]), two other neural methods ([M6] and [M7]), the stateofthe art classical codec BPG [M8], and JPEG 2000 [M9]. We reproduced the Base Hyperprior results, and took the other results from (Ballé et al., ).
Results.
Figure 4 compares the compression performance of our proposed method to existing baselines on the Kodak
dataset, using the standard Peak SignaltoNoise Ratio (PSNR) quality metric (higher is better), averaged over all images for each considered quality setting
. The left panel of Figure 4 plots PSNR vs. bitrate, and the right panel shows the resulting BD rate savings (Bjontegaard, 2001) computed relative to BPG as a function of PSNR for readability (higher is better). The BD plot cuts out CAE [M7] and JPEG 2000 [M9] at the bottom, which performed worse. Overall, both variants of the proposed method (blue and orange lines in Figure 4) improve substantially over the Base Hyperprior model (brown), and outperform the previous stateoftheart. We report similar results on the Tecnick dataset (Asuni and Giachetti, 2014) in the Supplementary Material. Figure 5 shows a qualitative comparison of a compressed image (in the order of BPG, SGA (proposed), and the Base Hyperprior), in which we see that our method notably enhances the baseline image reconstruction at comparable bitrate, while avoiding unpleasant visual artifacts of BPG.Ablations. Figure 4 compares BD rate improvements of the two variants of our proposal (blue and orange) to six alternative choices (ablations [A1][A6] in Table 1), measured relative to the Base Hyperprior model (Minnen et al., 2018) (zero line; [M3] in Table 1), on which our proposals build. [A1][A4] replace the discretization method of SGA [M1] with four alternatives discussed at the end of Section 3.2. [A5] and [A6] are ablations of our proposed bitsback variant SGA+BB [M2], which remove iterative optimization over , or over and , respectively, at compression time. Going from red [A3] to orange [A2] to blue [M1] traces our proposed improvements over (Campos et al., 2019) by adding SGA (Section 3.2) and then bitsback (Section 3.3). It shows that iterative inference (Section 3.1) and SGA contribute approximately equally to the performance gain, whereas the gain from bitsback coding is smaller. Interestingly, lossy bitsback coding without iterative inference and SGA actually hurts performance [A6]. As Section 3.3 mentioned, this is likely because bitsback coding constrains the posterior over to be conditioned only on and not on the original image .
5 Discussion
Starting from the variational inference view on data compression, we proposed three enhancements to the standard inference procedure in VAEs: hybrid amortizediterative inference, Stochastic Gumbel Annealing, and lossy bitsback coding, which translated to dramatic performance gains on lossy image compression. Improved inference provides a new promising direction to improved compression, orthogonal to modeling choices (e.g., with autoregressive priors (Minnen et al., 2018), which can harm decoding efficiency). Although lossybitsback coding in the present VAE only gave relatively minor benefits, it may reach its full potential in more hierarchical architectures as in (Kingma et al., 2019). Similarly, carrying out iterative inference also at training time may lead to even more performance improvement, with techniques like iterative amortized inference (Marino et al., 2018).
6 Acknowledgements
We thank Yang Yang for valuable feedback on the manuscript. Yibo Yang acknowledges funding from the Hasso Plattner Foundation. Stephan Mandt acknowledges funding from DARPA (HR001119S0038), NSF (FWHTFRM), and Qualcomm.
References
 The jpeg2000 still image compression standard. Cited by: Table 1.
 Softtohard vector quantization for endtoend learning compressible representations. In Advances in Neural Information Processing Systems, Cited by: §3.2, §3.2.
 Fixing a broken elbo. arXiv preprint arXiv:1711.00464. Cited by: Improving Inference for Neural Image Compression.
 TESTIMAGES: a largescale archive for testing visual devices and basic image processing algorithms (SAMPLING 1200 RGB set). In STAG: Smart Tools and Apps for Graphics, External Links: Link Cited by: Figure S2, §S3, §4, §4.
 [5] Tensorflowcompression: data compression in tensorflow). External Links: Link Cited by: §S1, §S3, §4.
 Endtoend optimized image compression. International Conference on Learning Representations. Cited by: Figure 1, §2, §2, §2, §2, §3.2.
 Variational image compression with a scale hyperprior. In ICLR. Cited by: Improving Inference for Neural Image Compression, item 2, §1, §S1, §S1, §2, §2, §S3, Table 1.
 The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting, Vol. 7, pp. 453–464. Cited by: §2.
 BPG specification. Note: (accessed June 3, 2020) External Links: Link Cited by: Table 1.

Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §3.2, §3.2.  Calculation of average psnr differences between rdcurves. VCEGM33. Cited by: §4.
 Variational inference: a review for statisticians. Journal of the American statistical Association 112 (518), pp. 859–877. Cited by: §2.

Content adaptive optimization for neural image compression.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, Cited by: item 1, §3.1, §3.1, §3.2, §3, §4, §4, Table 1. 
Inference suboptimality in variational autoencoders.
In
International Conference on Machine Learning
, pp. 1078–1086. Cited by: §3.1.  Video compression with ratedistortion autoencoders. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7033–7042. Cited by: item 3, §1.

Keeping the neural networks simple by minimizing the description length of the weights.
In
Proceedings of the sixth annual conference on Computational learning theory
, pp. 5–13. Cited by: item 3, §3.3.  Iterative refinement of the approximate posterior for directed belief networks. In Advances in Neural Information Processing Systems, pp. 4691–4699. Cited by: §3.1.
 Compression with flows via local bitsback coding. In Advances in Neural Information Processing Systems, pp. 3874–3883. Cited by: §3.3.
 Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §3.2, §4.
 Computationally efficient neural image compression. arXiv preprint arXiv:1912.08771. Cited by: §2.
 Semiamortized variational autoencoders. In International Conference on Machine Learning, pp. 2678–2687. Cited by: §3.1.
 Adam: a method for stochastic gradient descent. In ICLR: International Conference on Learning Representations, Cited by: §4.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.1.
 Bitswap: recursive bitsback coding for lossless compression with hierarchical latent variables. arXiv preprint arXiv:1905.06845. Cited by: §3.3, §5.
 [25] Kodak lossless true color image suite (PhotoCD PCD0992). External Links: Link Cited by: 0(a), Figure S1, §3.2, §S3, Figure 4, 4(a), §4, §4.

On the challenges of learning with inference networks on sparse, highdimensional data
. InInternational Conference on Artificial Intelligence and Statistics
, pp. 143–151. Cited by: §3.1, §3.1.  Contextadaptive entropy model for endtoend optimized image compression. In the 7th Int. Conf. on Learning Representations, Cited by: Improving Inference for Neural Image Compression, §1, §2, Table 1.
 Deep generative video compression. In Advances in Neural Information Processing Systems 32, pp. 9287–9298. Cited by: §1.

The concrete distribution: a continuous relaxation of discrete random variables
. arXiv preprint arXiv:1611.00712. Cited by: §3.2.  Iterative amortized inference. In Proceedings of the 35th International Conference on Machine Learning, pp. 3403–3412. Cited by: §3.1, §5.
 Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: Improving Inference for Neural Image Compression, §1, §1, §S1, Figure 1, §2, §2, §2, 0(d), 4(d), §4, §4, §4, Table 1, §4, §5.
 Neural discrete representation learning. arXiv preprint arXiv:1711.00937. Cited by: §3.2.
 MNG: a multipleimage format in the png family. World Wide Web Journal 2 (1), pp. 209–211. Cited by: §1.
 Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286. Cited by: §1, §3.1.
 Lossy image compression with compressive autoencoders. International Conference on Learning Representations. Cited by: §2, §2, Table 1.

Variable rate image compression with recurrent neural networks
. International Conference on Learning Representations. Cited by: §3.2.  HiLLoC: lossless image compression with hierarchical latent variable models. arXiv preprint arXiv:1912.09953. Cited by: §3.3.
 Practical lossless compression with latent variables using bits back coding. arXiv preprint arXiv:1901.04866. Cited by: §3.3.
 Classification by minimummessagelength inference. In International Conference on Computing and Information, pp. 72–81. Cited by: item 3, §3.3.

Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
. Machine learning 8 (34), pp. 229–256. Cited by: §3.2.  Variablebitrate neural compression via bayesian arithmetic coding. In International Conference on Machine Learning, Cited by: §3.2.
 Understanding straightthrough estimator in training activation quantized neural nets. arXiv preprint arXiv:1903.05662. Cited by: §S2, §3.2, §3.2.
 Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
S1 Model Architecture and Training
As mentioned in the main text, the Base Hyperprior model uses the same architecture as in Table 1 of Minnen et al. [2018] except without the “Context Prediction” and “Entropy Parameters” components (this model was referred to as “Mean & Scale Hyperprior” in this work). The model is mostly already implemented in bmshj2018.py^{1}^{1}1https://github.com/tensorflow/compression/blob/master/examples/bmshj2018.py from Ballé et al. , Ballé et al. [2018], and we modified their implementation to double the number of output channels of the HyperSynthesisTransform to predict both the mean and (log) scale of the conditional prior over the latents .
As mentioned in Section 3.3 and 4, lossy bitsback in this model makes the following modifications:

The number of output channels of the hyper inference network is doubled to compute both the mean and diagonal (log) variance of Gaussian ;

The hyperprior is no longer restricted to the form of a flexible density model convolved with the uniform distribution on ; instead it simply uses the flexible density, as described in the Appendix of Ballé et al. [2018].
The above models were trained on CLIC2018 images for 2 million iterations, using minibatches of eight randomlycropped image patches. We trained models with on a logscale. Following Ballé et al. [2018], we found that increasing the number of latent channels helps avoid the “bottleneck” effect and improve ratedistortion performance at higher rates, and increased the number of latent channels from 192 to 256 for models trained with and .
S2 HyperParameters of Various Methods Considered
Given a pretrained model (see above) and image(s) to be compressed, our experiments in Section 4 explored hybrid amortizediterative inference to improve compression performance. Below we provide hyperparameters of each method considered:
Stochastic Gumbel Annealing.
In all experiments using SGA (including the SGA+BB experiments, which optimized over using SGA), we used the Adam optimizer with an initial learning rate of , and an exponentially decaying temperature schedule , where is the iteration number, and controls the speed of the decay. We found that lower generally increases the number of iterations needed for convergence, but can also yield slightly better solutions; in all of our experiments we set , and obtained good convergence with iterations.
Lossy bitsback.
In the SGA+BB experiments, line LABEL:step:bitsbackjopt of the encode subroutine of Algorithm LABEL:alg:bitsback performs joint optimization w.r.t. , , and with Blackbox Variational Inference (BBVI), using the reparameterization trick to differentiate through Gumbelsoftmax samples for , and Gaussian samples for . We used Adam with an initial learning rate of , and ran stochastic gradient descent iterations; the optimization w.r.t. (SGA) used the same temperature annealing schedule as above. The reproducible_BBVI subroutine of Algorithm LABEL:alg:bitsback used Adam with an initial learning rate of and stochastic gradient descent iterations; we used the same random seed for the encoder and decoder for simplicity, although a hash of can also be used instead.
Alternative Discretization Methods
In Figure 2 and ablation studies of Section 4, all alternative discretization methods were optimized with Adam for iterations to compare with SGA. The initial learning rate was for MAP, Uniform Noise, and Deterministic Annealing, and for STE. All methods were tuned on a besteffort basis to ensure convergence, except that STE consistently encountered convergence issues even with a tiny learning rate (see [Yin et al., 2019]). The ratedistortion results for MAP and STE were calculated with early stopping (i.e., using the intermediate with the lowest true ratedistortion objective during optimization), just to give them a fair chance. Lastly, the comparisons in Figure 2 used the Base Hyperprior model trained with .
S3 Additional Results
On Kodak, we achieve the following BD rate savings: for SGA+BB and for SGA relative to Base Hyperprior; for SGA+BB and for SGA relative to BPG. On Tecnick, we achieve the following BD rate savings: for SGA+BB and for SGA relative to Base Hyperprior; for SGA+BB and for SGA relative to BPG. We note that the bitrates in our ratedistortion results were based on rate estimates from entropy losses; for SGA, we confirmed that the rate estimate from the prior agrees with (typically within 1% of) the actual file size produced by the codec implementation of Ballé et al. , Ballé et al. [2018].
Below we provide an additional qualitative comparison on the Kodak dataset, and report detailed ratedistortion performance on the Tecnick [Asuni and Giachetti, 2014] dataset.
Comments
There are no comments yet.