1 Introduction
Image compression refers to the task of representing images using as little storage (i.e., bits) as possible. While in lossless image compression the compression rate is limited by the requirement that the original image should be perfectly reconstructible, in lossy image compression, a greater reduction in storage is enabled by allowing for some distortion in the reconstructed image. This results in a socalled ratedistortion tradeoff, where a balance is found between the bitrate and the distortion by minimizing , where balances the two competing objectives. Recently, deep neural networks (DNNs) trained as image autoencoders for this task led to promising results, achieving better performance than many traditional techniques for image compression [19, 20, 17, 4, 2, 9]. Another advantage of DNNbased learned compression systems is their adaptability to specific target domains such as areal images or stereo images, enabling even higher compression rates on these domains. A key challenge in training such systems is to optimize the bitrate of the latent image representation in the autoencoder. To encode the latent representation using a finite number of bits, it needs to be discretized into symbols (i.e., mapped to a stream of elements from some finite set of values). Since discretization is nondifferentiable, this presents challenges for gradientbased optimization methods and many techniques have been proposed to address them. After discretization, information theory tells us that the correct measure for bitrate is the entropy of the resulting symbols. Thus the challenge, and the focus of this paper, is how to model such that we can navigate the tradeoff during optimization of the autoencoder.
Our proposed method is based on leveraging context models, which were previously used as techniques to improve coding rates for alreadytrained models [4, 20, 9, 14], directly as an entropy term in the optimization. We concurrently train the autoencoder and the context model with respect to each other, where the context model learns a convolutional probabilistic model of the image representation in the autoencoder, while the autoencoder uses it for entropy estimation to navigate the ratedistortion tradeoff. Furthermore, we generalize our formulation to spatiallyaware networks, which use an importance map to spatially attend the bitrate representation to the most important regions in the compressed representation. The proposed techniques lead to a simple image compression system^{1}^{1}1https://github.com/fabjul/imgcompcvpr, which achieves stateoftheart performance when measured with the popular multiscale structural similarity index (MSSSIM) distortion metric [23]
, while being straightforward to implement with standard deeplearning toolboxes.
2 Related work
Fullresolution image compression using DNNs has attracted considerable attention recently. DNN architectures commonly used for image compression are autoencoders [17, 4, 2, 9]
and recurrent neural networks (RNNs)
[19, 20]. The networks are typically trained to minimize the meansquared error (MSE) between original and decompressed image [17, 4, 2, 9], or using perceptual metrics such as MSSSIM [20, 14]. Other notable techniques involve progressive encoding/decoding strategies [19, 20], adversarial training [14], multiscale image decompositions [14], and generalized divisive normalization (GDN) layers [4, 3].Context models and entropy estimation—the focus of the present paper—have a long history in the context of engineered compression methods, both lossless and lossy [24, 12, 25, 13, 10]. Most of the recent DNNbased lossy image compression approaches have also employed such techniques in some form. [4] uses a binary context model for adaptive binary arithmetic coding [11]. The works of [20, 9, 14] use learned context models for improved coding performance on their trained models when using adaptive arithmetic coding. [17, 2] use nonadaptive arithmetic coding but estimate the entropy term with an independence assumption on the symbols.
3 Proposed method
Given a set of training images , we wish to learn a compression system which consists of an encoder, a quantizer, and a decoder. The encoder maps an image to a latent representation . The quantizer discretizes the coordinates of to centers, obtaining with , which can be losslessly encoded into a bitstream. The decoder then forms the reconstructed image from the quantized latent representation , which is in turn (losslessy) decoded from the bitstream. We want the encoded representation to be compact when measured in bits, while at the same time we want the distortion to be small, where is some measure of reconstruction error, such as MSE or MSSSIM. This results in the socalled ratedistortion tradeoff
(1) 
where denotes the cost of encoding to bits, i.e., the entropy of . Our system is realized by modeling and
as convolutional neural networks (CNNs) (more specifically, as the encoder and decoder, respectively, of a convolutional autoencoder) and minimizing (
1) over the training set , where a large/small draws the system towards low/high average entropy . In the next sections, we will discuss how we quantize and estimate the entropy . We note that as are CNNs,will be a 3D feature map, but for simplicity of exposition we will denote it as a vector with equally many elements. Thus,
refers to the th element of the feature map, in raster scan order (row by column by channel).3.1 Quantization
We adopt the scalar variant of the quantization approach proposed in [2] to quantize , but simplify it using ideas from [17]. Specifically, given centers , we use nearest neighbor assignments to compute
(2) 
but rely on (differentiable) soft quantization
(3) 
to compute gradients during the backward pass. This combines the benefit of [2] where the quantization is restricted to a finite set of learned centers (instead of the fixed (nonlearned) integer grid as in [17]) and the simplicity of [17], where a differentiable approximation of quantization is only used in the backward pass, avoiding the need to choose an annealing strategy (i.e., a schedule for ) as in [2] to drive the soft quantization (3) to hard assignments (2
) during training. In TensorFlow, this is implemented as
(4) 
We note that for forward pass computations, , and thus we will continue writing for the latent representation.
3.2 Entropy estimation
To model the entropy we build on the approach of PixelRNN [22] and factorize the distribution as a product of conditional distributions
(5) 
where the 3D feature volume is indexed in raster scan order. We then use a neural network , which we refer to as a context model, to estimate each term :
(6) 
where specifies for every 3D location in the probabilites of each symbol in with . We refer to the resulting approximate distribution as , where denotes the index of in .
Since the conditional distributions only depend on previous values , this imposes a causality constraint on the network : While may compute in parallel for , it needs to make sure that each such term only depends on previous values .
The authors of PixelCNN [22, 21]
study the use of 2DCNNs as causal conditional models over 2D images in a lossless setting, i.e., treating the RGB pixels as symbols. They show that the causality constraint can be efficiently enforced using masked filters in the convolution. Intuitively, the idea is as follows: If for each layer the causality condition is satisfied with respect to the spatial coordinates of the layer before, then by induction the causality condition will hold between the output layer and the input. Satisfying the causality condition for each layer can be achieved with proper masking of its weight tensor, and thus the entire network can be made causal
only through the masking of its weights. Thus, the entire set of probabilities for all (2D) spatial locations and symbol values can be computed in parallel with a fully convolutional network, as opposed to modeling each term separately.In our case, is a 3D symbol volume, with as much as channels. We therefore generalize the approach of PixelCNN to 3D convolutions, using the same idea of masking the filters properly in every layer of the network. This enables us to model efficiently, with a lightweight^{2}^{2}2We use a 4layer network, compared to 15 layers in [22]. 3DCNN which slides over , while properly respecting the causality constraint. We refer to the supplementary material for more details.
As in [21], we learn by training it for maximum likelihood, or equivalently (see [16]) by training
to classify the index
of in with a cross entropy loss:(7) 
Using the wellknown property of cross entropy as the coding cost when using the wrong distribution instead of the true distribution , we can also view the loss as an estimate of since we learn such that . That is, we can compute
(8)  
(9)  
(10)  
(11)  
(12) 
Therefore, when training the autoencoder we can indirectly minimize through the cross entropy . We refer to argument in the expectation of (7),
(13) 
as the coding cost of the latent image representation, since this reflects the coding cost incurred when using as a context model with an adaptive arithmetic encoder [11]. From the application perspective, minimizing the coding cost is actually more important than the (unknown) true entropy, since it reflects the bitrate obtained in practice.
To backpropagate through
we use the same approach as for the encoder (see (4)). Thus, like the decoder , only sees the (discrete) in the forward pass, whereas the gradient of the soft quantization is used for the backward pass.3.3 Concurrent optimization
Given an autoencoder , we can train to model the dependencies of the entries of as described in the previous section by minimizing (7). On the other hand, using the model , we can obtain an estimate of as in (12) and use this estimate to adjust such that is reduced, thereby navigating the rate distortion tradeoff. Therefore, it is natural to concurrently learn (with respect to its own loss), and (with respect to the rate distortion tradeoff) during training, such that all models which the losses depend on are continuously updated.
3.4 Importance map for spatial bitallocation
Recall that since and are CNNs, is a 3D featuremap. For example, if
has three stride2 convolution layers and the bottleneck has
channels, the dimensions of will be . A consequence of this formulation is that we are using equally many symbols in for each spatial location of the input image . It is known, however, that in practice there is great variability in the information content across spatial locations (e.g., the uniform area of blue sky vs. the finegrained structure of the leaves of a tree).This can in principle be accounted for automatically in the tradeoff between the entropy and the distortion, where the network would learn to output more predictable (i.e., low entropy) symbols for the low information regions, while making room for the use of high entropy symbols for the more complex regions. More precisely, the formulation in (7) already allows for variable bit allocation for different spatial regions through the context model .
However, this arguably requires a quite sophisticated (and hence computationally expensive) context model, and we find it beneficial to follow Li et al. [9] instead by using an importance map to help the CNN attend to different regions of the image with different amounts of bits. While [9] uses a separate network for this purpose, we consider a simplified setting. We take the last layer of the encoder , and add a second singlechannel output . We expand this single channel into a mask of the same dimensionality as as follows:
(14) 
where denotes the value of at spatial location . The transition value for is such that the mask smoothly transitions from 0 to 1 for noninteger values of .
We then mask
by pointwise multiplication with the binarization of
, i.e., . Since the ceiling operator is not differentiable, as done by [17, 9], we use identity for the backward pass.With this modification, we have simply changed the architecture of slightly such that it can easily “zero out” portions of columns of (the rest of the network stays the same, so that (2) still holds for example). As suggested by [9], the soobtained structure in presents an alternative coding strategy: Instead of losslessly encoding the entire symbol volume , we could first (separately) encode the mask , and then for each column only encode the first symbols, since the remaining ones are the constant , which we refer to as the zero symbol.
Work [9] uses binary symbols (i.e., ) and assumes independence between the symbols and a uniform prior during training, i.e., costing each 1 bit to encode. The importance map is thus their principal tool for controlling the bitrate, since they thereby avoid encoding all the bits in the representation. In contrast, we stick to the formulation in (5) where the dependencies between the symbols are modeled during training. We then use the importance map as an architectural constraint and use their suggested coding strategy to obtain an alternative estimate for the entropy , as follows.
We observe that we can recover from by counting the number of consecutive zero symbols at the end of each column .^{3}^{3}3If contained zeros before it was masked, we might overestimate the number of entries in . However, we can redefine those entries of as and this will give the same result after masking. is therefore a function of the masked , i.e., for recovering as described, which means that we have for the conditional entropy . Now, we have
(15)  
(16)  
(17) 
If we treat the entropy of the mask, , as constant during optimization of the autoencoder, we can then indirectly minimize through .
To estimate , we use the same factorization of as in (5), but since the mask is known we have deterministic for the 3D locations in where the mask is zero. The s of the corresponding terms in (9) then evaluate to . The remaining terms, we can model with the same context model , which results in
(18) 
where denotes the th element of (in the same raster scan order as ).
Similar to the coding cost (13), we refer to the argument in the expectation in (18),
(19) 
as the masked coding cost of .
While the entropy estimate (18) is almost estimating the same quantity as (7) (only differing by ), it has the benefit of being weighted by . Therefore, the encoder has an obvious path to control the entropy of , by simply increasing/decreasing the value of for some spatial location of and thus obtaining fewer/more zero entries in .
When the context model is trained, however, we still train it with respect to the formulation in (8), so it does not have direct access to the mask and needs to learn the dependencies on the entire masked symbol volume . This means that when encoding an image, we can stick to standard adaptive arithmetic coding over the entire bottleneck, without needing to resort to a twostep coding process as in [9], where the mask is first encoded and then the remaining symbols. We emphasize that this approach hinges critically on the context model and the encoder being trained concurrently as this allows the encoder to learn a meaningful (in terms of coding cost) mask with respect to (see the next section).
In our experiments we observe that during training, the two entropy losses (7) and (18) converge to almost the same value, with the latter being around smaller due to being ignored.
While the importance map is not crucial for optimal ratedistortion performance, if the channel depth is adjusted carefully, we found that we could more easily control the entropy of through when using a fixed , since the network can easily learn to ignore some of the channels via the importance map. Furthermore, in the supplementary material we show that by using multiple importance maps for a single network, one can obtain a single model that supports multiple compression rates.
3.5 Putting the pieces together
We made an effort to carefully describe our formulation and its motivation in detail. While the description is lengthy, when putting the resulting pieces together we get a quite straightforward pipeline for learned image compression, as follows.
Given the set of training images , we initialize (fully convolutional) CNNs , , and , as well as the centers of the quantizer . Then, we train over minibatches of crops from . At each iteration, we take one gradient step for the autoencoder and the quantizer , with respect to the ratedistortion tradeoff
(20) 
which is obtained by combining (1) with the estimate (18) & (19) and taking the batch sample average. Furthermore, we take a gradient step for the context model with respect to its objective (see (7) & (13))
(21) 
To compute these two batch losses, we need to perform the following computation for each :

Obtain compressed (latent) representation and importance map from the encoder:

Expand importance map to mask via (14)

Mask , i.e.,

Quantize

Compute the context

Decode ,
which can be computed in parallel over the minibatch on a GPU since all the models are fully convolutional.
3.6 Relationship to previous methods
We are not the first to use context models for adaptive arithmetic coding to improve the performance in learned deep image compression. Work [20] uses a PixelRNNlike architecture [22] to train a recurrent network as a context model for an RNNbased compression autoencoder. Li et al. [9] extract cuboid patches around each symbol in a binary feature map, and feed them to a convolutional context model. Both these methods, however, only learn the context model after training their system, as a postprocessing step to boost coding performance.
In contrast, our method directly incorporates the context model as the entropy term for the ratedistortion term (1) of the autoencoder, and trains the two concurrently. This is done at little overhead during training, since we adopt a 3DCNN for the context model, using PixelCNNinspired [21] masking of the weights of each layer to ensure causality in the context model. Adopting the same approach to the context models deployed by [20] or [9] would be nontrivial since they are not designed for fast feedforward computation. In particular, while the context model of [9] is also convolutional, its causality is enforced through masking the inputs to the network, as opposed to our masking of the weights of the networks. This means their context model needs to be run separately with a proper input cuboid for each symbol in the volume (i.e., not fully convolutionally).
4 Experiments
Architecture
Distortion measure
Training
We use the Adam optimizer [8] with a minibatch size of 30 to train seven models. Each model is trained to maximize MSSSIM directly. As a baseline, we used a learning rate (LR) of for each model, but found it beneficial to vary it slightly for different models. We set in the smooth approximation (3) used for gradient backpropagation through . To make the model more predictably land at a certain bitrate when optimizing (1), we found it helpful to clip the rate term (i.e., replace the entropy term with ), such that the entropy term is “switched off” when it is below
. We found this did not hurt performance. We decay the learning rate by a factor 10 every two epochs. To obtain models for different bitrates, we adapt the target bitrate
and the number of channels , while using a moderately large . We use a small regularization on the weights and note that we achieve very stable training. We trained our models for 6 epochs, which took around 24h per model on a single GPU. For , we use a LR of and the same decay schedule.Datasets
We train on the the ImageNet dataset from the Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) [15]. As a preprocessing step, we take random
crops, and randomly flip them. We set aside 100 images from ImageNet as a testing set,
ImageNetTest. Furthermore, we test our method on the widely used Kodak [1] dataset. To asses performance on highquality fullresolution images, we also test on the datasets B100 [18] and Urban100 [5], commonly used in superresolution.
Other codecs
We compare to JPEG, using libjpeg^{4}^{4}4http://libjpeg.sourceforge.net/, and JPEG2000, using the Kakadu implementation^{5}^{5}5http://kakadusoftware.com/. We also compare to the lesser known BPG^{6}^{6}6https://bellard.org/bpg/, which is based on HEVC, the stateoftheart in video compression, and which outperforms JPEG and JPEG2000. We use BPG in the nondefault 4:4:4 chroma format, following [14].
Comparison
Like [14]
, we proceed as follows to compare to other methods. For each dataset, we compress each image using all our models. This yields a set of (bpp, MSSSIM) points for each image, which we interpolate to get a curve for each image. We fix a grid of bpp values, and average the curves for each image at each bpp grid value (ignoring those images whose bpp range does not include the grid value, i.e., we do not extrapolate). We do this for our method, BPG, JPEG, and JPEG2000. Due to code being unavailable for the related works in general, we digitize the Kodak curve from Rippel & Bourdev
[14], who have carefully collected the curves from the respective works. With this, we also show the results of Rippel & Bourdev [14], Johnston et al. [7], Ballé et al. [4], and Theis et al. [17]. To validate that our estimated MSSSIM is correctly implemented, we independently generated the BPG curves for Kodak and verified that they matched the one from [14].Results
Fig. 1 shows a comparison of the aforementioned methods for Kodak. Our method outperforms BPG, JPEG, and JPEG2000, as well as the neural network based approaches of Johnston et al. [7], Ballé et al. [4], and Theis et al. [17]. Furthermore, we achieve performance comparable to that of Rippel & Bourdev [14]. This holds for all bpps we tested, from 0.3 bpp to 0.9 bpp. We note that while Rippel & Bourdev and Johnston et al. also train to maximize (MS)SSIM, the other methods minimize MSE.
In each of the other testing sets, we also outperform BPG, JPEG, and JPEG2000 over the reported bitrates, as shown in Fig. 4.
In Fig. 5, we compare our approach to BPG, JPEG, and JPEG2000 visually, using very strong compression on kodim21 from Kodak. It can be seen that the output of our network is pleasant to look at. Soft structures like the clouds are very well preserved. BPG appears to handle high frequencies better (see, e.g., the fence) but loses structure in the clouds and in the sea. Like JPEG2000, it produces block artifacts. JPEG breaks down at this rate. We refer to the supplementary material for further visual examples.
Ablation study: Context model
In order to show the effectiveness of the context model, we performed the following ablation study. We trained the autoencoder without entropy loss, i.e., in (20), using centers and channels. On Kodak, this model yields an average MSSSIM of 0.982, at an average rate of 0.646 bpp (calculated assuming that we need bits per symbol). We then trained three different context models for this autoencoder, while keeping the autoencoder fixed: A zeroth order context model which uses a histogram to estimate the probability of each of the symbols; a first order (onestep prediction) context model, which uses a conditional histogram to estimate the probability of each of the symbols given the previous symbol (scanning in raster order); and , i.e., our proposed context model. The resulting average rates are shown in Table 1. Our context model reduces the rate by 10 %, even though the autoencoder was optimized using a uniform prior (see supplementary material for a detailed comparison of Table 1 and Fig. 1).
Model  rate 

Baseline (Uniform)  0.646 bpp 
Zeroth order  0.642 bpp 
First order  0.627 bpp 
Our context model  0.579 bpp 
Importance map
As described in detail in Section 3.4, we use an importance map to dynamically alter the number of channels used at different spatial locations to encode an image. To visualize how this helps, we trained two autoencoders and , where uses an importance map and at most channels to compress an image, and compresses without importance map and with channels (this yields a rate for similar to that of ). In Fig. 6, we show an image from ImageNetTest along with the same image compressed to 0.463 bpp by and compressed to 0.504 bpp by . Furthermore, Fig. 6 shows the importance map produced by , as well as ordered visualizations of all channels of the latent representation for both and . Note how for , channels with larger index are sparser, showing how the model can spatially adapt the number of channels. uses all channels similarly.
Input  Importance map of 
Output of  Latent representation of 
Output of  Latent representation of 
5 Discussion
Our experiments showed that combining a convolutional autoencoder with a lightweight 3DCNN as context model and training the two networks concurrently leads to a highly effective image compression system. Not only were we able to clearly outperform stateoftheart engineered compression methods including BPG and JPEG2000 in terms of MSSSIM, but we also obtained performance competitive with the current stateoftheart learned compression method from [14]. In particular, our method outperforms BPG and JPEG2000 in MSSSIM across four different testing sets (ImageNetTest, Kodak, B100, Urban100), and does so significantly, i.e., the proposed method generalizes well. We emphasize that our method relies on elementary techniques both in terms of the architecture (standard convolutional autoencoder with importance map, convolutional context model) and training procedure (minimize the ratedistortion tradeoff and the negative loglikelihood for the context model), while [14] uses highly specialized techniques such as a pyramidal decomposition architecture, adaptive codelength regularization, and multiscale adversarial training.
The ablation study for the context model showed that our 3DCNNbased context model is significantly more powerful than the first order (histogram) and second order (onestep prediction) baseline context models. Further, our experiments suggest that the importance map learns to condensate the image information in a reduced number of channels of the latent representation without relying on explicit supervision. Notably, the importance map is learned as a part of the image compression autoencoder concurrently with the autoencoder and the context model, without introducing any optimization difficulties. In contrast, in [9] the importance map is computed using a separate network, learned together with the autoencoder, while the context model is learned separately.
6 Conclusions
In this paper, we proposed the first method for learning a lossy image compression autoencoder concurrently with a lightweight context model by incorporating it into an entropy loss for the optimization of the autoencoder, leading to performance competitive with the current stateoftheart in deep image compression [14].
Future works could explore heavier and more powerful context models, as those employed in [22, 21]. This could further improve compression performance and allow for sampling of natural images in a “lossy” manner, by sampling according to the context model and then decoding.
Acknowledgements
This work was supported by ETH Zürich and by NVIDIA through a GPU grant.
References
 [1] Kodak PhotoCD dataset. http://r0k.us/graphics/kodak/.
 [2] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. Van Gool. Softtohard vector quantization for endtoend learning compressible representations. arXiv preprint arXiv:1704.00648, 2017.
 [3] J. Ballé, V. Laparra, and E. P. Simoncelli. Endtoend optimization of nonlinear transform codes for perceptual quality. arXiv preprint arXiv:1607.05006, 2016.
 [4] J. Ballé, V. Laparra, and E. P. Simoncelli. Endtoend optimized image compression. arXiv preprint arXiv:1611.01704, 2016.

[5]
J.B. Huang, A. Singh, and N. Ahuja.
Single image superresolution from transformed selfexemplars.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5197–5206, 2015. 
[6]
S. Ioffe and C. Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In
International Conference on Machine Learning
, pages 448–456, 2015.  [7] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici. Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. arXiv preprint arXiv:1703.10114, 2017.
 [8] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [9] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang. Learning convolutional networks for contentweighted image compression. arXiv preprint arXiv:1703.10553, 2017.
 [10] D. Marpe, H. Schwarz, and T. Wiegand. Contextbased adaptive binary arithmetic coding in the h. 264/avc video compression standard. IEEE Transactions on circuits and systems for video technology, 13(7):620–636, 2003.
 [11] D. Marpe, H. Schwarz, and T. Wiegand. Contextbased adaptive binary arithmetic coding in the h. 264/avc video compression standard. IEEE Transactions on circuits and systems for video technology, 13(7):620–636, 2003.
 [12] B. Meyer and P. Tischer. Tmwa new method for lossless image compression. ITG FACHBERICHT, pages 533–540, 1997.
 [13] B. Meyer and P. E. Tischer. Glicbawlsgrey level image compression by adaptive weighted least squares. In Data Compression Conference, volume 503, 2001.
 [14] O. Rippel and L. Bourdev. Realtime adaptive image compression. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2922–2930, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 [15] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014.
 [16] C. Shalizi. Lecture notes on stochastic processes. http://www.stat.cmu.edu/~cshalizi/754/2006/notes/lecture28.pdf, 2006. [Online; accessed 15Nov2017].

[17]
L. Theis, W. Shi, A. Cunningham, and F. Huszar.
Lossy image compression with compressive autoencoders.
In ICLR 2017, 2017.  [18] R. Timofte, V. De Smet, and L. Van Gool. A+: Adjusted Anchored Neighborhood Regression for Fast SuperResolution, pages 111–126. Springer International Publishing, Cham, 2015.
 [19] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar. Variable rate image compression with recurrent neural networks. arXiv preprint arXiv:1511.06085, 2015.
 [20] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell. Full resolution image compression with recurrent neural networks. arXiv preprint arXiv:1608.05148, 2016.
 [21] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
 [22] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747–1756, 2016.
 [23] Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In Asilomar Conference on Signals, Systems Computers, 2003, volume 2, pages 1398–1402 Vol.2, Nov 2003.
 [24] M. J. Weinberger, J. J. Rissanen, and R. B. Arps. Applications of universal context modeling to lossless compression of grayscale images. IEEE Transactions on Image Processing, 5(4):575–586, 1996.
 [25] X. Wu, E. Barthel, and W. Zhang. Piecewise 2d autoregression for predictive image coding. In Image Processing, 1998. ICIP 98. Proceedings. 1998 International Conference on, pages 901–904. IEEE, 1998.
Conditional Probability Models for Deep
Image Compression – Suppl. Material
3D probability classifier
As mentioned in Section 3.2, we rely on masked 3D convolutions to enforce the causality constraint in our probability classifier . In a 2DCNN, standard 2D convolutions are used in filter banks, as shown in Fig. 7 on the left: A dimensional tensor is mapped to a dimensional tensor using banks of 2D filters, i.e., filters can be represented as dimensional tensors. Note that all channels are used together, which violates causality: When we encode, we proceed channel by channel.
Using 3D convolutions, a depth dimension is introduced. In a 3DCNN, dimensional tensors are mapped to dimensional tensors, with dimensional filters. Thus, a 3DCNN slides over the depth dimension, as shown in Fig. 7 on the right. We use such a 3DCNN for , where we use as input our dimensional feature map , using for the first layer.
To explain how we mask the filters in , consider the 2D case in Fig. 8. We want to encode all values by iterating in raster scan order and by computing . We simplify this by instead of relying on all previously encoded symbols, we use some context around (black square in Fig. 8). To satisfy the causality constraint, this context may only contain values above or in the same row to the left of (gray cells). By using the filter shown in Fig. 8 in the top right for the first layer of a CNN and the filter shown in Fig. 8 in the bottom right for subsequent filters, we can build a 2DCNN with a receptive field that forms such a context. We build our 3DCNN by generalizing this idea to 3D, where we construct the mask for the filter of the first layer as shown in pseudocode Algorithm 1. The mask for the subsequent layers is constructed analoguously by replacing “” in line 7 with “”. We use filter size .
With this approach, we obtain a 3DCNN which operates on dimensional blocks. We can use to encode by iterating over in such blocks, exhausting first axis , then axis , and finally axis (like in Algorithm 1). For each such block,
yields the probability distribution of the central symbol given the symbols in the block. Due to the construction of the masks, this probability distribution only depends on previously encoded symbols.
Multiple compression rates
It is quite straightforward to obtain multiple operating points in a single network with our framework: We can simply share the network but use multiple importance maps. We did a simple experiment where we trained an autoencoder with 5 different importance maps. In each iteration, a random importance map was picked, and the target entropy was set to . While not tuned for performance, this already yielded a model competitive with BPG. The following shows the output of the model for (from left to right):
On the benefit of 3DCNN and joint training
We note that the points from Table 1 (where we trained different entropy models nonjointly as a posttraining step) are not directly comparable with the curve in Fig. 1. This is because these points are obtained by taking the mean of the MSSSIM and bpp values over the Kodak images for a single model. In contrast, the curve in Fig. 1 is obtained by following the approach of [14], constructing a MSSSIM vs. bpp curve perimage via interpolation (see Comparison in Section 4). In Fig. 9, we show the black curve from Fig. 1, as well as the mean (MSSSIM, bpp) points achieved by the underlying models (). We also show the points from Tab. 1 (). We can see that our masked 3DCNN with joint training gives a significant improvement over the separately trained 3DCNN, i.e., a 22% reduction in bpp when comparing mean points (the red point is estimated).
Nonrealistic images
In Fig. 10, we compare our approach to BPG on an image from the Manga109^{7}^{7}7http://www.manga109.org/ dataset. We can see that our approach preserves text well enough to still be legible, but it is not as crip as BPG (left zoom). On the other hand, our approach manages to preserve the fine texture on the face better than BPG (right zoom).
Visual examples
The following pages show the first four images of each of our validation sets compressed to low bitrate, together with outputs from BPG, JPEG2000 and JPEG compressed to similar bitrates. We ignored all header information for all considered methods when computing the bitrate (here and throughout the paper). We note that the only header our approach requires is the size of the image and an identifier, e.g., , specifying the model.
Overall, our images look pleasant to the eye. We see cases of overblurring in our outputs, where BPG manages to keep high frequencies due to its more local approach. An example is the fences in front of the windows in Fig. 14, top, or the text in Fig. 15, top. On the other hand, BPG tends to discard lowcontrast high frequencies where our approach keeps them in the output, like in the door in Fig. 11, top, or in the hair in Fig. 12, bottom. This may be explained by BPG being optimized for MSE as opposed to our approach being optimized for MSSSIM.
JPEG looks extremely blocky for most images due to the very low bitrate.
Ours 0.239 bpp  0.246 bpp BPG 
JPEG 2000 0.242 bpp  0.259 bpp JPEG 
Ours 0.203 bpp  0.201 bpp BPG 
JPEG 2000 0.197 bpp  0.205 bpp JPEG 
Ours 0.165 bpp  0.164 bpp BPG 
JPEG 2000 0.166 bpp  0.166 bpp JPEG 
Ours 0.193 bpp  0.209 bpp BPG 
JPEG 2000 0.194 bpp  0.203 bpp JPEG 
Ours 0.385 bpp  0.394 bpp BPG 
JPEG 2000 0.377 bpp  0.386 bpp JPEG 
Ours 0.365 bpp  0.363 bpp BPG 
JPEG 2000 0.363 bpp  0.372 bpp JPEG 
Ours 0.435 bpp  0.479 bpp BPG 
JPEG 2000 0.437 bpp  0.445 bpp JPEG 
Ours 0.345 bpp  0.377 bpp BPG 
JPEG 2000 0.349 bpp  0.357 bpp JPEG 
Ours 0.355 bpp  0.394 bpp BPG 
JPEG 2000 0.349 bpp  0.378 bpp JPEG 
Ours 0.263 bpp  0.267 bpp BPG 
JPEG 2000 0.254 bpp  0.266 bpp JPEG 
Ours 0.284 bpp  0.280 bpp BPG 
JPEG 2000 0.287 bpp  0.288 bpp JPEG 
Ours 0.247 bpp  0.253 bpp BPG 
JPEG 2000 0.243 bpp  0.252 bpp JPEG 
Ours 0.494 bpp  0.501 bpp BPG 
JPEG 2000 0.490 bpp  0.525 bpp JPEG 
Ours 0.298 bpp  0.301 bpp BPG 
JPEG 2000 0.293 bpp  0.315 bpp JPEG 
Ours 0.315 bpp  0.329 bpp BPG 
JPEG 2000 0.311 bpp  0.321 bpp JPEG 
Ours 0.363 bpp  0.397 bpp BPG 
JPEG 2000 0.369 bpp  0.372 bpp JPEG 