# Conditional Probability Models for Deep Image Compression

Deep Neural Networks trained as image auto-encoders have recently emerged as a promising direction for advancing the state of the art in image compression. The key challenge in learning such networks is twofold: to deal with quantization, and to control the trade-off between reconstruction error (distortion) and entropy (rate) of the latent image representation. In this paper, we focus on the latter challenge and propose a new technique to navigate the rate-distortion trade-off for an image compression auto-encoder. The main idea is to directly model the entropy of the latent representation by using a context model: a 3D-CNN which learns a conditional probability model of the latent distribution of the auto-encoder. During training, the auto-encoder makes use of the context model to estimate the entropy of its representation, and the context model is concurrently updated to learn the dependencies between the symbols in the latent representation. Our experiments show that this approach yields a state-of-the-art image compression system based on a simple convolutional auto-encoder.

## Authors

• 11 publications
• 20 publications
• 27 publications
• 131 publications
• 282 publications
• ### Block-optimized Variable Bit Rate Neural Image Compression

In this work, we propose an end-to-end block-based auto-encoder system f...
05/28/2018 ∙ by Caglar Aytekin, et al. ∙ 0

• ### Maximum Entropy Auto-Encoding

In this paper, it is shown that an auto-encoder using optimal reconstruc...
04/13/2021 ∙ by Paul M. Baggenstoss, et al. ∙ 0

• ### Modeling Lost Information in Lossy Image Compression

Lossy image compression is one of the most commonly used operators for d...
06/22/2020 ∙ by Yaolong Wang, et al. ∙ 8

• ### Context-aware Deep Feature Compression for High-speed Visual Tracking

We propose a new context-aware correlation filter based tracking framewo...
03/28/2018 ∙ by Jongwon Choi, et al. ∙ 0

• ### AutoEncoder by Forest

Auto-encoding is an important task which is typically realized by deep n...
09/26/2017 ∙ by Ji Feng, et al. ∙ 0

• ### Layered Image Compression using Scalable Auto-encoder

This paper presents a novel convolutional neural network (CNN) based ima...
04/01/2019 ∙ by Chuanmin Jia, et al. ∙ 0

• ### Non-local Attention Optimized Deep Image Compression

This paper proposes a novel Non-Local Attention Optimized Deep Image Com...
04/22/2019 ∙ by Haojie Liu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Image compression refers to the task of representing images using as little storage (i.e., bits) as possible. While in lossless image compression the compression rate is limited by the requirement that the original image should be perfectly reconstructible, in lossy image compression, a greater reduction in storage is enabled by allowing for some distortion in the reconstructed image. This results in a so-called rate-distortion trade-off, where a balance is found between the bitrate and the distortion by minimizing , where balances the two competing objectives. Recently, deep neural networks (DNNs) trained as image auto-encoders for this task led to promising results, achieving better performance than many traditional techniques for image compression [19, 20, 17, 4, 2, 9]. Another advantage of DNN-based learned compression systems is their adaptability to specific target domains such as areal images or stereo images, enabling even higher compression rates on these domains. A key challenge in training such systems is to optimize the bitrate of the latent image representation in the auto-encoder. To encode the latent representation using a finite number of bits, it needs to be discretized into symbols (i.e., mapped to a stream of elements from some finite set of values). Since discretization is non-differentiable, this presents challenges for gradient-based optimization methods and many techniques have been proposed to address them. After discretization, information theory tells us that the correct measure for bitrate is the entropy of the resulting symbols. Thus the challenge, and the focus of this paper, is how to model such that we can navigate the trade-off during optimization of the auto-encoder.

Our proposed method is based on leveraging context models, which were previously used as techniques to improve coding rates for already-trained models [4, 20, 9, 14], directly as an entropy term in the optimization. We concurrently train the auto-encoder and the context model with respect to each other, where the context model learns a convolutional probabilistic model of the image representation in the auto-encoder, while the auto-encoder uses it for entropy estimation to navigate the rate-distortion trade-off. Furthermore, we generalize our formulation to spatially-aware networks, which use an importance map to spatially attend the bitrate representation to the most important regions in the compressed representation. The proposed techniques lead to a simple image compression system, which achieves state-of-the-art performance when measured with the popular multi-scale structural similarity index (MS-SSIM) distortion metric [23]

, while being straightforward to implement with standard deep-learning toolboxes.

## 2 Related work

Full-resolution image compression using DNNs has attracted considerable attention recently. DNN architectures commonly used for image compression are auto-encoders [17, 4, 2, 9]

and recurrent neural networks (RNNs)

[19, 20]. The networks are typically trained to minimize the mean-squared error (MSE) between original and decompressed image [17, 4, 2, 9], or using perceptual metrics such as MS-SSIM [20, 14]. Other notable techniques involve progressive encoding/decoding strategies [19, 20], adversarial training [14], multi-scale image decompositions [14], and generalized divisive normalization (GDN) layers [4, 3].

Context models and entropy estimation—the focus of the present paper—have a long history in the context of engineered compression methods, both lossless and lossy [24, 12, 25, 13, 10]. Most of the recent DNN-based lossy image compression approaches have also employed such techniques in some form. [4] uses a binary context model for adaptive binary arithmetic coding [11]. The works of [20, 9, 14] use learned context models for improved coding performance on their trained models when using adaptive arithmetic coding. [17, 2] use non-adaptive arithmetic coding but estimate the entropy term with an independence assumption on the symbols.

Also related is the work of van den Oord et al. [22, 21], who proposed PixelRNN and PixelCNN, powerful RNN- and CNN-based context models for modeling the distribution of natural images in a lossless setting, which can be used for (learned) lossless image compression as well as image generation.

## 3 Proposed method

Given a set of training images , we wish to learn a compression system which consists of an encoder, a quantizer, and a decoder. The encoder maps an image to a latent representation . The quantizer discretizes the coordinates of to centers, obtaining with , which can be losslessly encoded into a bitstream. The decoder then forms the reconstructed image from the quantized latent representation , which is in turn (losslessy) decoded from the bitstream. We want the encoded representation to be compact when measured in bits, while at the same time we want the distortion to be small, where is some measure of reconstruction error, such as MSE or MS-SSIM. This results in the so-called rate-distortion trade-off

 d(x,^x)+βH(^z), (1)

where denotes the cost of encoding to bits, i.e., the entropy of . Our system is realized by modeling and

as convolutional neural networks (CNNs) (more specifically, as the encoder and decoder, respectively, of a convolutional auto-encoder) and minimizing (

1) over the training set , where a large/small draws the system towards low/high average entropy . In the next sections, we will discuss how we quantize and estimate the entropy . We note that as are CNNs,

will be a 3D feature map, but for simplicity of exposition we will denote it as a vector with equally many elements. Thus,

refers to the -th element of the feature map, in raster scan order (row by column by channel).

### 3.1 Quantization

We adopt the scalar variant of the quantization approach proposed in [2] to quantize , but simplify it using ideas from [17]. Specifically, given centers , we use nearest neighbor assignments to compute

 ^zi=Q(zi):=arg minj∥zi−cj∥, (2)

but rely on (differentiable) soft quantization

 ~zi=L∑j=1exp(−σ∥zi−cj∥)∑Ll=1exp(−σ∥zi−cl∥)cj (3)

to compute gradients during the backward pass. This combines the benefit of [2] where the quantization is restricted to a finite set of learned centers (instead of the fixed (non-learned) integer grid as in [17]) and the simplicity of [17], where a differentiable approximation of quantization is only used in the backward pass, avoiding the need to choose an annealing strategy (i.e., a schedule for ) as in [2] to drive the soft quantization (3) to hard assignments (2

) during training. In TensorFlow, this is implemented as

We note that for forward pass computations, , and thus we will continue writing for the latent representation.

### 3.2 Entropy estimation

To model the entropy we build on the approach of PixelRNN [22] and factorize the distribution as a product of conditional distributions

 p(^z)=m∏i=1p(^zi|^zi−1,…,^z1), (5)

where the 3D feature volume is indexed in raster scan order. We then use a neural network , which we refer to as a context model, to estimate each term :

 Pi,l(^z)≈p(^zi=cl|^zi−1,…,^z1), (6)

where specifies for every 3D location in the probabilites of each symbol in with . We refer to the resulting approximate distribution as , where denotes the index of in .

Since the conditional distributions only depend on previous values , this imposes a causality constraint on the network : While may compute in parallel for , it needs to make sure that each such term only depends on previous values .

The authors of PixelCNN [22, 21]

study the use of 2D-CNNs as causal conditional models over 2D images in a lossless setting, i.e., treating the RGB pixels as symbols. They show that the causality constraint can be efficiently enforced using masked filters in the convolution. Intuitively, the idea is as follows: If for each layer the causality condition is satisfied with respect to the spatial coordinates of the layer before, then by induction the causality condition will hold between the output layer and the input. Satisfying the causality condition for each layer can be achieved with proper masking of its weight tensor, and thus the entire network can be made causal

only through the masking of its weights. Thus, the entire set of probabilities for all (2D) spatial locations and symbol values can be computed in parallel with a fully convolutional network, as opposed to modeling each term separately.

In our case, is a 3D symbol volume, with as much as channels. We therefore generalize the approach of PixelCNN to 3D convolutions, using the same idea of masking the filters properly in every layer of the network. This enables us to model efficiently, with a light-weight222We use a 4-layer network, compared to 15 layers in [22]. 3D-CNN which slides over , while properly respecting the causality constraint. We refer to the supplementary material for more details.

As in [21], we learn by training it for maximum likelihood, or equivalently (see [16]) by training

to classify the index

of in with a cross entropy loss:

 CE:=E^z∼p(^z)[m∑i=1−logPi,I(^zi)]. (7)

Using the well-known property of cross entropy as the coding cost when using the wrong distribution instead of the true distribution , we can also view the loss as an estimate of since we learn such that . That is, we can compute

 H(^z) =E^z∼p(^z)[−log(p(^z))] (8) =E^z∼p(^z)[m∑i=1−logp(^zi|^zi−1,⋯,^z1)] (9) ≈E^z∼p(^z)[m∑i=1−logq(^zi|^zi−1,⋯,^z1)] (10) =E^z∼p(^z)[m∑i=1−logPi,I(^zi)] (11) =CE (12)

Therefore, when training the auto-encoder we can indirectly minimize through the cross entropy . We refer to argument in the expectation of (7),

 C(^z):=m∑i=1−logPi,I(^zi), (13)

as the coding cost of the latent image representation, since this reflects the coding cost incurred when using as a context model with an adaptive arithmetic encoder [11]. From the application perspective, minimizing the coding cost is actually more important than the (unknown) true entropy, since it reflects the bitrate obtained in practice.

To backpropagate through

we use the same approach as for the encoder (see (4)). Thus, like the decoder , only sees the (discrete) in the forward pass, whereas the gradient of the soft quantization is used for the backward pass.

### 3.3 Concurrent optimization

Given an auto-encoder , we can train to model the dependencies of the entries of as described in the previous section by minimizing (7). On the other hand, using the model , we can obtain an estimate of as in (12) and use this estimate to adjust such that is reduced, thereby navigating the rate distortion trade-off. Therefore, it is natural to concurrently learn (with respect to its own loss), and (with respect to the rate distortion trade-off) during training, such that all models which the losses depend on are continuously updated.

### 3.4 Importance map for spatial bit-allocation

Recall that since and are CNNs, is a 3D feature-map. For example, if

has three stride-2 convolution layers and the bottleneck has

channels, the dimensions of will be . A consequence of this formulation is that we are using equally many symbols in for each spatial location of the input image . It is known, however, that in practice there is great variability in the information content across spatial locations (e.g., the uniform area of blue sky vs. the fine-grained structure of the leaves of a tree).

This can in principle be accounted for automatically in the trade-off between the entropy and the distortion, where the network would learn to output more predictable (i.e., low entropy) symbols for the low information regions, while making room for the use of high entropy symbols for the more complex regions. More precisely, the formulation in (7) already allows for variable bit allocation for different spatial regions through the context model .

However, this arguably requires a quite sophisticated (and hence computationally expensive) context model, and we find it beneficial to follow Li et al. [9] instead by using an importance map to help the CNN attend to different regions of the image with different amounts of bits. While [9] uses a separate network for this purpose, we consider a simplified setting. We take the last layer of the encoder , and add a second single-channel output . We expand this single channel into a mask of the same dimensionality as as follows:

 mi,j,k=⎧⎪⎨⎪⎩1 if kyi,j, (14)

where denotes the value of at spatial location . The transition value for is such that the mask smoothly transitions from 0 to 1 for non-integer values of .

by pointwise multiplication with the binarization of

, i.e., . Since the ceiling operator is not differentiable, as done by [17, 9], we use identity for the backward pass.

With this modification, we have simply changed the architecture of slightly such that it can easily “zero out” portions of columns of (the rest of the network stays the same, so that (2) still holds for example). As suggested by [9], the so-obtained structure in presents an alternative coding strategy: Instead of losslessly encoding the entire symbol volume , we could first (separately) encode the mask , and then for each column only encode the first symbols, since the remaining ones are the constant , which we refer to as the zero symbol.

Work [9] uses binary symbols (i.e., ) and assumes independence between the symbols and a uniform prior during training, i.e., costing each 1 bit to encode. The importance map is thus their principal tool for controlling the bitrate, since they thereby avoid encoding all the bits in the representation. In contrast, we stick to the formulation in (5) where the dependencies between the symbols are modeled during training. We then use the importance map as an architectural constraint and use their suggested coding strategy to obtain an alternative estimate for the entropy , as follows.

We observe that we can recover from by counting the number of consecutive zero symbols at the end of each column .333If contained zeros before it was masked, we might overestimate the number of entries in . However, we can redefine those entries of as and this will give the same result after masking. is therefore a function of the masked , i.e., for recovering as described, which means that we have for the conditional entropy . Now, we have

 H(^z) =H(⌈m⌉|^z)+H(^z) (15) =H(^z,⌈m⌉) (16) =H(^z|⌈m⌉)+H(⌈m⌉). (17)

If we treat the entropy of the mask, , as constant during optimization of the auto-encoder, we can then indirectly minimize through .

To estimate , we use the same factorization of as in (5), but since the mask is known we have deterministic for the 3D locations in where the mask is zero. The s of the corresponding terms in (9) then evaluate to . The remaining terms, we can model with the same context model , which results in

 H(^z|⌈m⌉)≈E^z∼p(^z)[m∑i=1−⌈mi⌉logPi,I(^zi)], (18)

where denotes the -th element of (in the same raster scan order as ).

Similar to the coding cost (13), we refer to the argument in the expectation in (18),

 MC(^z):=m∑i=1−⌈mi⌉logPi,I(^zi) (19)

as the masked coding cost of .

While the entropy estimate (18) is almost estimating the same quantity as (7) (only differing by ), it has the benefit of being weighted by . Therefore, the encoder has an obvious path to control the entropy of , by simply increasing/decreasing the value of for some spatial location of and thus obtaining fewer/more zero entries in .

When the context model is trained, however, we still train it with respect to the formulation in (8), so it does not have direct access to the mask and needs to learn the dependencies on the entire masked symbol volume . This means that when encoding an image, we can stick to standard adaptive arithmetic coding over the entire bottleneck, without needing to resort to a two-step coding process as in [9], where the mask is first encoded and then the remaining symbols. We emphasize that this approach hinges critically on the context model and the encoder being trained concurrently as this allows the encoder to learn a meaningful (in terms of coding cost) mask with respect to (see the next section).

In our experiments we observe that during training, the two entropy losses (7) and (18) converge to almost the same value, with the latter being around smaller due to being ignored.

While the importance map is not crucial for optimal rate-distortion performance, if the channel depth is adjusted carefully, we found that we could more easily control the entropy of through when using a fixed , since the network can easily learn to ignore some of the channels via the importance map. Furthermore, in the supplementary material we show that by using multiple importance maps for a single network, one can obtain a single model that supports multiple compression rates.

### 3.5 Putting the pieces together

We made an effort to carefully describe our formulation and its motivation in detail. While the description is lengthy, when putting the resulting pieces together we get a quite straightforward pipeline for learned image compression, as follows.

Given the set of training images , we initialize (fully convolutional) CNNs , , and , as well as the centers of the quantizer . Then, we train over minibatches of crops from . At each iteration, we take one gradient step for the auto-encoder and the quantizer , with respect to the rate-distortion trade-off

 LE,D,Q=1BB∑j=1d(x(j),^x(j))+βMC(^z(j)), (20)

which is obtained by combining (1) with the estimate (18) & (19) and taking the batch sample average. Furthermore, we take a gradient step for the context model with respect to its objective (see (7) & (13))

 LP:=1BB∑j=1d(x(j),^x(j))+βC(^z(j)). (21)

To compute these two batch losses, we need to perform the following computation for each :

1. Obtain compressed (latent) representation and importance map from the encoder:

2. Expand importance map to mask via (14)

4. Quantize

5. Compute the context

6. Decode ,

which can be computed in parallel over the minibatch on a GPU since all the models are fully convolutional.

### 3.6 Relationship to previous methods

We are not the first to use context models for adaptive arithmetic coding to improve the performance in learned deep image compression. Work [20] uses a PixelRNN-like architecture [22] to train a recurrent network as a context model for an RNN-based compression auto-encoder. Li et al. [9] extract cuboid patches around each symbol in a binary feature map, and feed them to a convolutional context model. Both these methods, however, only learn the context model after training their system, as a post-processing step to boost coding performance.

In contrast, our method directly incorporates the context model as the entropy term for the rate-distortion term (1) of the auto-encoder, and trains the two concurrently. This is done at little overhead during training, since we adopt a 3D-CNN for the context model, using PixelCNN-inspired [21] masking of the weights of each layer to ensure causality in the context model. Adopting the same approach to the context models deployed by [20] or [9] would be non-trivial since they are not designed for fast feed-forward computation. In particular, while the context model of [9] is also convolutional, its causality is enforced through masking the inputs to the network, as opposed to our masking of the weights of the networks. This means their context model needs to be run separately with a proper input cuboid for each symbol in the volume (i.e., not fully convolutionally).

## 4 Experiments

#### Architecture

Our auto-encoder has a similar architecture as [17] but with more layers, and is described in Fig. 2. We adapt the number of channels in the latent representation for different models. For the context model , we use a simple -layer 3D-CNN as described in Fig. 3.

#### Distortion measure

Following [7, 14], we use the multi-scale structural similarity index (MS-SSIM) [23] as measure of distortion for our models. MS-SSIM reportedly correlates better with human perception of distortion than mean squared error (MSE). We train and test all our models using MS-SSIM.

#### Training

We use the Adam optimizer [8] with a minibatch size of 30 to train seven models. Each model is trained to maximize MS-SSIM directly. As a baseline, we used a learning rate (LR) of for each model, but found it beneficial to vary it slightly for different models. We set in the smooth approximation (3) used for gradient backpropagation through . To make the model more predictably land at a certain bitrate when optimizing (1), we found it helpful to clip the rate term (i.e., replace the entropy term with ), such that the entropy term is “switched off” when it is below

. We found this did not hurt performance. We decay the learning rate by a factor 10 every two epochs. To obtain models for different bitrates, we adapt the target bitrate

and the number of channels , while using a moderately large . We use a small regularization on the weights and note that we achieve very stable training. We trained our models for 6 epochs, which took around 24h per model on a single GPU. For , we use a LR of and the same decay schedule.

#### Datasets

We train on the the ImageNet dataset from the Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) [15]. As a preprocessing step, we take random

crops, and randomly flip them. We set aside 100 images from ImageNet as a testing set,

ImageNetTest. Furthermore, we test our method on the widely used Kodak [1] dataset. To asses performance on high-quality full-resolution images, we also test on the datasets B100 [18] and Urban100 [5]

, commonly used in super-resolution.

#### Other codecs

We compare to JPEG, using libjpeg, and JPEG2000, using the Kakadu implementation. We also compare to the lesser known BPG, which is based on HEVC, the state-of-the-art in video compression, and which outperforms JPEG and JPEG2000. We use BPG in the non-default 4:4:4 chroma format, following [14].

#### Comparison

Like [14]

, we proceed as follows to compare to other methods. For each dataset, we compress each image using all our models. This yields a set of (bpp, MS-SSIM) points for each image, which we interpolate to get a curve for each image. We fix a grid of bpp values, and average the curves for each image at each bpp grid value (ignoring those images whose bpp range does not include the grid value, i.e., we do not extrapolate). We do this for our method, BPG, JPEG, and JPEG2000. Due to code being unavailable for the related works in general, we digitize the Kodak curve from Rippel & Bourdev

[14], who have carefully collected the curves from the respective works. With this, we also show the results of Rippel & Bourdev [14], Johnston et al[7], Ballé et al[4], and Theis et al[17]. To validate that our estimated MS-SSIM is correctly implemented, we independently generated the BPG curves for Kodak and verified that they matched the one from [14].

#### Results

Fig. 1 shows a comparison of the aforementioned methods for Kodak. Our method outperforms BPG, JPEG, and JPEG2000, as well as the neural network based approaches of Johnston et al. [7], Ballé et al. [4], and Theis et al. [17]. Furthermore, we achieve performance comparable to that of Rippel & Bourdev [14]. This holds for all bpps we tested, from 0.3 bpp to 0.9 bpp. We note that while Rippel & Bourdev and Johnston et al. also train to maximize (MS-)SSIM, the other methods minimize MSE.

In each of the other testing sets, we also outperform BPG, JPEG, and JPEG2000 over the reported bitrates, as shown in Fig. 4.

In Fig. 5, we compare our approach to BPG, JPEG, and JPEG2000 visually, using very strong compression on kodim21 from Kodak. It can be seen that the output of our network is pleasant to look at. Soft structures like the clouds are very well preserved. BPG appears to handle high frequencies better (see, e.g., the fence) but loses structure in the clouds and in the sea. Like JPEG2000, it produces block artifacts. JPEG breaks down at this rate. We refer to the supplementary material for further visual examples.

#### Ablation study: Context model

In order to show the effectiveness of the context model, we performed the following ablation study. We trained the auto-encoder without entropy loss, i.e., in (20), using centers and channels. On Kodak, this model yields an average MS-SSIM of 0.982, at an average rate of 0.646 bpp (calculated assuming that we need bits per symbol). We then trained three different context models for this auto-encoder, while keeping the auto-encoder fixed: A zeroth order context model which uses a histogram to estimate the probability of each of the symbols; a first order (one-step prediction) context model, which uses a conditional histogram to estimate the probability of each of the symbols given the previous symbol (scanning in raster order); and , i.e., our proposed context model. The resulting average rates are shown in Table 1. Our context model reduces the rate by 10 %, even though the auto-encoder was optimized using a uniform prior (see supplementary material for a detailed comparison of Table 1 and Fig. 1).

#### Importance map

As described in detail in Section 3.4, we use an importance map to dynamically alter the number of channels used at different spatial locations to encode an image. To visualize how this helps, we trained two auto-encoders and , where uses an importance map and at most channels to compress an image, and compresses without importance map and with channels (this yields a rate for similar to that of ). In Fig. 6, we show an image from ImageNetTest along with the same image compressed to 0.463 bpp by and compressed to 0.504 bpp by . Furthermore, Fig. 6 shows the importance map produced by , as well as ordered visualizations of all channels of the latent representation for both and . Note how for , channels with larger index are sparser, showing how the model can spatially adapt the number of channels. uses all channels similarly.

## 5 Discussion

Our experiments showed that combining a convolutional auto-encoder with a lightweight 3D-CNN as context model and training the two networks concurrently leads to a highly effective image compression system. Not only were we able to clearly outperform state-of-the-art engineered compression methods including BPG and JPEG2000 in terms of MS-SSIM, but we also obtained performance competitive with the current state-of-the-art learned compression method from [14]. In particular, our method outperforms BPG and JPEG2000 in MS-SSIM across four different testing sets (ImageNetTest, Kodak, B100, Urban100), and does so significantly, i.e., the proposed method generalizes well. We emphasize that our method relies on elementary techniques both in terms of the architecture (standard convolutional auto-encoder with importance map, convolutional context model) and training procedure (minimize the rate-distortion trade-off and the negative log-likelihood for the context model), while [14] uses highly specialized techniques such as a pyramidal decomposition architecture, adaptive codelength regularization, and multiscale adversarial training.

The ablation study for the context model showed that our 3D-CNN-based context model is significantly more powerful than the first order (histogram) and second order (one-step prediction) baseline context models. Further, our experiments suggest that the importance map learns to condensate the image information in a reduced number of channels of the latent representation without relying on explicit supervision. Notably, the importance map is learned as a part of the image compression auto-encoder concurrently with the auto-encoder and the context model, without introducing any optimization difficulties. In contrast, in [9] the importance map is computed using a separate network, learned together with the auto-encoder, while the context model is learned separately.

## 6 Conclusions

In this paper, we proposed the first method for learning a lossy image compression auto-encoder concurrently with a lightweight context model by incorporating it into an entropy loss for the optimization of the auto-encoder, leading to performance competitive with the current state-of-the-art in deep image compression [14].

Future works could explore heavier and more powerful context models, as those employed in [22, 21]. This could further improve compression performance and allow for sampling of natural images in a “lossy” manner, by sampling according to the context model and then decoding.

#### Acknowledgements

This work was supported by ETH Zürich and by NVIDIA through a GPU grant.

## Conditional Probability Models for Deep Image Compression – Suppl. Material

#### 3D probability classifier

As mentioned in Section 3.2, we rely on masked 3D convolutions to enforce the causality constraint in our probability classifier . In a 2D-CNN, standard 2D convolutions are used in filter banks, as shown in Fig. 7 on the left: A -dimensional tensor is mapped to a -dimensional tensor using banks of 2D filters, i.e., filters can be represented as -dimensional tensors. Note that all channels are used together, which violates causality: When we encode, we proceed channel by channel.

Using 3D convolutions, a depth dimension is introduced. In a 3D-CNN, -dimensional tensors are mapped to -dimensional tensors, with -dimensional filters. Thus, a 3D-CNN slides over the depth dimension, as shown in Fig. 7 on the right. We use such a 3D-CNN for , where we use as input our -dimensional feature map , using for the first layer.

To explain how we mask the filters in , consider the 2D case in Fig. 8. We want to encode all values by iterating in raster scan order and by computing . We simplify this by instead of relying on all previously encoded symbols, we use some -context around (black square in Fig. 8). To satisfy the causality constraint, this context may only contain values above or in the same row to the left of (gray cells). By using the filter shown in Fig. 8 in the top right for the first layer of a CNN and the filter shown in Fig. 8 in the bottom right for subsequent filters, we can build a 2D-CNN with a receptive field that forms such a context. We build our 3D-CNN by generalizing this idea to 3D, where we construct the mask for the filter of the first layer as shown in pseudo-code Algorithm 1. The mask for the subsequent layers is constructed analoguously by replacing “” in line 7 with “”. We use filter size .

With this approach, we obtain a 3D-CNN which operates on -dimensional blocks. We can use to encode by iterating over in such blocks, exhausting first axis , then axis , and finally axis (like in Algorithm 1). For each such block,

yields the probability distribution of the central symbol given the symbols in the block. Due to the construction of the masks, this probability distribution only depends on previously encoded symbols.

#### Multiple compression rates

It is quite straightforward to obtain multiple operating points in a single network with our framework: We can simply share the network but use multiple importance maps. We did a simple experiment where we trained an autoencoder with 5 different importance maps. In each iteration, a random importance map was picked, and the target entropy was set to . While not tuned for performance, this already yielded a model competitive with BPG. The following shows the output of the model for (from left to right):

#### On the benefit of 3DCNN and joint training

We note that the points from Table 1 (where we trained different entropy models non-jointly as a post-training step) are not directly comparable with the curve in Fig. 1. This is because these points are obtained by taking the mean of the MS-SSIM and bpp values over the Kodak images for a single model. In contrast, the curve in Fig. 1 is obtained by following the approach of [14], constructing a MS-SSIM vs. bpp curve per-image via interpolation (see Comparison in Section 4). In Fig. 9, we show the black curve from Fig. 1, as well as the mean (MS-SSIM, bpp) points achieved by the underlying models (). We also show the points from Tab. 1 (). We can see that our masked 3DCNN with joint training gives a significant improvement over the separately trained 3DCNN, i.e., a 22% reduction in bpp when comparing mean points (the red point is estimated).

#### Non-realistic images

In Fig. 10, we compare our approach to BPG on an image from the Manga109 dataset. We can see that our approach preserves text well enough to still be legible, but it is not as crip as BPG (left zoom). On the other hand, our approach manages to preserve the fine texture on the face better than BPG (right zoom).

#### Visual examples

The following pages show the first four images of each of our validation sets compressed to low bitrate, together with outputs from BPG, JPEG2000 and JPEG compressed to similar bitrates. We ignored all header information for all considered methods when computing the bitrate (here and throughout the paper). We note that the only header our approach requires is the size of the image and an identifier, e.g., , specifying the model.

Overall, our images look pleasant to the eye. We see cases of over-blurring in our outputs, where BPG manages to keep high frequencies due to its more local approach. An example is the fences in front of the windows in Fig. 14, top, or the text in Fig. 15, top. On the other hand, BPG tends to discard low-contrast high frequencies where our approach keeps them in the output, like in the door in Fig. 11, top, or in the hair in Fig. 12, bottom. This may be explained by BPG being optimized for MSE as opposed to our approach being optimized for MS-SSIM.

JPEG looks extremely blocky for most images due to the very low bitrate.