Variable Rate Deep Image Compression With a Conditional Autoencoder

09/11/2019 ∙ by Yoojin Choi, et al. ∙ SAMSUNG 6

In this paper, we propose a novel variable-rate learned image compression framework with a conditional autoencoder. Previous learning-based image compression methods mostly require training separate networks for different compression rates so they can yield compressed images of varying quality. In contrast, we train and deploy only one variable-rate image compression network implemented with a conditional autoencoder. We provide two rate control parameters, i.e., the Lagrange multiplier and the quantization bin size, which are given as conditioning variables to the network. Coarse rate adaptation to a target is performed by changing the Lagrange multiplier, while the rate can be further fine-tuned by adjusting the bin size used in quantizing the encoded representation. Our experimental results show that the proposed scheme provides a better rate-distortion trade-off than the traditional variable-rate image compression codecs such as JPEG2000 and BPG. Our model also shows comparable and sometimes better performance than the state-of-the-art learned image compression models that deploy multiple networks trained for varying rates.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 8

page 11

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image compression is an application of data compression for digital images to lower their storage and/or transmission requirements. Transform coding [8] has been successful to yield practical and efficient image compression algorithms such as JPEG [27] and JPEG2000 [18]. The transformation converts an input to a latent representation in the transform domain where lossy compression (that is typically a combination of quantization and lossless source coding) is more amenable and more efficient. For example, JPEG utilizes the discrete cosine transform (DCT) to convert an image into a sparse frequency domain representation. JPEG2000 replaces DCT with an enhanced discrete wavelet transform.

Deep learning is now leading many performance breakthroughs in various computer vision tasks [13]. Along with this revolutionary progress of deep learning, learned image compression also has derived significant interests [3, 23, 24, 19, 1, 15, 4, 9, 16, 14]

. In particular, non-linear transform coding designed with deep neural networks has advanced to outperform the classical image compression codecs sophisticatedly designed and optimized by domain experts, e.g., BPG 

[5], which is a still image version of the high efficiency video codec (HEVC) standard [22]—we note that very recently, only a few of the learning-based image compression schemes have reached the performance of the state-of-the-art BPG codec on peak signal-to-noise ratio (PSNR), a metric based on mean squared error (MSE) [16, 14].

Figure 1: Our variable-rate image compression model. We provide two knobs to vary the rate. First, we employ a conditional autoencoder, conditioned on the Lagrange multiplier  that adapts the rate, and optimize the rate-distortion Lagrangian for various values in one conditional model. Second, we train the model for mixed values of the quantization bin size  so we can vary the rate by changing .
Ground truth Ours BPG (4:4:4) JPEG2000 JPEG
Bits per pixel (BPP) 0.1697 0.1697 0.1702 0.1775
PSNR (dB) 32.2332 31.9404 30.3140 27.3389
MS-SSIM 0.9602 0.9539 0.9369 0.8669
Figure 2: PSNR and MS-SSIM comparison of our model and classical image compression algorithms (BPG, JPEG2000, and JPEG). We adapt the rate by changing the Lagrange multiplier  and the quantization bin size  to match the rate of BPG. In this example, we observe dB PSNR gain over the state-of-the-art BPG codec. A perceptual measure, MS-SSIM, also improves. Visually, our method provides better quality with less artifacts than the classical image compression codecs.

The resemblance of non-linear transform coding and autoencoders has been established and exploited for image compression in [3, 23]—an encoder transforms an image (a set of pixels) into a latent representation in a lower dimensional space, and a decoder performs an approximate inverse transform that converts the latent representation back to the image. The transformation is desired to yield a latent representation with the smallest entropy, given a distortion level, since the entropy is the minimum rate achievable with lossless entropy source coding [7, Section 5.3]

. In practice, however, it is generally not straightforward to calculate and optimize the exact entropy of a latent representation. Hence, the rate-distortion (R-D) trade-off is optimized by minimizing an entropy estimate of a latent representation provided by an autoencoder at a target quality. To improve compression efficiency, recent methods have focused on developing accurate entropy estimation models 

[1, 15, 4, 16, 14] with sophisticated density estimation techniques such as variational Bayes and autoregressive context modeling.

Given a model that provides an accurate entropy estimate of a latent representation, the previous autoencoder-based image compression frameworks optimize their networks by minimizing the weighted sum of the R-D pairs using the method of Lagrange multipliers. The Lagrange multiplier  introduced in the Lagrangian (see (2)) is treated as a hyper-parameter to train a network for a desired trade-off between the rate and the quality of compressed images. This implies that one needs to train and deploy separate networks for rate adaptation. One way is to re-train a network while varying the Lagrange multiplier. However, this is impractical when we operate at a broad range of the R-D curve with fine resolution and the size of each network is large.

In this paper, we suggest training and deploying only one variable-rate image compression network that is capable of rate adaptation. In particular, we propose a conditional autoencoder, conditioned on the Lagrange multiplier, i.e., the network takes the Lagrange multiplier as an input and produces a latent representation whose rate depends on the input value. Moreover, we propose training the network with mixed quantization bin sizes, which allows us to adapt the rate by adjusting the bin size applied to the quantization of a latent representation. Coarse rate adaptation to a target is achieved by varying the Lagrange multiplier in the conditional model, while fine rate adaptation is done by tuning the quantization bin size. We illustrate our variable-rate image compression model in Figure 1.

Conditional autoencoders have been used for conditional generation [21, 26]

, where their conditioning variables are typically labels, attributes, or partial observations of the target output. However, our conditional autoencoder takes a hyper-parameter, i.e., the Lagrange multiplier, of the optimization problem as its conditioning variable. We basically solve multiple objectives using one conditional network, instead of solving them individually using separate non-conditional networks (each optimized for one objective), which is new to the best of our knowledge.

We also note that variable-rate models using recurrent neural networks (RNNs) were proposed in

[24, 9]. However, the RNN-based models require progressive encoding and decoding, depending on the target image quality. The increasing number of iterations to obtain a higher-quality image is not desirable in certain applications and platforms. Our variable-rate model is different from the RNN-based models. Our model is based on a conditional autoencoder that needs no multiple iterations, while the quality is controlled by its conditioning variables, i.e., the Lagrange multiplier and the quantization bin size. Our method also shows superior performance over the RNN-based models in [24, 9].

We evaluate the performance of our variable-rate image compression model on the Kodak image dataset [12] for both the objective image quality metric, PSNR, and a perceptual score measured by the multi-scale structural similarity (MS-SSIM) [28]. The experimental results show that our variable-rate model outperforms BPG in both PSNR and MS-SSIM metrics; an example from the Kodak dataset is shown in Figure 2. Moreover, our model shows a comparable and sometime better R-D trade-off than the state-of-the-art learned image compression models [16, 14] that outperform BPG by deploying multiple networks trained for different target rates.

2 Preliminary

We consider a typical autoencoder architecture consisting of encoder  and decoder , where is an input image and is a quantized latent representation encoded from the input  with quantization bin size ; we let , where denotes element-wise rounding to the nearest integer. For now, we fix . Lossless entropy source coding, e.g., arithmetic coding [7, Section 13.3], is used to generate a compressed bitstream from the quantized representation . Let , where

is the probability density function of

.

Deterministic quantization. Suppose that we take entropy source coding for the quantized latent variable  and achieve its entropy rate. The rate  and the squared L2 distortion  (i.e., the MSE loss) are given by

(1)

where is the probability density function of all natural images, and is the probability mass function of induced from encoder  and , which satisfies , where denotes the Dirac delta function. Using the method of Lagrange multipliers, the R-D optimization problem is given by

(2)

for ; the scalar factor  in the Lagrangian is called a Lagrange multiplier. The Lagrange multiplier is the factor that selects a specific R-D trade-off point (e.g., see [17]).

Relaxation with universal quantization. The rate and the distortion provided in (1) are not differentiable for network parameter , due to and , and thus it is not straightforward to optimize (2) through gradient descent. It was proposed in [3] to model the quantization error as additive uniform stochastic noise to relax the optimization of (2). The same technique was adopted in [4, 16, 14]. In this paper, we instead propose employing universal quantization [30, 29] to relax the problem (see Remark 2).

Universal quantization dithers every element of

with one common uniform random variable as follows:

(3)

where the dithering vector 

consists of repetitions of a single uniform random variable  with support . We fix just for now. In each dimension, universal quantization is effectively identical in distribution to adding uniform noise independent of the source, although the noise induced from universal quantization is dependent across dimensions. Note that universal quantization is approximated as a linear function of the unit slope (of gradient

) in the backpropagation of the network training.

Figure 3: The network trained with universal quantization gives higher PSNR than the one trained with additive uniform noise in our experiments on 24 Kodak images.
Remark 1.

To our knowledge, we are the first to adopt universal quantization in the framework of training image compression networks. In [6], universal quantization was used for efficient weight compression of deep neural networks, which is different from our usage here. We observed from our experiments that our relaxation with universal quantization provides some gain over the conventional method of adding independent uniform noise (see Figure 3).

Differentiable R-D cost function. Under the relaxation with universal quantization, similar to (1), the rate and the distortion can be expressed as below:

(4)

where . The stochastic quantization model makes have a continuous density , which is a continuous relaxation of , but still is usually intractable to compute. Thus, we further adopt approximation of to a tractable density  that is differentiable with respect to and . Then, it follows that

(5)

where denotes Kullback-Leibler (KL) divergence (e.g., see [7, p. 19]); the equality in holds when . The choice of in our implementation is deferred to Section 4 (see (12)–(14)).

From (2) and (4), approximating by its upperbound in (5), the R-D optimization problem reduces to

(6)

for . Optimizing a network for different values of , one can trade off the quality against the rate.

Remark 2.

The objective function in (6) has the same form as auto-encoding variational Bayes [11], given that the posterior  is uniform. This relation was already established in the previous works, and detailed discussions can be found in [3, 4]. Our contribution in this section is to deploy universal quantization (see (3)) to guarantee that the quantization error is uniform and independent of the source distribution, instead of artificially adding uniform noise, when generating random samples of from in Monte Carlo estimation of (6).

3 Variable rate image compression

To adapt the quality and the rate of compressed images, we basically need to optimize the R-D Lagrange function in (6) for varying values of the Lagrange multiplier . That is, one has to train multiple networks or re-train a network while varying the Lagrange multiplier . Training and deploying multiple networks are not practical, in particular when we want to cover a broad range of the R-D curve with fine resolution, and each network is of a large size. In this section, we develop a variable-rate model that can be deployed once and can be used to produce compressed images of varying quality with different rates, depending on user’s requirements, with no need of re-training.

3.1 Conditional autoencoder

Figure 4: Conditional convolution, conditioned on the Lagrange multiplier , which produces a different output depending on the input Lagrange multiplier .

To avoid training and deploying multiple networks, we propose training one conditional autoencoder, conditioned on the Lagrange multiplier . The network takes as a conditioning input parameter, along with the input image, and produces a compressed image with varying rate and distortion depending on the conditioning value of . To this end, the rate and distortion terms in (4) and (5) are altered into

for , where is a pre-defined finite set of Lagrange multiplier values, and then we minimize the following combined objective function:

(7)

To implement a conditional autoencoder, we develop the conditional convolution, conditioned on the Lagrange multiplier , as shown in Figure 4. Let be a 2-dimensional (2-D) input feature map of channel  and be a 2-D output feature map of channel . Let be a 2-D convolutional kernel for input channel  and output channel . Our conditional convolution yields

(8)

where denotes 2-D convolution. The channel-wise scaling factor and the additive bias term depend on by

(9)

where and are the fully-connected layer weight vectors of length for output channel ; denotes the transpose, , and

is one-hot encoding of

over .

Remark 3.

The proposed conditional convolution is similar to the one proposed by conditional PixelCNN [26]. In [26], conditioning variables are typically labels, attributes, or partial observations of the target output, while our conditioning variable is the Lagrange multiplier, which is the hyper-parameter that trades off the quality against the rate in the compression problem. A gated-convolution structure is presented in [26], but we develop a simpler structure so that the additional computational cost of conditioning is marginal.

3.2 Training with mixed bin sizes

(a) Vary for fixed (b) Vary for fixed (c) Vary the mixing range of in training
Figure 5: In (a,b), we show how we can adapt the rate in our variable-rate model by changing the Lagrange multiplier  and the quantization bin size . In (a), we vary within for each fixed in (15). In (b), we change in while fixing for some selected values. In (c), we compare PSNR when models are trained for mixed bin sizes of different ranges.

We established a variable-rate conditional autoencoder model, conditioned on the Lagrange multiplier  in the previous subsection, but only finite discrete points in the R-D curve can be obtained from it, since is selected from a pre-determined finite set .111 The conditioning part can be modified to take continuous values, which however did not produce good results in our trials. To extend the coverage to the whole continuous range of the R-D curve, we develop another (continuous) knob to control the rate, i.e., the quantization bin size.

Recall that in the previous R-D formulation (1), we fixed the quantization bin size , i.e., we simply used for quantization. In actual inference, we can change the bin size to adapt the rate—the larger the bin size, the lower the rate. However, the performance naturally suffers from mismatched bin sizes in training and inference. For a trained network to be robust and accurate for varying bin sizes, we propose training (or fine-tuning) it with mixed bin sizes.

In training, we draw a uniform noise in (3) for various noise levels, i.e., for random . The range of and the mixing distribution within the range are design choices. In our experiments, we choose , where is uniformly drawn from so we can cover . The larger the range of , we optimize a network for a broader range of the R-D curve, but the performance also degrades. In Figure 5(c), we compare the R-D curves obtained from the networks trained with mixed bin sizes of different ranges; we used fixed in training the networks just for this experiment. We found that mixing bin sizes in yields the best performance, although the coverage is limited, which is not a problem since we can cover large-scale rate adaptation by changing the input Lagrange multiplier in our conditional model (see Figure 5 (a,b)).

In summary, we solve the following optimization:

(10)

where is a pre-defined mixing density for , and

(11)
Remark 4.

In training, we compute neither the summation over nor the expectation over in (10). Instead, we randomly select uniformly from and draw from for each image to compute its individual R-D cost, and then we use the average R-D cost per batch as the loss for gradient descent, which makes the training scalable.

3.3 Inference

Rate adaptation. The rate increases, as we decrease the Lagrange multiplier  and/or the quantization bin size . In Figure 5(a,b), we show how the rate varies as we change and . In (a), we change within for each fixed from (15). In (b), we vary in while fixing for some selected values. Given a user’s target rate, large-scale discrete rate adaptation is achieved by changing , while fine continuous rate adaptation can be performed by adjusting for fixed . When the R-D curves overlap at the target rate (e.g., see BPP in Figure 5(a)), we select the combination of and that produces better performance.222In practice, one can make a set of pre-selected combinations of and , similar to the set of quality factors in JPEG or BPG.

Compression. After selecting , we do one-hot encoding of and use it in all conditional convolutional layers to encode a latent representation of the input. Then, we perform regular deterministic quantization on the encoded representation with the selected quantization bin size . The quantized latent representation is then finally encoded into a compressed bitstream with entropy coding, e.g., arithmetic coding; we additionally need to store the values of the conditioning variables, and , used in encoding.

Decompression. We decode the compressed bitstream. We also retrieve and used in encoding from the compressed bitstream. We restore the quantized latent representation from the decoded integer values by multiplying them with the quantization bin size . The restored latent representation is then fed to the decoder to reconstruct the image. The value of used in encoding is again used in all deconvolutional layers, for conditional generation.

4 Refined probabilistic model

In this section, we discuss how we refine the baseline model in the previous section to improve the performance. The model refinement is orthogonal to the rate adaptation schemes in Section 3. From (11), we introduce a secondary latent variable that depends on and to yield

For compression, we encode from , and then we further encode from . The encoded representations  are entropy-coded based on , respectively. For decompression, given , we decode , which is then used to compute and to decode

. This model is further refined by introducing autoregressive models for

and as below:

(12)

where is the -th element of , and . In Figure 6, we illustrate a graph representation of our refined variable-rate image compression model.

Figure 6: A graph representation of our refined variable-rate image compression model.

In our experiments, we use

(13)

where , , and denotes the standard normal density; and are parameterized with autoregressive neural networks, e.g., consisting of masked convolutions [26], which are also conditioned on as in Figure 4. Similarly, we let

(14)

where , , and is designed as a univariate density model parameterized with a neural network as described in [4, Appendix 6.1].

Remark 5.

Setting aside the conditioning parts, the refined model can be viewed as a hierarchical autoencoder (e.g., see [25]). It is also similar to the one in [16] with the differences summarized in Appendix A.

5 Experiments

Figure 7: UnivQuant denotes universal quantization with the quantization bin size . AE and AD are arithmetic encoding and decoding, respectively. Concat implies concatenation. GDN stands for generalized divisive normalization, and IGDN is inverse GDN [2]. The convolution parameters are denoted as , where and indicate upsampling and downsampling, respectively. CConv denotes conditional convolution, conditioned on the Lagrange multiplier  (see Figure 4). All convolution and masked convolution blocks employ conditional convolutions. Upsampling convolutions are implemented as the deconvolution. Masked convolutions are implemented as in [26].

We illustrate the network architecture that we used in our experiments in Figure 7. We emphasize that all convolution (including masked convolution) blocks employ conditional convolutions (see Figure 4 in Section 3.1).

Training

. For a training dataset, we used the ImageNet ILSVRC 2012 dataset 

[20]. We resized the training images so that the shorter of the width and the height is , and we extracted patches at random locations. In addition to the ImageNet dataset, we used the training dataset provided in the Workshop and Challenge on Learned Image Compression (CLIC)333https://www.compression.cc. For the CLIC training dataset, we extracted patches at random locations without resizing. We used Adam optimizer [10] and trained a model for epochs, where each epoch consists of k batches and the batch size is set to . The learning rate was set to be initially, and we decreased the learning rate to and at and epochs, respectively.

We pre-trained a conditional model that can be conditioned on different values of the Lagrange multiplier in for fixed bin size , where

(15)

In pre-training, we used the MSE loss. Then, we re-trained the model for mixed bin sizes; the quantization bin size  is selected randomly from , where is drawn uniformly between and so that we cover . In the re-training with mixed bin sizes, we used one of MSE, MS-SSIM and combined MSE+MS-SSIM losses (see Figure 9). We used the same training datasets and the same training procedure for pre-training and re-training. We also trained multiple fixed-rate models for fixed and fixed for comparison.

Experimental results. We compare the performance of our variable-rate model to the state-of-the-art learned image compression models from [19, 15, 4, 16, 9, 14] and the classical state-of-the-art variable-rate image compression codec, BPG [5], on the Kodak image set [12]. Some of the previous models were optimized for MSE, and some of them were optimized for a perceptual measure, MS-SSIM. Thus, we compare both measures separately in Figure 8. In particular, we included the results for the RNN-based variable-rate compression model in [9], which were obtained from [4]. All the previous works in Figure 8, except [9], trained multiple networks to get the multiple points in their R-D curves.

Figure 8: PSNR and MS-SSIM comparison to the state-of-the-art image compression models on 24 Kodak images. As in Figure 5(a), we plotted curves from our variable-rate model for Lagrange multiplier values in of (15) and .

For our variable-rate model, we plotted curves of the same blue color for PSNR and MS-SSIM, respectively, in Figure 8. Each curve corresponds to one of Lagrange multiplier values in (15). For each , we varied the quantization bin size  in to get each curve. Our variable-rate model outperforms BPG in both PSNR and MS-SSIM measures. It also performs comparable overall and better in some cases than the state-of-the-art learned image compression models [16, 14] that outperform BPG by deploying multiple networks trained for varying rates.

Our model shows superior performance over the RNN-based variable-rate model in [9]. The RNN-based model requires multiple encoding/decoding iterations at high rates, implying the complexity increases as more iterations are needed to achieve better quality. In contrast, our model uses single iteration, i.e., the encoding/decoding complexity is fixed, for any rates. Moreover, our model can produce any point in the R-D curve with infinitely fine resolution by tuning the continuous rate-adaptive parameter, the quantization bin size . However, the RNN-based model can produce only finite points in the R-D curve, depending on how many bits it encodes in each recurrent stage.

Figure 9: PSNR and MS-SSIM comparison on 24 Kodak images for our variable-rate and fixed-rate networks when they are optimized for MSE, MS-SSIM and combined MSE+MS-SSIM losses, respectively. In particular, we note that our variable-rate network optimized for MSE outperforms BPG in both PNSR and MS-SSIM measures.

In Figure 9, we compare our variable-rate networks optimized for MSE, MS-SSIM and combined MSE+MS-SSIM losses, respectively. We also plotted the results from our fixed-rate networks trained for fixed and . Observe that our variable-rate network performs very near to the ones individually optimized for fixed and . Here, we emphasize that our variable-rate network optimized for MSE performs better than BPG in both PNSR and MS-SSIM measures.

Ground truth
Latent representation
Latent representation
# bits assigned to in arithmetic coding
# bits assigned to in arithmetic coding
Bits per pixel (BPP) 1.8027 0.8086 0.6006 0.4132 0.1326
PSNR (dB) 41.3656 36.2535 34.8283 33.1478 29.2833
MS-SSIM 0.9951 0.9863 0.9819 0.9737 0.9249
Figure 10: Our variable-rate image compression outputs for different values of and . We also depicted the value and the number of bits assigned to each element of latent representations  and in arithmetic coding, respectively.

Figure 10 shows the compressed images generated from our variable-rate model to assess their visual quality. We also depicted the number of bits (implicitly) used to represent each element of and in arithmetic coding, which are and , respectively, in (12)–(14). We randomly selected two and four channels from and , respectively, and showed the code length for each latent representation value in the figure. As we change conditioning parameters  and , we can adapt the arithmetic code length that determines the rate of the latent representation. Observe that the smaller the values of and/or , the resulting latent representation requires more bits in arithmetic coding and the rate increases, as expected.

6 Conclusion

This paper proposed a variable-rate image compression framework with a conditional autoencoder. Unlike the previous learned image compression methods that train multiple networks to cover various rates, we train and deploy one variable-rate model that provides two knobs to control the rate, i.e., the Lagrange multiplier and the quantization bin size, which are given as input to the conditional autoencoder model. Our experimental results showed that the proposed scheme provides better performance than the classical image compression codecs such as JPEG2000 and BPG. Our method also showed comparable and sometimes better performance than the recent learned image compression methods that outperform BPG but need multiple networks trained for different compression rates. We finally note that the proposed conditional neural network can be adopted in deep learning not only for image compression but also in general to solve any optimization problem that can be formulated with the method of Lagrange multipliers.

References

  • [1] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. Van Gool (2017) Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems, pp. 1141–1151. Cited by: §1, §1.
  • [2] J. Ballé, V. Laparra, and E. P. Simoncelli (2016) Density modeling of images using a generalized normalization transformation. In International Conference on Learning Representations, Cited by: Figure 7.
  • [3] J. Ballé, V. Laparra, and E. P. Simoncelli (2017) End-to-end optimized image compression. In International Conference on Learning Representations, Cited by: §1, §1, §2, Remark 2.
  • [4] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston (2018)

    Variational image compression with a scale hyperprior

    .
    In International Conference on Learning Representations, Cited by: §1, §1, §2, §4, §5, Remark 2.
  • [5] F. Bellard (2014) BPG image format. Note: https://bellard.org/bpg Cited by: §1, §5.
  • [6] Y. Choi, M. El-Khamy, and J. Lee (2018) Universal deep neural network compression. In NeurIPS Workshop on Compact Deep Neural Network Representation with Industrial Applications (CDNNRIA), Cited by: Remark 1.
  • [7] T. M. Cover and J. A. Thomas (2012) Elements of information theory. John Wiley & Sons. Cited by: §1, §2, §2.
  • [8] V. K. Goyal (2001) Theoretical foundations of transform coding. IEEE Signal Processing Magazine 18 (5), pp. 9–21. Cited by: §1.
  • [9] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici (2018) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4385–4393. Cited by: §1, §1, §5, §5.
  • [10] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §5.
  • [11] D. P. Kingma and M. Welling (2014) Auto-encoding variational Bayes. In International Conference on Learning Representations, Cited by: Remark 2.
  • [12] E. Kodak (1993) Kodak lossless true color image suite (PhotoCD PCD0992). Note: http://r0k.us/graphics/kodak Cited by: §1, §5.
  • [13] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
  • [14] J. Lee, S. Cho, and S. Beack (2019) Context-adaptive entropy model for end-to-end optimized image compression. In International Conference on Learning Representations, Cited by: §1, §1, §1, §2, §5, §5.
  • [15] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool (2018) Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402. Cited by: §1, §1, §5.
  • [16] D. Minnen, J. Ballé, and G. D. Toderici (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10794–10803. Cited by: Table 1, §1, §1, §1, §A, §A, §2, §5, §5, Remark 5.
  • [17] A. Ortega and K. Ramchandran (1998) Rate-distortion methods for image and video compression. IEEE Signal Processing Magazine 15 (6), pp. 23–50. Cited by: §2.
  • [18] M. Rabbani (2002) JPEG2000: image compression fundamentals, standards and practice. Journal of Electronic Imaging 11 (2), pp. 286. Cited by: §1.
  • [19] O. Rippel and L. Bourdev (2017) Real-time adaptive image compression. In

    Proceedings of the International Conference on Machine Learning

    ,
    pp. 2922–2930. Cited by: §1, §5.
  • [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §5.
  • [21] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491. Cited by: §1.
  • [22] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand (2012) Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology 22 (12), pp. 1649–1668. Cited by: §1.
  • [23] L. Theis, W. Shi, A. Cunningham, and F. Huszár (2017) Lossy image compression with compressive autoencoders. In International Conference on Learning Representations, Cited by: §1, §1.
  • [24] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell (2017) Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5306–5314. Cited by: §1, §1.
  • [25] J. Tomczak and M. Welling (2018) VAE with a VampPrior. In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 1214–1223. Cited by: Remark 5.
  • [26] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798. Cited by: §1, §4, Figure 7, Remark 3.
  • [27] G. K. Wallace (1992) The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38 (1), pp. xviii–xxxiv. Cited by: §1.
  • [28] Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In Asilomar Conference on Signals, Systems & Computers, Vol. 2, pp. 1398–1402. Cited by: §1.
  • [29] R. Zamir and M. Feder (1992) On universal quantization by randomized uniform/lattice quantizers. IEEE Transactions on Information Theory 38 (2), pp. 428–436. Cited by: §2.
  • [30] J. Ziv (1985) On universal quantization. IEEE Transactions on Information Theory 31 (3), pp. 344–347. Cited by: §2.

A Comparison of our refined probabilistic model to [16]

The major difference from [16] is the conditioning part of . Furthermore, there are some differences from [16] in the probabilistic model, which we highlight in Table 1 with red color.

Probability Modeling in [16] Modeling in ours
Table 1: Comparison of our refined probabilistic model to [16].

B More example images

As supplementary materials, we provide more example images produced by our variable-rate image compression network that is optimized for the MSE loss. We compare our method to the classical image compression codecs, i.e., JPEG, JPEG2000, and BPG. We adapt and match the compression rate of our variable-rate network to the rate of BPG by adjusting the Lagrange multiplier  and the quantization bin size . All the examples show that our method outperforms the state-of-the-art BPG codec in both PSNR and MS-SSIM measures at the same bits per pixel (BPP). Visually, our method provides better quality with less artifacts than the classical image compression codecs. We put orange boxes to highlight the visual differences in Figure 11 and Figure 13, and the orange-boxed areas are magnified in Figure 12 and Figure 14, respectively.

Ground truth Ours BPP: 0.2078, PSNR: 32.4296 (dB), MS-SSIM: 0.9543 BPG (4:4:4) BPP: 0.2078, PSNR: 32.0406 (dB), MS-SSIM: 0.9488
JPEG2000 BPP: 0.2092, PSNR: 30.9488 (dB), MS-SSIM: 0.9342 JPEG BPP: 0.2098, PSNR: 28.1758 (dB), MS-SSIM: 0.8777
Figure 11: PSNR, MS-SSIM, and visual quality comparison of our variable-rate deep image compression method and classical image compression algorithms (BPG, JPEG2000, and JPEG) for the Kodak image 04. Our method outperforms the state-of-the-art BPG codec in both PSNR and MS-SSIM measures. We put orange boxes to highlight the visual differences.
Ground truth Ours BPG (4:4:4) JPEG2000 JPEG
Bits per pixel (BPP) 0.2078 0.2078 0.2092 0.2098
PSNR (dB) 32.4296 32.0406 30.9488 28.1758
MS-SSIM 0.9543 0.9488 0.9342 0.8777
Figure 12: Visual quality comparison of our variable-rate deep image compression method and classical image compression algorithms (BPG, JPEG2000, and JPEG) for the Kodak image 04 in the orange-boxed areas of Figure 11.
Ground truth
Ours BPP: 0.1289, PSNR: 34.4543 (dB), MS-SSIM: 0.9695 BPG (4:4:4) BPP: 0.1289, PSNR: 33.3546 (dB), MS-SSIM: 0.9593
JPEG2000 BPP: 0.1298, PSNR: 31.8927 (dB), MS-SSIM: 0.9482 JPEG BPP: 0.1299, PSNR: 27.1270 (dB), MS-SSIM: 0.8404
Figure 13: PSNR, MS-SSIM, and visual quality comparison of our variable-rate deep image compression method and classical image compression algorithms (BPG, JPEG2000, and JPEG) for the Kodak image 23. Our method outperforms the state-of-the-art BPG codec in both PSNR and MS-SSIM measures. We put orange boxes to highlight the visual differences.
Ground truth Ours BPG (4:4:4) JPEG2000 JPEG
Bits per pixel (BPP) 0.1289 0.1289 0.1298 0.1299
PSNR (dB) 34.4543 33.3546 31.8927 27.1270
MS-SSIM 0.9695 0.9593 0.9482 0.8404
Figure 14: Visual quality comparison of our variable-rate deep image compression method and classical image compression algorithms (BPG, JPEG2000, and JPEG) for the Kodak image 23 in the orange-boxed areas of Figure 13.