Image compression is an application of data compression for digital images to lower their storage and/or transmission requirements. Transform coding  has been successful to yield practical and efficient image compression algorithms such as JPEG  and JPEG2000 . The transformation converts an input to a latent representation in the transform domain where lossy compression (that is typically a combination of quantization and lossless source coding) is more amenable and more efficient. For example, JPEG utilizes the discrete cosine transform (DCT) to convert an image into a sparse frequency domain representation. JPEG2000 replaces DCT with an enhanced discrete wavelet transform.
Deep learning is now leading many performance breakthroughs in various computer vision tasks . Along with this revolutionary progress of deep learning, learned image compression also has derived significant interests [3, 23, 24, 19, 1, 15, 4, 9, 16, 14]
. In particular, non-linear transform coding designed with deep neural networks has advanced to outperform the classical image compression codecs sophisticatedly designed and optimized by domain experts, e.g., BPG, which is a still image version of the high efficiency video codec (HEVC) standard —we note that very recently, only a few of the learning-based image compression schemes have reached the performance of the state-of-the-art BPG codec on peak signal-to-noise ratio (PSNR), a metric based on mean squared error (MSE) [16, 14].
|Ground truth||Ours||BPG (4:4:4)||JPEG2000||JPEG|
|Bits per pixel (BPP)||0.1697||0.1697||0.1702||0.1775|
The resemblance of non-linear transform coding and autoencoders has been established and exploited for image compression in [3, 23]—an encoder transforms an image (a set of pixels) into a latent representation in a lower dimensional space, and a decoder performs an approximate inverse transform that converts the latent representation back to the image. The transformation is desired to yield a latent representation with the smallest entropy, given a distortion level, since the entropy is the minimum rate achievable with lossless entropy source coding [7, Section 5.3]
. In practice, however, it is generally not straightforward to calculate and optimize the exact entropy of a latent representation. Hence, the rate-distortion (R-D) trade-off is optimized by minimizing an entropy estimate of a latent representation provided by an autoencoder at a target quality. To improve compression efficiency, recent methods have focused on developing accurate entropy estimation models[1, 15, 4, 16, 14] with sophisticated density estimation techniques such as variational Bayes and autoregressive context modeling.
Given a model that provides an accurate entropy estimate of a latent representation, the previous autoencoder-based image compression frameworks optimize their networks by minimizing the weighted sum of the R-D pairs using the method of Lagrange multipliers. The Lagrange multiplier introduced in the Lagrangian (see (2)) is treated as a hyper-parameter to train a network for a desired trade-off between the rate and the quality of compressed images. This implies that one needs to train and deploy separate networks for rate adaptation. One way is to re-train a network while varying the Lagrange multiplier. However, this is impractical when we operate at a broad range of the R-D curve with fine resolution and the size of each network is large.
In this paper, we suggest training and deploying only one variable-rate image compression network that is capable of rate adaptation. In particular, we propose a conditional autoencoder, conditioned on the Lagrange multiplier, i.e., the network takes the Lagrange multiplier as an input and produces a latent representation whose rate depends on the input value. Moreover, we propose training the network with mixed quantization bin sizes, which allows us to adapt the rate by adjusting the bin size applied to the quantization of a latent representation. Coarse rate adaptation to a target is achieved by varying the Lagrange multiplier in the conditional model, while fine rate adaptation is done by tuning the quantization bin size. We illustrate our variable-rate image compression model in Figure 1.
, where their conditioning variables are typically labels, attributes, or partial observations of the target output. However, our conditional autoencoder takes a hyper-parameter, i.e., the Lagrange multiplier, of the optimization problem as its conditioning variable. We basically solve multiple objectives using one conditional network, instead of solving them individually using separate non-conditional networks (each optimized for one objective), which is new to the best of our knowledge.
We also note that variable-rate models using recurrent neural networks (RNNs) were proposed in[24, 9]. However, the RNN-based models require progressive encoding and decoding, depending on the target image quality. The increasing number of iterations to obtain a higher-quality image is not desirable in certain applications and platforms. Our variable-rate model is different from the RNN-based models. Our model is based on a conditional autoencoder that needs no multiple iterations, while the quality is controlled by its conditioning variables, i.e., the Lagrange multiplier and the quantization bin size. Our method also shows superior performance over the RNN-based models in [24, 9].
We evaluate the performance of our variable-rate image compression model on the Kodak image dataset  for both the objective image quality metric, PSNR, and a perceptual score measured by the multi-scale structural similarity (MS-SSIM) . The experimental results show that our variable-rate model outperforms BPG in both PSNR and MS-SSIM metrics; an example from the Kodak dataset is shown in Figure 2. Moreover, our model shows a comparable and sometime better R-D trade-off than the state-of-the-art learned image compression models [16, 14] that outperform BPG by deploying multiple networks trained for different target rates.
We consider a typical autoencoder architecture consisting of encoder and decoder , where is an input image and is a quantized latent representation encoded from the input with quantization bin size ; we let , where denotes element-wise rounding to the nearest integer. For now, we fix . Lossless entropy source coding, e.g., arithmetic coding [7, Section 13.3], is used to generate a compressed bitstream from the quantized representation . Let , where
is the probability density function of.
Deterministic quantization. Suppose that we take entropy source coding for the quantized latent variable and achieve its entropy rate. The rate and the squared L2 distortion (i.e., the MSE loss) are given by
where is the probability density function of all natural images, and is the probability mass function of induced from encoder and , which satisfies , where denotes the Dirac delta function. Using the method of Lagrange multipliers, the R-D optimization problem is given by
for ; the scalar factor in the Lagrangian is called a Lagrange multiplier. The Lagrange multiplier is the factor that selects a specific R-D trade-off point (e.g., see ).
Relaxation with universal quantization. The rate and the distortion provided in (1) are not differentiable for network parameter , due to and , and thus it is not straightforward to optimize (2) through gradient descent. It was proposed in  to model the quantization error as additive uniform stochastic noise to relax the optimization of (2). The same technique was adopted in [4, 16, 14]. In this paper, we instead propose employing universal quantization [30, 29] to relax the problem (see Remark 2).
Universal quantization dithers every element of
with one common uniform random variable as follows:
where the dithering vectorconsists of repetitions of a single uniform random variable with support . We fix just for now. In each dimension, universal quantization is effectively identical in distribution to adding uniform noise independent of the source, although the noise induced from universal quantization is dependent across dimensions. Note that universal quantization is approximated as a linear function of the unit slope (of gradient
) in the backpropagation of the network training.
To our knowledge, we are the first to adopt universal quantization in the framework of training image compression networks. In , universal quantization was used for efficient weight compression of deep neural networks, which is different from our usage here. We observed from our experiments that our relaxation with universal quantization provides some gain over the conventional method of adding independent uniform noise (see Figure 3).
Differentiable R-D cost function. Under the relaxation with universal quantization, similar to (1), the rate and the distortion can be expressed as below:
where . The stochastic quantization model makes have a continuous density , which is a continuous relaxation of , but still is usually intractable to compute. Thus, we further adopt approximation of to a tractable density that is differentiable with respect to and . Then, it follows that
for . Optimizing a network for different values of , one can trade off the quality against the rate.
The objective function in (6) has the same form as auto-encoding variational Bayes , given that the posterior is uniform. This relation was already established in the previous works, and detailed discussions can be found in [3, 4]. Our contribution in this section is to deploy universal quantization (see (3)) to guarantee that the quantization error is uniform and independent of the source distribution, instead of artificially adding uniform noise, when generating random samples of from in Monte Carlo estimation of (6).
3 Variable rate image compression
To adapt the quality and the rate of compressed images, we basically need to optimize the R-D Lagrange function in (6) for varying values of the Lagrange multiplier . That is, one has to train multiple networks or re-train a network while varying the Lagrange multiplier . Training and deploying multiple networks are not practical, in particular when we want to cover a broad range of the R-D curve with fine resolution, and each network is of a large size. In this section, we develop a variable-rate model that can be deployed once and can be used to produce compressed images of varying quality with different rates, depending on user’s requirements, with no need of re-training.
3.1 Conditional autoencoder
To avoid training and deploying multiple networks, we propose training one conditional autoencoder, conditioned on the Lagrange multiplier . The network takes as a conditioning input parameter, along with the input image, and produces a compressed image with varying rate and distortion depending on the conditioning value of . To this end, the rate and distortion terms in (4) and (5) are altered into
for , where is a pre-defined finite set of Lagrange multiplier values, and then we minimize the following combined objective function:
To implement a conditional autoencoder, we develop the conditional convolution, conditioned on the Lagrange multiplier , as shown in Figure 4. Let be a 2-dimensional (2-D) input feature map of channel and be a 2-D output feature map of channel . Let be a 2-D convolutional kernel for input channel and output channel . Our conditional convolution yields
where denotes 2-D convolution. The channel-wise scaling factor and the additive bias term depend on by
where and are the fully-connected layer weight vectors of length for output channel ; denotes the transpose, , and
is one-hot encoding ofover .
The proposed conditional convolution is similar to the one proposed by conditional PixelCNN . In , conditioning variables are typically labels, attributes, or partial observations of the target output, while our conditioning variable is the Lagrange multiplier, which is the hyper-parameter that trades off the quality against the rate in the compression problem. A gated-convolution structure is presented in , but we develop a simpler structure so that the additional computational cost of conditioning is marginal.
3.2 Training with mixed bin sizes
|(a) Vary for fixed||(b) Vary for fixed||(c) Vary the mixing range of in training|
We established a variable-rate conditional autoencoder model, conditioned on the Lagrange multiplier in the previous subsection, but only finite discrete points in the R-D curve can be obtained from it, since is selected from a pre-determined finite set .111 The conditioning part can be modified to take continuous values, which however did not produce good results in our trials. To extend the coverage to the whole continuous range of the R-D curve, we develop another (continuous) knob to control the rate, i.e., the quantization bin size.
Recall that in the previous R-D formulation (1), we fixed the quantization bin size , i.e., we simply used for quantization. In actual inference, we can change the bin size to adapt the rate—the larger the bin size, the lower the rate. However, the performance naturally suffers from mismatched bin sizes in training and inference. For a trained network to be robust and accurate for varying bin sizes, we propose training (or fine-tuning) it with mixed bin sizes.
In training, we draw a uniform noise in (3) for various noise levels, i.e., for random . The range of and the mixing distribution within the range are design choices. In our experiments, we choose , where is uniformly drawn from so we can cover . The larger the range of , we optimize a network for a broader range of the R-D curve, but the performance also degrades. In Figure 5(c), we compare the R-D curves obtained from the networks trained with mixed bin sizes of different ranges; we used fixed in training the networks just for this experiment. We found that mixing bin sizes in yields the best performance, although the coverage is limited, which is not a problem since we can cover large-scale rate adaptation by changing the input Lagrange multiplier in our conditional model (see Figure 5 (a,b)).
In summary, we solve the following optimization:
where is a pre-defined mixing density for , and
In training, we compute neither the summation over nor the expectation over in (10). Instead, we randomly select uniformly from and draw from for each image to compute its individual R-D cost, and then we use the average R-D cost per batch as the loss for gradient descent, which makes the training scalable.
Rate adaptation. The rate increases, as we decrease the Lagrange multiplier and/or the quantization bin size . In Figure 5(a,b), we show how the rate varies as we change and . In (a), we change within for each fixed from (15). In (b), we vary in while fixing for some selected values. Given a user’s target rate, large-scale discrete rate adaptation is achieved by changing , while fine continuous rate adaptation can be performed by adjusting for fixed . When the R-D curves overlap at the target rate (e.g., see BPP in Figure 5(a)), we select the combination of and that produces better performance.222In practice, one can make a set of pre-selected combinations of and , similar to the set of quality factors in JPEG or BPG.
Compression. After selecting , we do one-hot encoding of and use it in all conditional convolutional layers to encode a latent representation of the input. Then, we perform regular deterministic quantization on the encoded representation with the selected quantization bin size . The quantized latent representation is then finally encoded into a compressed bitstream with entropy coding, e.g., arithmetic coding; we additionally need to store the values of the conditioning variables, and , used in encoding.
Decompression. We decode the compressed bitstream. We also retrieve and used in encoding from the compressed bitstream. We restore the quantized latent representation from the decoded integer values by multiplying them with the quantization bin size . The restored latent representation is then fed to the decoder to reconstruct the image. The value of used in encoding is again used in all deconvolutional layers, for conditional generation.
4 Refined probabilistic model
In this section, we discuss how we refine the baseline model in the previous section to improve the performance. The model refinement is orthogonal to the rate adaptation schemes in Section 3. From (11), we introduce a secondary latent variable that depends on and to yield
For compression, we encode from , and then we further encode from . The encoded representations are entropy-coded based on , respectively. For decompression, given , we decode , which is then used to compute and to decode
. This model is further refined by introducing autoregressive models forand as below:
where is the -th element of , and . In Figure 6, we illustrate a graph representation of our refined variable-rate image compression model.
In our experiments, we use
where , , and denotes the standard normal density; and are parameterized with autoregressive neural networks, e.g., consisting of masked convolutions , which are also conditioned on as in Figure 4. Similarly, we let
where , , and is designed as a univariate density model parameterized with a neural network as described in [4, Appendix 6.1].
We illustrate the network architecture that we used in our experiments in Figure 7. We emphasize that all convolution (including masked convolution) blocks employ conditional convolutions (see Figure 4 in Section 3.1).
. For a training dataset, we used the ImageNet ILSVRC 2012 dataset. We resized the training images so that the shorter of the width and the height is , and we extracted patches at random locations. In addition to the ImageNet dataset, we used the training dataset provided in the Workshop and Challenge on Learned Image Compression (CLIC)333https://www.compression.cc. For the CLIC training dataset, we extracted patches at random locations without resizing. We used Adam optimizer  and trained a model for epochs, where each epoch consists of k batches and the batch size is set to . The learning rate was set to be initially, and we decreased the learning rate to and at and epochs, respectively.
We pre-trained a conditional model that can be conditioned on different values of the Lagrange multiplier in for fixed bin size , where
In pre-training, we used the MSE loss. Then, we re-trained the model for mixed bin sizes; the quantization bin size is selected randomly from , where is drawn uniformly between and so that we cover . In the re-training with mixed bin sizes, we used one of MSE, MS-SSIM and combined MSE+MS-SSIM losses (see Figure 9). We used the same training datasets and the same training procedure for pre-training and re-training. We also trained multiple fixed-rate models for fixed and fixed for comparison.
Experimental results. We compare the performance of our variable-rate model to the state-of-the-art learned image compression models from [19, 15, 4, 16, 9, 14] and the classical state-of-the-art variable-rate image compression codec, BPG , on the Kodak image set . Some of the previous models were optimized for MSE, and some of them were optimized for a perceptual measure, MS-SSIM. Thus, we compare both measures separately in Figure 8. In particular, we included the results for the RNN-based variable-rate compression model in , which were obtained from . All the previous works in Figure 8, except , trained multiple networks to get the multiple points in their R-D curves.
For our variable-rate model, we plotted curves of the same blue color for PSNR and MS-SSIM, respectively, in Figure 8. Each curve corresponds to one of Lagrange multiplier values in (15). For each , we varied the quantization bin size in to get each curve. Our variable-rate model outperforms BPG in both PSNR and MS-SSIM measures. It also performs comparable overall and better in some cases than the state-of-the-art learned image compression models [16, 14] that outperform BPG by deploying multiple networks trained for varying rates.
Our model shows superior performance over the RNN-based variable-rate model in . The RNN-based model requires multiple encoding/decoding iterations at high rates, implying the complexity increases as more iterations are needed to achieve better quality. In contrast, our model uses single iteration, i.e., the encoding/decoding complexity is fixed, for any rates. Moreover, our model can produce any point in the R-D curve with infinitely fine resolution by tuning the continuous rate-adaptive parameter, the quantization bin size . However, the RNN-based model can produce only finite points in the R-D curve, depending on how many bits it encodes in each recurrent stage.
In Figure 9, we compare our variable-rate networks optimized for MSE, MS-SSIM and combined MSE+MS-SSIM losses, respectively. We also plotted the results from our fixed-rate networks trained for fixed and . Observe that our variable-rate network performs very near to the ones individually optimized for fixed and . Here, we emphasize that our variable-rate network optimized for MSE performs better than BPG in both PNSR and MS-SSIM measures.
|# bits assigned to in arithmetic coding|
|# bits assigned to in arithmetic coding|
|Bits per pixel (BPP)||1.8027||0.8086||0.6006||0.4132||0.1326|
Figure 10 shows the compressed images generated from our variable-rate model to assess their visual quality. We also depicted the number of bits (implicitly) used to represent each element of and in arithmetic coding, which are and , respectively, in (12)–(14). We randomly selected two and four channels from and , respectively, and showed the code length for each latent representation value in the figure. As we change conditioning parameters and , we can adapt the arithmetic code length that determines the rate of the latent representation. Observe that the smaller the values of and/or , the resulting latent representation requires more bits in arithmetic coding and the rate increases, as expected.
This paper proposed a variable-rate image compression framework with a conditional autoencoder. Unlike the previous learned image compression methods that train multiple networks to cover various rates, we train and deploy one variable-rate model that provides two knobs to control the rate, i.e., the Lagrange multiplier and the quantization bin size, which are given as input to the conditional autoencoder model. Our experimental results showed that the proposed scheme provides better performance than the classical image compression codecs such as JPEG2000 and BPG. Our method also showed comparable and sometimes better performance than the recent learned image compression methods that outperform BPG but need multiple networks trained for different compression rates. We finally note that the proposed conditional neural network can be adopted in deep learning not only for image compression but also in general to solve any optimization problem that can be formulated with the method of Lagrange multipliers.
-  (2017) Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems, pp. 1141–1151. Cited by: §1, §1.
-  (2016) Density modeling of images using a generalized normalization transformation. In International Conference on Learning Representations, Cited by: Figure 7.
-  (2017) End-to-end optimized image compression. In International Conference on Learning Representations, Cited by: §1, §1, §2, Remark 2.
Variational image compression with a scale hyperprior. In International Conference on Learning Representations, Cited by: §1, §1, §2, §4, §5, Remark 2.
-  (2014) BPG image format. Note: https://bellard.org/bpg Cited by: §1, §5.
-  (2018) Universal deep neural network compression. In NeurIPS Workshop on Compact Deep Neural Network Representation with Industrial Applications (CDNNRIA), Cited by: Remark 1.
-  (2012) Elements of information theory. John Wiley & Sons. Cited by: §1, §2, §2.
-  (2001) Theoretical foundations of transform coding. IEEE Signal Processing Magazine 18 (5), pp. 9–21. Cited by: §1.
Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4385–4393. Cited by: §1, §1, §5, §5.
-  (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §5.
-  (2014) Auto-encoding variational Bayes. In International Conference on Learning Representations, Cited by: Remark 2.
-  (1993) Kodak lossless true color image suite (PhotoCD PCD0992). Note: http://r0k.us/graphics/kodak Cited by: §1, §5.
-  (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
-  (2019) Context-adaptive entropy model for end-to-end optimized image compression. In International Conference on Learning Representations, Cited by: §1, §1, §1, §2, §5, §5.
-  (2018) Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402. Cited by: §1, §1, §5.
-  (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10794–10803. Cited by: Table 1, §1, §1, §1, §A, §A, §2, §5, §5, Remark 5.
-  (1998) Rate-distortion methods for image and video compression. IEEE Signal Processing Magazine 15 (6), pp. 23–50. Cited by: §2.
-  (2002) JPEG2000: image compression fundamentals, standards and practice. Journal of Electronic Imaging 11 (2), pp. 286. Cited by: §1.
Real-time adaptive image compression.
Proceedings of the International Conference on Machine Learning, pp. 2922–2930. Cited by: §1, §5.
-  (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §5.
-  (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491. Cited by: §1.
-  (2012) Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology 22 (12), pp. 1649–1668. Cited by: §1.
-  (2017) Lossy image compression with compressive autoencoders. In International Conference on Learning Representations, Cited by: §1, §1.
-  (2017) Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5306–5314. Cited by: §1, §1.
VAE with a VampPrior.
International Conference on Artificial Intelligence and Statistics, pp. 1214–1223. Cited by: Remark 5.
-  (2016) Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798. Cited by: §1, §4, Figure 7, Remark 3.
-  (1992) The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38 (1), pp. xviii–xxxiv. Cited by: §1.
-  (2003) Multiscale structural similarity for image quality assessment. In Asilomar Conference on Signals, Systems & Computers, Vol. 2, pp. 1398–1402. Cited by: §1.
-  (1992) On universal quantization by randomized uniform/lattice quantizers. IEEE Transactions on Information Theory 38 (2), pp. 428–436. Cited by: §2.
-  (1985) On universal quantization. IEEE Transactions on Information Theory 31 (3), pp. 344–347. Cited by: §2.
A Comparison of our refined probabilistic model to 
B More example images
As supplementary materials, we provide more example images produced by our variable-rate image compression network that is optimized for the MSE loss. We compare our method to the classical image compression codecs, i.e., JPEG, JPEG2000, and BPG. We adapt and match the compression rate of our variable-rate network to the rate of BPG by adjusting the Lagrange multiplier and the quantization bin size . All the examples show that our method outperforms the state-of-the-art BPG codec in both PSNR and MS-SSIM measures at the same bits per pixel (BPP). Visually, our method provides better quality with less artifacts than the classical image compression codecs. We put orange boxes to highlight the visual differences in Figure 11 and Figure 13, and the orange-boxed areas are magnified in Figure 12 and Figure 14, respectively.
|Ground truth||Ours BPP: 0.2078, PSNR: 32.4296 (dB), MS-SSIM: 0.9543||BPG (4:4:4) BPP: 0.2078, PSNR: 32.0406 (dB), MS-SSIM: 0.9488|
|JPEG2000 BPP: 0.2092, PSNR: 30.9488 (dB), MS-SSIM: 0.9342||JPEG BPP: 0.2098, PSNR: 28.1758 (dB), MS-SSIM: 0.8777|
|Ground truth||Ours||BPG (4:4:4)||JPEG2000||JPEG|
|Bits per pixel (BPP)||0.2078||0.2078||0.2092||0.2098|
|Ours BPP: 0.1289, PSNR: 34.4543 (dB), MS-SSIM: 0.9695||BPG (4:4:4) BPP: 0.1289, PSNR: 33.3546 (dB), MS-SSIM: 0.9593|
|JPEG2000 BPP: 0.1298, PSNR: 31.8927 (dB), MS-SSIM: 0.9482||JPEG BPP: 0.1299, PSNR: 27.1270 (dB), MS-SSIM: 0.8404|
|Ground truth||Ours||BPG (4:4:4)||JPEG2000||JPEG|
|Bits per pixel (BPP)||0.1289||0.1289||0.1298||0.1299|