1 Introduction
In the last few years, deep learningbased image compression [1, 2, 3, 4, 5, 6, 7] has made tremendous progresses, and has achieved better performance than JPEG2000 and the H.265/HEVCbased BPG image codec [8].
In [3]
, a generalized divisive normalization (GDN)based scheme was proposed. The encoding network consists of three stages of convolution, subsampling, and GDN layers. Despite its simple architecture, it outperforms JPEG2000 in both PSNR and SSIM. A compressive autoencoder (AE) framework with residual connection as in ResNet was proposed in
[5], where the quantization was replaced by a smooth approximation, and a scaling approach was used to get different rates. In [1], a softtohard vector quantization approach was introduced, and a unified framework was developed for both image compression and neural network model compression.
In [2], a deep semantic segmentationbased layered image compression (DSSLIC) was proposed, by taking advantage of the Generative Adversarial Network (GAN). A lowdimensional representation and segmentation map of the input were encoded. Moreover, the residual between the input and the synthesized image was also encoded. It outperforms the BPG codec (in RGB444 format) in both PSNR and MSSSIM [9] across a large range of bit rates.
In [4], a context model of entropy coding for endtoend optimized image compression was proposed, and a hyperprior was augmented with the context. This method represents the stateoftheart learningbased image compression, which outperforms BPG in terms of both PSNR and MSSSIM. Another contextbased model was proposed in [10], where an importance map for locally adaptive bit rate allocation was employed to handle the spatial variation of image content.
All the abovementioned methods train multiple networks for multiple bit rates, which increases the implementation complexity. Therefore the variablerate approach is desired in many scenarios, in which a single neural network model is trained to operate at multiple bit rates with satisfactory performance [6, 7, 11, 12].
In [6]
, long shortterm memory (LSTM)based recurrent neural networks (RNNs) and residualbased layer coding was used to compress thumbnail images. Better SSIM results than JPEG and WebP were reported. This approach was generalized in
[7], which proposed a variablerate framework for fullresolution images by introducing a gated recurrent unit, residual scaling, and deep learningbased entropy coding. This method can outperform JPEG in terms of PSNR.
In [11], a CNNbased multiscale decomposition transform was optimized for all scales. Rate allocation algorithms were also applied to determine the optimal scale of each image block. The results in [11] were reported to be better than BPG in MSSSIM. In [12], a learned progressive image compression model was proposed, in which bitplane decomposition was adopted. Bidirectional assembling gated units were also introduced to reduce the correlation between different bitplanes [12].
In this paper, we propose a new deep learningbased variablerate image compression framework, which employs more GDN layers than [3, 4]. Two novel types of GDNbased residual subnetworks are also developed in the encoder and decoder networks, by incorporating the shortcut connection in ResNet [13]. Our scheme uses the stochastic roundingbased scalable quantization as in [14, 15, 6]. To further improve the performance, we encode the residual between the input and the reconstructed image from the decoder network as an enhancement layer. To enable a single model to operate with different bit rates and to learn multirate image features, a new variablerate objective function is introduced that considers the performance at multiple rates. Experimental results show that the proposed framework trained with variablerate objective function outperforms stateoftheart learningbased variablerate methods as well as all standard codecs including H.265/HEVCbased BPG (in all formats) in terms of MSSSIM metric.
2 The Proposed Method
The overall framework of the proposed codec is shown in Fig. 1. At the encoder side, two layers of information are encoded: the encoder network output (code map) and the residual image. The code map is a lowdimensional feature map of the original image , obtained by the deep encoder , which is quantized by a uniform scalable quantizer , and then encoded by the FLIF lossless codec [16]. To further improve the performance, the reconstruction of the input image (denoted by ) from the quantized code map is obtained using the deep decoder , and the residual between the input and the reconstruction is encoded by the BPG codec as an enhancement layer [2]. At the decoder side, the reconstruction from the deep decoder and the decoded residual image are added to get the final reconstruction .
It has been shown that endtoend optimization of a model including cascades of differentiable linearnonlinear transforms has better performance over the traditional linear ones [3]. One example is the GDN, which is very efficient in gaussianizing local statistics of natural images. It also provides significant improvements when utilized as a prior for image compression when used with scalable quantization. GDN was first introduced in [3] for a learningbased image compression framework, which had a simple architecture of 3 downsampling convolutions, each is followed by a GDN layer.
The architecture of the proposed deep encoder and decoder networks are shown in Figure 2. Several modifications to the GDNbased schemes in [3, 4] are developed. First, we adopt a deeper architecture including 11 convolution layers, followed by GDN or inverse GDN (IGDN) in the encoder and decoder. Second, for deeper learning of image statistics and faster convergence, the concept of identity shortcut connection in the ResNet in [13] is introduced to some GDN and IGDN layers, denoted by ResGDN and ResIGDN. The architecture of the ResGDN and ResIGDN are shown in Figure 3
. Unlike the traditional residual blocks where the ReLU and batch (or instance) normalization are employed, we utilize GDN and IGDN layers in our residual blocks, which provide better performance and faster convergence rate.
2.1 Deep Encoder
Let be the input image, the code map is generated using the parametric deep encoder as: , where is the parameter vector to be optimized. The encoder consists of 5 stages where the input to the th stage is denoted by . The image is then represented as , which is the input to the first stage of the encoder. Each stage begins with a convolution layer as:
(1) 
where and are affine and downsampling convolutions, respectively. Each convolution layer is followed by a GDN operation [3] defined as:
(2) 
where and run over channels and
is the spatial location of a specific value of a tensor (e.g.,
). Except for the first and last stages, a ResGDN transform is applied at the end of each stage:(3) 
where is composed of two subsequent pairs of affine convolutions (Eq. 1 with ), each is followed by a GDN operation (Eq. 2).
2.2 Stochastic RoundingBased Quantization
The output of the last stage of the encoder, , represents the code map with channels. Each channel denoted by is then quantized to a discretevalued vector using a stochastic roundingbased uniform scalar quantizer as: where the function is defined as in [6, 14, 15]: where is produced by a uniform random number generator. and respectively represent the quantization step (scale) and the zeropoint, which are defined as:
(4) 
where , is the number of bits, and and are the input’s min and max values over the th channel, respectively. The zeropoint
is an integer ensuring that zero is quantized with no error, which avoids quantization error in common operations such as zero padding
[17].The stochastic rounding approach provides a better performance and convergence rate compared to roundtonearest algorithm. Stochastic rounding is indeed an unbiased rounding scheme, which maintains a nonzero probability of small parameters. In other words, it possesses the desirable property that the expected rounding error is zero, i.e.
. As a consequence, the gradient information is preserved and the network is able to learn with low bits of precision without any significant loss in performance.Each of the quantized code map channels, denoted by , is separately entropycoded using the FLIF codec [16], which is a stateoftheart lossless image codec.
2.3 Deep Decoder
At the decoder side, the quantized code map is dequantized using the following function: Given the dequantized code map , the parametric decoder (with the parameter vector ) reconstructs the image as follows: . Similar to the encoder, the decoder network is composed of 5 stages in which all the operations are reversed. Each stage at the decoder network starts with an IGDN operation computed as follows:
(5) 
which is followed by a convolution layer defined as:
(6) 
where denotes transposed convolution used for upsampling the input tensor. As the reverse of the encoder, each convolution at the middle stages is followed by a ResIGDN transform defined as:
(7) 
where consists of two subsequent pairs of an IGDN operation (Eq. 5) followed by an affine convolution (Eq. 6 with ). The reconstructed image is finally resulted from the output of the decoder represented by .
2.4 Residual Coding
As an enhancement layer to the bitstream, the residual between the input image and the deep decoder’s output is further encoded by the BPG codec, as in [2]. To do this, the minimal and the maximal values of the residual image are first obtained, and the range between them is rescaled to [0,255], so that we can call the BPG codec directly to encode it as a regular 8bit image. The minimal and maximal values are also sent to decoder for inverse scaling after BPG decoding.
2.5 Objective Function and Optimization
Our cost function is a combination of L2norm loss denoted by , and MSSSIM loss [18] denoted by as follows:
(8) 
where and are the optimization parameter vectors of the deep encoder and decoder, respectively, each is defined as a full set of their parameters across all their 5 stages as: and where . In order to optimize the parameters such that our codec can operate at a variety of bit rates, we propose the following novel variablerate objective functions for the and losses:
(9) 
where denotes the reconstructed image with bit quantizer (Eq. 4), and can take all possible values in a set . In this paper, is used for training variablerate network model. The MSSSIM metric use luminance , contrast , and structure to compare the pixels and their neighborhoods in and . Moreover, MSSSIM operates at multiple scales where the images are iteratively downsampled by factors of , for . Note that other methods optimize for PSNR and MSSSIM separately in order to get better performance in each of them. Our scheme jointly optimize for both of them. It will be shown later that we can still achieve satisfactory results in both metrics.
Our goal is to minimize the objective over the continuous parameters . However, both depend on the quantized values of whose derivative is discontinuous, which makes the quantizer nondifferentiable [3]
. To overcome this issue, the fact that the exact derivatives of discrete variables are zero almost everywhere is considered, and the straightthrough estimate (STE) approach in
[19] is employed to approximate the differentiation through discrete variables in the backward pass.3 Experimental Results
The ADE20K dataset [20] was used for training the proposed model. The images with at least 512 pixels in height or width were used (9272 images in total). All images were rescaled to and to have a fixed size for training. We set the downsampling factor and the channel size to get the code map of size 3232
8. The deep encoder and decoder models were jointly trained for 200 epochs with minibatch stochastic gradient descent (SGD) and a minibatch size of 16. The Adam solver with learning rate of 0.00002 was fixed for the first 100 epochs, and was gradually decreased to zero for the next 100 epochs. All the networks were trained in the RGB domain. The model was trained using 3 different bit rates, i.e.,
in Eq. 9. However, the trained model can operate at any bit rate in range at the test time.In this section, we compare the performance of the proposed scheme with standard codecs (including JPEG, JPEG2000, WebP, and the H.265/HEVC intra codingbased BPG codec [8]) and the stateoftheart learningbased variablerate image compression methods in Cai2018 [11], Zhang2019 [7], and Toderici2017 [12], in which a single network was trained to generate multiple bit rates. We use both PSNR and MSSSIM [9]
as the evaluation metrics.
The comparison results on the popular Kodak image set (averaged over 24 test images) are shown in Figure 4. Different points on the ratedistortion (RD) curve of our variablerate results are obtained from 5 different bit rates for the code maps in the base layer, i.e., . The corresponding residual images in the enhancement layer are coded by BPG (YUV4:4:4 format) with quantizer parameters of respectively. As shown in Figure 4, our method outperforms the stateoftheart learningbased variablerate image compression models and JPEG2000 in terms of both PSNR and MSSSIM. Our PSNR results are slightly lower than BPG (in both YUV4:2:0 and YUV4:4:4 formats), but we achieve better MSSSIM, especially at low rates.
The BPGbased residual coding in our scheme is exploited to avoid retraining the entire model for another bit rate and more importantly to boost the quality at high BPPs (bits/pixel). For the 5 points (low to high) in Figure 4, the percentage of BPPs used by residual image is {2%, 34%, 52%, 68%, 76%}. This shows that as the BPP increases, the residual coding has more significant contribution to the RD performance.
One visual example from the Kodak image set is given in Figures 5 in which our results are compared with JPEG2000 and BPG (YUV4:4:4 format). As seen in the example, our proposed method provides the highest visual quality compared to the others. JPEG2000 has poor performance due to the ringing artifacts at edges. The BPG result is smoother compared to JPEG2000, but the details and fine structures (e.g., the grass on the ground) are not preserved in many areas.
In order to evaluate the performance of different components of the proposed framework, other ablation studies were performed. The results are shown in Figure 6.
Code map channel size: Figure 6 shows the results with channel sizes of 4 and 8 (i.e., ). It can be seen that has better results than . In general, we find that a larger code map channel size with smaller quantization bits provides a better performance, because deeper texture information of the input is preserved within the feature map.
GDN vs. ReLU: In order to show the performance of the proposed GDNbased architecture, we compare it with a ReLUbased variant of our model, denoted as . In this model, all GDN and IGDN layers in the deep encoder and decoder are removed; instead, instance normalization followed by ReLU are added to the end of all convolution layers. The last GDN and IGDN layers in the encoder and decoder are replaced by a Tanh layer. As shown in Fig. 6, the models with GDN structure outperform the ones without GDN.
Conventional vs. GDN/IGDNbased residual transforms: In this scenario, the model composed of the proposed ResGDN/ResIGDN transforms (denoted by ResGDN in Figure 6) is compared with the conventional ReLUbased residual block (denoted by ) in which all the GDN/IGDN layers are repalced by ReLU. The results with neither ResReLU nor ResGDN (yellow blocks in Figure 1), denoted by , are also included. As demonstrated in Figure 6, ResGDN achieves better performance compared to the other scenarios.
In terms of complexity, the average processing time of the deep encoder, quantizer, and deep decoder on Kodak are 65ms, 2ms, and 51ms on a TITAN X Pascal GPU, respectively.
4 Conclusion
In this paper, we propose a new variablerate image compression framework, by applying more GDN layers and incorporating the shortcut connection in the ResNet. We also use a stochastic roundingbased scalable quantization. To further improve the performance, the residual between the input and the reconstructed image from the decoder network is encoded by BPG as an enhancement layer. A novel variablerate objective function is proposed. Experimental results show that our variablerate model can outperform all standard codecs including BPG, as well as stateoftheart learningbased variablerate methods. A future topic is the rate allocation optimization between the base layer and the enhancement layer.
References
 [1] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. Van Gool, “Softtohard vector quantization for endtoend learning compressible representations,” arXiv preprint arXiv:1704.00648, 2017.
 [2] M. Akbari, J. Liang, and J. Han, “DSSLIC: Deep semantic segmentationbased layered image compression,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 2042–2046.
 [3] J. Ballé, V. Laparra, and E. P. Simoncelli, “Endtoend optimization of nonlinear transform codes for perceptual quality,” in Picture Coding Symposium, 2016, pp. 1–5.
 [4] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems, 2018, pp. 10771–10780.
 [5] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” arXiv preprint arXiv:1703.00395, 2017.
 [6] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compression with recurrent neural networks,” arXiv preprint arXiv:1511.06085, 2015.

[7]
G. Toderici, D. Vincent, N. Johnston, Sung Jin Hwang, David Minnen, J. Shor,
and M. Covell,
“Full resolution image compression with recurrent neural networks,”
in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2017, pp. 5306–5314.  [8] F. Bellard, “Bpg image format (http://bellard.org/bpg/),” 2017.
 [9] Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale structural similarity for image quality assessment,” in Record of the ThirtySeventh Asilomar Conference on Signals, Systems and Computers, 2003, vol. 2, pp. 1398–1402.
 [10] M. Li, W. Zuo, S. Gu, J. You, and D. Zhang, “Learning contentweighted deep image compression,” arXiv preprint arXiv:1904.00664, 2019.
 [11] C. Cai, L. Chen, X. Zhang, and Z. Gao, “Efficient variable rate image compression with multiscale decomposition network,” IEEE Transactions on Circuits and Systems for Video Technology, 2018.
 [12] Z. Zhang, Z. Chen, J. Lin, and W. Li, “Learned scalable image compression with bidirectional context disentanglement network,” in IEEE International Conference on Multimedia and Expo. IEEE, 2019, pp. 1438–1443.
 [13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[14]
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan,
“Deep learning with limited numerical precision,”
in
International Conference on Machine Learning
, 2015, pp. 1737–1746.  [15] T. Raiko, M. Berglund, G. Alain, and L. Dinh, “Techniques for learning binary stochastic feedforward neural networks,” in International Conference on Learning Representations, 2015.
 [16] J. Sneyers and P. Wuille, “FLIF: Free lossless image format based on maniac compression,” in IEEE International Conference on Image Processing, 2016, pp. 66–70.
 [17] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018.
 [18] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
 [19] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
 [20] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ADE20K dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, vol. 1, p. 4.