I Introduction
Image compression aims at representing an image with minimal coding bits while preserving the maximal pixellevel reconstruction quality as it could be. Recently, deep learning (DL) based image compression has been one of the emerging topics due to its elegant endtoend optimization ability. Multiple learningbased image codecs have been proposed by investigating the joint intersection of deep learning and image coding. Essentially, the deep models are trained to learn the imagetoimage mapping between the pristine image and the reconstructed image based on the ratedistortion (RD) learning objective.
Different from conventional image coding formats, JPEG [1], JPEG2000 [2] and BPG [3] (based on High Efficiency Video Coding, HEVC [4]), which utilize separate submodules for prediction and transform coding, the deep codecs formulate the endtoend learnt networks as transform coding [5]. The most representative models typically adopted the autoencoder (AE) like structure, which generate latent representations that are quantized and entropy coded. For example, Toderici et al. [6]
proposed an endtoend image coding model based on the recurrent neural network (RNN), where the original and residual images are iteratively compressed using RNN structure. During each RNN iteration, the better reconstruction with a commensurate bitrate (2bits) cost will be produced. However, the RNN based method might have limitations in representing highfrequency residuals. Additionally, the lack of explicit entropy estimation during RNN training also constraints its overall RD performance. To address this issue, Theis
et al. [7]presented a convolutional neural network (CNN) based AE structure, where the entropy model was approximated using Gaussian distribution during optimization. In
[8], an inpainting based learning approach was proposed for image compression. To enhance the visual quality, the generative adversarial network (GAN) based learning strategies were embeded in the CNN based framework [9, 10] to improve the perceptual quality of the reconstructed images. In [11], the generalized divisive normalization (GDN) [12]was introduced as a substitute for the nonlinear activation in a variational autoencoder (VAE) to decorrelate the channelwise dependency among latent representations, which significantly improves the coding performance to be competitive with JPEG2000 standard. Compared with other activation functions, the core advantage of GDN is its full reversibility, which guarantees nearly no information loss for the transform coding. More recently
[13], a novel distribution parameter estimation method was proposed for entropy coding which has brought additional coding gain.All the aforementioned approaches have to train a particular autoencoder for each target bit rate, which can limit their applicability when the desired bit rates have to be adapted in realtime, and/or when it is impractical to save multiple trained models. Inspired by the conventional scalable coding paradigm [14], we proposed a scalable autoencoder (SAE) based deep image coding method to solve this problem by iteratively and incrementally coding the errors using the endtoend trained autoencoders. By cascading the bitstreams generated by each layer of SAE, variable bitrates or layered bit streams could be obtained while maintaining optimal RD performances. Furthermore, the proposed SAE structure could be compatible with any other learning based image codec since each layer of SAE could be substituted by different AE based deep codecs.
Ii Proposed Method
The overall flowchart of the proposed framework is depicted in Fig. 1. This model contains several stacked modules, each of which is an endtoend optimized AE. The original image is firstly compressed by the AE in the base layer. Subsequently, the enhance layers would take the difference between the latest reconstructed image and the original image as input then compress the residue. We train our entire framework in a layerbylayer manner, which means we fix all previous layers when training the current one. The design of base layer and enhance layers will be introduced in more details in subsequent subsections.
Iia Base Layer
The base layer in the proposed SAE mainly compresses the coarse image content and basic texture information. Considering the original image , the encoder part of AE in the base layer encodes into latent representation .
(1) 
Subsequently, the quantizer () is applied to the latent representation to obtain the quantized latent feature . The bits required to code can be approximated by its entropy, which is determined from
, which is the marginal probability mass function of
. Following [11], for the purpose of endtoend differentiability of the loss function, we replace the quantizer by adding a white noise with uniform distribution in (1/2, 1/2) to the latent feature
. We considered two different training objectives. The first one is aimed to optimize the objective quality measured by the mean square error, and is trained using the following loss function,(2) 
where is the Lagrange multiplier and is the reconstructed image of base layer and represents the base layer decoder of AE (shown in Fig. 2). The second training objective optimizes for perceptual quality measured by MSSSIM [15], which is
(3) 
where is based on [15] and is the Lagrange multiplier for MSSSIM. To estimate the rate , we deploy the stateoftheart entropy model described in [13].
Following [11], the has three convolution layers with filter size , and with the downsampling step size respectively. GDN is utilized to achieve nonlinearity after each convolution layer. The decoder () has the mirror structure of . The AE structure is depicted in Fig. 2, which is the same as [11]. Note that the same structure is used in each subsequent enhance layer and the only difference is the number of channels for the latent features, which will be described later.
IiB Enhance Layers
In the proposed framework, the enhance layers are responsible for iteratively encoding the residues between the reconstructed image from the previous layers and the original image. As shown in Fig. 1, the first enhance layer (enhance1) takes the error between original image and the reconstructed image of base layer as input. The formulation of enhance1 could be represented as follows.
(4) 
To acquire the reconstruction from base layer and enhance1, we could simply add the two outputs from such two layer (). As such, the reconstruction quality could be enhanced with the error coded in the enhance layer.
For the subsequent enhance layers, taking the second enhance layer (enhance2) as example, the input is the residue between original and the reconstruction from the latest layer . The reconstructed residue can be described as,
(5) 
In this case, the reconstruction for enhance2 is obtained by adding three corresponding outputs of the three AEs ().
The loss function of training the th enhance layer has the following formation,
(6) 
where the is the Lagrange Multiplier for the layer. Similar with base layer training, uniform noise is also added to the latent variables to train the enhance layers for relaxation of the loss functions. To obtain ratedistortion optimality, we tried different combination of ’s, which are detailed in the next Section.
Iii Training Details
This section presents the training details of the proposed SAE based image codec. A subset of ImageNet
[16] database is used for training, which contains 5500 RGB images. And another 300 images are used for validation. The convergence is met when the loss on the validation images becomes stable. We randomly crop a region of from each training sample to prevent boundary issues. The hyperparameters during our training procedure are listed in Table. I. For each SAE layer, we need to train a pair of encoder and decoder as well as a entropy model, iteratively. The means the fixed learning rate for the AE and entropy model in Eq. (2) and is the initialized learning rate of the entropy model. The has an exponential decay with the parameter 0.96 for every iterations during training. [17] is utilized as the optimizer for both AE and entropy model respectively.We train each layer successively: given the previous (1) layers, we try to find the optimal hyperparameters for the th SAE layer, including the number of feature maps and , that achieve the best ratedistortion tradeoff.
Parameter Name  Value 

MSE lr  0.0001 
Rate lr  0.001 
Optimizer  
epochs  1000 
batch size  8 
training image size 
Iiia Base Layer Training
To achieve a good ratedistortion tradeoff for the base layer, and to use as few channels as possible for reduced complexity, we varied the number of feature maps in each of the three convolution layers in the encoder and decoder of the AE and values. We found that using 48 features maps in all three layers, and =3000 achieved the best result for the MSEoriented optimization. For MSSSIMoriented optimization, =50 achieved the best result.
IiiB Enhance Layer Training
We train each subsequent enhance layers, while fixing the lower layers at their optimized states. We have found that the number of features for enhance layers should increase with the growing of enhance layers. We suspect that this is because the distribution of the errors become more similar to random noise when the enhance layer goes deeper such that more parameters are needed to model and capture such distribution. The parameter , which is responsible for balancing the contributions from the entropy term and the MSE term in Eq. (6), should decrease layer by layer since more emphasis should be put on minimizing the MSE. Recall that ideally should be equal to the negative slope of the MSE vs. rate curve, and this slope reduces as the rate increases. Table. II summarizes the values and the number of features for each layer in the trained SAE. These values are selected based on exhaustive search of different combinations of the parameters when training the SAE structure.
Layers  Base  e1  e2  e3  e4 

(for MSE)  3000  1000  300  100  30 
(for MSSSIM)  50  30  10  0.5   
Feature Map Number  48  48  96  144  192 
IiiC Entropy Model
In this paper, we reuse the entropy model proposed in [13], which incorporates a hyperprior to effectively capture spatial dependencies in the latent representation generated by each layer of the proposed SAE. Particularly, we deploy different entropy models for different layers in the proposed SAE.
Iv Experimental Results
For training and testing, we used the popular DL library Tensorflow
[18] and the Tensorflowcompression submodule [19], which is an implementation of [13].Iva Experiment Setup
To evaluate the efficiency of the propose SAE based image codec, we test the proposed model on the widely used Kodak Lossless True Color Image dataset [20] who contains 24 true color images with resolution . The results presented in this section are the average of these 24 images. It is worthy noting that all the experiments and comparisons are based on threechannel true color images. The test environment of this work is the Intel i5 7200UCPU with 16GB RAM and NVIDIA GTX 1050Ti GPU.
IvB Ratedistortion Performances
We compare our proposed SAE coder against the algorithms in [11] and [13]. Our method and [11] have the same structure for the encoder and decoder, as illustrated in Fig. 2. And [13] used one more convolution layer on top of [11]. However, the numbers of feature maps differ among these methods. [11] used 192 feature maps in each layer while [13] used 128 and 192 for the convolution layer and bottleneck layer at low bitrate and that of 192, 320 for high bitrate points respectively. The numbers of feature maps in the proposed SAE differ among the scalable layers and are summarized in Table. II.
The proposed method use the same entropy estimation method as [13], but [11] used an older method, which is less efficient.
Dataset  vs. [11]  vs. [13]  

BDrate  BDPSNR  BDrate  BDPSNR  
Kodak  65.2 %  3.38 dB  0.6 %  0.021 dB 
To illustrate the coding performance in a wide range of bitrates, the RD curves of the proposed SAE, [11] and [13] are provided in Fig. 3. We also provide the RD curves obtained by the BPG codec. For each of [13] and SAE, we provide two sets of results, one optimized for MSE and another for MSSSIM. Table III summarizes the BDRate and BDPSNR of the proposed SAE method against [11] and [13], respectively. Compared with [11], the SAE achieved significant coding gain, where over 65% bitrate could be reduced and 3.38dB PSNR increase could be obtained. The SAE performance is slightly better than [13]. ^{1}^{1}footnotetext: Note that when using the BPG codec [3], we set the option to indicate that the input image is in the RGB format. The BPG performance reported in [13] is lower than in Fig. 3 and 4, because the option was set to assume the input is in YCbCr format.
For the proposed SAE model, the points from the left to the right correspond to the results of the base layer, enhance1 layer to enhance4 layer. Obviously, both [13] and the proposed model outperforms [11] over the entire range of bitrate by clear margin. This is due to the more efficient entropy coding method used. The SAE coder is similar to [13] up to about 0.5 bpp, and then becomes less efficient. This loss of coding efficiency with more layers is as expected, as with any scalable coder compared to a nonscalable coder. In fact, it was somewhat surprising that SAE was able to achieve similar performance (in fact slightly better) as [13], up to 3 enhancement layers.
Additionally, we provide the comparisons based on MSSSIM in Fig. 4. In general, the MSSSIM metric is better correlated with the perceptual quality than PSNR. It is very encouraging that the proposed SAE method has similar or better performance than [13] in the entire rate range. Moreover, the SAE method achieved much better performance than BPG in the entire bitrate range, consistent with the visual evaluation described below.
IvC Visual Evaluations
For visual comparisons, Fig. 5 and Fig. 6 present several cropped versions of reconstructed images from Kodak dataset. Notice that the proposed SAE structure offers more detailed information in contour and textural regions while using less or similar bits than both [11] and [13] at the low to intermediate bit rate.
We have provided all of the decoded test images by the proposed SAE framework and methods of [11, 13] in different bitrate points in the supplementary materials.
^{1}^{1}footnotetext: Please visit https://github.com/chuanminj/MIPR2019/.
for supplementary materials.
V Conclusion
In this paper, a scalable autoencoder based deep image codec is proposed. The novelty of the paper lies in that the proposed method does not need to train multiple independent networks to realize different bitrate points. The quantitative and qualitative evaluations have shown that the proposed method can achieve ratedistortion performance similar to a stateofart DLbased method in the low to intermediate bit rate in terms of meansquare error, and has similar or better performance than the benchmark in terms of the perceptual quality in the entire rate range. We should also note that the proposed SAE structure is general. One can simply replace the particular AE structure in Fig. 2 by another structure that can provide better coding performance in each layer. In fact, one can also use methods not based on AEs.
Acknowledgment
The authors would like to thank Dr. Johannes Ballé for kindly providing their trained models of [11] and the decoded images of [13] for performance comparison. This work was done by C. Jia and Z. Liu as visiting students in NYUTandon sponsored by China Scholarship Council (CSC), which is gratefully acknowledged.
References
 [1] G. K. Wallace, “The jpeg still picture compression standard,” IEEE transactions on consumer electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992.
 [2] A. Skodras, C. Christopoulos, and T. Ebrahimi, “The jpeg 2000 still image compression standard,” IEEE Signal processing magazine, vol. 18, no. 5, pp. 36–58, 2001.
 [3] F. Bellard, “Bpg image format,” Available: http://bellard.org/bpg/., Accessed: 05282018.
 [4] G. J. Sullivan, J.R. Ohm, W.J. Han, T. Wiegand et al., “Overview of the high efficiency video coding(hevc) standard,” IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.
 [5] V. K. Goyal, “Theoretical foundations of transform coding,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 9–21, 2001.
 [6] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks.” in CVPR, 2017, pp. 5435–5443.
 [7] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” arXiv preprint arXiv:1703.00395, 2017.
 [8] M. H. Baig, V. Koltun, and L. Torresani, “Learning to inpaint for image compression,” in Advances in Neural Information Processing Systems, 2017, pp. 1246–1255.
 [9] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool, “Generative adversarial networks for extreme learned image compression,” arXiv preprint arXiv:1804.02958, 2018.
 [10] O. Rippel and L. Bourdev, “Realtime adaptive image compression,” arXiv preprint arXiv:1705.05823, 2017.
 [11] J. Ballé, V. Laparra, and E. P. Simoncelli, “Endtoend optimized image compression,” arXiv preprint arXiv:1611.01704, 2016.
 [12] ——, “Density modeling of images using a generalized normalization transformation,” arXiv preprint arXiv:1511.06281, 2015.
 [13] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436, 2018.
 [14] J.R. Ohm, “Advances in scalable video coding,” Proceedings of the IEEE, vol. 93, no. 1, pp. 42–56, 2005.
 [15] Z. Wang, E. Simoncelli, A. Bovik et al., “Multiscale structural similarity for image quality assessment,” in ASILOMAR CONFERENCE ON SIGNALS SYSTEMS AND COMPUTERS, vol. 2. IEEE, 2003, pp. 1398–1402.
 [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [17] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[18]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al.
, “Tensorflow: a system for largescale machine learning.” in
OSDI, vol. 16, 2016, pp. 265–283.  [19] J. Ballé, S. J. Hwang, and N. Johnston, https://tensorflow.github.io/compression/, 2018.
 [20] Kodak, http://r0k.us/graphics/kodak/, 1999.
Comments
There are no comments yet.