Image compression aims at representing an image with minimal coding bits while preserving the maximal pixel-level reconstruction quality as it could be. Recently, deep learning (DL) based image compression has been one of the emerging topics due to its elegant end-to-end optimization ability. Multiple learning-based image codecs have been proposed by investigating the joint intersection of deep learning and image coding. Essentially, the deep models are trained to learn the image-to-image mapping between the pristine image and the reconstructed image based on the rate-distortion (R-D) learning objective.
Different from conventional image coding formats, JPEG , JPEG2000  and BPG  (based on High Efficiency Video Coding, HEVC ), which utilize separate sub-modules for prediction and transform coding, the deep codecs formulate the end-to-end learnt networks as transform coding . The most representative models typically adopted the auto-encoder (AE) like structure, which generate latent representations that are quantized and entropy coded. For example, Toderici et al. 
proposed an end-to-end image coding model based on the recurrent neural network (RNN), where the original and residual images are iteratively compressed using RNN structure. During each RNN iteration, the better reconstruction with a commensurate bitrate (2-bits) cost will be produced. However, the RNN based method might have limitations in representing high-frequency residuals. Additionally, the lack of explicit entropy estimation during RNN training also constraints its overall R-D performance. To address this issue, Theiset al. 
presented a convolutional neural network (CNN) based AE structure, where the entropy model was approximated using Gaussian distribution during optimization. In, an inpainting based learning approach was proposed for image compression. To enhance the visual quality, the generative adversarial network (GAN) based learning strategies were embeded in the CNN based framework [9, 10] to improve the perceptual quality of the reconstructed images. In , the generalized divisive normalization (GDN) 
was introduced as a substitute for the nonlinear activation in a variational autoencoder (VAE) to de-correlate the channel-wise dependency among latent representations, which significantly improves the coding performance to be competitive with JPEG2000 standard. Compared with other activation functions, the core advantage of GDN is its full reversibility, which guarantees nearly no information loss for the transform coding. More recently, a novel distribution parameter estimation method was proposed for entropy coding which has brought additional coding gain.
All the aforementioned approaches have to train a particular auto-encoder for each target bit rate, which can limit their applicability when the desired bit rates have to be adapted in realtime, and/or when it is impractical to save multiple trained models. Inspired by the conventional scalable coding paradigm , we proposed a scalable auto-encoder (SAE) based deep image coding method to solve this problem by iteratively and incrementally coding the errors using the end-to-end trained auto-encoders. By cascading the bitstreams generated by each layer of SAE, variable bit-rates or layered bit streams could be obtained while maintaining optimal R-D performances. Furthermore, the proposed SAE structure could be compatible with any other learning based image codec since each layer of SAE could be substituted by different AE based deep codecs.
Ii Proposed Method
The overall flowchart of the proposed framework is depicted in Fig. 1. This model contains several stacked modules, each of which is an end-to-end optimized AE. The original image is firstly compressed by the AE in the base layer. Subsequently, the enhance layers would take the difference between the latest reconstructed image and the original image as input then compress the residue. We train our entire framework in a layer-by-layer manner, which means we fix all previous layers when training the current one. The design of base layer and enhance layers will be introduced in more details in subsequent subsections.
Ii-a Base Layer
The base layer in the proposed SAE mainly compresses the coarse image content and basic texture information. Considering the original image , the encoder part of AE in the base layer encodes into latent representation .
Subsequently, the quantizer () is applied to the latent representation to obtain the quantized latent feature . The bits required to code can be approximated by its entropy, which is determined from
, which is the marginal probability mass function of. Following . We considered two different training objectives. The first one is aimed to optimize the objective quality measured by the mean square error, and is trained using the following loss function,
where is the Lagrange multiplier and is the reconstructed image of base layer and represents the base layer decoder of AE (shown in Fig. 2). The second training objective optimizes for perceptual quality measured by MS-SSIM , which is
Following , the has three convolution layers with filter size , and with the down-sampling step size respectively. GDN is utilized to achieve non-linearity after each convolution layer. The decoder () has the mirror structure of . The AE structure is depicted in Fig. 2, which is the same as . Note that the same structure is used in each subsequent enhance layer and the only difference is the number of channels for the latent features, which will be described later.
Ii-B Enhance Layers
In the proposed framework, the enhance layers are responsible for iteratively encoding the residues between the reconstructed image from the previous layers and the original image. As shown in Fig. 1, the first enhance layer (enhance-1) takes the error between original image and the reconstructed image of base layer as input. The formulation of enhance-1 could be represented as follows.
To acquire the reconstruction from base layer and enhance-1, we could simply add the two outputs from such two layer (). As such, the reconstruction quality could be enhanced with the error coded in the enhance layer.
For the subsequent enhance layers, taking the second enhance layer (enhance-2) as example, the input is the residue between original and the reconstruction from the latest layer . The reconstructed residue can be described as,
In this case, the reconstruction for enhance-2 is obtained by adding three corresponding outputs of the three AEs ().
The loss function of training the -th enhance layer has the following formation,
where the is the Lagrange Multiplier for -the layer. Similar with base layer training, uniform noise is also added to the latent variables to train the enhance layers for relaxation of the loss functions. To obtain rate-distortion optimality, we tried different combination of ’s, which are detailed in the next Section.
Iii Training Details
This section presents the training details of the proposed SAE based image codec. A subset of ImageNet database is used for training, which contains 5500 RGB images. And another 300 images are used for validation. The convergence is met when the loss on the validation images becomes stable. We randomly crop a region of from each training sample to prevent boundary issues. The hyper-parameters during our training procedure are listed in Table. I. For each SAE layer, we need to train a pair of encoder and decoder as well as a entropy model, iteratively. The means the fixed learning rate for the AE and entropy model in Eq. (2) and is the initialized learning rate of the entropy model. The has an exponential decay with the parameter 0.96 for every iterations during training.  is utilized as the optimizer for both AE and entropy model respectively.
We train each layer successively: given the previous (-1) layers, we try to find the optimal hyper-parameters for the -th SAE layer, including the number of feature maps and , that achieve the best rate-distortion tradeoff.
|training image size|
Iii-a Base Layer Training
To achieve a good rate-distortion tradeoff for the base layer, and to use as few channels as possible for reduced complexity, we varied the number of feature maps in each of the three convolution layers in the encoder and decoder of the AE and values. We found that using 48 features maps in all three layers, and =3000 achieved the best result for the MSE-oriented optimization. For MS-SSIM-oriented optimization, =50 achieved the best result.
Iii-B Enhance Layer Training
We train each subsequent enhance layers, while fixing the lower layers at their optimized states. We have found that the number of features for enhance layers should increase with the growing of enhance layers. We suspect that this is because the distribution of the errors become more similar to random noise when the enhance layer goes deeper such that more parameters are needed to model and capture such distribution. The parameter , which is responsible for balancing the contributions from the entropy term and the MSE term in Eq. (6), should decrease layer by layer since more emphasis should be put on minimizing the MSE. Recall that ideally should be equal to the negative slope of the MSE vs. rate curve, and this slope reduces as the rate increases. Table. II summarizes the values and the number of features for each layer in the trained SAE. These values are selected based on exhaustive search of different combinations of the parameters when training the SAE structure.
|Feature Map Number||48||48||96||144||192|
Iii-C Entropy Model
In this paper, we re-use the entropy model proposed in , which incorporates a hyper-prior to effectively capture spatial dependencies in the latent representation generated by each layer of the proposed SAE. Particularly, we deploy different entropy models for different layers in the proposed SAE.
Iv Experimental Results
For training and testing, we used the popular DL library Tensorflow and the Tensorflow-compression submodule , which is an implementation of .
Iv-a Experiment Set-up
To evaluate the efficiency of the propose SAE based image codec, we test the proposed model on the widely used Kodak Lossless True Color Image dataset  who contains 24 true color images with resolution . The results presented in this section are the average of these 24 images. It is worthy noting that all the experiments and comparisons are based on three-channel true color images. The test environment of this work is the Intel i5 7200U-CPU with 16GB RAM and NVIDIA GTX 1050Ti GPU.
Iv-B Rate-distortion Performances
We compare our proposed SAE coder against the algorithms in  and . Our method and  have the same structure for the encoder and decoder, as illustrated in Fig. 2. And  used one more convolution layer on top of . However, the numbers of feature maps differ among these methods.  used 192 feature maps in each layer while  used 128 and 192 for the convolution layer and bottleneck layer at low bit-rate and that of 192, 320 for high bit-rate points respectively. The numbers of feature maps in the proposed SAE differ among the scalable layers and are summarized in Table. II.
|Dataset||vs. ||vs. |
|Kodak||-65.2 %||3.38 dB||-0.6 %||0.021 dB|
To illustrate the coding performance in a wide range of bit-rates, the R-D curves of the proposed SAE,  and  are provided in Fig. 3. We also provide the R-D curves obtained by the BPG codec. For each of  and SAE, we provide two sets of results, one optimized for MSE and another for MS-SSIM. Table III summarizes the BD-Rate and BD-PSNR of the proposed SAE method against  and , respectively. Compared with , the SAE achieved significant coding gain, where over 65% bit-rate could be reduced and 3.38dB PSNR increase could be obtained. The SAE performance is slightly better than . 11footnotetext: Note that when using the BPG codec , we set the option to indicate that the input image is in the RGB format. The BPG performance reported in  is lower than in Fig. 3 and 4, because the option was set to assume the input is in YCbCr format.
For the proposed SAE model, the points from the left to the right correspond to the results of the base layer, enhance-1 layer to enhance-4 layer. Obviously, both  and the proposed model outperforms  over the entire range of bit-rate by clear margin. This is due to the more efficient entropy coding method used. The SAE coder is similar to  up to about 0.5 bpp, and then becomes less efficient. This loss of coding efficiency with more layers is as expected, as with any scalable coder compared to a non-scalable coder. In fact, it was somewhat surprising that SAE was able to achieve similar performance (in fact slightly better) as , up to 3 enhancement layers.
Additionally, we provide the comparisons based on MS-SSIM in Fig. 4. In general, the MS-SSIM metric is better correlated with the perceptual quality than PSNR. It is very encouraging that the proposed SAE method has similar or better performance than  in the entire rate range. Moreover, the SAE method achieved much better performance than BPG in the entire bit-rate range, consistent with the visual evaluation described below.
Iv-C Visual Evaluations
For visual comparisons, Fig. 5 and Fig. 6 present several cropped versions of reconstructed images from Kodak dataset. Notice that the proposed SAE structure offers more detailed information in contour and textural regions while using less or similar bits than both  and  at the low to intermediate bit rate.
We have provided all of the decoded test images by the proposed SAE framework and methods of [11, 13] in different bit-rate points in the supplementary materials.
11footnotetext: Please visit https://github.com/chuanminj/MIPR2019/.
for supplementary materials.
In this paper, a scalable auto-encoder based deep image codec is proposed. The novelty of the paper lies in that the proposed method does not need to train multiple independent networks to realize different bit-rate points. The quantitative and qualitative evaluations have shown that the proposed method can achieve rate-distortion performance similar to a state-of-art DL-based method in the low to intermediate bit rate in terms of mean-square error, and has similar or better performance than the benchmark in terms of the perceptual quality in the entire rate range. We should also note that the proposed SAE structure is general. One can simply replace the particular AE structure in Fig. 2 by another structure that can provide better coding performance in each layer. In fact, one can also use methods not based on AEs.
The authors would like to thank Dr. Johannes Ballé for kindly providing their trained models of  and the decoded images of  for performance comparison. This work was done by C. Jia and Z. Liu as visiting students in NYU-Tandon sponsored by China Scholarship Council (CSC), which is gratefully acknowledged.
-  G. K. Wallace, “The jpeg still picture compression standard,” IEEE transactions on consumer electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992.
-  A. Skodras, C. Christopoulos, and T. Ebrahimi, “The jpeg 2000 still image compression standard,” IEEE Signal processing magazine, vol. 18, no. 5, pp. 36–58, 2001.
-  F. Bellard, “Bpg image format,” Available: http://bellard.org/bpg/., Accessed: 05-28-2018.
-  G. J. Sullivan, J.-R. Ohm, W.-J. Han, T. Wiegand et al., “Overview of the high efficiency video coding(hevc) standard,” IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.
-  V. K. Goyal, “Theoretical foundations of transform coding,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 9–21, 2001.
-  G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks.” in CVPR, 2017, pp. 5435–5443.
-  L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” arXiv preprint arXiv:1703.00395, 2017.
-  M. H. Baig, V. Koltun, and L. Torresani, “Learning to inpaint for image compression,” in Advances in Neural Information Processing Systems, 2017, pp. 1246–1255.
-  E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool, “Generative adversarial networks for extreme learned image compression,” arXiv preprint arXiv:1804.02958, 2018.
-  O. Rippel and L. Bourdev, “Real-time adaptive image compression,” arXiv preprint arXiv:1705.05823, 2017.
-  J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” arXiv preprint arXiv:1611.01704, 2016.
-  ——, “Density modeling of images using a generalized normalization transformation,” arXiv preprint arXiv:1511.06281, 2015.
-  J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436, 2018.
-  J.-R. Ohm, “Advances in scalable video coding,” Proceedings of the IEEE, vol. 93, no. 1, pp. 42–56, 2005.
-  Z. Wang, E. Simoncelli, A. Bovik et al., “Multi-scale structural similarity for image quality assessment,” in ASILOMAR CONFERENCE ON SIGNALS SYSTEMS AND COMPUTERS, vol. 2. IEEE, 2003, pp. 1398–1402.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al.
, “Tensorflow: a system for large-scale machine learning.” inOSDI, vol. 16, 2016, pp. 265–283.
-  J. Ballé, S. J. Hwang, and N. Johnston, https://tensorflow.github.io/compression/, 2018.
-  Kodak, http://r0k.us/graphics/kodak/, 1999.