Layered Image Compression using Scalable Auto-encoder

04/01/2019 ∙ by Chuanmin Jia, et al. ∙ NYU college Peking University 0

This paper presents a novel convolutional neural network (CNN) based image compression framework via scalable auto-encoder (SAE). Specifically, our SAE based deep image codec consists of hierarchical coding layers, each of which is an end-to-end optimized auto-encoder. The coarse image content and texture are encoded through the first (base) layer while the consecutive (enhance) layers iteratively code the pixel-level reconstruction errors between the original and former reconstructed images. The proposed SAE structure alleviates the need to train multiple models for different bit-rate points by recently proposed auto-encoder based codecs. The SAE layers can be combined to realize multiple rate points, or to produce a scalable stream. The proposed method has similar rate-distortion performance in the low-to-medium rate range as the state-of-the-art CNN based image codec (which uses different optimized networks to realize different bit rates) over a standard public image dataset. Furthermore, the proposed codec generates better perceptual quality in this bit rate range.



There are no comments yet.


page 2

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image compression aims at representing an image with minimal coding bits while preserving the maximal pixel-level reconstruction quality as it could be. Recently, deep learning (DL) based image compression has been one of the emerging topics due to its elegant end-to-end optimization ability. Multiple learning-based image codecs have been proposed by investigating the joint intersection of deep learning and image coding. Essentially, the deep models are trained to learn the image-to-image mapping between the pristine image and the reconstructed image based on the rate-distortion (R-D) learning objective.

Different from conventional image coding formats, JPEG [1], JPEG2000 [2] and BPG [3] (based on High Efficiency Video Coding, HEVC [4]), which utilize separate sub-modules for prediction and transform coding, the deep codecs formulate the end-to-end learnt networks as transform coding [5]. The most representative models typically adopted the auto-encoder (AE) like structure, which generate latent representations that are quantized and entropy coded. For example, Toderici et al. [6]

proposed an end-to-end image coding model based on the recurrent neural network (RNN), where the original and residual images are iteratively compressed using RNN structure. During each RNN iteration, the better reconstruction with a commensurate bitrate (2-bits) cost will be produced. However, the RNN based method might have limitations in representing high-frequency residuals. Additionally, the lack of explicit entropy estimation during RNN training also constraints its overall R-D performance. To address this issue, Theis 

et al. [7]

presented a convolutional neural network (CNN) based AE structure, where the entropy model was approximated using Gaussian distribution during optimization. In 

[8], an inpainting based learning approach was proposed for image compression. To enhance the visual quality, the generative adversarial network (GAN) based learning strategies were embeded in the CNN based framework [9, 10] to improve the perceptual quality of the reconstructed images. In [11], the generalized divisive normalization (GDN) [12]

was introduced as a substitute for the nonlinear activation in a variational autoencoder (VAE) to de-correlate the channel-wise dependency among latent representations, which significantly improves the coding performance to be competitive with JPEG2000 standard. Compared with other activation functions, the core advantage of GDN is its full reversibility, which guarantees nearly no information loss for the transform coding. More recently 

[13], a novel distribution parameter estimation method was proposed for entropy coding which has brought additional coding gain.

All the aforementioned approaches have to train a particular auto-encoder for each target bit rate, which can limit their applicability when the desired bit rates have to be adapted in realtime, and/or when it is impractical to save multiple trained models. Inspired by the conventional scalable coding paradigm [14], we proposed a scalable auto-encoder (SAE) based deep image coding method to solve this problem by iteratively and incrementally coding the errors using the end-to-end trained auto-encoders. By cascading the bitstreams generated by each layer of SAE, variable bit-rates or layered bit streams could be obtained while maintaining optimal R-D performances. Furthermore, the proposed SAE structure could be compatible with any other learning based image codec since each layer of SAE could be substituted by different AE based deep codecs.

Ii Proposed Method

The overall flowchart of the proposed framework is depicted in Fig. 1. This model contains several stacked modules, each of which is an end-to-end optimized AE. The original image is firstly compressed by the AE in the base layer. Subsequently, the enhance layers would take the difference between the latest reconstructed image and the original image as input then compress the residue. We train our entire framework in a layer-by-layer manner, which means we fix all previous layers when training the current one. The design of base layer and enhance layers will be introduced in more details in subsequent subsections.

Fig. 1: The framework of the proposed SAE based image compression (first three layers are illustrated).

Ii-a Base Layer

The base layer in the proposed SAE mainly compresses the coarse image content and basic texture information. Considering the original image , the encoder part of AE in the base layer encodes into latent representation .


Subsequently, the quantizer () is applied to the latent representation to obtain the quantized latent feature . The bits required to code can be approximated by its entropy, which is determined from

, which is the marginal probability mass function of

. Following [11]

, for the purpose of end-to-end differentiability of the loss function, we replace the quantizer by adding a white noise with uniform distribution in (-1/2, 1/2) to the latent feature

. We considered two different training objectives. The first one is aimed to optimize the objective quality measured by the mean square error, and is trained using the following loss function,


where is the Lagrange multiplier and is the reconstructed image of base layer and represents the base layer decoder of AE (shown in Fig. 2). The second training objective optimizes for perceptual quality measured by MS-SSIM [15], which is


where is based on [15] and is the Lagrange multiplier for MS-SSIM. To estimate the rate , we deploy the state-of-the-art entropy model described in [13].

Following [11], the has three convolution layers with filter size , and with the down-sampling step size respectively. GDN is utilized to achieve non-linearity after each convolution layer. The decoder () has the mirror structure of . The AE structure is depicted in Fig. 2, which is the same as [11]. Note that the same structure is used in each subsequent enhance layer and the only difference is the number of channels for the latent features, which will be described later.

Fig. 2: The AE in each layer of our proposed SAE. The encoder and decoder are shown in the left and right panels respectively.

Ii-B Enhance Layers

In the proposed framework, the enhance layers are responsible for iteratively encoding the residues between the reconstructed image from the previous layers and the original image. As shown in Fig. 1, the first enhance layer (enhance-1) takes the error between original image and the reconstructed image of base layer as input. The formulation of enhance-1 could be represented as follows.


To acquire the reconstruction from base layer and enhance-1, we could simply add the two outputs from such two layer (). As such, the reconstruction quality could be enhanced with the error coded in the enhance layer.

For the subsequent enhance layers, taking the second enhance layer (enhance-2) as example, the input is the residue between original and the reconstruction from the latest layer . The reconstructed residue can be described as,


In this case, the reconstruction for enhance-2 is obtained by adding three corresponding outputs of the three AEs ().

The loss function of training the -th enhance layer has the following formation,


where the is the Lagrange Multiplier for -the layer. Similar with base layer training, uniform noise is also added to the latent variables to train the enhance layers for relaxation of the loss functions. To obtain rate-distortion optimality, we tried different combination of ’s, which are detailed in the next Section.

Iii Training Details

This section presents the training details of the proposed SAE based image codec. A subset of ImageNet 

[16] database is used for training, which contains 5500 RGB images. And another 300 images are used for validation. The convergence is met when the loss on the validation images becomes stable. We randomly crop a region of from each training sample to prevent boundary issues. The hyper-parameters during our training procedure are listed in Table. I. For each SAE layer, we need to train a pair of encoder and decoder as well as a entropy model, iteratively. The   means the fixed learning rate for the AE and entropy model in Eq. (2) and   is the initialized learning rate of the entropy model. The   has an exponential decay with the parameter 0.96 for every iterations during training.  [17] is utilized as the optimizer for both AE and entropy model respectively.

We train each layer successively: given the previous (-1) layers, we try to find the optimal hyper-parameters for the -th SAE layer, including the number of feature maps and , that achieve the best rate-distortion tradeoff.

Parameter Name Value
MSE lr 0.0001
Rate lr 0.001
epochs 1000
batch size 8
training image size
TABLE I: The hyper-parameters for training SAE.

Iii-a Base Layer Training

To achieve a good rate-distortion tradeoff for the base layer, and to use as few channels as possible for reduced complexity, we varied the number of feature maps in each of the three convolution layers in the encoder and decoder of the AE and values. We found that using 48 features maps in all three layers, and =3000 achieved the best result for the MSE-oriented optimization. For MS-SSIM-oriented optimization, =50 achieved the best result.

Iii-B Enhance Layer Training

We train each subsequent enhance layers, while fixing the lower layers at their optimized states. We have found that the number of features for enhance layers should increase with the growing of enhance layers. We suspect that this is because the distribution of the errors become more similar to random noise when the enhance layer goes deeper such that more parameters are needed to model and capture such distribution. The parameter , which is responsible for balancing the contributions from the entropy term and the MSE term in Eq. (6), should decrease layer by layer since more emphasis should be put on minimizing the MSE. Recall that ideally should be equal to the negative slope of the MSE vs. rate curve, and this slope reduces as the rate increases. Table. II summarizes the values and the number of features for each layer in the trained SAE. These values are selected based on exhaustive search of different combinations of the parameters when training the SAE structure.

Layers Base e1 e2 e3 e4
(for MSE) 3000 1000 300 100 30
(for MS-SSIM) 50 30 10 0.5 -
Feature Map Number 48 48 96 144 192
TABLE II: and number of feature maps for each layer of the proposed SAE.

Iii-C Entropy Model

In this paper, we re-use the entropy model proposed in [13], which incorporates a hyper-prior to effectively capture spatial dependencies in the latent representation generated by each layer of the proposed SAE. Particularly, we deploy different entropy models for different layers in the proposed SAE.

Iv Experimental Results

For training and testing, we used the popular DL library Tensorflow 

[18] and the Tensorflow-compression submodule [19], which is an implementation of [13].

Iv-a Experiment Set-up

To evaluate the efficiency of the propose SAE based image codec, we test the proposed model on the widely used Kodak Lossless True Color Image dataset [20] who contains 24 true color images with resolution . The results presented in this section are the average of these 24 images. It is worthy noting that all the experiments and comparisons are based on three-channel true color images. The test environment of this work is the Intel i5 7200U-CPU with 16GB RAM and NVIDIA GTX 1050Ti GPU.

Iv-B Rate-distortion Performances

We compare our proposed SAE coder against the algorithms in [11] and [13]. Our method and [11] have the same structure for the encoder and decoder, as illustrated in Fig. 2. And [13] used one more convolution layer on top of [11]. However, the numbers of feature maps differ among these methods. [11] used 192 feature maps in each layer while [13] used 128 and 192 for the convolution layer and bottleneck layer at low bit-rate and that of 192, 320 for high bit-rate points respectively. The numbers of feature maps in the proposed SAE differ among the scalable layers and are summarized in Table. II.

The proposed method use the same entropy estimation method as [13], but [11] used an older method, which is less efficient.

Dataset vs. [11] vs. [13]
Kodak -65.2 % 3.38 dB -0.6 % 0.021 dB
TABLE III: The R-D performance of the proposed SAE based image codec.

To illustrate the coding performance in a wide range of bit-rates, the R-D curves of the proposed SAE, [11] and [13] are provided in Fig. 3. We also provide the R-D curves obtained by the BPG codec. For each of [13] and SAE, we provide two sets of results, one optimized for MSE and another for MS-SSIM. Table III summarizes the BD-Rate and BD-PSNR of the proposed SAE method against [11] and [13], respectively. Compared with [11], the SAE achieved significant coding gain, where over 65% bit-rate could be reduced and 3.38dB PSNR increase could be obtained. The SAE performance is slightly better than [13]. 11footnotetext: Note that when using the BPG codec [3], we set the option to indicate that the input image is in the RGB format. The BPG performance reported in [13] is lower than in Fig. 3 and 4, because the option was set to assume the input is in YCbCr format.

For the proposed SAE model, the points from the left to the right correspond to the results of the base layer, enhance-1 layer to enhance-4 layer. Obviously, both [13] and the proposed model outperforms [11] over the entire range of bit-rate by clear margin. This is due to the more efficient entropy coding method used. The SAE coder is similar to [13] up to about 0.5 bpp, and then becomes less efficient. This loss of coding efficiency with more layers is as expected, as with any scalable coder compared to a non-scalable coder. In fact, it was somewhat surprising that SAE was able to achieve similar performance (in fact slightly better) as [13], up to 3 enhancement layers.

Fig. 3: R-D curves (PSNR) over Kodak dataset.

Additionally, we provide the comparisons based on MS-SSIM in Fig. 4. In general, the MS-SSIM metric is better correlated with the perceptual quality than PSNR. It is very encouraging that the proposed SAE method has similar or better performance than [13] in the entire rate range. Moreover, the SAE method achieved much better performance than BPG in the entire bit-rate range, consistent with the visual evaluation described below.

Fig. 4: R-D curves (MS-SSIM) over Kodak dataset.

Iv-C Visual Evaluations

For visual comparisons, Fig. 5 and Fig. 6 present several cropped versions of reconstructed images from Kodak dataset. Notice that the proposed SAE structure offers more detailed information in contour and textural regions while using less or similar bits than both [11] and [13] at the low to intermediate bit rate. We have provided all of the decoded test images by the proposed SAE framework and methods of [11, 13] in different bit-rate points in the supplementary materials. 11footnotetext: Please visit
for supplementary materials.

(a)  [11]: 0.1455 bpp, 24.1610 dB, 0.8827 (b)  [13]: 0.1782 bpp, 25.0016 dB, 0.9079 (c) Ours: 0.1305 bpp, 24.9626 dB, 0.9250 (d)  [11]: 0.1394 bpp, 25.1619 dB, 0.8739 (e)  [13]: 0.1417 bpp, 25.5955 dB, 0.8870 (f)  Ours: 0.1166 bpp, 25.8482 dB, 0.9192 (g)  [11]: 0.6468 bpp, 28.9845 dB, 0.9725 (h)  [13]: 0.4959 bpp, 28.3934 dB, 0.9648 (i) Ours: 0.5495 bpp, 30.9506 dB, 0.9852 (j)  [11]: 0.5782 bpp, 30.0684 dB, 0.9680 (k)  [13]: 0. 4858 bpp, 29.1530 dB, 0.9533 (l)  Ours: 0.4932 bpp, 28.7907 dB, 0.9831
Fig. 5: Comparisons of decoded test images by [11][13] and proposed method. The last number under each figure is the MS-SSIM value.
(a)  [11]: 0.1128 bpp, 26.2665 dB, 0.8831 (b)  [13]: 0.1102 bpp, 30.0579 dB, 0.9483 (c) Ours: 0.0996 bpp, 27.0657 dB, 0.9228 (d)  [11]: 0.0850 bpp, 29.9462 dB, 0.9361 (e)  [13]: 0.0739 bpp, 30.1352 dB, 0.9388 (f) Ours: 0.0811 bpp, 30.6522 dB, 0.9524 (g)  [11]: 0.4858 bpp, 30.9572 dB, 0.9644 (h)  [13]: 0.4596 bpp, 31.9486 dB, 0.9660 (i) Ours: 0.4336 bpp, 32.6423 dB, 0.9809 (j)  [11]: 0.1370 bpp, 24.0123 dB, 0.8881 (k)  [13]: 0.1387 bpp, 24.4523 dB, 0.9004 (l) Ours: 0.1130 bpp, 24.6560 dB, 0.9287
Fig. 6: Comparisons of decoded test images by [11][13] and proposed method. The last number under each figure is the MS-SSIM value.

V Conclusion

In this paper, a scalable auto-encoder based deep image codec is proposed. The novelty of the paper lies in that the proposed method does not need to train multiple independent networks to realize different bit-rate points. The quantitative and qualitative evaluations have shown that the proposed method can achieve rate-distortion performance similar to a state-of-art DL-based method in the low to intermediate bit rate in terms of mean-square error, and has similar or better performance than the benchmark in terms of the perceptual quality in the entire rate range. We should also note that the proposed SAE structure is general. One can simply replace the particular AE structure in Fig. 2 by another structure that can provide better coding performance in each layer. In fact, one can also use methods not based on AEs.


The authors would like to thank Dr. Johannes Ballé for kindly providing their trained models of [11] and the decoded images of [13] for performance comparison. This work was done by C. Jia and Z. Liu as visiting students in NYU-Tandon sponsored by China Scholarship Council (CSC), which is gratefully acknowledged.