Deep Learning-based Image Compression with Trellis Coded Quantization

01/26/2020 ∙ by Binglin Li, et al. ∙ Simon Fraser University University of Manitoba 0

Recently many works attempt to develop image compression models based on deep learning architectures, where the uniform scalar quantizer (SQ) is commonly applied to the feature maps between the encoder and decoder. In this paper, we propose to incorporate trellis coded quantizer (TCQ) into a deep learning based image compression framework. A soft-to-hard strategy is applied to allow for back propagation during training. We develop a simple image compression model that consists of three subnetworks (encoder, decoder and entropy estimation), and optimize all of the components in an end-to-end manner. We experiment on two high resolution image datasets and both show that our model can achieve superior performance at low bit rates. We also show the comparisons between TCQ and SQ based on our proposed baseline model and demonstrate the advantage of TCQ.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of designing the optimal image codec is to minimize the distortion between the original image and the reconstructed image subject to the constraint of the bitrate . As the entropy is the lower bound of bitrate , the optimization can be formulated as minimizing , where is the tradeoff factor. Recently many works [4, 18, 14] attempt to develop image compression models based on deep learning architectures. In their approaches, a uniform scalar quantizer (SQ) is commonly applied to the feature maps between the encoder and decoder. As the codewords are distributed in a cubic and the corresponding Voronoi regions induced by SQ are always cubic, SQ cannot achieve the R-D bound [9]

. Vector quantization (VQ) has the optimal performance, but the complexity is usually high. Trellis coded quantizer (TCQ) is a structured VQ, and it can achieve better performance than SQ with modest computational complexity 

[12]. It is shown in [12] that for memoryless uniform sources, a 4 state TCQ can achieve 0.87dB higher SNR than SQ for 4 bitsample.

In this paper, motivated by the superior performance of TCQ over SQ in traditional image coding, we propose to use TCQ to replace the commonly used SQ in a deep learning based image compression model. The soft-to-hard strategy [1] is applied to allow for back propagation during training. To the best of our knowledge, we are the first to investigate the performance of TCQ in a deep learning based image compression framework. Our implementation allows for batch processing amenable to the mini-batch training in deep learning models, which greatly reduces the training time.

The entropy coding can further reduce the bitrate without impacting the reconstruction performance. One way to apply it in deep learning model is to use offline entropy coding method during testing [2]. This method is not optimized for the bitrate as the network is not explicitly designed to minimize the entropy. In this paper, we adopt the PixelCNN++ [16]

to model the probability density function on an image

over pixels from all channels as , where the conditional probability only depends on the pixels above and to the left of the pixel in the image. A cross entropy loss is followed to estimate the entropy of the quantized representation to jointly minimize the R-D function.

Our contributions are summarized as follows. We propose to incorporate TCQ into a deep learning based image compression framework. The image compression framework consists of encoder, decoder and entropy estimation subnetworks. They are optimized in an end-to-end manner. We experiment on two commonly used datasets and both show that our model can achieve superior performance at low bit rates. We also compare TCQ and SQ based on the same baseline model and demonstrate the advantage of TCQ.

Figure 1:

Overview of our proposed deep image compression model. The encoder has three consecutive strided convolutional layers to reduce the input size from 256

256 to 3232. After the TCQ is applied, the quantized feature representations are used as input to two branches. One goes to the decoder network to generate image from resolution 6464 to 128128, and then to 256256. These three losses are added together to be the distortion loss . The other one goes to the entropy model (pixelCNN++) to produce the probability matrix of pixels based on previous pixels optimized by the cross entropy loss . “Conv, , sb” denotes convolutional layer with kernels and a stride of b. “Resblock” is from [7] but without BatchNorm layers. “c ” represents downsampling by a factor of c.

2 Related Work

There has been a line of research on deep learning based image compression, especially autoencoders with a bottleneck to learn compact representations. The encoder maps the image data to the latent space with reduced dimensionality, and the decoder reconstructs the original image from the latent representation.

2.1 Quantization in DNN

Several approximation approaches have been proposed to allow the network to back-propagate through the quantizer during training. In [19, 11]

, a binarization layer is designed in the forward pass and the gradients are defined based on a proxy of the binarizer. Ballé

et. al. [4] stochastically round the given values by adding noise and use the new continuous function to compute the gradients during the backward pass. Theis et. al. [18] extend the binarizer in [19] to integers and use straight-through estimator in the backward pass. In [1], a soft quantization in both forward and backward passes is proposed. The model needs to learn the centers and change from soft quantization to hard assignments during training by an annealing strategy. In [14], the authors apply the nearest neighbors to obtain fixed centers, and the soft quantization in [1] is used during the backward pass.

2.2 Image Compression based on DNN

With the quantizer being differentiable, in order to jointly minimize the bitrate and distortion, we also need to make the entropy differentiable. For example, in [4, 18], the quantizer is added with uniform noise. The density function of this relaxed formulation is continuous and can be used as an approximation of the entropy of the quantized values. In [1], similar to the soft quantization strategy, a soft entropy is designed by summing up the partial assignments to each center instead of counting. In [14, 11], an entropy coding scheme is trained to learn the dependencies among the symbols in the latent representation by using a context model. These methods allow jointly optimizing the R-D function.

3 Proposed Approach

Our model follows the encoder-decoder framework. Different from the previous works that apply a uniform scalar quantizer (SQ) after the encoder network, we propose to use trellis coded quantizer (TCQ) to enhance the reconstruction performance. The whole framework is trained jointly with our entropy model.

3.1 Encoder and Decoder

Since our goal is to study the gain of TCQ and SQ, we only use a simple encoding and decoding framework. Our encoder network consists of three layers of convolutional layers with a stride of 2 to downsample the input. Each convolutional layer is followed by a ReLU layer. We remove BatchNorm 

[8] layers as we find removing them gives us better reconstruction performance. We add one more convolutional layer to reduce the channel dimension to a small value e.g. 8 to get a condensed feature representation . A layer is followed to project to continuous values between -1 and 1. Then a quantizer is applied to quantize the feature maps to discrete values. For the decoder network, we use PixelShuffle [17] layer for upsampling. Inspired by [21], we adopt two intermediate losses after each upsampling operation to force the network to generate images from low resolution to high resolution progressively as shown in Fig. 1.

3.2 Trellis Coded Quantizer

Figure 2: An example of 4 state trellis structure.
(a) (b) (c)
Figure 3: (a) indexing method I for TCQ, (b) indexing method II for TCQ, (c) SQ.

Forward Pass: Trellis coded quantizer (TCQ) is applied in JPEG2000 [9]

part II. Different from JPEG2000 where the input for TCQ is fixed given an image block, when embedded in deep neural networks, the input for TCQ is updated in each iteration during training. The forward pass for TCQ is similar to the original implementation in 

[9]. In essence, TCQ aims to find a path with minimum distortion from the start symbol to the last symbol based on the particular diagram structure. Figure 3 shows a trellis structure with 4 states. For bit/symbol, a quantizer with quantization levels is created. These reconstruction points can be obtained by a uniform quantizer. As the last layer of our encoder is a function, we have and . The quantization step is . A reconstruction point () is obtained by . Next all the reconstruction levels are partitioned into four subsets from left to right to form four sub-quantizers. Then different subsets are assigned to different branches of the trellis, so that different paths of the trellis can try different combinations to encode an input sequence. Each node only needs to record the input branch that has the smallest cost. After obtaining the minimum distortion for the last symbol, we trace back to get the optimal path as shown in red in Fig. 3 for instance. With this optimal path, 1 bit is used to indicate which branch to move for next symbol, and the last bits are used to indicate the index of codeword from the corresponding sub-quantizer. Here we call it indexing method I .

Backward Pass: In order to make a quantizer differentiable, the most common way is to use straight-through estimator [5] where the derivative of the quantizer is set to 1. However, we find that such backward method tends to converge slowly for TCQ. As the TCQ changes the distribution of the input data, this inconsistency may make it hard for the network to update weights in the right direction. Similar to [14], given reconstruction points (), we use the differentiable soft quantization during the backward pass.



is a hyperparameter to adjust the “softness” of the quantization.

Discussions: One issue for the TCQ implementation is that the time and memory complexity are both proportional to the number of symbols. Previous implementation usually flattens the input block into a sequence. Because pixels in one feature map are more correlated than pixels in other feature maps, we consider each feature map as an input for TCQ. For feature maps with size ( is the batch size for the network, is the number of channels, and are the height and width), we reshape the size as , where is the batch size for TCQ and is the number of symbols in a feature map, which reduces the processing time.

The other issue is that the conventional indexing method I mentioned above brings in randomness for the indices of a feature map as shown in Fig. 3 (a). The reason is that the branch bit depends on the optimal path in trellis structure and it does not carry any relationship among each symbol. From JPEG2000 [9], we have two union-quantizers and . As pointed in [13], given a node in the diagram, the codeword that can be chosen is either from or . Therefore, because of the particular structure of the trellis, all bits can be used to represent the indices for the union-quantizer and the same applies to . For example, in Fig. 3, assume we receive the initial state during decoding. Only or sub-quantizer will be chosen for this symbol. As the indices for and are all different, we get the corresponding unique codeword based on the received bits. Then we easily know which sub-quantizer ( or ) is chosen and accordingly the branch number. We call it indexing method II. Fig. 3 (b) gives the indices of a feature map resulting from the indexing method II.

3.3 Entropy Coding Model

The aforementioned autoencoder model is not optimized for entropy coding. We can model the conditional probability distribution of a symbol based on its context 

[14]. The context should be only related to previous decoded symbols, and not use the later unseen symbols. We employ PixelCNN++ [16] model for the entropy coding model. We replace the last layer of PixelCNN++ model in implementation111 with a softmax function so that a cross entropy loss can be used during training. This loss is viewed as an estimation of entropy for the quantized latent representation. Assume we have bits to encode each symbol and a dimensional feature map , the PixelCNN++ model outputs a probability matrix. Encoding is done row by row and each row orders from left to right. With the probability matrix, we encoder the indices of the feature maps by Adaptive Arithmetic Coding (AAC)222

to get the compressed representation. During decoding, for the first forward pass, we input the pre-trained PixelCNN++ model with a tensor with all zeros. This first forward pass gives distributions for entries

where is a position in the feature map . Then we decode the indices along the channel dimension by AAC. Based on the received initial states, we recover the symbols at . The following decoding steps are based on the conditional probability


where is a tensor with decoded symbol at location and zeros otherwise. When , . When , . As the decoding proceeds, the remaining zeros will be replaced by the decoded symbols progressively.

4 Experiment

4.1 Dataset

We use ADE20K dataset [22] for training and validation. We test on Kodak PhotoCD image dataset333 and Tecnick SAMPLING dataset [3]. ADE20K dataset contains 20K training and 2K validation images. Kodak PhotoCD image dataset and Tecnick SAMPLING dataset include 24 512768 images and 100 12001200 images respectively.

4.2 Training Details

We crop each input image by 256

256 during training and test on the whole images. During training, we use a learning rate of 0.0001 at the beginning, and decrease it by a factor 0.4 at epoch 80, 100 and 120. Training is stopped at 140 epochs and we use the model that gives the best validation result for testing. We set the batch size as 18 and run the training on one 12G GTX TITAN GPU with the Adam optimizer. We use 4 quantization levels and increase the channel size from

to control the bitrate. Compression performance is evaluated with Multi-Scale Structural Similarity (MS-SSIM) by bits per pixel (bpp) and we use MS-SSIM loss in Eq. 3 during training.


The first term is the distortion error and the second term is the cross entropy loss for pixelCNN++ model. is a hyperparameter and set to 1.

4.3 Baselines

We compare our results with conventional codecs and recent deep learning based compression models. JPEG [20] results are obtained from ImageMagick444 JPEG2000 results are from MATLAB implementation and BPG results are based on 4:2:0 chroma format555 For deep learning based image compression models, we either collect from the released test results or plot the rate-distortion curves from the published papers.

(a) (b)
Figure 4: (a) MS-SSIM/bpp on Kodak dataset. (b) MS-SSIM/bpp on Tecnick dataset.
quantizer Kodak dataset Tecnick dataset
PSNR(dB)/bpp MS-SSIM/bpp PSNR(dB)/bpp MS-SSIM/bpp
SQ 24.54/0.077 0.9028/0.077 26.14/0.068 0.9326/0.068
TCQ 24.95/0.076 0.9102/0.076 26.82/0.066 0.9377/0.066
SQ 25.66/0.117 0.9259/0.117 27.63/0.104 0.9493/0.104
TCQ 25.85/0.116 0.9315/0.116 27.86/0.101 0.9518/0.101
SQ 26.23/0.157 0.9386/0.157 28.32/0.139 0.9572/0.139
TCQ 26.47/0.154 0.9427/0.154 28.45/0.133 0.9592/0.133
Table 1: Performance comparisons between TCQ and SQ using MS-SSIM loss for training
quantizer Kodak dataset Tecnick dataset
PSNR(dB)/bpp MS-SSIM/bpp PSNR(dB)/bpp MS-SSIM/bpp
SQ 24.86/0.064 0.8715/0.064 25.94/0.054 0.9091/0.054
TCQ 25.23/0.062 0.8824/0.062 26.66/0.052 0.9189/0.052
SQ 25.81/0.098 0.8992/0.098 27.35/0.081 0.9308/0.081
TCQ 26.35/0.096 0.9090/0.096 27.96/0.078 0.9373/0.078
SQ 26.56/0.133 0.9178/0.133 28.18/0.112 0.9427/0.112
TCQ 26.89/0.130 0.9232/0.130 28.65/0.110 0.9473/0.110
Table 2: Performance comparisons between TCQ and SQ using MSE loss for training
0.8177/0.128 0.9028/0.118 0.9139/0.117
(a) Original (b) JPEG (c) JPEG2000 (d) Ballé [4]
0.9189/0.112 0.9257/0.112 0.9323/0.109
(e) BPG (f) Ours (SQ) (g) Ours (TCQ)
Figure 5: Qualitative results (MS-SSIM/bpp) from the image kodim16 on kodak dataset.

4.4 Comparisons with previous works

Fig. 4 shows result comparisons between our approach and other image compression algorithms (Theis et. al. [18], Ballé et. al. [4], Agustsson et. al. [1], Johnston et. al. [10], Li et. al. [11], Mentzer et. al. [14], Cheng et. al. [6]) on two datasets. Despite the simplicity of our network, the results from our model with TCQ show its superior performance at low bit rates. At high bit rates, our results can achieve comparable performance to previous papers except for the latest results in Mentzer et. al. [14] and Cheng et. al. [6]. It is probably because at high bit rates, we increase the number of channels of the model, but we do not finetune the training parameters.

4.5 Comparisons between TCQ and SQ

In Tab. 1, we compare the MS-SSIM and PSNR between TCQ and SQ using MS-SSIM loss for training. At the low bit rate (around 0.07 bpp), TCQ can achieve 0.008 in MS-SSIM (0.41dB in PSNR) and 0.005 in MS-SSIM (0.68dB in PSNR) higher than that from SQ on Kodak and Tecnick datasets respectively. We notice that at higher bit rates, the performance gap between TCQ and SQ is less obvious. As the number of channels increases, the learning ability of the model improves as well. The type of quantizer may not be that important for more complex models.

In Tab. 2, we compare the performance between TCQ and SQ using MSE loss as the distortion for training and is set to 0.01. A similar trend is observed where TCQ outperforms SQ at the same bit rate.

The pixelCNN++ model used in this paper is not optimal for entropy coding. In [15], a context model along with a hyper-network is used to predict and of a set of Gaussian models, which saves more bits than directly using the probability matrix. In our experiment, it gets 0.154 bpp for the model of 8 channels compared to pre-entropy coding with 0.25 bpp on the Kodak dataset.

4.6 Qualitative Comparisons

In Fig. 5, we show results from different codecs. Fig. 5 (a) is the original image. In (b), we can clearly see compression artifacts in the JPEG reconstructed image. In (c), (d) and (e), the shape of the cloud is very blurry. For BPG in (e), there are also some block artifacts in the green box sample. We notice that in (b), (c), (d) and (e), the sky lacks stripped cloud patterns at the upper left corner and there are less ripples in the areas below the trees. Our results in (f) and (g) get generally better perceptual quality.

5 Conclusion

In this paper, we incorporate TCQ into an end-to-end deep learning based image compression framework. Experiments show that our model can achieve comparable results to previous works. The comparisons between TCQ and SQ show that TCQ boosts both PSNR and MS-SSIM compared with SQ at low bit rates either using MSE loss or MS-SSIM loss for training.



  • [1] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool (2017) Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems, pp. 1141–1151. Cited by: §1, §2.1, §2.2, §4.4.
  • [2] M. Akbari, J. Liang, and J. Han (2019) Dsslic: deep semantic segmentation-based layered image compression. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2042–2046. Cited by: §1.
  • [3] N. Asuni and A. Giachetti (2014) TESTIMAGES: a large-scale archive for testing visual devices and basic image processing algorithms.. In Eurographics Italian Chapter Conference, Vol. 1, pp. 3. Cited by: §4.1.
  • [4] J. Ballé, V. Laparra, and E. P. Simoncelli (2016) End-to-end optimized image compression. arXiv preprint arXiv:1611.01704. Cited by: §1, §2.1, §2.2, Figure 5, §4.4.
  • [5] Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    arXiv preprint arXiv:1308.3432. Cited by: §3.2.
  • [6] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto (2019) Learning image and video compression through spatial-temporal energy compaction. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 10071–10080. Cited by: §4.4.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Figure 1.
  • [8] S. Ioffe (2017)

    Batch renormalization: towards reducing minibatch dependence in batch-normalized models

    In Advances in neural information processing systems, pp. 1945–1953. Cited by: §3.1.
  • [9] (2000-12) Information technology – jpeg 2000 image coding system: core coding system. Standard International Organization for Standardization. Cited by: §1, §3.2, §3.2.
  • [10] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici (2018) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4385–4393. Cited by: §4.4.
  • [11] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang (2018) Learning convolutional networks for content-weighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3214–3223. Cited by: §2.1, §2.2, §4.4.
  • [12] M. W. Marcellin and T. R. Fischer (1990) Trellis coded quantization of memoryless and gauss-markov sources. IEEE transactions on communications 38 (1), pp. 82–93. Cited by: §1.
  • [13] M. W. Marcellin (1994) On entropy-constrained trellis coded quantization. IEEE Transactions on Communications 42 (1), pp. 14–16. Cited by: §3.2.
  • [14] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool (2018) Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402. Cited by: §1, §2.1, §2.2, §3.2, §3.3, §4.4.
  • [15] D. Minnen, J. Ballé, and G. D. Toderici (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: §4.5.
  • [16] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma (2017) Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517. Cited by: §1, §3.3.
  • [17] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016)

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883. Cited by: §3.1.
  • [18] L. Theis, W. Shi, A. Cunningham, and F. Huszár (2017) Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395. Cited by: §1, §2.1, §2.2, §4.4.
  • [19] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar (2015)

    Variable rate image compression with recurrent neural networks

    arXiv preprint arXiv:1511.06085. Cited by: §2.1.
  • [20] G. K. Wallace (1992) The jpeg still picture compression standard. IEEE transactions on consumer electronics 38 (1), pp. xviii–xxxiv. Cited by: §4.3.
  • [21] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324. Cited by: §3.1.
  • [22] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.1.