1 Introduction
The goal of designing the optimal image codec is to minimize the distortion between the original image and the reconstructed image subject to the constraint of the bitrate . As the entropy is the lower bound of bitrate , the optimization can be formulated as minimizing , where is the tradeoff factor. Recently many works [4, 18, 14] attempt to develop image compression models based on deep learning architectures. In their approaches, a uniform scalar quantizer (SQ) is commonly applied to the feature maps between the encoder and decoder. As the codewords are distributed in a cubic and the corresponding Voronoi regions induced by SQ are always cubic, SQ cannot achieve the RD bound [9]
. Vector quantization (VQ) has the optimal performance, but the complexity is usually high. Trellis coded quantizer (TCQ) is a structured VQ, and it can achieve better performance than SQ with modest computational complexity
[12]. It is shown in [12] that for memoryless uniform sources, a 4 state TCQ can achieve 0.87dB higher SNR than SQ for 4 bitsample.In this paper, motivated by the superior performance of TCQ over SQ in traditional image coding, we propose to use TCQ to replace the commonly used SQ in a deep learning based image compression model. The softtohard strategy [1] is applied to allow for back propagation during training. To the best of our knowledge, we are the first to investigate the performance of TCQ in a deep learning based image compression framework. Our implementation allows for batch processing amenable to the minibatch training in deep learning models, which greatly reduces the training time.
The entropy coding can further reduce the bitrate without impacting the reconstruction performance. One way to apply it in deep learning model is to use offline entropy coding method during testing [2]. This method is not optimized for the bitrate as the network is not explicitly designed to minimize the entropy. In this paper, we adopt the PixelCNN++ [16]
to model the probability density function on an image
over pixels from all channels as , where the conditional probability only depends on the pixels above and to the left of the pixel in the image. A cross entropy loss is followed to estimate the entropy of the quantized representation to jointly minimize the RD function.Our contributions are summarized as follows. We propose to incorporate TCQ into a deep learning based image compression framework. The image compression framework consists of encoder, decoder and entropy estimation subnetworks. They are optimized in an endtoend manner. We experiment on two commonly used datasets and both show that our model can achieve superior performance at low bit rates. We also compare TCQ and SQ based on the same baseline model and demonstrate the advantage of TCQ.
2 Related Work
There has been a line of research on deep learning based image compression, especially autoencoders with a bottleneck to learn compact representations. The encoder maps the image data to the latent space with reduced dimensionality, and the decoder reconstructs the original image from the latent representation.
2.1 Quantization in DNN
Several approximation approaches have been proposed to allow the network to backpropagate through the quantizer during training. In [19, 11]
, a binarization layer is designed in the forward pass and the gradients are defined based on a proxy of the binarizer. Ballé
et. al. [4] stochastically round the given values by adding noise and use the new continuous function to compute the gradients during the backward pass. Theis et. al. [18] extend the binarizer in [19] to integers and use straightthrough estimator in the backward pass. In [1], a soft quantization in both forward and backward passes is proposed. The model needs to learn the centers and change from soft quantization to hard assignments during training by an annealing strategy. In [14], the authors apply the nearest neighbors to obtain fixed centers, and the soft quantization in [1] is used during the backward pass.2.2 Image Compression based on DNN
With the quantizer being differentiable, in order to jointly minimize the bitrate and distortion, we also need to make the entropy differentiable. For example, in [4, 18], the quantizer is added with uniform noise. The density function of this relaxed formulation is continuous and can be used as an approximation of the entropy of the quantized values. In [1], similar to the soft quantization strategy, a soft entropy is designed by summing up the partial assignments to each center instead of counting. In [14, 11], an entropy coding scheme is trained to learn the dependencies among the symbols in the latent representation by using a context model. These methods allow jointly optimizing the RD function.
3 Proposed Approach
Our model follows the encoderdecoder framework. Different from the previous works that apply a uniform scalar quantizer (SQ) after the encoder network, we propose to use trellis coded quantizer (TCQ) to enhance the reconstruction performance. The whole framework is trained jointly with our entropy model.
3.1 Encoder and Decoder
Since our goal is to study the gain of TCQ and SQ, we only use a simple encoding and decoding framework. Our encoder network consists of three layers of convolutional layers with a stride of 2 to downsample the input. Each convolutional layer is followed by a ReLU layer. We remove BatchNorm
[8] layers as we find removing them gives us better reconstruction performance. We add one more convolutional layer to reduce the channel dimension to a small value e.g. 8 to get a condensed feature representation . A layer is followed to project to continuous values between 1 and 1. Then a quantizer is applied to quantize the feature maps to discrete values. For the decoder network, we use PixelShuffle [17] layer for upsampling. Inspired by [21], we adopt two intermediate losses after each upsampling operation to force the network to generate images from low resolution to high resolution progressively as shown in Fig. 1.3.2 Trellis Coded Quantizer
Forward Pass: Trellis coded quantizer (TCQ) is applied in JPEG2000 [9]
part II. Different from JPEG2000 where the input for TCQ is fixed given an image block, when embedded in deep neural networks, the input for TCQ is updated in each iteration during training. The forward pass for TCQ is similar to the original implementation in
[9]. In essence, TCQ aims to find a path with minimum distortion from the start symbol to the last symbol based on the particular diagram structure. Figure 3 shows a trellis structure with 4 states. For bit/symbol, a quantizer with quantization levels is created. These reconstruction points can be obtained by a uniform quantizer. As the last layer of our encoder is a function, we have and . The quantization step is . A reconstruction point () is obtained by . Next all the reconstruction levels are partitioned into four subsets from left to right to form four subquantizers. Then different subsets are assigned to different branches of the trellis, so that different paths of the trellis can try different combinations to encode an input sequence. Each node only needs to record the input branch that has the smallest cost. After obtaining the minimum distortion for the last symbol, we trace back to get the optimal path as shown in red in Fig. 3 for instance. With this optimal path, 1 bit is used to indicate which branch to move for next symbol, and the last bits are used to indicate the index of codeword from the corresponding subquantizer. Here we call it indexing method I .Backward Pass: In order to make a quantizer differentiable, the most common way is to use straightthrough estimator [5] where the derivative of the quantizer is set to 1. However, we find that such backward method tends to converge slowly for TCQ. As the TCQ changes the distribution of the input data, this inconsistency may make it hard for the network to update weights in the right direction. Similar to [14], given reconstruction points (), we use the differentiable soft quantization during the backward pass.
(1) 
where
is a hyperparameter to adjust the “softness” of the quantization.
Discussions: One issue for the TCQ implementation is that the time and memory complexity are both proportional to the number of symbols. Previous implementation usually flattens the input block into a sequence. Because pixels in one feature map are more correlated than pixels in other feature maps, we consider each feature map as an input for TCQ. For feature maps with size ( is the batch size for the network, is the number of channels, and are the height and width), we reshape the size as , where is the batch size for TCQ and is the number of symbols in a feature map, which reduces the processing time.
The other issue is that the conventional indexing method I mentioned above brings in randomness for the indices of a feature map as shown in Fig. 3 (a). The reason is that the branch bit depends on the optimal path in trellis structure and it does not carry any relationship among each symbol. From JPEG2000 [9], we have two unionquantizers and . As pointed in [13], given a node in the diagram, the codeword that can be chosen is either from or . Therefore, because of the particular structure of the trellis, all bits can be used to represent the indices for the unionquantizer and the same applies to . For example, in Fig. 3, assume we receive the initial state during decoding. Only or subquantizer will be chosen for this symbol. As the indices for and are all different, we get the corresponding unique codeword based on the received bits. Then we easily know which subquantizer ( or ) is chosen and accordingly the branch number. We call it indexing method II. Fig. 3 (b) gives the indices of a feature map resulting from the indexing method II.
3.3 Entropy Coding Model
The aforementioned autoencoder model is not optimized for entropy coding. We can model the conditional probability distribution of a symbol based on its context
[14]. The context should be only related to previous decoded symbols, and not use the later unseen symbols. We employ PixelCNN++ [16] model for the entropy coding model. We replace the last layer of PixelCNN++ model in implementation^{1}^{1}1https://github.com/pclucas14/pixelcnnpp with a softmax function so that a cross entropy loss can be used during training. This loss is viewed as an estimation of entropy for the quantized latent representation. Assume we have bits to encode each symbol and a dimensional feature map , the PixelCNN++ model outputs a probability matrix. Encoding is done row by row and each row orders from left to right. With the probability matrix, we encoder the indices of the feature maps by Adaptive Arithmetic Coding (AAC)^{2}^{2}2https://github.com/nayuki/Referencearithmeticcodingto get the compressed representation. During decoding, for the first forward pass, we input the pretrained PixelCNN++ model with a tensor with all zeros. This first forward pass gives distributions for entries
where is a position in the feature map . Then we decode the indices along the channel dimension by AAC. Based on the received initial states, we recover the symbols at . The following decoding steps are based on the conditional probability(2) 
where is a tensor with decoded symbol at location and zeros otherwise. When , . When , . As the decoding proceeds, the remaining zeros will be replaced by the decoded symbols progressively.
4 Experiment
4.1 Dataset
We use ADE20K dataset [22] for training and validation. We test on Kodak PhotoCD image dataset^{3}^{3}3http://r0k.us/graphics/kodak/ and Tecnick SAMPLING dataset [3]. ADE20K dataset contains 20K training and 2K validation images. Kodak PhotoCD image dataset and Tecnick SAMPLING dataset include 24 512768 images and 100 12001200 images respectively.
4.2 Training Details
We crop each input image by 256
256 during training and test on the whole images. During training, we use a learning rate of 0.0001 at the beginning, and decrease it by a factor 0.4 at epoch 80, 100 and 120. Training is stopped at 140 epochs and we use the model that gives the best validation result for testing. We set the batch size as 18 and run the training on one 12G GTX TITAN GPU with the Adam optimizer. We use 4 quantization levels and increase the channel size from
to control the bitrate. Compression performance is evaluated with MultiScale Structural Similarity (MSSSIM) by bits per pixel (bpp) and we use MSSSIM loss in Eq. 3 during training.(3) 
The first term is the distortion error and the second term is the cross entropy loss for pixelCNN++ model. is a hyperparameter and set to 1.
4.3 Baselines
We compare our results with conventional codecs and recent deep learning based compression models. JPEG [20] results are obtained from ImageMagick^{4}^{4}4https://imagemagick.org. JPEG2000 results are from MATLAB implementation and BPG results are based on 4:2:0 chroma format^{5}^{5}5http://bellard.org/bpg. For deep learning based image compression models, we either collect from the released test results or plot the ratedistortion curves from the published papers.
(a)  (b) 
quantizer  Kodak dataset  Tecnick dataset  
PSNR(dB)/bpp  MSSSIM/bpp  PSNR(dB)/bpp  MSSSIM/bpp  
SQ  24.54/0.077  0.9028/0.077  26.14/0.068  0.9326/0.068 
TCQ  24.95/0.076  0.9102/0.076  26.82/0.066  0.9377/0.066 
SQ  25.66/0.117  0.9259/0.117  27.63/0.104  0.9493/0.104 
TCQ  25.85/0.116  0.9315/0.116  27.86/0.101  0.9518/0.101 
SQ  26.23/0.157  0.9386/0.157  28.32/0.139  0.9572/0.139 
TCQ  26.47/0.154  0.9427/0.154  28.45/0.133  0.9592/0.133 
quantizer  Kodak dataset  Tecnick dataset  
PSNR(dB)/bpp  MSSSIM/bpp  PSNR(dB)/bpp  MSSSIM/bpp  
SQ  24.86/0.064  0.8715/0.064  25.94/0.054  0.9091/0.054 
TCQ  25.23/0.062  0.8824/0.062  26.66/0.052  0.9189/0.052 
SQ  25.81/0.098  0.8992/0.098  27.35/0.081  0.9308/0.081 
TCQ  26.35/0.096  0.9090/0.096  27.96/0.078  0.9373/0.078 
SQ  26.56/0.133  0.9178/0.133  28.18/0.112  0.9427/0.112 
TCQ  26.89/0.130  0.9232/0.130  28.65/0.110  0.9473/0.110 
0.8177/0.128  0.9028/0.118  0.9139/0.117  
(a) Original  (b) JPEG  (c) JPEG2000  (d) Ballé [4] 
0.9189/0.112  0.9257/0.112  0.9323/0.109  
(e) BPG  (f) Ours (SQ)  (g) Ours (TCQ) 
4.4 Comparisons with previous works
Fig. 4 shows result comparisons between our approach and other image compression algorithms (Theis et. al. [18], Ballé et. al. [4], Agustsson et. al. [1], Johnston et. al. [10], Li et. al. [11], Mentzer et. al. [14], Cheng et. al. [6]) on two datasets. Despite the simplicity of our network, the results from our model with TCQ show its superior performance at low bit rates. At high bit rates, our results can achieve comparable performance to previous papers except for the latest results in Mentzer et. al. [14] and Cheng et. al. [6]. It is probably because at high bit rates, we increase the number of channels of the model, but we do not finetune the training parameters.
4.5 Comparisons between TCQ and SQ
In Tab. 1, we compare the MSSSIM and PSNR between TCQ and SQ using MSSSIM loss for training. At the low bit rate (around 0.07 bpp), TCQ can achieve 0.008 in MSSSIM (0.41dB in PSNR) and 0.005 in MSSSIM (0.68dB in PSNR) higher than that from SQ on Kodak and Tecnick datasets respectively. We notice that at higher bit rates, the performance gap between TCQ and SQ is less obvious. As the number of channels increases, the learning ability of the model improves as well. The type of quantizer may not be that important for more complex models.
In Tab. 2, we compare the performance between TCQ and SQ using MSE loss as the distortion for training and is set to 0.01. A similar trend is observed where TCQ outperforms SQ at the same bit rate.
The pixelCNN++ model used in this paper is not optimal for entropy coding. In [15], a context model along with a hypernetwork is used to predict and of a set of Gaussian models, which saves more bits than directly using the probability matrix. In our experiment, it gets 0.154 bpp for the model of 8 channels compared to preentropy coding with 0.25 bpp on the Kodak dataset.
4.6 Qualitative Comparisons
In Fig. 5, we show results from different codecs. Fig. 5 (a) is the original image. In (b), we can clearly see compression artifacts in the JPEG reconstructed image. In (c), (d) and (e), the shape of the cloud is very blurry. For BPG in (e), there are also some block artifacts in the green box sample. We notice that in (b), (c), (d) and (e), the sky lacks stripped cloud patterns at the upper left corner and there are less ripples in the areas below the trees. Our results in (f) and (g) get generally better perceptual quality.
5 Conclusion
In this paper, we incorporate TCQ into an endtoend deep learning based image compression framework. Experiments show that our model can achieve comparable results to previous works. The comparisons between TCQ and SQ show that TCQ boosts both PSNR and MSSSIM compared with SQ at low bit rates either using MSE loss or MSSSIM loss for training.
References
References
 [1] (2017) Softtohard vector quantization for endtoend learning compressible representations. In Advances in Neural Information Processing Systems, pp. 1141–1151. Cited by: §1, §2.1, §2.2, §4.4.
 [2] (2019) Dsslic: deep semantic segmentationbased layered image compression. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2042–2046. Cited by: §1.
 [3] (2014) TESTIMAGES: a largescale archive for testing visual devices and basic image processing algorithms.. In Eurographics Italian Chapter Conference, Vol. 1, pp. 3. Cited by: §4.1.
 [4] (2016) Endtoend optimized image compression. arXiv preprint arXiv:1611.01704. Cited by: §1, §2.1, §2.2, Figure 5, §4.4.

[5]
(2013)
Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §3.2. 
[6]
(2019)
Learning image and video compression through spatialtemporal energy compaction.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 10071–10080. Cited by: §4.4.  [7] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Figure 1.

[8]
(2017)
Batch renormalization: towards reducing minibatch dependence in batchnormalized models
. In Advances in neural information processing systems, pp. 1945–1953. Cited by: §3.1.  [9] (200012) Information technology – jpeg 2000 image coding system: core coding system. Standard International Organization for Standardization. Cited by: §1, §3.2, §3.2.
 [10] (2018) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4385–4393. Cited by: §4.4.
 [11] (2018) Learning convolutional networks for contentweighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3214–3223. Cited by: §2.1, §2.2, §4.4.
 [12] (1990) Trellis coded quantization of memoryless and gaussmarkov sources. IEEE transactions on communications 38 (1), pp. 82–93. Cited by: §1.
 [13] (1994) On entropyconstrained trellis coded quantization. IEEE Transactions on Communications 42 (1), pp. 14–16. Cited by: §3.2.
 [14] (2018) Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402. Cited by: §1, §2.1, §2.2, §3.2, §3.3, §4.4.
 [15] (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: §4.5.
 [16] (2017) Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517. Cited by: §1, §3.3.

[17]
(2016)
Realtime single image and video superresolution using an efficient subpixel convolutional neural network
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883. Cited by: §3.1.  [18] (2017) Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395. Cited by: §1, §2.1, §2.2, §4.4.

[19]
(2015)
Variable rate image compression with recurrent neural networks
. arXiv preprint arXiv:1511.06085. Cited by: §2.1.  [20] (1992) The jpeg still picture compression standard. IEEE transactions on consumer electronics 38 (1), pp. xviii–xxxiv. Cited by: §4.3.
 [21] (2018) Attngan: finegrained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324. Cited by: §3.1.
 [22] (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.1.
Comments
There are no comments yet.