1 Introduction
Scalable coding is different from nonscalable coding in the sense that the coded bitstream of scalable coding is partially decodable. That is to say that scalable image compression allows reconstructing complete images with more than one quality levels simultaneously by decoding appropriate subsets of the whole bitstream, which is called the “bitstream scalability”. In terms of the comparison with a simulcast codec [1], a scalable codec produces a cumulative set of hierarchical representations which can be combined for progressive refinement, instead of producing a multirate set of signals that are independent of each other. Thereby, scalable/progressive image compression is of great significance for image transmission and storage in practical using.
As the prosperity of deep learning, DNNbased models for lossy image compression [2, 3, 4, 5, 6, 7, 8, 9] have been widely explored recently. Toderici et al. [4] and Baig et al. [7] study the design of network architectures for deep image compression. Ball et al.[3] and Agustsson et al.[6] introduce the trainable quantization methods to help achieving endtoend optimization. The works of [3, 7, 10, 4, 11] investigate the context models to improve the compression efficiency of arithmetic coding. In addition, the biologicallyinspired joint nonlinearity, named generalized divisive normalization (GDN), is proposed in [3]. And the structure of side information in learned image compression is well studied in [8, 9]. Specially, Agustsson et al.[12] employ the generative models in improving the perceptual performance of learned image compression. Scalabilty, as a critical property, has not drawn much attention explicitly in deeplearningbased schemes, while it is supported by many prevailing conventional image/video compression standards [13, 14, 15].
In terms of the related works based on deep learning, Gregor et al. [16] introduce a novel hierarchical representation of images with a homogeneous deep generative model, which is considered as a ”conceptual compression” framework instead of a real compressor. The framework proposed by Toderici et al. [4]
can be viewed as the first DNNbased image compression model supporting bitstream scalability, in which recurrent neural networks (RNN) are employed to compress the residual information of the last reconstruction relative to the original image iteratively. However,
[4] still suffers from limited ratedistortion performance and complex encodingdecoding process due to the multiiteration encodingdecoding within it.In this paper, we are devoted to developing a more effective learned scalable image compression scheme. Except for obtaining better ratedistortion performance, we also aim to develop the functionality of enabling us to obtain the reconstructed images with different quality levels via a onepass encodingdecoding simultaneously. Inspired by the Fine Granularity Scalability (FGS) [14] in MPEG4 video standard, we adopt bitplane decomposition to decompose the information before the input layer of neural networks. Bitplane decomposition has an inherent advantage to transform the image to a hierarchical representation, in which an RGB image can be transformed into 24 bitplanes losslessly (8 bitplanes per channel). Two significant things can be observed: firstly, the sum of the information entropy [17] (shortly called “entropy”) of all bitplanes always exceeds the entropy of the corresponding original image; secondly, different bitplanes are not equal in their entropy. Theoretically, the carried information of a particular sequence of independent events is the sum of the information carried by each event. Therefore, there should be a correlation among different bitplanes, which is hard to be well considered in conventional bitplane coding. In addition, the information carried by different bitplanes are asymmetrical due to their unequal entropy volumes. In this work, we make the first endeavour to employ deep neural networks in capturing the correlation among bitplanes in coding process. Moreover, for the information with different importance for reconstruction, we design a selfconsistent architecture to disentangle them to form the hierarchical representations with an endtoend optimization.
In summary, we have made three main contributions: (1) We propose a new DNNbased framework for learned scalable/progressive image compression, which can enable us to get the compressed results corresponding to multiple bitrates simultaneously through onepass encoding and decoding. Note that the only one previous DNNbased image codec [4] can support bitstream scalability and it requires multiiteration encoding and decoding to get compressed results with different quality levels. (2) We propose to involve the idea of bitplane coding into a learnable scalable image codec, which benefits informantion decomposition for more effective hierarchical representation. (3) Within our proposed model, we design a LSTMbased architecture to disentangle the information of different bitplanes and achieve an endtoend optimization for better ratedistortion performance, which goes beyond the regular using of LSTMs [18]. Our proposed method outperforms the stateoftheart DNNbased scalable image codec greartly in both PSNR and MSSSIM metrics.
2 Proposed Method
We propose a deeplearningbased framework for scalable/progressive image compression. Within this framework, we adopt bitplane decomposition to perform information decomposition coarsely and design two bidirectional gated units to disentangle the contextual information precisely.
2.1 Scalable Compression Framework
Bitplane decomposition. As illustrated in Fig.1.(a), for a RGB image, we transfer each channel of it into bitplanes through bitplane decomposition. Here can be viewed as socalled bitdepth. In this paper, we set for the RGB images in which the pixels are in the range of [0, 255]. For clarity, we represent the bitplane for R, G, B channels as , , respectively. Illustratively, we denote the pixel located at in channel as , then we can obtain its corresponding value in the bitplane as below:
(1) 
where is the function to get the greatest integer less than or equal to . Inversely, we can reconstruct the original image from bitplanes by the following formula:
(2) 
By the operation described in Eq.1, the original information from the RGB images is unevenly scattered into eight correlated but heterogeneous subspaces. However, Eq.2 shows that each bitplane is of different importance for reconstruction. In addition, since the information entropy of each bitplane is not equal, the information volume carried by each bitplane is also different.
Encoder. Taking the bitplanes as the input of the encoder, we design a multibranch architecture to learn the hierarchical representations. The network layers in each branch don’t share weights with the layers in other branches. As shown in Fig.1.(a), we leverage one convolutional layer to perform a preliminary transformation for each bitplane independently, followed by three layers consisting of BAGUnits to further transform the carried information and yield
feature map partitions. Notice that there are bidirectional information flows between the two BAGUnits in adjacent branches via their hidden states. In both of the first convolutional layers and BAGUnits, we use the convolutional operation with the stride of 2 to achieve spatial downsampling for the feature maps. The quantization module “
” includes a convolutional layer with the stride of 1, a tanh activation and a binarization function defined in [2]. In the final of the encoder, we define a switch function located in each branch. When the switch is “on”, the corresponding feature map will be retained as one participant of the compressed codes before entropy coding; when the switch is “off”, the corresponding feature map will be filled with zero values in the compressed codes before entropy coding. Finally, we employ the method of entropy coding in [4] for the codes from each branch individually to obtain the final compressed codes.In terms of the transmission, only the compressed codes corresponding to the switch in “on” state should be transmitted from the sender to the receiver. The sum of the rates of all final compressed codes determines the compression rate and the highest reconstructed quality level in our scalable framework. The “basic bitrate”, namely the minimal coding rate can be achieved for one trained model before entropy coding, depends on the size of the binary feature map after quantization per branch in the encoder network. Some related recommended settings are listed in our supplementary materials.
Decoder. In our decoder, we leverage a convolutional layer with the stride of 1 to tune the dimensions of feature maps at the beginning and the ending of decoding respectively. Then we also use a multibranch architecture to disentangle the contextual information for reconstruction in the procedure of decoding. Different from the method of downsampling in the encoder, here we use pixel shuffle such a depthtospatial operation to implement spatial upsampling. The same switch function in our decoder is used for controlling the quality level of the reconstructed image.
2.2 Contextual Information Disentanglement
In this section, we elaborate the architecture design for BAGUnit and IBAGUnit which play the role of disentangling contextual information in our scalable compression framework. We design a modified version of bidirectionalconvolutional LSTM [20] as the gated unit in BAGUnit and IBAGUnit and go beyond the regular using of LSTMs.
The information of each bitplane is heterogeneous with each other. We therefore propose to abandon the recurrent connections of LSTM units by using different units with unshared weights. Mathematically, let , , and denote the input, cell, hidden, and output states of the BAGUnit/IBAGUnit in the branch (see (b) and/or (c) in Fig.1). Clearly, we use the arrows above the symbol to distinguish two different directions of the information flows between adjacent BAGUnits/IBAGUnits. For the paired gated unit within the BAGUnit/IBAGUnit in the branch, their cell, hidden and output states can be updated as follows:
(3)  
(4)  
(5)  
(6)  
(7)  
(8) 
where “” denotes the convolutional operator, and “” denotes elementwise multiplication. The symbols , and represent the input gate, forget gate and output gate respectively, and indicates the input of the gated unit. Additionally, with different subscripts denote the weight matrices of different convolutional transformations, and
denotes the sigmoid activation function
. The output state , which is also the input of “SE” block, is the result of concatenating the hidden states and of the gated units in two directions.The gated units in BAGUnit/IBAGUnit play two important roles in disentangling the information: (1) capturing the correlations among different bitplanes, which benefits reducing rate for compact representations in compression; (2) helping to determine which level of feature partitions the information should be expressed according to its relative importance.
After the gated units, we employ the “SqueezeandExcitation” module to introduce a channelwise attention for better fusing the information from different directions. Then we use a convolution layer with the stride of 1 to perform further transformation.
2.3 Training Algorithm
As a scalable image compression framework, it is required to be optimized for hierarchical reconstructed results with different quality levels meanwhile during training. Therefore, we use a specific approach to train this model, in which each training step contains a onepass forward process of the encoder, a multipass forward process of the decoder and a onepass backward process for parameters updating. Clearly, suppose that there are
quality levels in all, the loss function can be depicted by the formula below:
(9) 
where denotes the reconstructed results at the level of , represents the output of the ith branch, and refers to the distance function which is related to the distortion metrics used for evaluation. We weight the distortions under different code rates with a coefficient , which is set to generally. Typically, we take L1 norm and MSSSIM (proposed in [21]) as the mentioned distance functions to train our model in this paper.
3 Experiment Results
3.1 Datasets and Settings
We use two sets of training data to train our proposed model, which includes the CoCo dataset [22] and a dataset composed of thirty thousand RGB images we collected from the word wide web. For the first dataset, we obtain (
can be taken as 32, 64 and 128) image patches for training by adopting the commonly used data augmentation strategies of random cropping and random horizontal flipping (with a probability of 0.5). For the second dataset, each image is first scaled by a random factor in [0.5, 1.5], followed by a random cropping and a random horizontal flipping (with a probability of 0.5). Then, we perform filtering the obtained image patches by using the sobel operator and cany operator to reduce the ratio of the training samples with too simple textures.
We implement threestage training procedures with different patch sizes at each stage for our proposed models. We first pretrain our model by using
patches from the first dataset and perform stochastic gradient descent with minibatches of 32 by adopting Adam optimizer with a learning rate of
. Then we train our model by using patches from the second dataset. At this stage, we set the size of minibatch as 32 and adopt Adam optimizer with a initial learning rate of and a weight decay of . We finally perform finetuning with image patches from the second dataset. At this stage, we tune of Eq.10 in main text in a small range for improving the performance with respect to some specific bitrates.3.2 Ratedistortion Performance
We evaluate our proposed models on the Kodak dataset and illustrate the best ratedistortion performance across multiple trained models under different bitrates in Fig.2. By involving bitplane decomposition and disentangling the information with the BCDNet, in both PSNR and MSSSIM metrics, our proposed model achieves a significant improvement across different bitrates compared to the current stateoftheart DNNbased scalable image compression model. Relative to the conventional scalable image codec JPEG2000, our proposed model outperforms it in MSSSIM metric across different bitrates, and it also shows its advantage in PSNR metric under low bitrates.
3.3 Ablation Study
Rate (bpp) & Distortion  0.0625  0.125  0.1875  0.25  
PSNR  MSSSIM  PSNR  MSSSIM  PSNR  MSSSIM  PSNR  MSSSIM  
(1) Unidirectional Encoderdecoder  22.6267  0.7630  25.3592  0.8448  27.1178  0.8840  27.7594  0.9016  
22.6432  0.7684  25.5035  0.8536  26.7740  0.8830  27.4419  0.8966  
24.7268  0.7863  25.5717  0.8183  25.6212  0.8194  25.6195  0.8194  
24.5063  0.7894  25.5561  0.9216  25.6693  0.8240  25.6720  0.8244  
(2) with the regular using of LSTMs  25.3584  0.8175  26.5429  0.8707  27.2956  0.8986  27.4378  0.9030  
(3) w/o bitplane decomposition  25.1074  0.8203  26.5585  0.8720  27.2947  0.8929  27.6120  0.9036  
(4) w/o the “SE” modules  25.6104  0.8163  26.8601  0.8721  27.5019  0.8948  27.8937  0.9051  
(5) w/o the GDN/IGDN  25.3693  0.8160  26.6676  0.8715  27.3263  0.8932  27.7023  0.9027  
(6) Fullyequipped BCDNet  25.8295  0.8297  27.3045  0.8785  27.9695  0.8999  28.3327  0.9101 
To further investigate the effectiveness of the technical components within our proposed scheme, we construct a series of experiments in comparison to the following experimental cases: (1) We implement four different combinations of an unidirectional encoder and an unidirectional decoder, in which “E” and “D” denote the encoder and the decoder respectively, and the symbols and
represent two directions of the information flow; (2) We take the LSTM with recurrent connections as the gated units inside BAGUnits and IBAGUnits; (3) We replace the bitplane decomposition with convolution and slicing operations; (4) We take the SE blocks away from BAGUnits and IBAGUnits. (5) We replace the GDN and IGDN inside BAGUnits and IBAGUnits with the nonlinear activation function leaky relu. For each experimental case, we train one model with the basic bitrate of
bits per pixel (bpp). All experimental cases here are optimized for PSNR metric under the same training settings. The evaluation results on the Kodak dataset are reported in Table.1.As shown in Table.1, the ratedistortion performance of codec decline severely when we apply the unidirectional network topology in encoder and decoder, which shows that the bidirectional information flow is crucial for context disentanglement. Bidirectional message passing helps for determining where should be expressed for the information with different importance and considering the correlations among the representations at different levels. The second experiment demonstrates that the LSTMs with unshared parameters are more suitable for mapping heterogeneous information hidden in different subspaces to latent representations when compared to the regular using of LSTMs with recurrent connections. Also, we can find that bitplane decomposition is better than convolution and slicing operations in providing a coarse but effective information decomposition before the deeplearningbased transformation. The results of ablation study also suggest that “SqueezeandExcitation” block can lead to better information fusion by introducing the channelwise attention. Additionally, similar with Ball et al.’s work [3][8], GDN/IGDN is also effective within our scheme in simplifying learning by Gaussianizing image densities.
4 Conclusions
In this paper, we study the deeplearningbased scalable image codec. We propose to involve bitplane decomposition in a DNNbased compression framework to decompose the original information coarsely. Then we design the Bidirectional Context Disentanglement Network (BCDNet) to learn more effective hierarchical representations for scalable/progressive compression. Consequently, our proposed model can compress and reconstruct the images with different quality levels simultaneously through a onepass encodingdecoding. And it outperforms the stateoftheart of DNNbased scalable image codecs in both PSNR and MSSSIM metrics. It also outperforms the conventional scalable image codec in MSSSIM metric across different bitrates and in PSNR metric under low bitrates.
References
 [1] Steven Ray McCanne and Martin Vetterli, Scalable compression and transmission of internet multicast video, University of California, Berkeley, 1996.
 [2] George Toderici, Sean M O’Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja, Michele Covell, and Rahul Sukthankar, “Variable rate image compression with recurrent neural networks,” arXiv preprint arXiv:1511.06085, 2015.
 [3] Johannes Ballé, Valero Laparra, and Eero P Simoncelli, “Endtoend optimized image compression,” The 5th International Conference on Learning Representations, 2017.

[4]
George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen,
Joel Shor, and Michele Covell,
“Full resolution image compression with recurrent neural networks,”
in
IEEE Conference on Computer Vision and Pattern Recognition
, 2017, pp. 5306–5314. 
[5]
Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár,
“Lossy image compression with compressive autoencoders,”
The 5th International Conference on Learning Representations, 2017. 
[6]
Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu
Timofte, Luca Benini, and Luc V Gool,
“Softtohard vector quantization for endtoend learning compressible representations,”
in Advances in Neural Information Processing Systems, 2017, pp. 1141–1151.  [7] Mohammad Haris Baig, Vladlen Koltun, and Lorenzo Torresani, “Learning to inpaint for image compression,” in Advances in Neural Information Processing Systems, 2017, pp. 1246–1255.

[8]
Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick
Johnston,
“Variational image compression with a scale hyperprior,”
The 6th International Conference on Learning Representations, 2018.  [9] David Minnen, Johannes Ballé, and George Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” In NeurlPS, 2018.
 [10] Oren Rippel and Lubomir Bourdev, “Realtime adaptive image compression,” arXiv preprint arXiv:1705.05823, 2017.
 [11] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool, “Conditional probability models for deep image compression,” arXiv preprint arXiv:1801.04260, 2018.
 [12] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool, “Generative adversarial networks for extreme learned image compression,” arXiv preprint arXiv:1804.02958, 2018.
 [13] Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi, “The jpeg 2000 still image compression standard,” IEEE Signal processing magazine, vol. 18, no. 5, pp. 36–58, 2001.
 [14] Weiping Li, “Overview of fine granularity scalability in mpeg4 video standard,” IEEE Transactions on circuits and systems for video technology, vol. 11, no. 3, pp. 301–317, 2001.
 [15] Yan Ye and Pierre Andrivon, “The scalable extensions of hevc for ultrahighdefinition video delivery,” IEEE MultiMedia, vol. 21, no. 3, pp. 58–64, 2014.
 [16] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra, “Towards conceptual compression,” in Advances In Neural Information Processing Systems, 2016, pp. 3549–3557.
 [17] Aaron Wyner, “Recent results in the shannon theory,” IEEE Transactions on information Theory, vol. 20, no. 1, pp. 2–10, 1974.
 [18] Sepp Hochreiter and Jürgen Schmidhuber, Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [19] Jie Hu, Li Shen, and Gang Sun, “Squeezeandexcitation networks,” arXiv preprint arXiv:1709.01507, 2017.
 [20] Qingshan Liu, Feng Zhou, Renlong Hang, and Xiaotong Yuan, “Bidirectionalconvolutional lstm based spectralspatial feature learning for hyperspectral image classification,” Remote Sensing, vol. 9, no. 12, pp. 1330, 2017.
 [21] Zhou Wang, Eero P Simoncelli, and Alan C Bovik, “Multiscale structural similarity for image quality assessment,” in Signals, Systems and Computers, 2004. Conference Record of the ThirtySeventh Asilomar Conference on. Ieee, 2003, vol. 2, pp. 1398–1402.
 [22] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
Comments
There are no comments yet.