1 Introduction
Image and video data appearing as general media provides us great convenience to share information and communicate with each other. However, nowadays huge amounts of image and video are required to be stored and transmitted efficiently. As we all know, image and video coding techniques have enormously alleviated this problem by compressing these data to be a small yet strong expressive one, but the compression efficiency of standard coding gradually can’t satisfy the explosive transmission demands of social media and streaming media with the popularization of electronic products such as digital camera and cellphone [1]
. Thus, image and video’s representation as well as compression towards higher compression ratio should be deeply studied, especially using deep learning.
Conventional still image coding [2, 3] has been developed from JPEG and JPEG2000 to WebP and BPG, etc. Meanwhile, several latest works, such as [4, 5, 6, 7, 8], are devoted to image compression with deep neural networks. In [6], two collaborated convolutional neural networks are used to form a unified endtoend learning framework, where one network produces a compact representation for encoding, while another one reconstructs the decoded image. Different the works of [6], a virtual codec neural network of [7] is learned to bridge the gap between the networks ahead of standard codec and after this codec so that the gradients could be properly passed from the back end to the front end. On the basis of the works of [7], multiple description convolutional neural networks are designed to compress image so as to ensure that an acceptable image can be decoded in the unprioritized network or under the condition of transmission congestion [8].
Because our work is highly related to the problem of postprocessing such as artifacts removal [9], deblocking [10] and denoising [11], we next introduce several stateoftheart works [11, 12, 13, 14, 15, 16, 17] for compression artifacts removal. In [11], shapeadaptive discrete cosine transformbased filtering is developed for denoising and deblocking by introducing the shape of the transform’s support in a pointwise adaptive fashion. In [12], after grouping similar 2D image patches into 3D data arrays, three successive procedures of 3D transformation of these array, shrinkage of the transform spectrum, and inverse 3D transformation are conducted to achieve image denoising. In [13], twostep algorithm is formed to reducing artifacts by dictionary learning and total variation regularization. To reduce blocking artifacts and obtain highquality image, an optimization problem using constrained nonconvex lowrank model is developed within maximum a posteriori framework [14]. Except the methods of filtering and optimization [11, 12, 13, 14], there are several convolutional neural networkbased approaches, such as [15, 16, 17]. Although postprocessing could improve the coding efficiency, they lose the sight of the significance of image representation, which can highlight the significance pixels or regions before coding in order to protect these pixels or regions. Thus, postprocessing and image representation should be combined together to further improve image’s coding efficiency.
Although the literatures of [6, 7] have got a high image compression gains at the low bitrate, they can’t have a large margin coding gain at the high bitrate. In this paper, mixedresolution image representation and compression with deep convolutional neural networks is introduced to efficiently compress image, no matter which bitrate is chosen by the users. The rest of this paper is arranged as follow. Firstly, we introduce the proposed method in Section 2. Secondly, the experimental results are given in Section3. At last, the Section 4 concludes the paper.
2 The proposed method
Our framework is composed of feature description neural network (denoted as FDNN) network, a standard codec (e.g., JPEG), postprocessing neural network (PPNN), and virtual codec neural network (VCNN) network. To greatly reduce the amount of image data for storage or transmission, we use the FDNN network to represent groundtruth image as in the lowresolution or highresolution before image compression. For simplicity, the FDNN network is expressed as a nonlinear function , in which is the parameter set of FDNN network. The compression procedure of standard codec is described as a mapping function , where is the parameter set of codec. Our PPNN network learns a postprocessing function from image to image to remove the noise, such as blocking artifacts, ringing artifacts and blurring, when is represented in the high resolution. Here, the parameter is the parameter set of PPNN network. However, if is represented in the lowresolution, PPNN network is used to simultaneously deartifact and upsample to be from lowresolution yet lowquality to highresolution yet high quality.
To combine image compression standard with image representation as well as postprocessing based on convolutional neural network, the intuitive idea is that the compression procedure of codec is learned by a convolutional neural network to get a approximation function. Although convolutional neural network is a powerful tool to approximate any nonlinear function, it’s hard to imitate the procedure of image compression, because the quantization operator always leads to serious blocking artifacts and coding distortion. However, as compared to the compressed images , the postprocessed compressed image has less distortion, because loses some detail information, but it does not have obvious artifacts and blocking artifacts. Thus, the function of two successive procedure of codec and postprocessing can be well represented by the VCNN network. To make sure that the gradient can be rightly backpropagated from the PPNN to FDNN, our VCNN network is proposed to learn a projection function from image representation to final output of PPNN. Here, the parameter is the parameter set of VCNN network. This projection can rightly approximate the two successive procedures: the compression of standard codec and postprocessing based on convolutional neural network. After training the VCNN network, this network is leveraged to supervise the training of our FDNN network.
2.1 Objective function
Our framework’s objective function is written as follows:
(1) 
where , , and are respectively three parameter sets of FDNN, PPNN, and VCNN network, and is the linear upsampling operator, if and don’t have the same image size, or else . Here, in order to make final output image to be similar to , has the L1 content loss and L1 gradient difference loss for the regularization of the FDNN network’s training, which are written as::
(2) 
(3) 
where is the L1 norm, which has better performance to supervise convolutional neural network’s training than the L2 norm. This has been reported in the literature of [18], which learns to predict subsequent frames from the video sequences.
Since standard codec, as a big obstacle, exists between PPNN network and FDNN network, it’s tough to make the gradient backpropagate between them. Therefore, it’s a challenging task to train FDNN network directly without the supervision of PPNN network. To address this task, we can learn a nonlinear function from the to in the VCNN network, where the L1 content loss and L1 gradient difference loss are used to supervise the VCNN network’s training. Here, is the predicted result by VCNN network to approximate .
The structural information of representation is always expected to be similar to groundtruth image , so the SSIM loss [19] supervises the learning of FDNN, besides the training loss from the network of VCNN, which is defined as follows:
(4) 
(5) 
where and are two constant values, which respectively equal to and . and
respectively denote the mean value and the variance of the neighborhood window centered by pixel
in the image . In this way, as well as can be denoted similarly. Meanwhile, is the covariance between neighbourhood windows centered by pixel in the image and in the image . Because the function of SSIM is differentiable, the gradient can be efficiently backpropagated during the FDNN network’s training.2.2 Networks
Eight convolutional layers in the FDNN network are used to extract features so as to represent the groundtruth image as
. In this network, the weights of these convolutional layers are in the spatial size of 9x9 for the first layer and the last layer, which could make receptive field (RF) of convolutional neural networks to be large enough. In addition, other six convolutional layers in the FDNN use 3x3 convolution kernel to further enlarge the size of RF. These convolutional layers are used to increase the nonlinearity of the network, when ReLU is followed to activate the output features of these convolutional hidden layers. The feature map number of 17 convolutional layers is 128, but the last layer only has one feature map so as to keep consistent with the ground truth image
. Each convolutional layer is operated with a stride of 1, except that the second layer uses stride step of 2 to downsample feature maps, so that the convolution operation is carried out in the lowresolution space to reduce computational complexity from the third convolutional layer to the 8th convolutional layer. However, the second layer uses stride step of 1, if
is represented in the highresolution, when the given bitrate is beyond a certain value. All the convolutional layers are followed by an activation layer with ReLU function, except the last convolutional layer.In the PPNN network, we leverage seven convolutional layers to extract features and each layer is activated by ReLU function. The size of convolutional layer is 9x9 in the first layer and the left six layers use 3x3, while the output channel of feature map equals to 128 in these convolutional layer. After these layers, one deconvolution layer with size of 9x9 and stride to be 2 is used to upscale feature map from lowresolution to highresolution so that the size of output image is matched with the ground truth image. However, if is fullresolution image, the last deconvolution layer is replaced by convolutional layer with size of 9x9 and stride to be 1.
The VCNN network is designed to be the same structure with the PPNN network, because they are the same kind of lowlevel image processing problems. The role of VCNN network is to make the representation degrade to a postprocessed compressed but highresolution image . On the contrary, the functionality of the PPNN network is to improve the quality of the compressed represenation so that the user could receive a highquality image without coding artifacts after postprocessing with PPNN network at the decoder.
2.3 Learning Algorithm
Due to the difficulty of directly training the whole framework once, we decompose the learning of three convolutional neural networks in our framework as three subproblems learning. First, we initialize all the parameter set , , , and
of codec, FDNN network, PPNN network, and VCNN network. Meanwhile, we use traditional interpolation methods to get an initial representation image
of the groundtruth image , which is then compressed by JPEG codec as the input of training data set at the beginning. Next, the first subproblem learning is to train PPNN network by updating the parameter set of according to the Eq. (23). The compressed representation image got from groundtruth image and its postprocessed compressed image predicted by PPNN network are used for the second subproblem’s learning of VCNN to update parameter set of based on the Eq. (45). After VCNN’s learning, we fix the parameter set of in the VCNN network to carry on the third subproblem learning by updating the parameter set of for training FDNN network. After FDNN network’s learning, the next iteration begins to train the PPNN network, after the updated description image are compressed by the standard codec. It is worth mentioning that the functionality of VCNN network is used to bridge the great gap between FDNN and PPNN. Thus, once the training of our whole framework is finished, the VCNN network is not in use any more, that is to say, only the parameter sets of , in the networks of FDNN and PPNN are employed during testing.3 Experimental results
3.1 Training details
Our framework of learning a virtual codec neural network to compress image is implemented with TensorFlow
[20]. The training data comes from [21], in which 400 images of size 180x180 are included. We augment these data by cropping, rotating and flipping image to build our training dataset, in which the total number of image patches with size of 160x160 is 3200 (n=3200). For testing as shown in Fig. 1, four images, which are broadly employed for compressed image denoising or deartifact, are used to evaluate the efficiency of the proposed method. We train our model using the optimization method of Adam, with the beta1=0.9, beta2=0.999. The initial learning rate of training three convolutional neural network is set to be 0.0001, while the learning rate decays to be half of the initial one once the training step reaches 3/5 of total step. And it decreases to be 1/4 of the initial one when the training step reaches 4/5 of total step.3.2 The quality comparison of different methods
To validate the efficiency of the proposed framework, we compare our method with JPEG, Foi’s [11], BM3D [12], DicTV [13], CONCOLOR [14], and Jiang’s [6]. Here, both Foi’s [11] and BM3D [12] are the class of image denoising. The method of Foi’s [11] is specifically designed for deblocking. The approaches of DicTV [13], CONCOLOR [14] use the dictionary learning or the lowrank model to resolve the problem of deblocking and deartifact. The results of Foi’s [11], BM3D [12], DicTV [13] and CONCOLOR [14] are got by strictly using the author’s open codes with the parameter settings in their papers. However, the highly related method of Jiang’s [6] only give one factor for testing, so we try to reimplement their method with TensorFlow. Meanwhile, to fairly compare with the Jiang’s [6], we use our FDNN and PPNN to replace networks of ComCNN and ReCNN for training and testing to avoid the effect of the network’s structure design on the experimental results. Besides, we extend Jiang’s framework [6] to be mixedresolution so that more comparisons can be conducted between the proposed method and Jiang’s [6].
The JPEG software of image compression in OpenCV is used for all the experimental results. To compare our method with Jiang’s [6] in the following, the low resolution representation are compressed by JPEG with quality factors to be 5, 10, and 20 at low bitrate, but the high resolution representation is assigned with 10, 20, 30, and 40 for their training and testing at high bitrate. Meanwhile, Foi’s [11], BM3D [12], DicTV [13], and CONCOLOR [14] compress image with JPEG codec with quality factors to be 2, 3, 4, 5, 10, 15, 20, 25 and 30. Note that in the proposed framework the JPEG codec is used, but in fact our framework can be applied into most of existing standard codec.
We use the Peak Signal to Noise Ratio (PSNR) as the objective quality’s measurement. From the Fig. 2, where bpp denotes the bitperpixel, it can be obviously observed that the proposed method has the best objective performance on PSNR and SSIM, as compared to several stateoftheart approaches: JPEG [2], Foi’s [11], BM3D [12], DicTV [13], CONCOLOR [14], and Jiang’s [6]. From the above comparisons, it can be known that the backpropagation of gradient in the feature description network from postprocessing neural network plays a significant role on the effectiveness of feature description and the compression efficiency when combining the neural network with standard codec together to effectively compress image. In a word, by learning a virtual codec neural network, the proposed framework provides a good way to resolve the gradient backpropagation problem in the image compression framework with convolutional neural network ahead of a standard codec.
4 Conclusion
In this paper, we propose an endtoend mixedresolution image compression framework to resolve the problem of nondifferentiability of the quantization function in the lossy image compression by learning a virtual codec neural network. Directly learning the whole framework of the proposed method is a intractable problem, so we decompose this challenging optimization problem into three subproblems learning. Finally, experimental results have shown the priority of the proposed method than several stateoftheart methods.
References
 [1] W. Dai, G. Cheung, N. Cheung, A. Ortega, and O. Au, “Merge frame design for video stream switching using piecewise constant functions,” IEEE Transactions on Image Processing, vol. 25, no. 8, pp. 3489–3504, 2016.
 [2] G. Wallace, “The JPEG still picture compression standard,” IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992.
 [3] J. Lainema, M. Hannuksela, V. Vadakital, and E. Aksu, “HEVC still image coding and high efficiency image file format,” in IEEE International Conference on Image Processing, Arizona, Sept. 2016.

[4]
G. Toderici, D. Vincent, N. Johnston H., Jin, D. Minnen, J. Shor, and
M. Covell,
“Full resolution image compression with recurrent neural networks,”
inIEEE Conference on Computer Vision and Pattern Recognition
, Honolulu, July 2017.  [5] M. Li, W. Zuo, S Gu., D. Zhao, and D. Zhang, “Learning convolutional networks for contentweighted image compression,” in arXiv: 1703.10553, 2017.
 [6] F. Jiang, W. Tao, S. Liu, J. Ren, X. Guo, and D. Zhao, “An endtoend compression framework based on convolutional neural networks,” IEEE Transactions on Circuits and Systems for Video Technology, 2017.
 [7] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Learning a virtual codec based on deep convolutional neural network to compress image,” in arXiv: 1712.05969, 2017.
 [8] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Multiple description convolutional neural networks for image compression,” in arXiv: 1801.06611, 2018.
 [9] L. Zhao, H. Bai, A. Wang, Y. Zhao, and B. Zeng, “Twostage filtering of compressed depth images with markov random field,” Signal Processing: Image Communication, vol. 51, pp. 11–22, 2017.
 [10] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz, “Adaptive deblocking filter,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 614–619, 2003.
 [11] A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwise shapeadaptive DCT for highquality denoising and deblocking of grayscale and color images,” IEEE Transactions on Image Processing, vol. 16, no. 5, pp. 1395–1411, 2007.
 [12] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3D transformdomain collaborative filtering,” IEEE Transactions on Image Processing, vol. 16, no. 8, pp. 2080–2095, 2007.
 [13] H. Chang, M. Ng, and T. Zeng, “Reducing artifacts in JPEG decompression via a learned dictionary,” IEEE Signal Processing Letters, vol. 62, no. 3, pp. 718–728, 2014.
 [14] J. Zhang, R. Xiong, C. Zhao, Y. Zhang, S. Ma, and W. Gao, “CONCOLOR: Constrained nonconvex lowrank model for image deblocking,” IEEE Transactions on Image Processing, vol. 25, no. 3, pp. 1246–1259, 2016.
 [15] C. Dong, Y. Deng, C. Chen, and X. Tang, “Compression Artifacts Reduction by a Deep Convolutional Network,” in IEEE International Conference on Computer Vision, Santiago, Dec. 2015.
 [16] L. Cavigelli, P. Hager, and L. Benini, “CASCNN: A deep convolutional neural network for image compression artifact suppression,” in IEEE Conference on Neural Networks, Anchorage, AK, USA, May 2017.
 [17] L. Galteri, L. Seidenari, M. Bertini, and B. Del, “Deep generative adversarial compression artifact removal,” in arXiv: 1704.02518, 2017.
 [18] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multiscale video prediction beyond mean square error,” in arXiv: 1511.05440, 2015.
 [19] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
 [20] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, and et al., “Tensorflow: largescale machine learning on heterogeneous distributed systems,” in arXiv:1603.04467, 2016.
 [21] Y. Chen and T. Pock, “Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1256–1272, 2017.
Comments
There are no comments yet.