Mixed-Resolution Image Representation and Compression with Convolutional Neural Networks

02/02/2018 ∙ by Lijun Zhao, et al. ∙ BEIJING JIAOTONG UNIVERSITY 0

In this paper, we propose a end-to-end mixed-resolution image compression framework with convolutional neural networks. Firstly, given one input image, feature description neural network (FDNN) is used to generate a new representation of this image, so that this representation can be more efficiently compressed by standard coder, as compared to the input image. Furthermore, we use post-processing neural network (PPNN) to remove the coding artifacts caused by quantization of codec. Secondly, low-resolution representation is considered under low bit-rate for high efficiency compression in terms of most of bit spent by image's structures. However, more bits should be assigned to image details in the high-resolution, when most of structures have been kept after compression at the high bit-rate. This comes from that the low-resolution representation can't burden more information than high-resolution representation beyond a certain bit-rate. Finally, to resolve the problem of error back-propagation from the PPNN network to the FDNN network, we introduce a virtual codec neural network to intimate the procedure of standard compression and post-processing. The objective experimental results have demonstrated the proposed method has a large margin improvement, when comparing with several state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image and video data appearing as general media provides us great convenience to share information and communicate with each other. However, nowadays huge amounts of image and video are required to be stored and transmitted efficiently. As we all know, image and video coding techniques have enormously alleviated this problem by compressing these data to be a small yet strong expressive one, but the compression efficiency of standard coding gradually can’t satisfy the explosive transmission demands of social media and streaming media with the popularization of electronic products such as digital camera and cell-phone [1]

. Thus, image and video’s representation as well as compression towards higher compression ratio should be deeply studied, especially using deep learning.

Conventional still image coding [2, 3] has been developed from JPEG and JPEG2000 to WebP and BPG, etc. Meanwhile, several latest works, such as [4, 5, 6, 7, 8], are devoted to image compression with deep neural networks. In [6], two collaborated convolutional neural networks are used to form a unified end-to-end learning framework, where one network produces a compact representation for encoding, while another one reconstructs the decoded image. Different the works of [6], a virtual codec neural network of [7] is learned to bridge the gap between the networks ahead of standard codec and after this codec so that the gradients could be properly passed from the back end to the front end. On the basis of the works of [7], multiple description convolutional neural networks are designed to compress image so as to ensure that an acceptable image can be decoded in the un-prioritized network or under the condition of transmission congestion [8].

Because our work is highly related to the problem of post-processing such as artifacts removal [9], de-blocking [10] and de-noising [11], we next introduce several state-of-the-art works [11, 12, 13, 14, 15, 16, 17] for compression artifacts removal. In [11], shape-adaptive discrete cosine transform-based filtering is developed for de-noising and de-blocking by introducing the shape of the transform’s support in a point-wise adaptive fashion. In [12], after grouping similar 2D image patches into 3D data arrays, three successive procedures of 3D transformation of these array, shrinkage of the transform spectrum, and inverse 3D transformation are conducted to achieve image de-noising. In [13], two-step algorithm is formed to reducing artifacts by dictionary learning and total variation regularization. To reduce blocking artifacts and obtain high-quality image, an optimization problem using constrained non-convex low-rank model is developed within maximum a posteriori framework [14]. Except the methods of filtering and optimization [11, 12, 13, 14], there are several convolutional neural network-based approaches, such as [15, 16, 17]. Although post-processing could improve the coding efficiency, they lose the sight of the significance of image representation, which can highlight the significance pixels or regions before coding in order to protect these pixels or regions. Thus, post-processing and image representation should be combined together to further improve image’s coding efficiency.

Although the literatures of [6, 7] have got a high image compression gains at the low bit-rate, they can’t have a large margin coding gain at the high bit-rate. In this paper, mixed-resolution image representation and compression with deep convolutional neural networks is introduced to efficiently compress image, no matter which bit-rate is chosen by the users. The rest of this paper is arranged as follow. Firstly, we introduce the proposed method in Section 2. Secondly, the experimental results are given in Section3. At last, the Section 4 concludes the paper.

2 The proposed method

Our framework is composed of feature description neural network (denoted as FDNN) network, a standard codec (e.g., JPEG), post-processing neural network (PPNN), and virtual codec neural network (VCNN) network. To greatly reduce the amount of image data for storage or transmission, we use the FDNN network to represent ground-truth image as in the low-resolution or high-resolution before image compression. For simplicity, the FDNN network is expressed as a non-linear function , in which is the parameter set of FDNN network. The compression procedure of standard codec is described as a mapping function , where is the parameter set of codec. Our PPNN network learns a post-processing function from image to image to remove the noise, such as blocking artifacts, ringing artifacts and blurring, when is represented in the high resolution. Here, the parameter is the parameter set of PPNN network. However, if is represented in the low-resolution, PPNN network is used to simultaneously de-artifact and up-sample to be from low-resolution yet low-quality to high-resolution yet high quality.

To combine image compression standard with image representation as well as post-processing based on convolutional neural network, the intuitive idea is that the compression procedure of codec is learned by a convolutional neural network to get a approximation function. Although convolutional neural network is a powerful tool to approximate any nonlinear function, it’s hard to imitate the procedure of image compression, because the quantization operator always leads to serious blocking artifacts and coding distortion. However, as compared to the compressed images , the post-processed compressed image has less distortion, because loses some detail information, but it does not have obvious artifacts and blocking artifacts. Thus, the function of two successive procedure of codec and post-processing can be well represented by the VCNN network. To make sure that the gradient can be rightly back-propagated from the PPNN to FDNN, our VCNN network is proposed to learn a projection function from image representation to final output of PPNN. Here, the parameter is the parameter set of VCNN network. This projection can rightly approximate the two successive procedures: the compression of standard codec and post-processing based on convolutional neural network. After training the VCNN network, this network is leveraged to supervise the training of our FDNN network.

2.1 Objective function

Our framework’s objective function is written as follows:

(1)

where , , and are respectively three parameter sets of FDNN, PPNN, and VCNN network, and is the linear up-sampling operator, if and don’t have the same image size, or else . Here, in order to make final output image to be similar to , has the L1 content loss and L1 gradient difference loss for the regularization of the FDNN network’s training, which are written as::

(2)
(3)

where is the L1 norm, which has better performance to supervise convolutional neural network’s training than the L2 norm. This has been reported in the literature of [18], which learns to predict subsequent frames from the video sequences.

Since standard codec, as a big obstacle, exists between PPNN network and FDNN network, it’s tough to make the gradient back-propagate between them. Therefore, it’s a challenging task to train FDNN network directly without the supervision of PPNN network. To address this task, we can learn a nonlinear function from the to in the VCNN network, where the L1 content loss and L1 gradient difference loss are used to supervise the VCNN network’s training. Here, is the predicted result by VCNN network to approximate .

The structural information of representation is always expected to be similar to ground-truth image , so the SSIM loss [19] supervises the learning of FDNN, besides the training loss from the network of VCNN, which is defined as follows:

(4)
(5)

where and are two constant values, which respectively equal to and . and

respectively denote the mean value and the variance of the neighborhood window centered by pixel

in the image . In this way, as well as can be denoted similarly. Meanwhile, is the covariance between neighbourhood windows centered by pixel in the image and in the image . Because the function of SSIM is differentiable, the gradient can be efficiently back-propagated during the FDNN network’s training.

2.2 Networks

Eight convolutional layers in the FDNN network are used to extract features so as to represent the ground-truth image as

. In this network, the weights of these convolutional layers are in the spatial size of 9x9 for the first layer and the last layer, which could make receptive field (RF) of convolutional neural networks to be large enough. In addition, other six convolutional layers in the FDNN use 3x3 convolution kernel to further enlarge the size of RF. These convolutional layers are used to increase the nonlinearity of the network, when ReLU is followed to activate the output features of these convolutional hidden layers. The feature map number of 1-7 convolutional layers is 128, but the last layer only has one feature map so as to keep consistent with the ground truth image

. Each convolutional layer is operated with a stride of 1, except that the second layer uses stride step of 2 to down-sample feature maps, so that the convolution operation is carried out in the low-resolution space to reduce computational complexity from the third convolutional layer to the 8-th convolutional layer. However, the second layer uses stride step of 1, if

is represented in the high-resolution, when the given bit-rate is beyond a certain value. All the convolutional layers are followed by an activation layer with ReLU function, except the last convolutional layer.

In the PPNN network, we leverage seven convolutional layers to extract features and each layer is activated by ReLU function. The size of convolutional layer is 9x9 in the first layer and the left six layers use 3x3, while the output channel of feature map equals to 128 in these convolutional layer. After these layers, one de-convolution layer with size of 9x9 and stride to be 2 is used to up-scale feature map from low-resolution to high-resolution so that the size of output image is matched with the ground truth image. However, if is full-resolution image, the last de-convolution layer is replaced by convolutional layer with size of 9x9 and stride to be 1.

The VCNN network is designed to be the same structure with the PPNN network, because they are the same kind of low-level image processing problems. The role of VCNN network is to make the representation degrade to a post-processed compressed but high-resolution image . On the contrary, the functionality of the PPNN network is to improve the quality of the compressed represenation so that the user could receive a high-quality image without coding artifacts after post-processing with PPNN network at the decoder.

2.3 Learning Algorithm

Due to the difficulty of directly training the whole framework once, we decompose the learning of three convolutional neural networks in our framework as three sub-problems learning. First, we initialize all the parameter set , , , and

of codec, FDNN network, PPNN network, and VCNN network. Meanwhile, we use traditional interpolation methods to get an initial representation image

of the ground-truth image , which is then compressed by JPEG codec as the input of training data set at the beginning. Next, the first sub-problem learning is to train PPNN network by updating the parameter set of according to the Eq. (2-3). The compressed representation image got from ground-truth image and its post-processed compressed image predicted by PPNN network are used for the second sub-problem’s learning of VCNN to update parameter set of based on the Eq. (4-5). After VCNN’s learning, we fix the parameter set of in the VCNN network to carry on the third sub-problem learning by updating the parameter set of for training FDNN network. After FDNN network’s learning, the next iteration begins to train the PPNN network, after the updated description image are compressed by the standard codec. It is worth mentioning that the functionality of VCNN network is used to bridge the great gap between FDNN and PPNN. Thus, once the training of our whole framework is finished, the VCNN network is not in use any more, that is to say, only the parameter sets of , in the networks of FDNN and PPNN are employed during testing.

3 Experimental results

Figure 1: The data-set is used for our testing

3.1 Training details

Our framework of learning a virtual codec neural network to compress image is implemented with TensorFlow

[20]. The training data comes from [21], in which 400 images of size 180x180 are included. We augment these data by cropping, rotating and flipping image to build our training data-set, in which the total number of image patches with size of 160x160 is 3200 (n=3200). For testing as shown in Fig. 1, four images, which are broadly employed for compressed image de-noising or de-artifact, are used to evaluate the efficiency of the proposed method. We train our model using the optimization method of Adam, with the beta1=0.9, beta2=0.999. The initial learning rate of training three convolutional neural network is set to be 0.0001, while the learning rate decays to be half of the initial one once the training step reaches 3/5 of total step. And it decreases to be 1/4 of the initial one when the training step reaches 4/5 of total step.

3.2 The quality comparison of different methods

To validate the efficiency of the proposed framework, we compare our method with JPEG, Foi’s [11], BM3D [12], DicTV [13], CONCOLOR [14], and Jiang’s [6]. Here, both Foi’s [11] and BM3D [12] are the class of image de-noising. The method of Foi’s [11] is specifically designed for de-blocking. The approaches of DicTV [13], CONCOLOR [14] use the dictionary learning or the low-rank model to resolve the problem of de-blocking and de-artifact. The results of Foi’s [11], BM3D [12], DicTV [13] and CONCOLOR [14] are got by strictly using the author’s open codes with the parameter settings in their papers. However, the highly related method of Jiang’s [6] only give one factor for testing, so we try to re-implement their method with TensorFlow. Meanwhile, to fairly compare with the Jiang’s [6], we use our FDNN and PPNN to replace networks of ComCNN and ReCNN for training and testing to avoid the effect of the network’s structure design on the experimental results. Besides, we extend Jiang’s framework [6] to be mixed-resolution so that more comparisons can be conducted between the proposed method and Jiang’s [6].

Figure 2: The objective measurement comparison on PSNR for several state-of-the-art approaches. (a-d) are the results of image (a-d) in Fig. 1

The JPEG software of image compression in OpenCV is used for all the experimental results. To compare our method with Jiang’s [6] in the following, the low resolution representation are compressed by JPEG with quality factors to be 5, 10, and 20 at low bit-rate, but the high resolution representation is assigned with 10, 20, 30, and 40 for their training and testing at high bit-rate. Meanwhile, Foi’s [11], BM3D [12], DicTV [13], and CONCOLOR [14] compress image with JPEG codec with quality factors to be 2, 3, 4, 5, 10, 15, 20, 25 and 30. Note that in the proposed framework the JPEG codec is used, but in fact our framework can be applied into most of existing standard codec.

We use the Peak Signal to Noise Ratio (PSNR) as the objective quality’s measurement. From the Fig. 2, where bpp denotes the bit-per-pixel, it can be obviously observed that the proposed method has the best objective performance on PSNR and SSIM, as compared to several state-of-the-art approaches: JPEG [2], Foi’s [11], BM3D [12], DicTV [13], CONCOLOR [14], and Jiang’s [6]. From the above comparisons, it can be known that the back-propagation of gradient in the feature description network from postprocessing neural network plays a significant role on the effectiveness of feature description and the compression efficiency when combining the neural network with standard codec together to effectively compress image. In a word, by learning a virtual codec neural network, the proposed framework provides a good way to resolve the gradient back-propagation problem in the image compression framework with convolutional neural network ahead of a standard codec.

4 Conclusion

In this paper, we propose an end-to-end mixed-resolution image compression framework to resolve the problem of non-differentiability of the quantization function in the lossy image compression by learning a virtual codec neural network. Directly learning the whole framework of the proposed method is a intractable problem, so we decompose this challenging optimization problem into three sub-problems learning. Finally, experimental results have shown the priority of the proposed method than several state-of-the-art methods.

References