Structure-Preserving Image Super-resolution via Contextualized Multi-task Learning

07/26/2017 ∙ by Yukai Shi, et al. ∙ SUN YAT-SEN UNIVERSITY 0

Single image super resolution (SR), which refers to reconstruct a higher-resolution (HR) image from the observed low-resolution (LR) image, has received substantial attention due to its tremendous application potentials. Despite the breakthroughs of recently proposed SR methods using convolutional neural networks (CNNs), their generated results usually lack of preserving structural (high-frequency) details. In this paper, regarding global boundary context and residual context as complimentary information for enhancing structural details in image restoration, we develop a contextualized multi-task learning framework to address the SR problem. Specifically, our method first extracts convolutional features from the input LR image and applies one deconvolutional module to interpolate the LR feature maps in a content-adaptive way. Then, the resulting feature maps are fed into two branched sub-networks. During the neural network training, one sub-network outputs salient image boundaries and the HR image, and the other sub-network outputs the local residual map, i.e., the residual difference between the generated HR image and ground-truth image. On several standard benchmarks (i.e., Set5, Set14 and BSD200), our extensive evaluations demonstrate the effectiveness of our SR method on achieving both higher restoration quality and computational efficiency compared with several state-of-the-art SR approaches. The source code and some SR results can be found at: http://hcp.sysu.edu.cn/structure-preserving-image-super-resolution/

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

page 6

page 10

page 11

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image super-resolution (SR) is a fundamental problem in image processing. Single image SR approaches, which aim at restoring a high-resolution (HR) image only from a single low-resolution (LR) image, have been applied to many image and video analysis tasks, such as video surveillance [1], image-based medical analysis [2], and image/video streaming [3, 4].

Common techniques for single image SR can be roughly categorized into reconstruction-, example- and interpolation- based approaches. Reconstruction-based approaches [5, 6, 7], which restore HR images by deconvolutional methods [6] with a global blur degradation model, usually introduce ringing artifacts around salient structures [7] due to inaccurate blurring kernels in the inverse problem. Example-based approaches [8] boost the amplification factor by using internal or external patch data to guide the image restoration. Recently, Huang et al. [9] proposed to exploit self-similarity for single image SR, which greatly expands the internal patch searching space. Hu et al. [10]

proposed a cascaded linear regression technique to model the relationship between HR and LR images. Interpolation-based approaches can achieve acceptable trade-off between performance and efficiency with a pre-defined kernel. However, pre-defined kernels use fixed weights for interpolation, which will inevitably cause blur when the weight definition is inconsistent with image structures. To address issue, various adaptive interpolations 

[11, 12, 13] are proposed. But the improvements in restoration quality are still limited.

The success of deep convolutional neural network (CNN) in computer vision tasks has inspired novel trends in low-level image restoration researches, such as rain/dirt removal 

[14], noise removal [15], face hallucination [16, 17], hashing [18]

and image inpainting 

[19]. Focusing on learning an end-to-end mapping between the LR images and their corresponding HR images, several CNN-based methods [20, 21, 22, 23] have been proposed to perform image SR

in a pure data-driven manner. That is, they directly minimize the mean squared error (MSE) between the predicted and ground-truth images in the training stage. Although the restoration performance is significantly improved, the structural inconsistency between the LR input and HR output still exists. This is because human visual system is more sensitive to structural changes, which are difficult to be exploited from MSE-based loss functions. Recent advances in image SR try to address this issue 

[24, 25, 26] by introducing feature-based perceptive loss functions to the training stage. However, unwanted artifacts and unreal details are also introduced, which make their SR results look unrealistic.

Considering single image SR is an ill-defined problem, it is necessary to exploit the priors of natural image to further improve the SR performance. Motivated by recent advances in deep learning researches that exploit priors in the form of context information in designing neural networks 

[27, 28], in this work, we propose to design neutral networks to investigate two types of image structural information, i.e., global structural information which corresponds to salient boundaries in a global perspective and residual structural information which contains noticeable details that are critical to visual quality. The success of multi-task learning framework inspires us to leverage such structural information in a unified manner. For instance, Yang et al. [29]

proposed to utilize the common knowledge (e.g., feature selection functions) of multiple tasks as supplementary information to facilitate decision making. Considering aforementioned structural information are usually considered as complementary context rather than common knowledge, in this work, we concentrate on complimentary contextualized multi-task learning for structure-preserving single image SR. In particular, we propose a deep joint contextualized multi-task learning framework, where three types of image components are imposed as complimentary contexts and jointly learned, i.e., the base image content, the boundary map, and the residual map. Besides a convolutional network that learns content-adaptive interpolations to produce the intermediate base image, we impose an auxiliary task to back-propagate the global boundary structural context. Meanwhile, an independent sub-network is introduced to explicitly model the noticeable details to provide residual structural context.

The major contribution of this work is the proposed contextualized multi-task learning framework, which is the first attempt to incorporate joint learning of local, global, and residual contexts into CNNs for single image SR. Other contributions mainly come from the proposed content-adaptive interpolation and the sub-networks for capturing complementary image contents, which enables better trade-off between restoration quality and the number of network parameters. Extensive experiments on several benchmarks datasets (e.g. Set5, Set14, BSD500) demonstrate that the proposed framework shows superior performance to most learning-based approaches in the perspective of both visual quality and quantitative metrics, which facilitates the real-time image SR process.

We would like to point out that a preliminary version of this work is reported in [30], which coarsely concatenates content-adaptive interpolation and holistic edge context. In this paper, we inherit the idea of preserving structures and refining the network architecture. A simple yet powerful sub-network is further employed to capture noticeable image details for better visual quality. The whole framework is re-interpreted from the aspect of joint context learning and multi-task learning. Besides, more comparisons with state-of-the-art approaches and more detailed analyses of the proposed modules are added to further verify our statements.

The rest parts of this paper are organized as follows. Section II

briefly reviews existing machine learning-based SR approaches which motivate this work. Section 

III presents the details of the proposed framework, with thorough analysis of every component. Section V demonstrates the experimental results on several public benchmarks, comparing with state-of-the-art alternatives. Finally, Section VI concludes this paper.

Ii Related Work

Ii-a Interpolation-based image super-resolution

Interpolation-based approaches typically start from evenly placing the pixels of LR image to the HR grid (the integral coordinates in the HR image domain). The basic idea of these approaches is to estimate the unknown pixel values in the HR grid by weighted average of surrounding known pixels. Considering common pixel changes in a local region can be approximated by continuous functions, people have proposed various weight definitions for image interpolation. For example, bilinear interpolation is proposed to utilize local linearity, and bicubic interpolation is proposed to exploit the high-order continuity 

[31]. However, there are plenty of pixel changes that cannot be described by these pre-defined functions, especially for regions with rich image structures. In this case, structures will be blurred due to improper pixel averaging. To address this problem, various adaptive interpolation [11, 12] are proposed. For instance, Walt et al. [12] proposed to express polygonal pixel overlap as a linear operator to improve the interpolation performance. But the improvements are still limited.

Ii-B Multi-task learning in image super-resolution

Decades of researches on multi-task learning have demonstrated that learning multiple correlated tasks simultaneously can significantly improve the performance of the main task [32, 33, 34, 35, 36]. In single image SR, there is also a trend of utilizing multi-task learning. For example, Yang et al. [37] proposed a multi-task K-SVD learning for image SR, in which example image patches are divided into different groups and K-SVD is applied to every group. It is shown that simultaneous learning multiple dictionaries can lead to better SR quality. Liang et al. [38] proposed a multi-task learning framework that jointly considers image SR process and the image degeneration process. These works claim that the multi-task learning framework is a feasible way of utilizing priors in learning-based image SR.

Ii-C Deep learning in image super-resolution

Recently, deep learning has achieved significant quality improvements in image SR. For example, Dong et al. [20] utilized a three-layer fully convolutional network to learn the non-linear mapping between HR and LR patches, which has a close relationship to sparse coding. Ren et al. [21] introduced Shepard CNNs to facilitate translation variant interpolation, which gives a solution to both inpainting and SR. Wang et al. [22] proposed a sparse coding based network for image SR. Based on learned iterative shrinkage and thresholding algorithm(LISTA) [39], they employ a set of neural networks to restore images. Zeng et al. [40]

proposed a deep autoencoder for SR,

which explores the consistent representations of HR and LR images and demonstrate a superior efficiency compared to similar methods based on sparse representation. Kumar et al. [41] studied on several factors that affect the training phase to facilitate learning-based SR with fewer training samples. The models of these methods, although being proposed from different aspects, are trained to minimize the squared error w.r.t. the ground-truth HR image, which is not necessarily correlated to good perceptual quality. Bruna et al. [24] referred this problem as regression to mean. Their proposed solution is a conditional generative model, which demonstrates improvement over visual quality, but with high time cost in both training and testing.

More recently, researchers notice the importance of image details and make various of attempts for exploration. Kim et al. [23, 42] further improved the SR quality by different network architectures such as very deep and recursive network structures. However, these methods heavily rely on very deep networks with plenty of parameters. e.g., a 20-layer convolutional neural network [43]. In addition, perceptual losses have been proposed for CNNs [24, 26], which conduct the loss from the image space to high-level feature space of a pre-trained VGG-net [43]. At the same time, Ledig et al. [25] proposed to apply adversarial network to the task of SR, which results in more image details but lower PSNR score. More related to our work, there are several attempts to accelerate image SR. By developing a sub-pixel convolutional layer, Shi et al. [3] used a single model to handle real-time image SR. Similarity, Dong et al. [44] applied convolutional layers on LR image and upscaled it with deconvolution. They both promise low computational complexity, but there still exists plenty of room for performance improvement.

Iii Contextualized Multi-task Learning

Fig. 1: The architecture of our contextualized multi-task deep learning framework for single image super-resolution. Given an input LR image, our framework first extracts its convolutional features and applies one deconvolutional module to interpolate the feature maps in a content-adaptive way. The resulting maps are then fed into two branched sub-networks, which incorporate global boundary context and residual context, respectively. Specifically, during the neural network training, one sub-network outputs salient image boundaries and the intermediate HR image; the other sub-network outputs the local residual map, i.e., the residual difference of the generated HR image and ground-truth image. The final HR estimation is obtained by fusing the intermediate HR image and the local residual map.

In this section, we present the details of our framework. As sketched in Fig. 1, the proposed framework includes

three components: feature extraction, content-adaptive interpolation, and multi-task estimation.

Component Feature Extraction Interpolation-1 BCN Interpolation-2 RCN
layer conv conv conv conv deconv conv conv deconv conv conv
filter 5 3 3 1 11 3 3 11 3 3
channels 16 32 128 8 8 12 2 8 12 1
size 128 124 124 124 372 372 370 372 372 370
parameters 400 4,608 36,864 1,024 7,744 864 216 7,744 864 108
TABLE I: Detailed setup of each component in our framework. The five rows of the table represent the “layer type”, “filter size”, “output channels”, “size of output feature maps” and “number of parameters”, respectively. The content-adaptive interpolation layers for RCN and BCN are “Interpolation-1” and “Interpolation-2”, respectively. Note that this table takes the magnification factor of 3 and input images of resolution as an example of parameter setup.

Iii-a Feature Extraction

Inspired by the Pyramid-Net [45], we design a pyramid network structure for feature extraction. That is, there are 3 convolutional layers with 16, 32 and 128 kernels, respectively. Detailed setup is summarized in Table I. The first layer with kernel size is designed as a spacious receptive field to capture as much image information as possible, as illustrated in [46]. The other two layers with kernel are adopted for better efficiency as [47]. Note that we focus on extracting features from original LR images instead of the interpolated images. Thanks to the decreased computations of convolutional operations caused by the small size of feature maps, the proposed feature extraction can significantly accelerate the speed without obvious quality drop. Since the LR image has been represented as high-dimension feature maps through the first 3 layers, the computation cost may become pretty high if we import the high-dimension feature maps to content-adaptive interpolation directly. Therefore, we apply a shrinking layer with 8 kernels of size to reduce the feature dimension. Note that the kernel number is empirically chosen for a reasonable trade-off between effectiveness and efficiency. Benefitting from the shrinking layer, our model not only avoids parameter explosion but also promotes the restoration efficiency.

(a)
(b)
Fig. 4: A comparison between image interpolations by bicubic and learned kernels.

Iii-B Content-adaptive Interpolation

The second component is one deconvolutional layer, which is used to interpolate the LR feature maps in a content-adaptive way. The deconvolutional layer has 8 kernels of size . Note that in this work, is determined by the upscaling factor, which follows the principles of bicubic interpolation. That is, the kernel should be large enough to cover the second pixel around the anchor pixel in the HR grid. For example, the deconvolutional kernel is of size , , and for the upscaling factors of 2, 3 and 4, respectively. In this way, the deconvolutional layer can be regarded as a neural network implementation of standard image interpolation. Let be the HR image with a HR grid. We construct another HR image by evenly placed the LR image in the HR grid with identical pixel intervals. Then, standard interpolation can be written as:

(1)

where and are the pixel indices in the HR grid, represents the subset of neighbouring pixels around pixel , and is the pre-defined weight for interpolation. Note that is non-zero only when it comes from a pixel in the LR image.

With these definitions, we re-formulate the interpolation process as a basic component of a deconvolutional layer, i.e.,

(2)

where

represents the activation function,

is the deconvolutional kernel, represents the pixel of that contributes to pixel , and is the bias.

In the proposed content-adaptive interpolation, we use multiple deconvolutional kernels in a similar fashion. That is, we evenly place the LR image in the HR grid to construct . Then,

(3)

where the subscript represents the kernel index, “” represents the convolutional operator, and is the output image of the layer. In this way, content-adaptive image interpolation can be accomplished via a deconvolutional layer, whose kernels are learned from sufficient training data. Note that the deconvolutional layer is in the middle of the proposed network, which is different from other CNN-based SR methods [21, 20] that use deconvolution as the last layer. It is shown empirically that the proposed network can achieve nice restoration quality with reasonably increasing network parameters.

To compare the proposed network with the bicubic interpolation, we construct a small network which only has one deconvolutional layer to learn an adaptive kernel, taking BSD300 as training data and bicubic interpolation parameters for initialization. The intensity changes of bicubic and our learned kernels are visualized in Fig. 4, which illustrates that the learned kernel contains more high-frequency components. Meanwhile, the restoration results also indicate that the learned kernel leads to a superior restoration quality with more recovered details compared to the bicubic kernel. Thus, the effectiveness of the proposed adaptive interpolation is verified.

Iii-C Contextualized Multi-task Learning

(a)
(b)
Fig. 7: Example images with salient boundaries. (a) Original images. (b) Manually labeled edge maps.

In spired by the multi-task learning principles, we make an attempt to introduce auxiliary knowledge to SR issue.

Fig. 8: Illumination of several representative feature maps produced by the first three layers of feature extraction. The top row and bottom row show image-like and edge-like features, respectively.

Global Boundary Context: We develop a Boundary Context sub-Network (BCN) to preserve salient boundaries that represent global image structures. BCN consists of two convolutional layers with kernels, where one layer is with 12 kernels and the other layer is with 2 kernels. In the training phase of BCN, we propose to exploit salient image boundaries by regarding edge detection as a joint task of HR image restoration. In particular, we introduce an auxiliary term into the objective function, which computes the error between predicted and human-labeled edge/boundary maps. These boundary maps are from Berkeley Segmentation Dataset (BSD) [48]. Note that there are multiple boundary maps in BSD500 data set, we use their summation for better visualization and show the examples in Fig.7.

With the two tasks of image restoration and edge detection, image components and structural features are firstly extracted and enlarged by content-adaptive interpolation before being fed into the BCN. Several representative samples of the extracted feature maps are shown in Fig. 8, in which the top row and bottom row show image-like and edge-like features, respectively. This implies that these layers simultaneously extract redundant components and features, making it possible to produce base image and boundary maps in the HR image domain.

Through joint optimization in an end-to-end manner, feature extraction, content-adaptive interpolation and BCN can provide complimentary context information to each other. In this way, structure-aware feature representations can be learned with the content-adaptive interpolation.

Residual Context: As a result of paying close attention to generating the HR image with salient boundaries, the concatenated BCN might fail to restore some subtle but noticeable structures. Motivated by the recent residual learning paradigm [23, 49], we make an attempt to address this issue by employing a Residue Context sub-Network (RCN). The objective of the RCN is to synthesize a residual image, which is defined as the difference between the interpolated HR image and the ground-truth HR image. In contrast to using the bicubic interpolated HR image as in [23] and [49], our model uses the intermediate HR image provided by BCN. This can bring us two benefits: i) Higher image SR performance. As the HR image provided by BCN achieves comparable performance to the state-of-the-art methods, RCN can focus on remedying the overlooked information for higher SR quality; ii) A lightweight network architecture for RCN. Our used interpolated image contains significantly richer information than the bicubic one. Hence, compared with [23] and [49], the synthesization of residual images is much easier. As illustrated in Fig. 1, the architecture of RCN is the same as that of the concatenated BCN.

For the joint optimization of content-adaptive interpolation, BCN and RCN, we develop a fusion layer to merge the intermediate output of RCN and BCN in a data-driven way. In particular, the final HR image of our framework is obtained by:

(4)

where denotes a convolutional filter, is the intermediate HR image provided by BCN, and is the residue image synthesized by RCN. In this way, the parameters of can be adaptively updated during the learning process.

Iv Framework Training

The proposed framework is jointly optimized on a set of “LR image, HR image and HR edge map222In BSD data sets, more than one boundary maps are provided for every image, which are all used in our training process. Since multiple boundary maps are used in the same way, in this subsection, we focus on the case of one boundary map for simplicity.” triplets. For convenience, we use , and to represent the LR image, HR image and boundary map, respectively. Given the input , the objective of our model is to reconstruct a HR image similar to and predict a boundary map similar to .

The parameter of our model can be divided into 4 disjoint parts, i.e., , where and denote the parameters of content-adaptive interpolation and RCN, respectively. We denote the parameter of feature extraction stage has combined into content-adaptive interpolation part. For BCN, we use and to represent the specific weights for generating the intermediate HR image and the boundary maps, respectively. Since the parameters are separable, we propose to train our model in three iterative steps. First, we jointly train content-adaptive interpolation and BCN until their convergence; Second, fixing the parameters of content-adaptive interpolation and BCN, we update the parameters of RCN. Third, we jointly optimize content-adaptive interpolation, BCN and RCN. Specifically, content-adaptive interpolation and BCN are trained according to the following objective function:

(5)

where and represent the HR image reconstruction objective and the boundary prediction objective, respectively. The balance weight is used to control the importance of and , which is empirically set to 1 in all our experiments. Both and the are in the form of mean squared error (MSE), i.e.,

(6)

and

(7)

where and denote the reconstructed HR image and the predicted boundary map, respectively, represents the sample index, and is the number of training triplets. For simplicity, we use to denote . Note that when multiple boundary maps are available, there will be more edge prediction objectives.

1:Training LR images ; HR images ; boundary images ;
2:while  do
3:     ;
4:     Randomly select a subset of LR images, HR images and boundary images from the training set;
5:     for all do
6:     Obtain and via forward propagation;
7:     Update via the intermediate HR output and boundary output: ,;
8:     end for
9:end while
10:while  do
11:     ;
12:     for all do
13:     Obtain via forward propagation;
14:     Update via the residual output and intermediate HR output: ;
15:     end for
16:end while
Algorithm 1 Contextualized Multi-task Learning.

The loss function for training RCN is defined as:

(8)

Finally, the whole framework is optimized by employing the standard back propagation algorithm, i.e.,

(9)

where , the output of fusion layer, is the final HR image in the testing phase.

The whole training phase is summarized as Algorithm 1, which accords with the pipeline of our proposed framework in Fig. 1.

V Experiments

V-a Experiment Setting

Datasets: All experiments are evaluated on three challenging benchmarks, i.e., Set5 [50], Set14 [51] and BSD500 [48]. The BSD500 dataset consists of 500 natural images and human annotations for corresponding boundaries. We use the 300 images from its training and validation set for training. The rest of 200 images in BSD500 dataset form a widely used benchmark called BSD200. Besides, the Set5 and Set14 datasets are also adopted as testing sets in other state-of-the-art methods such as [20, 22, 23]. Thus, we conduct experiments on the three benchmarks.

Test set Set5 Set14 BSD200
Scaling factor
Bicubic 33.66 30.39 28.42 30.23 27.54 26.00 29.43 27.18 25.92
A+ [52] 36.55 32.59 30.28 32.28 29.13 27.32 31.44 28.36 26.83
SRCNN [20] 36.34 32.59 30.09 32.18 29.00 27.20 31.38 28.28 26.73
SRF [53] 36.89 32.72 30.35 32.52 29.23 27.41 31.66 28.45 26.89
FSRCNN [44] 36.94 33.06 30.55 32.54 29.37 27.50 31.73 28.55 26.92
SCN [22] 36.93 33.10 30.86 32.56 29.41 27.64 31.63 28.54 27.02
ShCNN [21] 36.83 32.88 30.46 32.48 29.39 27.51 31.75 28.60 26.95
Proposed 37.17 33.45 31.11 32.77 29.63 27.79 31.81 28.67 27.11
TABLE II: Quantitative comparisons among different methods in terms of PSNR (dB), in which the underline indicates the second place and bold face represents the first place.
Test set Set5 Set14 BSD200
Scaling factor
Bicubic 0.9299 0.8682 0.8104 0.8687 0.7736 0.7019 0.8524 0.7469 0.6727
A+ [52] 0.9544 0.9088 0.8603 0.9056 0.8188 0.7491 0.8966 0.7945 0.7171
SRCNN [20] 0.9521 0.9033 0.8530 0.9039 0.8145 0.7413 0.8835 0.7794 0.7018
SRF [53] 0.9536 0.9046 0.8529 0.9042 0.8168 0.7457 0.9011 0.8053 0.7332
FSRCNN [44] 0.9552 0.9128 0.8619 0.9080 0.8231 0.7509 0.9064 0.8123 0.7378
SCN [22] 0.9571 0.9112 0.8644 0.9093 0.8246 0.7541 0.9058 0.8139 0.7403
ShCNN [21] 0.9551 0.9109 0.8638 0.9079 0.8239 0.7530 0.9069 0.8144 0.7407
Proposed 0.9583 0.9175 0.8736 0.9109 0.8269 0.7594 0.9074 0.8182 0.7460
TABLE III: Quantitative comparisons among different methods in terms of SSIM, in which the underline indicates the second place and bold face represents the first place.

Implementation details

: In the training phase, we first convert the original color image to grayscale image by extracting the luminance component in YCbCr color space. Then, we downscale the training images by requested scaling factors (e.g., 2, 3, and 4) to obtain the LR images. The LR images are cropped into a set of patches with a stride of 4. The size of patches

is

set to be same as receptive field. The corresponding HR images and boundary maps are cropped with respect to the scaling factors. Before training, we initialize the network parameters by a zero-mean Gaussian distribution with a standard deviation of

. For the pre-training of the proposed model, we use the 91-images [8] and PASCAL VOC2012 [54] datasets, which totally contain 13,487 clear images. Specifically, the model using LR and HR image pairs is pre-trained following the same strategy as [20]. Since the feature extraction stage employ pyramid structure, we speed it up with the help of Factorized CNN [55]. In the training on BSD300 dataset, The learning rate of the last layer is set to , while the rest layers are using a fixed learning rate of . To increase the number of training samples, we also employ data augmentation for BSD300 dataset, as reported in [22].

Methods Parameter number PSNR
SRCNN [20] 57,184 32.59
FSRCNN [44] 15,740 33.06
VDSR [42] 664,704 33.66
Ours 60,436 33.45
Deeper ours 594,964 33.80
TABLE IV: Comparison on parameter number and PSNR performance on Set5 with a scaling factor of 3.
Fig. 9: The efficiency analysis for the scaling factor of 3 on the Set5 dataset.

Methods and metrics: We compare our model with several recent state-of-the-art methods, including a three-layer CNN (SRCNN) [20], super-resolution forest (SRF) [53], sparse coding-based network (SCN) [22], anchored neighborhood regression (A+) [23], shepard interpolation neural network (ShCNN) [21], very deep convolutional network (VDSR) [23], and fast convolutional network for SR (FSRCNN) [44]. For fair comparisons, we employ the popular PSNR and SSIM metrics for evaluation. To evaluate the structure-preserving capability, we introduce a new metric called “EPSNR”, which can be formulated as:

(10)

where is used for 8-bit images, and denote the ground-truth and the produced HR images, respectively, indicates the pixels whose distances to their closest boundary are less than 2 pixels, and is the pixel index. It is believed that EPSNR can better exploits image fidelity on edge regions.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 18: Visual comparison on the “Zebra” image from Set14 (factor 3), where the PSNR and SSIM are separated by “/”.
Fig. 19: Visual comparisons on the “Butterfly” image from Set5 (factor 4), where the PSNR and SSIM are separated by “/”.
Fig. 18: Visual comparison on the “Zebra” image from Set14 (factor 3), where the PSNR and SSIM are separated by “/”.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 26: Visual comparison on Bicubic, ShCNN, our proposed and SRGAN methods. Note that, ‘SRGAN-1’ represents the adversarial network with MSE-based content loss only. ‘SRGAN-2’ is the adversarial network with perceptual loss as mentioned in [25].

We have also investigated the model complexity from the aspect of parameter number. Two profiles of our model are used, i.e., the common model (denoted as “ours”) used in the above comparisons, and the model with a much deeper architecture (denoted as “deeper ours”). In the “deeper ours” profile, we only increase the convolutional layer number of feature extraction stage from 4 to 18. Thus our model has similar number of parameters compared to VDSR. Both profiles can be accelerated by cuDNN [47]. All the CNN-based methods are compared using the Set5 dataset with a scaling factor of 3. The results illustrated in Table IV demonstrate that the performance of our model keeps increasing as the parameter number increases. Using comparable network parameters, our model can achieve a PSNR gain of 0.14 dB compared to VDSR. Since fewer parameters can benefit both the training and testing phases, we recommend our model with the common profile. Fig. 9 illustrates the efficiency of all the compared methods using the “time-quality” diagram. It is demonstrated that our model with common profile runs nearly 2 times faster than VDSR while maintaining the second best SR performance, which is quite suitable for lightweight and fast implementation on consumer-grade devices. For applications that require extremely high SR quality, deeper ours will be a nice choice.

(a)
(b)
(c)
(d)
Fig. 31: Visual results of our model on real-world cases. The upper row shows the case of video surveillance and the lower row shows the case of mobile device. To see clear comparisons, it is better to zoom in the electronic version of this paper.

Some promising examples are visualized in Fig. 19 and Fig. 19. For better viewing, we interpolate the chrominance components by the bicubic method to generate color images. To clearly demonstrate the difference, we choose one patch from each image and attach them below. Compared to other methods, our model can produce images with sharper and clearer boundaries.

Visual Comparison with SRGAN: We compare our method with the super-resolution generative adversarial network (SRGAN) [25]. Because of their proposed adversarial loss, SRGAN has obtained promising performance. However, it still has problems in recovering real details, which is verified by the comparisons shown in Fig. 26. It is shown in the enlarged patches of Fig. 26 (c) and (d) that some waterdrops exist in the ground-truth image disappear, which are produced by SRGAN methods. But these waterdrops are captured by our method and ShCNN. As pointed out in [56], SRGAN tends to bring in similar textures instead of recovering real details. Therefore, our proposed framework performs better than SRGAN on recovering more accurate details.

Discussion on real-world cases: To justify the effectiveness of our method, we move one step forward to deal with images from video surveillance and mobile device. Specifically, we apply our model on real-world images with a scaling factor of 3. As reported in Fig. 31, “Original” indicates the original images and “Proposed” represent the images processed with our model. As one can observe from results shown in Fig. 31, “Proposed” have fewer artifacts compared with “Original”. This demonstrates the robustness of our method towards real-world challenges.

V-B Ablation Study

In this subsection, we conduct detailed analyses on the proposed modules, i.e., content-adaptive interpolation, BCN and RCN, for better understanding of our framework. We hope such analysis can lead to new insights into image restoration researches.

Content-adaptive interpolation: One of the major differences between our model and SRCNN [20] is the employment of the deconvolutional layer. To demonstrate the superiority of our design, we train several fully convolutional networks (FCNs) with various layer numbers for comparisons. Specifically, we increase the number of middle layers from 5 to 16, resulting in FCN-5, FCN-9, FCN-12, and FCN-16. These FCNs follow the bicubic upsampling strategy as in SRCNN [20]. Our content-adaptive interpolation consist of 5 convolutional layers and one deconvolutional layer, which contain feature extraction stage, content-adaptive interpolation and BCN. We remove the task of boundary objective to address the effectiveness of content-adaptive interpolation. By comparing content-adaptive interpolation with these FCNs on Set5 dataset with a scaling factor of 3, we obtain the results shown in Table V.

Fig. 32: The PSNR curves generated by models trained with and without edge prediction objective.

It is indicated in these results that although the SR performance of FCN keeps increasing as the network depth increases, it still cannot outperform content-adaptive interpolation even when there are 16 layers. Nevertheless, our content-adaptive interpolation network, which only has 6 layers, surpasses these FCNs by a clear margin. More specifically, content-adaptive interpolation network outperforms FCN-16 by 0.32 dB. This explicitly verifies the superiority of the content-adaptive interpolation.

Module FCN-5 FCN-9 FCN-12 FCN-16 LSPM
PSNR (dB) 32.75 32.82 32.86 32.97 33.29
TABLE V: Comparison between content-adaptive interpolation and FCNs on Set5 dataset with a scaling factor of 3. We remove the edge prediction objective to justify the effectiveness of content-adaptive interpolation.

Global Boundary Context: The proposed BCN is motivated by the paradigm of mult-task learning, which incorporates edge estimation as a co-task of HR image generation. Therefore, its analysis is conducted by comparing the SR performance between with and without the edge prediction objective. Since the BSD200 dataset contains manually labeled boundary maps, based on which we can easily compute the EPSNR. We compare two profiles of our model on this dataset with a scaling factor of 3 using both PSNR and EPSNR metrics. By removing the boundary prediction objective, we degrade BCN into single-task learning and denote it as “ours w/o boundary”. As illustrated in Table VI, the PSNR and EPSNR gains indicate the benefit of multi-task learning. Because the boundaries only occupy a small portion of the whole image, the improvement on overall PSNR is minor. However, the large improvement on EPSNR verifies the effectiveness of BCN. Another benefit of incorporating boundary prediction objective is the acceleration of training process. As shown in the PSNR curves of Fig. 32, the edge prediction objective not only accelerates the convergence, but also contributes to a higher restoration quality.

Methods PSNR (dB) EPSNR (dB)
Bicubic 27.18 (+0.00) 22.71 (+0.00)
A+ [52] 28.36 (+1.21) 24.28 (+1.57)
SRCNN [20] 28.28 (+1.1) 24.24 (+1.53)
SRF [53] 28.45 (+1.27) 24.27 (+1.56)
SCN [22] 28.54 (+1.36) 24.29 (+1.58)
ShCNN [21] 28.60 (+1.42) 24.32 (+1.61)
Ours w/o boundary 28.68 (+1.46) 24.36 (+1.65)
Ours 28.69 (+1.47) 24.43 (+1.72)
TABLE VI: Comparisons on BSD200 dataset with a scaling factor of 3.

Local Residue Context: We design RCN to provide complementary information for image SR. Therefore, the SR performance of our model will be degraded if RCN is removed. To verify our statement, we use another profile named “ours w/o RCN”, which is very similar to the previous version of this work [30], to conduct more comparisons on the aforementioned datasets with a scaling factor of 3. Table VII reports the comparison results. It is shown that, although content-adaptive interpolation and BCN can produce HR image of high quality, the SR performance can still be further improved. The improvement on PSNR is minor because PSNR is a squared error-based metric, which is difficult to reveal subtle structure differences. In contrast, because SSIM concentrates on structure similarity, the improvement on SSIM is more significant.

Test set Set5 Set14 BSD200
Ours w/o RCN 33.36 dB 29.57 dB 28.63 dB
Ours 33.47 dB 29.64 dB 28.69 dB
Ours w/o RCN 0.9162 0.8255 0.8176
Ours 0.9176 0.8273 0.8183
TABLE VII: Comparisons between our model with and without RCN on the PSNR (top) and SSIM (bottom) metrics.

Vi Conclusion and Future Work

In this paper, to address single image super-resolution, we have proposed a novel contextualized multi-task deep learning framework. Our neural network model incorporates global boundary context and residual context to super-resolve images while well preserving their structural details. Moreover, we have introduced “content-adaptive interpolation”, which leverages a set of filters that are adaptive to the training samples. Different from the kernel estimation in blind image SR which usually employs only a single filter, our proposed content-adaptive interpolation has more filtering parameters and better convenience of being embedded into CNNs. Our extensive experiments suggest that the proposed method outperforms other leading image super-resolution approaches, and achieves state-of-the-art performances on both popular evaluation metrics and visual quality comparison.

There are several directions to extend our method. First, we are considering to introduce a perceptual loss into the multi-task optimization, aiming to better capture realistic and meaningful image details. Second, we shall generalize this framework to adapt to video data by taking spatio-temporal coherency into consideration. Third, considering that additional common knowledge in deep neural networks would be an interesting trial, we intend to utilize complementary spatial-temporal contexts as privileged information for video SR, as suggested by Yang et al. [34].

Acknowledgements

This work is partially supported by NSFC (No. 61602533), The Fundamental Research Funds for the Central Universities, in part by Hong Kong Scholars Program and Hong Kong Polytechnic University Mainland University Joint Supervision Scheme. We are grateful to acknowledge NVIDIA for GPU donations.

References

  • [1] J. Jiang, R. Hu, Z. Wang, and Z. Han, “Noise robust face hallucination via locality-constrained representation,” IEEE Transactions on Multimedia, vol. 16, no. 5, pp. 1268–1281, Aug 2014.
  • [2] H. He, S. Mandal, A. Buehler, X. L. Deán-Ben, D. Razansky, and V. Ntziachristos, “Improving optoacoustic image quality via geometric pixel super-resolution approach,” IEEE Transactions on Medical Imaging, vol. 35, no. 3, pp. 812–818, March 2016.
  • [3] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2016, pp. 1874–1883.
  • [4] Yaniv Romano, John Isidoro, and Peyman Milanfar, “Raisr: rapid and accurate image super resolution,” IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 110–125, 2017.
  • [5] Michal Irani and Shmuel Peleg, “Improving resolution by image registration,” CVGIP: Graphical models and image processing, vol. 53, no. 3, pp. 231–239, 1991.
  • [6] Qi Shan, Zhaorong Li, Jiaya Jia, and Chi-Keung Tang, “Fast image/video upsampling,” ACM Transactions on Graphics (TOG), vol. 27, no. 5, pp. 153, 2008.
  • [7] Tomer Michaeli and Michal Irani, “Nonparametric blind super-resolution,” in ICCV, 2013, pp. 945–952.
  • [8] Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma, “Image super-resolution via sparse representation,” Image Processing, IEEE Transactions on, vol. 19, no. 11, pp. 2861–2873, 2010.
  • [9] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja, “Single image super-resolution from transformed self-exemplars,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5197–5206.
  • [10] Yanting Hu, Nannan Wang, Dacheng Tao, Xinbo Gao, and Xuelong Li, “Serf: A simple, effective, robust, and fast image super-resolver from cascaded linear regression,” IEEE Transactions on Image Processing, vol. 25, no. 9, pp. 4091–4102, 2016.
  • [11] Jinyu Chu, Ju Liu, Jianping Qiao, Xiaoling Wang, and Yujun Li, “Gradient-based adaptive interpolation in super-resolution image restoration,” in Signal Processing, 2008. ICSP 2008. 9th International Conference on. IEEE, 2008, pp. 1027–1030.
  • [12] Stéfan J van der Walt and BM Herbst, “A polygon-based interpolation operator for super-resolution imaging,” arXiv preprint arXiv:1210.3404, 2012.
  • [13] Keze Wang, Liang Lin, Jiangbo Lu, Chenglong Li, and Keyang Shi, “Pisa: Pixelwise image saliency by aggregating complementary appearance contrast measures with edge-preserving coherence,” IEEE Transactions on Image Processing, vol. 24, no. 10, pp. 3019–3033, Oct 2015.
  • [14] David Eigen, Dilip Krishnan, and Rob Fergus, “Restoring an image taken through a window covered with dirt or rain,” in ICCV. IEEE, 2013, pp. 633–640.
  • [15] Viren Jain and Sebastian Seung, “Natural image denoising with convolutional networks,” in Advances in Neural Information Processing Systems, 2009, pp. 769–776.
  • [16] Nannan Wang, Dacheng Tao, Xinbo Gao, Xuelong Li, and Jie Li, “A comprehensive survey to face hallucination,” International journal of computer vision, vol. 106, no. 1, pp. 9–30, 2014.
  • [17] Qingxing Cao, Liang Lin, Yukai Shi, Xiaodan Liang, and Guanbin Li,

    “Attention-aware face hallucination via deep reinforcement learning,”

    in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [18] Ruimao Zhang, Liang Lin, Rui Zhang, Wangmeng Zuo, and Lei Zhang,

    “Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification,”

    IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 4766–4779, 2015.
  • [19] Junyuan Xie, Linli Xu, and Enhong Chen, “Image denoising and inpainting with deep neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 341–349.
  • [20] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang, “Learning a deep convolutional network for image super-resolution,” in Computer Vision–ECCV 2014, pp. 184–199. Springer, 2014.
  • [21] Jimmy SJ. Ren, Li Xu, Qiong Yan, and Wenxiu Sun, “Shepard convolutional neural networks,” in Advances in Neural Information Processing Systems, 2015.
  • [22] Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, and Thomas Huang, “Deep networks for image super-resolution with sparse prior,” arXiv preprint arXiv:1507.08905, 2015.
  • [23] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” arXiv preprint arXiv:1511.04587, 2015.
  • [24] Joan Bruna, Pablo Sprechmann, and Yann LeCun, “Super-resolution with deep convolutional sufficient statistics,” arXiv preprint arXiv:1511.05666, 2015.
  • [25] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al., “Photo-realistic single image super-resolution using a generative adversarial network,” arXiv preprint arXiv:1609.04802, 2016.
  • [26] Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision. Springer, 2016, pp. 694–711.
  • [27] Si Liu, Xiaodan Liang, Luoqi Liu, and Ke Lu, “Fashion parsing with video context,” IEEE Transactions on Multimedia, vol. 17, no. 8, pp. 1347–1358, 2015.
  • [28] X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, and S. Yan, “Human parsing with contextualized convolutional neural network,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1386–1394.
  • [29] Yi Yang, Zhigang Ma, Alexander G Hauptmann, and Nicu Sebe, “Feature selection for multimedia analysis by sharing information among multiple tasks,” IEEE Transactions on Multimedia, vol. 15, no. 3, pp. 661–669, 2013.
  • [30] Yukai Shi, Keze Wang, Li Xu, and Liang Lin, “Local- and holistic- structure preserving image super resolution via deep joint component learning,” in IEEE international Conference on Multimedia and Expo. IEEE, 2016, vol. 1.
  • [31] R. Keys, “Cubic convolution interpolation for digital image processing,” IEEE Transactions on Acoustics Speech & Signal Processing, vol. 29, no. 6, pp. 1153–1160, 1981.
  • [32] R. Caruana, “Multitask learning,” Machine learning, vol. 28, pp. 41–75, 1997.
  • [33] Keze Wang, Shengfu Zhai, Hui Cheng, Xiaodan Liang, and Liang Lin, “Human pose estimation from depth images via inference embedded multi-task learning,” in Proceedings of the ACM International Conference on Multimedia (ACM MM), 2016.
  • [34] Y. Yan, F. Nie, W. Li, C. Gao, Y. Yang, and D. Xu,

    “Image classification by cross-media active learning with privileged information,”

    IEEE Transactions on Multimedia, vol. 18, no. 12, pp. 2494–2502, Dec 2016.
  • [35] Keze Wang, Liang Lin, Wangmeng Zuo, Shuhang Gu, and Lei Zhang,

    “Dictionary pair classifier driven convolutional neural networksfor object detection,”

    in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [36] Jun Yu, Baopeng Zhang, Zhengzhong Kuang, Dan Lin, and Jianping Fan, “iPrivacy: image privacy protection by identifying sensitive objects via deep multi-task learning,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 5, pp. 1005–1016, 2017.
  • [37] S. Yang, Z. Liu, M. Wang, F. Sun, and L. Jiao, “Multitask learning and sparse representation based super-resolution reconstruction of synthetic aperture radar images,” in 2011 International Workshop on Multi-Platform/Multi-Sensor Remote Sensing and Mapping, 2011, pp. 1–5.
  • [38] Y. Liang, J. Wang, S. Zhang, and Y. Gong, “Incorporating image degeneration modeling with multitask learning for image super-resolution,” in 2015 IEEE International Conference on Image Processing (ICIP), 2015, pp. 2110–2114.
  • [39] Karol Gregor and Yann LeCun, “Learning fast approximations of sparse coding,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 399–406.
  • [40] Kun Zeng, Jun Yu, Ruxin Wang, Cuihua Li, and Dacheng Tao, “Coupled deep autoencoder for single image super-resolution,” IEEE transactions on cybernetics, vol. 47, no. 1, pp. 27–37, 2017.
  • [41] N. Kumar and A. Sethi, “Fast learning-based single image super-resolution,” IEEE Transactions on Multimedia, vol. 18, no. 8, pp. 1504–1515, Aug 2016.
  • [42] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee, “Deeply-recursive convolutional network for image super-resolution,” arXiv preprint arXiv:1511.04491, 2015.
  • [43] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [44] Chao Dong, Chen Change Loy, and Xiaoou Tang, “Accelerating the super-resolution convolutional neural network,” in European Conference on Computer Vision. Springer, 2016, pp. 391–407.
  • [45] Dongyoon Han, Jiwhan Kim, and Junmo Kim, “Deep pyramidal residual networks,” arXiv preprint arXiv:1610.02915, 2016.
  • [46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [47] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer, “cuDNN: Efficient primitives for deep learning,” arXiv preprint arXiv:1410.0759, 2014.
  • [48] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik, “Contour detection and hierarchical image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 898–916, May 2011.
  • [49] Shuhang Gu, Wangmeng Zuo, Qi Xie, Deyu Meng, Xiangchu Feng, and Lei Zhang, “Convolutional sparse coding for image super-resolution,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1823–1831.
  • [50] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in BMVC, 2012.
  • [51] Roman Zeyde, Michael Elad, and Matan Protter, “On single image scale-up using sparse-representations,” in Curves and Surfaces, pp. 711–730. Springer, 2012.
  • [52] Radu Timofte, Vincent De Smet, and Luc Van Gool, “A+: Adjusted anchored neighborhood regression for fast super-resolution,” in Asian Conference on Computer Vision. Springer, 2014, pp. 111–126.
  • [53] Samuel Schulter, Christian Leistner, and Horst Bischof, “Fast and accurate image upscaling with super-resolution forests,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3791–3799.
  • [54] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,” http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  • [55] Min Wang, Baoyuan Liu, and Hassan Foroosh, “Factorized convolutional neural networks,” 2016.
  • [56] Mehdi Sajjadi, Bernhard Schölkopf, and Michael Hirsch, “Enhancenet: Single image super-resolution through automated texture synthesis,” arXiv preprint arXiv:1612.07919, 2016.