Super-Resolution with Deep Adaptive Image Resampling

12/18/2017 ∙ by Xu Jia, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences 0

Deep learning based methods have recently pushed the state-of-the-art on the problem of Single Image Super-Resolution (SISR). In this work, we revisit the more traditional interpolation-based methods, that were popular before, now with the help of deep learning. In particular, we propose to use a Convolutional Neural Network (CNN) to estimate spatially variant interpolation kernels and apply the estimated kernels adaptively to each position in the image. The whole model is trained in an end-to-end manner. We explore two ways to improve the results for the case of large upscaling factors, and propose a recursive extension of our basic model. This achieves results that are on par with state-of-the-art methods. We visualize the estimated adaptive interpolation kernels to gain more insight on the effectiveness of the proposed method. We also extend the method to the task of joint image filtering and again achieve state-of-the-art performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reconstructing a high-resolution (HR) image from a low-resolution (LR) input is a classic computer vision problem, referred as

Single Image Super-Resolution (SISR). Research on SISR receives a lot of attention because of the wide range of applications, such as surveillance, medical imaging and remote sensing imaging, where high-frequency details are required. The main difficulty with SISR lies in the fact that it is an ill-posed problem: the high-frequency information is missing and there are many possible solutions that are all consistent with the given low-resolution input. Therefore, additional assumptions have to be made regarding the formation of HR images. A common key assumption for this task is that the high-frequency information is redundant and can be reconstructed either from the given LR image or from external exemplars.

Long-standing, basic methods for SISR are general interpolation based methods, such as bilinear interpolation, bicubic interpolation and Lanczos resampling [4]. These methods are motivated either by the sampling theorem or spline theory. While they have a strong theoretical basis, they assume a band-limited continuous signal and apply a fixed interpolation kernel to the LR image to achieve the upscaling. As a result, they cannot adapt to the image content, often resulting in aliasing artefacts or over-smoothed regions. To address this issue, several works [52, 42, 8] have proposed edge guided image interpolation methods. They use prior information about the images as regularization such that they can upscale the image while keeping the edges sharp.

More recently, learning based, i.e. data-driven, methods have become more popular. This includes dictionary based methods [5, 48, 43, 44, 47]. They explicitly learn a dictionary mapping between LR space and HR space. Once the mapping is learned, the same set of coding coefficients computed for the LR image are used for the HR image to produce the super-resolved result. Another family of data-driven methods are deep learning based models [9, 10, 37]. Building on the powerful capability of deep neural networks to approximate arbitrary functions, these methods learn an implicit mapping between LR and HR images, typically with a fully-convolutional network and in an end-to-end manner. Deeper networks [22, 23, 33, 28] have been proposed to further improve the performance and currently define the state-of-the-art.

valign=c
(a) SRCNN [9]
(b) VDSR [22]
valign=c
(c) Ours
Figure 1: Network architecture comparison between our method, SRCNN and VDSR.

In this work, instead of further increasing the network depth, we revisit the idea of the interpolation based methods, but now with the help of deep learning, aiming at an effective model with some insights. We compute a pixel in the HR image using adaptive interpolation, i.e.

a weighted average of the nearby pixels in the corresponding LR image, with weights that are not fixed but depend on the image content at that position. Therefore, the interpolation kernels are spatially variant and content-aware. For example, in smooth regions, there is not much variance among pixels in a neighborhood, so a uniform kernel might do a reasonable job; however, for a region with an edge or some rich texture, a specially-designed combination of neighboring pixels is required for its interpolation.

Instead of using hand-designed kernels to do the filtering/interpolation, we propose to use a deep neural network to learn good interpolation kernels in a data-driven fashion. For this, we build on the recently proposed Dynamic Filter Network architecture [21, 12, 46]. Once the adaptive interpolation kernels are estimated, we use them in an adaptive image resampling layer which carries out the actual filtering operation (see Figure 1 (c)): the estimated interpolation kernels are applied to a (Nearest Neighbour interpolated) low resolution image to obtain the super-resolved result. The adaptive image resampling module is differentiable and allows end-to-end training for the whole model.

The performance of interpolation-based methods drops as the upscaling factor increases. That is because when the upscaling factor is big, there is little correlation among nearby pixels. In this case non-local methods perform better than local linear filtering methods. We explore two ways to reduce the degeneration in case of large upscaling factor for interpolation-based methods: an atrous spatial pyramid and progressive upsampling. Besides, the deep adaptive image resampling can be applied to the previously obtained super-resolution result several times, i.e. in a recurrent fashion, to further improve the performance.

The proposed methods are evaluated on four super-resolution benchmark datasets and perform favorably compared to state-of-the-art methods. We visualize the estimated interpolation kernels and shed some light on why the proposed method works well. In addition, we show that the proposed method can be naturally extended for the joint image filtering task and again obtains very good performance.

2 Related Work

Deep Learning for Super-Resolution

Recently, a lot of works have addressed the task of SISR based on Convolutional Neural Networks (CNN). One pioneering work is the Super-Resolution Convolutional Neural Network (SRCNN, see Figure 1 (a)) [9, 10]. It implicitly learns a mapping between LR and HR images using a fully-convolutional network. It takes bicubic interpolation as a pre-processing step and feeds the interpolated result to the network for super-resolution. This slows down the processing speed and increases the memory requirement as all convolution operations are done on HR images. The Efficient Sub-Pixel Convolutional Neural Network (ESPCN) [37] addresses this issue by feeding the small size LR image to the network and postponing the upscaling to just before the output layer, by means of a newly proposed sub-pixel layer. Inspired by the success of very deep networks in recognition tasks [26, 39, 40], Kim et al. [22, 23] proposed Very Deep Super Resolution (VDSR, see Figure 1 (b)), increasing the network depth to 20 layers. Moreover, inspired by ResNet [15], they predict the residual between the bicubic interpolation result and the HR image instead of directly predicting the HR image, which eases the training process. Both steps improve performance further. In [33], skip connections are added to the convolutional and deconvolutional layers of very deep convolutional encoder-decoder networks for faster convergence and more detailed restoration. Lai et al. [27] proposed the Deep Laplacian Pyramid Network to do the upscaling progressively from small upscaling factor to large upscaling factor. Very recently, SRResNet [28], EDSR [30] and DRRN [41]

proposed to not only use the residual connection in the last layer but also local residual connections in the intermediate layers as in the ResNet 

[15] and DenseNet [17] architectures, to further improve the performance. A comparison between our network architecture and two popular ones is shown in Figure 1. Ours is most similar to the VDSR architecture, except that we do interpolation instead of addition operation.

Adaptive Convolution

Very recently several works have proposed to modify the traditional convolutional layer and make it more adaptive to the input, as we do. In the context of image classification, Jeon and Kim [20] introduced an active convolution unit, which allows a convolutional layer to have flexible shape. In [7], convolutional layers are further modified such that each position has an adaptive receptive field. This gives good performance on both object detection and semantic segmentation tasks. Recently, [12, 21, 46] simultaneously proposed the Dynamic Filter Network to model spatial transformations with a single convolution step to address the task of video prediction, with the model conditioned not only on different input but also on different positions in the image. Niklaus et al. [34] extended this work to the video interpolation task by replacing a single 2D convolution step with two separable 1D convolutions. Our work is one of the few pioneering works to relate the idea of adaptive convolution with the SISR task. One similar work in this context is by Riegler et al. [35]. However, they modified SRCNN by conditioning the parameters of its first convolutional layer on the input image in order to address different blur kernels for different images. This requires a different setup, so cannot directly be compared against.

3 Deep Adaptive Image Resampling

In this section, we describe our proposed deep adaptive image resampling model (section 3.1) and several further refinements thereof (section 3.2). In addition, it is also extended for the joint image filtering task (section 3.3).

3.1 The Basic Model

Our model is composed of two parts: one module to estimate the adaptive image interpolation kernels, and another module applying the interpolation kernels to the LR input to produce the super-resolved result. The full architecture is shown in Figure 1 (c).

Adaptive interpolation kernels. Instead of using a fixed blind interpolation kernel for every image and every position, we propose to use a data-driven method to compute a content-aware interpolation kernel separately for each position in the image. We use a fully convolutional network (FCN) [31] to compute the weights of the interpolation kernels. Our FCN consists of several standard convolutional layers and an upsampling layer. The convolutional layers in the FCN learn to model local context for each position in an LR image. Its output is a set of feature maps denoted as , where , and are the size of the LR input, is the spatial size of the interpolation kernels and is the upscaling factor. has the same spatial resolution as the LR input. To adapt its spatial resolution to HR images, we add an upsampling layer, which can be implemented as either a subpixel layer [37]

or a fractionally-strided convolutional layer 

[31].

Figure 2: Demonstration of the adaptive image resampling layer for upscaling by a factor 2 and filter size 3x3.

The upsampled interpolation kernels are denoted as , where and . Each spatial position in

corresponds to a vector of dimension

. It can be reshaped to a filter of size and works as an interpolation kernel at that position. The interpolation kernel combines the nearby pixels in the LR input and reconstructs the corresponding pixel in the HR image. The interpolation estimation module is expected to learn which elements in a neighborhood contribute to the reconstruction of a certain pixel and how much each of them contributes.

Adaptive image resampling operation. Once the interpolation kernels are estimated, they are adaptively applied to the corresponding positions in the LR input image to reconstruct the HR image. Nearby pixels in the HR image may be resampled from the same set of pixels in the LR input, yet obtain different intensity values, as each pixel in the HR image space has its own interpolation kernel.

We first resize the LR input image to the same size of the HR image using the nearest neighbor method, resulting in . Now has the same size as and , which is convenient for the implementation of the adaptive resampling (filtering) operation and further extensions. Yet directly applying the interpolation kernels to consecutive elements in does not make sense, since neighboring elements in include repeated pixels (see Figure 2). To apply the estimated kernels to the correct set of pixels within a local region in , we need to upscale the interpolation kernels as well, i.e.apply them to elements with a certain interval in , as shown in Figure 2 and Equation 1.

(1)

This is similar to the concept of atrous convolution, widely used for semantic segmentation [6, 50], but is not translation invariant as atrous convolution. The interval corresponds to the sampling rate parameter in atrous convolution. Using this scheme, different from traditional interpolation methods, each position in the HR image has a different interpolation kernel which is able to adapt to the appearance and semantics of that position.

3.2 Further Improvements

Interpolation for larger upscaling factors. Even if the interpolation kernels are estimated with a deep neural network with relatively large receptive fields, the following filtering operation is still a locally linear model. The elements used for interpolation are limited by the size of the filters. When the upscaling factor gets larger, the correlation between a pixel and its neighbors in the low resolution image becomes smaller. Therefore, the relative performance of interpolation based methods drops as the upscaling factor increases. To reduce this degeneration, we explore two alternatives: i) increasing the size of the interpolation kernels, and ii) doing the upsampling in a progressive way.

Figure 3: Demonstration of the atrous spatial pyramid interpolation kernel with filter size 3x3 and interval 1, 2 and 3.

For the first approach, to sample from a larger neighborhood, a lot more parameters and memory would be required. To alleviate this problem, we borrow the idea of Atrous Spatial Pyramid Pooling (ASPP) from Deeplab-v2  [6], which is originally proposed to increase receptive fields for semantic segmentation. Similarly, we want the interpolation kernel to cover a larger neighborhood, especially when the upscaling factor is big. This can be done by applying the estimated filters to the NN interpolated LR image with different intervals , i.e.

(2)

The sum of the filters over all intervals composes a large interpolation kernel, as is shown in Figure 3. The interpolation kernel is sparse but covers a large neighborhood. This way, the range of the local context is enlarged without drastically increasing the number of parameters and memory.

Alternatively, we can also decompose a large upscaling factor into several upsampling operations with smaller upscaling factors. As mentioned in [27], progressive upsampling makes the super-resolution task easier by dividing it into several sub-problems. Take super-resolution as an example. For progressive upsampling with our model, we first feed a LR image to the model and produce a super-resolved image . At this stage, it just uses pixels within a local neighborhood to interpolate the pixel in a downsampled version of the high resolution image. At the next stage, we feed to another network to estimate another set of interpolation kernels. To avoid drifting away from the content in the original low resolution image, we concatenate the intermediate super-resolution result with the nearest neighbor resized image . The final super-resolved result is obtained by applying the second stage of estimated interpolation kernels to .

Recursive image resampling. Finally, note how the proposed adaptive image resampling approach can be applied several times in a recursive way to further refine the super-resolved result. In this case, the initial super-resolved result has already filled in some of the details which were missing in the LR image. This intermediate super-resolved result is concatenated with the nearest neighbor resized image and sent to another interpolation kernel estimation module which is used during the recursive process. The estimated interpolation kernels are then applied to the intermediate super-resolved result to refine it further. This process is repeated multiple times with shared parameters for the interpolation kernel estimation modules. This is reminiscent of but different from the recursive layer proposed in [23] or the recursive block proposed in [41]. In [23], each recursive output estimates a level of residual and the final result is an ensemble of all recursive outputs and the initial bicubic interpolation result. In [41], multi-path residual blocks are used to compute a residual and the final result is the sum of the residual and the bicubic interpolation result. In our work, the adaptive image resampling is simply repeated multiple times based on the previous super-resolved result.

3.3 Joint Filtering

The proposed adaptive image resampling method is not restricted to the SISR setting. It can be easily extended to joint image filtering tasks such as depth image super-resolution. In this case, the input to the interpolation kernel estimation module includes an additional guidance image which provides auxiliary information for the filtering process. With help of the guidance image, the model can learn better filters by considering the content in the guidance image. Traditional methods proposed in this context, such as joint bilateral upsampling [25] and guided image filtering [14], perform spatially-variant filtering operations as well. They compute the output at a pixel as a weighted average of nearby pixels, with the weights estimated from the guidance image. Instead of hand-designing a function to compute the filter kernel, we use a deep neural network to compute the kernel following a data-driven approach. Deep neural networks are able to model more complex mappings than the hand-designed functions used in [25, 14]. Therefore, the proposed model can take full advantage of the guidance image and the filtering input, and integrate them to produce adaptive filters. Different from the SISR setup, here we have , where is the guidance image. Then the estimated filters are applied to to reconstruct the high resolution image . This results in sharper edges and less unwanted gradient reversal artefacts.

4 Experiments

(a) HR image (b) NN resized image (b) Our result

(c) Estimated filters

Figure 4: Visualization of the feature maps corresponding to estimated interpolation kernels for super-resolution. Note that we only visualize the inner part of the kernel since most of the outer part is close to zero. It is recommended to see (c) by zooming in in the electronic version.

In this section, we evaluate our models on several widely used single image super-resolution benchmark datasets and visualize the interpolation kernels learned in the proposed model. We also apply it to the joint image filtering task.

Datasets We use 291 images as our training data, where 91 images are from Yang et al. [49] and 200 images are from the training set of the Berkeley Segmentation Dataset [1]. The data is augmented by rotation (90, 180, 270 degrees), scaling (scalefactors of 0.6, 0.7, 0.8, 0.9) and horizontal flipping. Patches of size , and are cropped from the augmented data respectively for , , and super-resolution tasks. We downsample the cropped patches to using bicubic resizing method. The proposed method is evaluated on four widely used benchmarks: Set5 [3], Set14 [51], BSD100 [1] and Urban100 [18] with SSIM [45] and PSNR as measure.

Implementation details As for the FCN used in our interpolation kernel estimation module, we find consecutive standard convolutional layers of

with Relu activation function, but without max-pooling or striding to be effective. Similar to 

[22] the number of filters for all convolutional layers is 64, except for the convolutional layer which produces the adaptive interpolation kernels because that number is dependent on the size of the kernel. Unless otherwise mentioned, the interpolation kernel size is set to 5. We use subpixel layer [37] as the upsampling layer in this work. Simple mean absolute error (

loss) is used as our loss function, i.e.,

. All models are initialized using the method proposed in [13] and trained for 200,000 iterations with mini-batches of size 16. Adam optimizer [24] with , and is used to optimize the parameters. The learning rate is initially set to 1e-4 and halved every 50,000 iterations.

basic model baseline large upscaling factor recursive model
Dataset Scale DAIR_5 DAIR_10 DAIR_20 FCN_20 ASP2_20 ASP3_20 Prog_10 Recur1_10 Recur2_10
37.42/0.9582 37.61/0.9591 37.61/0.9592 37.43/0.9583 37.69/0.9594 37.75/0.9597
Set5 33.47/0.9193 33.71/0.9217 33.83/0.9228 33.57/0.9200 33.82/0.9230 33.87/0.9232
31.19/0.8806 31.31/0.8830 31.28/0.8828 31.14/0.8784 31.27/0.8833 31.31/0.8844 31.47/0.8858 31.29/0.8846 31.43/0.8861
32.92/0.9117 33.07/0.9130 33.12/0.9131 32.93/0.9116 33.16/0.9134 33.21/0.9139
Set14 29.66/0.8298 29.74/0.8317 29.78/0.8317 29.71/0.8302 29.81/0.8326 29.85/0.8331
27.89/0.7651 27.97/0.7677 27.80/0.7675 27.83/0.7630 28.00/0.7673 27.96/0.7675 28.04/0.7690 27.98/0.7681 28.06/0.7700
Table 1: Ablation study on Set5 and Set14 (PSNR/SSIM).

4.1 Filter visualization

To see how the model exploits the adaptivity associated with our image resampling, we take super-resolution as an example and visualize the estimated interpolation kernels – see Figure 4. Here, instead of directly visualizing the filters at each position, we visualize the feature maps that correspond to the filters. The feature maps have 25 channels where each channel corresponds to one element in a filter. As shown in Figure 4, the feature map in the middle has higher values than the others. This indicates that the nearest neighbor contributes most to the interpolation, which is consistent with traditional interpolation methods. The edge regions clearly stand out in those feature maps, indicating that they are treated differently and the interpolation kernels do indeed adapt to the image content.

(a) GT (b) NN (c) Result
(d) GT (e) NN (f) Result (g) Filters
Figure 5: Visualization of the learned filters for different regions for super-resolution. The rows from top to bottom correspond to the red, blue and green regions respectively. Each cell in the grids in (g) corresponds to a filter. It is recommended to zoom in for details.

Further, we can see that the middle feature map and the ones next to it show certain patterns, that is, vertical stripes for its left and right neighbors and horizontal stripes for its top and bottom neighbors. This reflects the variation due to the relative location of the HR pixel to its nearest region in the LR image. The elements next to the nearest neighbor complement with each other to obtain the best combination, especially on the edges. This increases the contrast between two sides of an edge such that the edge looks sharper.

To see how the filters adapt to different regions, we show some example filters that correspond to a smooth region , a textured region and a region with strong edge in Figure 5. Each cell in the grid in column (g) corresponds to a filter. We find that for a smooth region the distribution of filters follows a regular pattern, that is, there are several patterns at different positions and one pattern repeatedly appears with an interval of the upscaling factor 3 (note that the values of the filters are not exactly the same but very close) – for example, the grid with dark red dot in the center and its 8 neighbors. It is easy to understand this because pixels at different positions in a smooth region of a low resolution image look similar and they can be handled in a similar way for high resolution reconstruction. However, this is not the case for regions with rich textures and strong edges. Filters in such regions do not follow a regular pattern but become more complicated. Spatially invariant interpolation kernels do not work well. The kernels need to adapt to the variance within a region such as the edge and texture.

4.2 Ablation Study

In this section, we study the effects of different components of our method described in Section 3.

Number of layers. First, we study the effect of the number of layers used to estimate the adaptive interpolation kernels. We experiment with 5, 10 and 20 layers for the adaptive interpolation kernel module, referred to as DAIR, DAIR and DAIR. Table 1 shows that the deeper the network for that module, the better the super-resolved result it obtains. However, the difference between models with 10 and 20 convolutional layers is small. To have a compact and effective model, we use 10 convolutional layers for each stage for our progressive and recursive models (see below).

Adaptive image resampling. We compare the proposed model and a model without adaptive image resampling operation, FCN_20. The architecture is similar to the network architecture in SRCNN [9] in that it is simply a fully convolutional nework, but different from it in two points: it starts from a nearest neighbor resized image instead of a bicubic interpolation resized one; and it has the same number of convolutional layers as DAIR_20 which is more than that in SRCNN.

Dataset Scale Bicubic Lanczos3 SRCNN [9] FSRCNN [10] VDSR [22] DRCN [23] LapSRN [27] DRRN [41] Ours
33.66/0.9299 34.32/0.9365 36.66/0.9542 37.05/0.956 37.53/0.9587 37.63/0.9588 37.52/0.959 37.74/0.9591 37.75/0.9597
Set5 30.89/00.8682 30.82/0.8754 32.75/0.9090 33.18/0.914 33.66/0.9213 33.82/0.9226 34.03/0.9244 33.87/0.9232
28.42/0.8104 28.80/0.8178 30.48/0.8628 30.72/0.866 31.35/0.8838 31.53/0.8854 31.54/0.885 31.68/0.8888 31.47/0.8865
30.24/0.8688 30.69/0.8791 32.45/0.9067 32.66/0.909 33.03/0.9124 33.04/0.9118 33.08/0.913 33.23/0.9136 33.21/0.9139
Set14 27.55/0.7742 27.83/0.7830 29.30/0.8215 29.37/0.824 29.77/0.8314 29.76/0.8311 29.96/0.8349 29.85/0.8331
26.00/0.7027 26.23/0.7098 27.50/0.7513 27.61/0.755 28.01/0.7674 28.02/0.7670 28.19/0.772 28.21/0.7720 28.07/0.7701
29.56/0.8431 29.92/0.8551 31.36/0.8879 31.53/0.892 31.90/0.8960 31.85/0.8942 31.80/0.895 32.05/0.8973 32.00/0.8974
BSD100 27.21/0.7385 27.41/0.7481 28.41/0.7863 28.53/0.791 28.82/0.7976 28.80/0.7963 28.95/0.8004 28.87/0.7991
25.96/0.6675 26.13/0.6754 26.90/0.7101 26.98/0.715 27.29/0.7251 27.23/0.7233 27.32/0.728 27.38/0.7284 27.25/0.7263
26.88/0.8403 27.25/0.8503 29.50/0.8946 29.88/0.902 30.76/0.9140 30.75/0.9133 30.41/0.910 31.23/0.9188 31.08/0.9176
Urban100 24.46/0.7349 24.68/0.7430 26.24/0.7989 26.43/0.808 27.14/0.8279 27.15/0.8276 27.53/0.8378 27.24/0.8317
23.14/0.6577 23.32/0.6641 24.52/0.7221 24.62/0.728 25.18/0.7524 25.14/0.7510 25.21/0.756 25.44/0.7638 25.13/0.7549
Table 2: Quantitative results on benchmark datasets (PSNR/SSIM). The best is marked with red and the second best is marked with blue.
Ground Truth Bicubic SRCNN VDSR DRCN LapSRN Ours
(PSNR/SSIM) (20.43/0.4914) (21.06/0.5735) (21.34/0.6034) (21.36/0.6025) (21.28/0.6018) (21.30/0.6046)
(PSNR/SSIM) (18.07/0.6839) (21.27/0.8130) (21.22/0.8629) (21.25/0.8621) (21.18/0.8624) (21.31/0.8649)
Figure 6: Qualitative comparison on 3x single image super-resolution.

The numbers in Table 1 show that the method with adaptive resampling module is much better than the one without. Instead of directly generating a new value as a pixel in HR image space, we propose a way to interpolate nearby pixels in LR image space. This shares a similar philosophy as the global residual connection in the last layer of [22, 23], which makes the training easier and has been proven to improve the performance.

Large upscaling factor. As is mentioned in Section 3, we explored two ways to address the weakness of interpolation-based methods in case of large upscaling factors. One way is to use the sum of an atrous spatial pyramid to approximate a large interpolation kernel such that it can combine information from a large neighborhood. We experimented with 2-level and 3-level atrous spatial pyramids, denoted as ASP and ASP in Table 1. We find that the use of atrous spatial pyramid somewhat improves the performance because its approximated kernels can cover larger context. The alternative way we proposed is progressive upsampling, i.e. to progressively upsample a low-resolution image to an intermediate resolution image and further upsample it to a high-resolution one. The result shows progressive upsampling significantly improves the result in case of super-resolution. We conclude that the progressive upsampling for large upscaling factor is more effective and this will be used in our final model for super-resolution.

Recursive refinement. We first apply a basic deep adaptive image resampling module with 10 convolutional layers to the low resolution image and obtain an initial result. Then another adaptive image resampling module with 10 convolutional layers is recursively applied to the previous result. We experiment with one time (Recur1_10) and two times (Recur2_10) recursion. The number of parameters for Recur1_10 and Recur2_10 are the same and the difference is the times the adaptive image resampling is applied. Table 1 shows that more iterations further refines the super-resolution result. We will use Recur2_10 in our final model.

(a) Guidance (b) GT (c) GF (d) JBU (e) Ours
Figure 7: Qualitative comparison with JBU and GF ( upsampling) on depth map super-resolution.

4.3 Comparison with state-of-the-art methods

We use Recur2_10 as our final model for and super-resolution and combine Prog_10 and Recur2_10 as the final model for super-resolution. We compare the proposed deep adaptive image resampling method with several state-of-the-art methods: SRCNN [9], FSRCNN [10], VDSR [22], DRCN [23], LapSRN [27] and DRRN [41]. As shown in Table 2, our method achieves competitive performance to state-of-the-art methods on four benchmarks, especially for small upscaling factors. We also show some visual comparison results. In Figure 6, we can find that the proposed method performs well on textured regions and regions with strong edges.

4.4 Joint Image Filtering

To demonstrate the effectiveness of the proposed model in joint image filtering, we carry out an experiment on the task of depth map super-resolution. The basic model with 10 convolutional layers, i.e. DAIR_10, is used here to validate the effectiveness of the deep adaptive image resampling on joint image filtering. A downsampled depth map is first resized using nearest neighbor interpolation and then concatenated with a guidance RGB image as the input to the interpolation kernel generation module. The high resolution reconstruction is computed by applying the estimated adaptive interpolation kernels to the nearest neighbor interpolation of the downsampled depth map.

Similar to [29], we also collect the training data by cropping patch pairs from 1,000 RGB images and depth maps in the NYU-v2 dataset [38]. The sizes of cropped patches are for and and for upsampling. The depth map patches are downsampled using the nearest neighbor method as low-resolution input and the RGB patches are used as guidance. Once the model is trained, it is evaluated on two datasets, the rest of 449 images in NYU-v2 dataset and Middlebury dataset [16, 36] with missing values filled in by Lu et al. [32].

We compare the proposed method with several joint image filtering methods, which include JBU [25], GF [14], TGV [11], MSG-Net [19], FBS [2] and DJF [29]. Quantitative results111The results of other methods are obtained from [29]. are shown in Table 3

with root mean squared errors (RMSE) as evaluation metric.

Middlebury [16, 36] NYU-v2 [38]
4 8 16 4 8 16
Bicubic 4.44 7.58 11.87 8.16 14.22 22.32
GF [14] 4.01 7.22 11.70 7.32 13.62 22.03
JBU [25] 2.44 3.81 6.13 4.07 8.29 13.35
TGV [11] 3.39 5.41 12.03 6.98 11.23 28.13
MSG-Net [19] 1.79 3.39 5.87 3.78 6.37 11.16
FBS [2] 2.58 4.19 7.30 4.29 8.94 14.59
DJF [29] 2.14 3.77 6.12 3.54 6.20 10.21
Ours 1.79 3.27 6.08 2.67 5.86 10.03
Table 3: Quantitative comparison on depth map super-resolution. The best is marked with red and the second best with blue.

We can see that the proposed method performs favorably against state-of-the-art methods. Compared with [19], which uses a multi-scale guidance, our method is simpler and more effective. Besides, our method which is trained on NYU-v2 dataset also shows good generalization ability when evaluated on Middleburry dataset. We attribute this to the adaptive nature of the proposed method. Similar to the experiment on SISR, we observe that the performance somewhat decreases in case of large upscaling factor, especially for upsampling. For large upscaling factors, other deep learning based methods may learn to directly predict the depth map from the guidance image, which is impossible with our approach. We also show a qualitative comparison with filter-based methods in Figure 7. As shown, JBU and GF, both based on hand-designed spatially variant filters, can adapt to the content of the image to some extent. However, they tend to pay much attention to the strong edges even if the two sides of the edge have the same depth. They are also prone to produce artefacts on the boundary. The deep neural network used in our model is able to capture more complicated relations among LR input, guidance and HR output. Hence the proposed method is more powerful than the methods based on hand-designed spatially variant filters.

5 Conclusion

In this paper we propose a Deep Adaptive Image Resampling method to address the image super-resolution task. Spatially variant interpolation kernels are estimated with a convolutional neural network and then applied to a low resolution image to reconstruct the high resolution image. We demonstrate the effectiveness of the proposed method by evaluating it on both single image super-resolution and joint image filtering tasks. Visualization of the estimated inteporlation kernels gives more insight on the effectiveness of the proposed method.

References

  • [1] P. Arbelaez, M. Maire, C. C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. 33(5):898–916, 2011.
  • [2] J. T. Barron and B. Poole. The fast bilateral solver. In ECCV, 2016.
  • [3] M. Bevilacqua, A. Roumy, C. Guillemot, and M. Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, 2012.
  • [4] T. Blu, P. Thévenaz, and M. Unser. Linear interpolation revitalized. TIP, 13(5):710–719, 2004.
  • [5] H. Chang, D. Yeung, and Y. Xiong. Super-resolution through neighbor embedding. In CVPR, 2004.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016.
  • [7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017.
  • [8] W. D. Dengwen Zhou, Xiaoliu Shen. Image zooming using directional cubic convolution interpolation. IET image processing, 6(6):627–634, 2012.
  • [9] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. TPAMI, 38(2):295–307, 2016.
  • [10] C. Dong, C. C. Loy, and X. Tang. Accelerating the super-resolution convolutional neural network. In ECCV, 2016.
  • [11] D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof. Image guided depth upsampling using anisotropic total generalized variation. In ICCV, 2013.
  • [12] C. Finn, I. J. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
  • [13] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
  • [14] K. He, J. Sun, and X. Tang. Guided image filtering. TPAMI, 35(6):1397–1409, 2013.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [16] H. Hirschmüller and D. Scharstein. Evaluation of cost functions for stereo matching. In CVPR, 2007.
  • [17] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
  • [18] J. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In CVPR, 2015.
  • [19] T. Hui, C. C. Loy, and X. Tang. Depth map super-resolution by deep multi-scale guidance. In ECCV, 2016.
  • [20] Y. Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classification. In CVPR, 2017.
  • [21] X. Jia, B. D. Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In NIPS, 2016.
  • [22] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
  • [23] J. Kim, J. K. Lee, and K. M. Lee. Deeply-recursive convolutional network for image super-resolution. In CVPR, 2016.
  • [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [25] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. ACM Trans. Graph., 26(3):96, 2007.
  • [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [27] W. Lai, J. Huang, N. Ahuja, and M. Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, 2017.
  • [28] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
  • [29] Y. Li, J.-B. Huang, A. Narendra, and M.-H. Yang. Deep joint image filtering. In European Conference on Computer Vision, 2016.
  • [30] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image super-resolution. In CVPR Workshops, 2017.
  • [31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [32] S. Lu, X. Ren, and F. Liu. Depth enhancement via low-rank matrix completion. In CVPR, 2014.
  • [33] X. Mao, C. Shen, and Y. Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In NIPS, 2016.
  • [34] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive separable convolution. In ICCV, 2017.
  • [35] G. Riegler, S. Schulter, M. Rüther, and H. Bischof. Conditioned regression models for non-blind single image super-resolution. In ICCV, 2015.
  • [36] D. Scharstein and C. Pal. Learning conditional random fields for stereo. In CVPR, 2007.
  • [37] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, 2016.
  • [38] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, 2012.
  • [39] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [41] Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep recursive residual network. In CVPR, 2017.
  • [42] H. Takeda, S. Farsiu, and P. Milanfar. Kernel regression for image processing and reconstruction. TIP, 16(2):349–366, 2007.
  • [43] R. Timofte, V. D. Smet, and L. J. V. Gool. Anchored neighborhood regression for fast example-based super-resolution. In ICCV, 2013.
  • [44] R. Timofte, V. D. Smet, and L. J. V. Gool. A+: adjusted anchored neighborhood regression for fast super-resolution. In ACCV, 2014.
  • [45] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 13(4):600–612, 2004.
  • [46] T. Xue, J. Wu, K. L. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, 2016.
  • [47] C. Yang, C. Ma, and M. Yang. Single-image super-resolution: A benchmark. In ECCV, 2014.
  • [48] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution as sparse representation of raw image patches. In CVPR, 2008.
  • [49] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. TIP, 19(11):2861–2873, 2010.
  • [50] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
  • [51] R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In Curves and Surfaces, 2010.
  • [52] L. Zhang and X. Wu. An edge-guided image interpolation algorithm via directional filtering and data fusion. TIP, 15.