1 Introduction
Reconstructing a highresolution (HR) image from a lowresolution (LR) input is a classic computer vision problem, referred as
Single Image SuperResolution (SISR). Research on SISR receives a lot of attention because of the wide range of applications, such as surveillance, medical imaging and remote sensing imaging, where highfrequency details are required. The main difficulty with SISR lies in the fact that it is an illposed problem: the highfrequency information is missing and there are many possible solutions that are all consistent with the given lowresolution input. Therefore, additional assumptions have to be made regarding the formation of HR images. A common key assumption for this task is that the highfrequency information is redundant and can be reconstructed either from the given LR image or from external exemplars.Longstanding, basic methods for SISR are general interpolation based methods, such as bilinear interpolation, bicubic interpolation and Lanczos resampling [4]. These methods are motivated either by the sampling theorem or spline theory. While they have a strong theoretical basis, they assume a bandlimited continuous signal and apply a fixed interpolation kernel to the LR image to achieve the upscaling. As a result, they cannot adapt to the image content, often resulting in aliasing artefacts or oversmoothed regions. To address this issue, several works [52, 42, 8] have proposed edge guided image interpolation methods. They use prior information about the images as regularization such that they can upscale the image while keeping the edges sharp.
More recently, learning based, i.e. datadriven, methods have become more popular. This includes dictionary based methods [5, 48, 43, 44, 47]. They explicitly learn a dictionary mapping between LR space and HR space. Once the mapping is learned, the same set of coding coefficients computed for the LR image are used for the HR image to produce the superresolved result. Another family of datadriven methods are deep learning based models [9, 10, 37]. Building on the powerful capability of deep neural networks to approximate arbitrary functions, these methods learn an implicit mapping between LR and HR images, typically with a fullyconvolutional network and in an endtoend manner. Deeper networks [22, 23, 33, 28] have been proposed to further improve the performance and currently define the stateoftheart.
valign=c

valign=c

In this work, instead of further increasing the network depth, we revisit the idea of the interpolation based methods, but now with the help of deep learning, aiming at an effective model with some insights. We compute a pixel in the HR image using adaptive interpolation, i.e.
a weighted average of the nearby pixels in the corresponding LR image, with weights that are not fixed but depend on the image content at that position. Therefore, the interpolation kernels are spatially variant and contentaware. For example, in smooth regions, there is not much variance among pixels in a neighborhood, so a uniform kernel might do a reasonable job; however, for a region with an edge or some rich texture, a speciallydesigned combination of neighboring pixels is required for its interpolation.
Instead of using handdesigned kernels to do the filtering/interpolation, we propose to use a deep neural network to learn good interpolation kernels in a datadriven fashion. For this, we build on the recently proposed Dynamic Filter Network architecture [21, 12, 46]. Once the adaptive interpolation kernels are estimated, we use them in an adaptive image resampling layer which carries out the actual filtering operation (see Figure 1 (c)): the estimated interpolation kernels are applied to a (Nearest Neighbour interpolated) low resolution image to obtain the superresolved result. The adaptive image resampling module is differentiable and allows endtoend training for the whole model.
The performance of interpolationbased methods drops as the upscaling factor increases. That is because when the upscaling factor is big, there is little correlation among nearby pixels. In this case nonlocal methods perform better than local linear filtering methods. We explore two ways to reduce the degeneration in case of large upscaling factor for interpolationbased methods: an atrous spatial pyramid and progressive upsampling. Besides, the deep adaptive image resampling can be applied to the previously obtained superresolution result several times, i.e. in a recurrent fashion, to further improve the performance.
The proposed methods are evaluated on four superresolution benchmark datasets and perform favorably compared to stateoftheart methods. We visualize the estimated interpolation kernels and shed some light on why the proposed method works well. In addition, we show that the proposed method can be naturally extended for the joint image filtering task and again obtains very good performance.
2 Related Work
Deep Learning for SuperResolution
Recently, a lot of works have addressed the task of SISR based on Convolutional Neural Networks (CNN). One pioneering work is the SuperResolution Convolutional Neural Network (SRCNN, see Figure 1 (a)) [9, 10]. It implicitly learns a mapping between LR and HR images using a fullyconvolutional network. It takes bicubic interpolation as a preprocessing step and feeds the interpolated result to the network for superresolution. This slows down the processing speed and increases the memory requirement as all convolution operations are done on HR images. The Efficient SubPixel Convolutional Neural Network (ESPCN) [37] addresses this issue by feeding the small size LR image to the network and postponing the upscaling to just before the output layer, by means of a newly proposed subpixel layer. Inspired by the success of very deep networks in recognition tasks [26, 39, 40], Kim et al. [22, 23] proposed Very Deep Super Resolution (VDSR, see Figure 1 (b)), increasing the network depth to 20 layers. Moreover, inspired by ResNet [15], they predict the residual between the bicubic interpolation result and the HR image instead of directly predicting the HR image, which eases the training process. Both steps improve performance further. In [33], skip connections are added to the convolutional and deconvolutional layers of very deep convolutional encoderdecoder networks for faster convergence and more detailed restoration. Lai et al. [27] proposed the Deep Laplacian Pyramid Network to do the upscaling progressively from small upscaling factor to large upscaling factor. Very recently, SRResNet [28], EDSR [30] and DRRN [41]
proposed to not only use the residual connection in the last layer but also local residual connections in the intermediate layers as in the ResNet
[15] and DenseNet [17] architectures, to further improve the performance. A comparison between our network architecture and two popular ones is shown in Figure 1. Ours is most similar to the VDSR architecture, except that we do interpolation instead of addition operation.Adaptive Convolution
Very recently several works have proposed to modify the traditional convolutional layer and make it more adaptive to the input, as we do. In the context of image classification, Jeon and Kim [20] introduced an active convolution unit, which allows a convolutional layer to have flexible shape. In [7], convolutional layers are further modified such that each position has an adaptive receptive field. This gives good performance on both object detection and semantic segmentation tasks. Recently, [12, 21, 46] simultaneously proposed the Dynamic Filter Network to model spatial transformations with a single convolution step to address the task of video prediction, with the model conditioned not only on different input but also on different positions in the image. Niklaus et al. [34] extended this work to the video interpolation task by replacing a single 2D convolution step with two separable 1D convolutions. Our work is one of the few pioneering works to relate the idea of adaptive convolution with the SISR task. One similar work in this context is by Riegler et al. [35]. However, they modified SRCNN by conditioning the parameters of its first convolutional layer on the input image in order to address different blur kernels for different images. This requires a different setup, so cannot directly be compared against.
3 Deep Adaptive Image Resampling
In this section, we describe our proposed deep adaptive image resampling model (section 3.1) and several further refinements thereof (section 3.2). In addition, it is also extended for the joint image filtering task (section 3.3).
3.1 The Basic Model
Our model is composed of two parts: one module to estimate the adaptive image interpolation kernels, and another module applying the interpolation kernels to the LR input to produce the superresolved result. The full architecture is shown in Figure 1 (c).
Adaptive interpolation kernels. Instead of using a fixed blind interpolation kernel for every image and every position, we propose to use a datadriven method to compute a contentaware interpolation kernel separately for each position in the image. We use a fully convolutional network (FCN) [31] to compute the weights of the interpolation kernels. Our FCN consists of several standard convolutional layers and an upsampling layer. The convolutional layers in the FCN learn to model local context for each position in an LR image. Its output is a set of feature maps denoted as , where , and are the size of the LR input, is the spatial size of the interpolation kernels and is the upscaling factor. has the same spatial resolution as the LR input. To adapt its spatial resolution to HR images, we add an upsampling layer, which can be implemented as either a subpixel layer [37]
or a fractionallystrided convolutional layer
[31].The upsampled interpolation kernels are denoted as , where and . Each spatial position in
corresponds to a vector of dimension
. It can be reshaped to a filter of size and works as an interpolation kernel at that position. The interpolation kernel combines the nearby pixels in the LR input and reconstructs the corresponding pixel in the HR image. The interpolation estimation module is expected to learn which elements in a neighborhood contribute to the reconstruction of a certain pixel and how much each of them contributes.Adaptive image resampling operation. Once the interpolation kernels are estimated, they are adaptively applied to the corresponding positions in the LR input image to reconstruct the HR image. Nearby pixels in the HR image may be resampled from the same set of pixels in the LR input, yet obtain different intensity values, as each pixel in the HR image space has its own interpolation kernel.
We first resize the LR input image to the same size of the HR image using the nearest neighbor method, resulting in . Now has the same size as and , which is convenient for the implementation of the adaptive resampling (filtering) operation and further extensions. Yet directly applying the interpolation kernels to consecutive elements in does not make sense, since neighboring elements in include repeated pixels (see Figure 2). To apply the estimated kernels to the correct set of pixels within a local region in , we need to upscale the interpolation kernels as well, i.e.apply them to elements with a certain interval in , as shown in Figure 2 and Equation 1.
(1) 
This is similar to the concept of atrous convolution, widely used for semantic segmentation [6, 50], but is not translation invariant as atrous convolution. The interval corresponds to the sampling rate parameter in atrous convolution. Using this scheme, different from traditional interpolation methods, each position in the HR image has a different interpolation kernel which is able to adapt to the appearance and semantics of that position.
3.2 Further Improvements
Interpolation for larger upscaling factors. Even if the interpolation kernels are estimated with a deep neural network with relatively large receptive fields, the following filtering operation is still a locally linear model. The elements used for interpolation are limited by the size of the filters. When the upscaling factor gets larger, the correlation between a pixel and its neighbors in the low resolution image becomes smaller. Therefore, the relative performance of interpolation based methods drops as the upscaling factor increases. To reduce this degeneration, we explore two alternatives: i) increasing the size of the interpolation kernels, and ii) doing the upsampling in a progressive way.
For the first approach, to sample from a larger neighborhood, a lot more parameters and memory would be required. To alleviate this problem, we borrow the idea of Atrous Spatial Pyramid Pooling (ASPP) from Deeplabv2 [6], which is originally proposed to increase receptive fields for semantic segmentation. Similarly, we want the interpolation kernel to cover a larger neighborhood, especially when the upscaling factor is big. This can be done by applying the estimated filters to the NN interpolated LR image with different intervals , i.e.
(2) 
The sum of the filters over all intervals composes a large interpolation kernel, as is shown in Figure 3. The interpolation kernel is sparse but covers a large neighborhood. This way, the range of the local context is enlarged without drastically increasing the number of parameters and memory.
Alternatively, we can also decompose a large upscaling factor into several upsampling operations with smaller upscaling factors. As mentioned in [27], progressive upsampling makes the superresolution task easier by dividing it into several subproblems. Take superresolution as an example. For progressive upsampling with our model, we first feed a LR image to the model and produce a superresolved image . At this stage, it just uses pixels within a local neighborhood to interpolate the pixel in a downsampled version of the high resolution image. At the next stage, we feed to another network to estimate another set of interpolation kernels. To avoid drifting away from the content in the original low resolution image, we concatenate the intermediate superresolution result with the nearest neighbor resized image . The final superresolved result is obtained by applying the second stage of estimated interpolation kernels to .
Recursive image resampling. Finally, note how the proposed adaptive image resampling approach can be applied several times in a recursive way to further refine the superresolved result. In this case, the initial superresolved result has already filled in some of the details which were missing in the LR image. This intermediate superresolved result is concatenated with the nearest neighbor resized image and sent to another interpolation kernel estimation module which is used during the recursive process. The estimated interpolation kernels are then applied to the intermediate superresolved result to refine it further. This process is repeated multiple times with shared parameters for the interpolation kernel estimation modules. This is reminiscent of but different from the recursive layer proposed in [23] or the recursive block proposed in [41]. In [23], each recursive output estimates a level of residual and the final result is an ensemble of all recursive outputs and the initial bicubic interpolation result. In [41], multipath residual blocks are used to compute a residual and the final result is the sum of the residual and the bicubic interpolation result. In our work, the adaptive image resampling is simply repeated multiple times based on the previous superresolved result.
3.3 Joint Filtering
The proposed adaptive image resampling method is not restricted to the SISR setting. It can be easily extended to joint image filtering tasks such as depth image superresolution. In this case, the input to the interpolation kernel estimation module includes an additional guidance image which provides auxiliary information for the filtering process. With help of the guidance image, the model can learn better filters by considering the content in the guidance image. Traditional methods proposed in this context, such as joint bilateral upsampling [25] and guided image filtering [14], perform spatiallyvariant filtering operations as well. They compute the output at a pixel as a weighted average of nearby pixels, with the weights estimated from the guidance image. Instead of handdesigning a function to compute the filter kernel, we use a deep neural network to compute the kernel following a datadriven approach. Deep neural networks are able to model more complex mappings than the handdesigned functions used in [25, 14]. Therefore, the proposed model can take full advantage of the guidance image and the filtering input, and integrate them to produce adaptive filters. Different from the SISR setup, here we have , where is the guidance image. Then the estimated filters are applied to to reconstruct the high resolution image . This results in sharper edges and less unwanted gradient reversal artefacts.
4 Experiments
(a) HR image  (b) NN resized image  (b) Our result 
(c) Estimated filters
In this section, we evaluate our models on several widely used single image superresolution benchmark datasets and visualize the interpolation kernels learned in the proposed model. We also apply it to the joint image filtering task.
Datasets
We use 291 images as our training data, where 91 images are from Yang et al. [49] and 200 images are from the training set of the Berkeley Segmentation Dataset [1]. The data is augmented by rotation (90, 180, 270 degrees), scaling (scalefactors of 0.6, 0.7, 0.8, 0.9) and horizontal flipping. Patches of size , and are cropped from the augmented data respectively for , , and superresolution tasks.
We downsample the cropped patches to using bicubic resizing method. The proposed method is evaluated on four widely used benchmarks: Set5 [3], Set14 [51], BSD100 [1] and Urban100 [18] with SSIM [45] and PSNR as measure.
Implementation details As for the FCN used in our interpolation kernel estimation module, we find consecutive standard convolutional layers of
with Relu activation function, but without maxpooling or striding to be effective. Similar to
[22] the number of filters for all convolutional layers is 64, except for the convolutional layer which produces the adaptive interpolation kernels because that number is dependent on the size of the kernel. Unless otherwise mentioned, the interpolation kernel size is set to 5. We use subpixel layer [37] as the upsampling layer in this work. Simple mean absolute error (loss) is used as our loss function, i.e.,
. All models are initialized using the method proposed in [13] and trained for 200,000 iterations with minibatches of size 16. Adam optimizer [24] with , and is used to optimize the parameters. The learning rate is initially set to 1e4 and halved every 50,000 iterations.basic model  baseline  large upscaling factor  recursive model  

Dataset  Scale  DAIR_5  DAIR_10  DAIR_20  FCN_20  ASP2_20  ASP3_20  Prog_10  Recur1_10  Recur2_10 
37.42/0.9582  37.61/0.9591  37.61/0.9592  37.43/0.9583  —  —  —  37.69/0.9594  37.75/0.9597  
Set5  33.47/0.9193  33.71/0.9217  33.83/0.9228  33.57/0.9200  —  —  —  33.82/0.9230  33.87/0.9232  
31.19/0.8806  31.31/0.8830  31.28/0.8828  31.14/0.8784  31.27/0.8833  31.31/0.8844  31.47/0.8858  31.29/0.8846  31.43/0.8861  
32.92/0.9117  33.07/0.9130  33.12/0.9131  32.93/0.9116  —  —  —  33.16/0.9134  33.21/0.9139  
Set14  29.66/0.8298  29.74/0.8317  29.78/0.8317  29.71/0.8302  —  —  —  29.81/0.8326  29.85/0.8331  
27.89/0.7651  27.97/0.7677  27.80/0.7675  27.83/0.7630  28.00/0.7673  27.96/0.7675  28.04/0.7690  27.98/0.7681  28.06/0.7700 
4.1 Filter visualization
To see how the model exploits the adaptivity associated with our image resampling, we take superresolution as an example and visualize the estimated interpolation kernels – see Figure 4. Here, instead of directly visualizing the filters at each position, we visualize the feature maps that correspond to the filters. The feature maps have 25 channels where each channel corresponds to one element in a filter. As shown in Figure 4, the feature map in the middle has higher values than the others. This indicates that the nearest neighbor contributes most to the interpolation, which is consistent with traditional interpolation methods. The edge regions clearly stand out in those feature maps, indicating that they are treated differently and the interpolation kernels do indeed adapt to the image content.
(a) GT  (b) NN  (c) Result 
(d) GT  (e) NN  (f) Result  (g) Filters 
Further, we can see that the middle feature map and the ones next to it show certain patterns, that is, vertical stripes for its left and right neighbors and horizontal stripes for its top and bottom neighbors. This reflects the variation due to the relative location of the HR pixel to its nearest region in the LR image. The elements next to the nearest neighbor complement with each other to obtain the best combination, especially on the edges. This increases the contrast between two sides of an edge such that the edge looks sharper.
To see how the filters adapt to different regions, we show some example filters that correspond to a smooth region , a textured region and a region with strong edge in Figure 5. Each cell in the grid in column (g) corresponds to a filter. We find that for a smooth region the distribution of filters follows a regular pattern, that is, there are several patterns at different positions and one pattern repeatedly appears with an interval of the upscaling factor 3 (note that the values of the filters are not exactly the same but very close) – for example, the grid with dark red dot in the center and its 8 neighbors. It is easy to understand this because pixels at different positions in a smooth region of a low resolution image look similar and they can be handled in a similar way for high resolution reconstruction. However, this is not the case for regions with rich textures and strong edges. Filters in such regions do not follow a regular pattern but become more complicated. Spatially invariant interpolation kernels do not work well. The kernels need to adapt to the variance within a region such as the edge and texture.
4.2 Ablation Study
In this section, we study the effects of different components of our method described in Section 3.
Number of layers.
First, we study the effect of the number of layers used to estimate the adaptive interpolation kernels. We experiment with 5, 10 and 20 layers for the adaptive interpolation kernel module, referred to as DAIR, DAIR and DAIR. Table 1 shows that the deeper the network for that module, the better the superresolved result it obtains. However, the difference between models with 10 and 20 convolutional layers is small. To have a compact and effective model, we use 10 convolutional layers for each stage for our progressive and recursive models (see below).
Adaptive image resampling. We compare the proposed model and a model without adaptive image resampling operation, FCN_20. The architecture is similar to the network architecture in SRCNN [9] in that it is simply a fully convolutional nework, but different from it in two points: it starts from a nearest neighbor resized image instead of a bicubic interpolation resized one; and it has the same number of convolutional layers as DAIR_20 which is more than that in SRCNN.
Dataset  Scale  Bicubic  Lanczos3  SRCNN [9]  FSRCNN [10]  VDSR [22]  DRCN [23]  LapSRN [27]  DRRN [41]  Ours 

33.66/0.9299  34.32/0.9365  36.66/0.9542  37.05/0.956  37.53/0.9587  37.63/0.9588  37.52/0.959  37.74/0.9591  37.75/0.9597  
Set5  30.89/00.8682  30.82/0.8754  32.75/0.9090  33.18/0.914  33.66/0.9213  33.82/0.9226  —  34.03/0.9244  33.87/0.9232  
28.42/0.8104  28.80/0.8178  30.48/0.8628  30.72/0.866  31.35/0.8838  31.53/0.8854  31.54/0.885  31.68/0.8888  31.47/0.8865  
30.24/0.8688  30.69/0.8791  32.45/0.9067  32.66/0.909  33.03/0.9124  33.04/0.9118  33.08/0.913  33.23/0.9136  33.21/0.9139  
Set14  27.55/0.7742  27.83/0.7830  29.30/0.8215  29.37/0.824  29.77/0.8314  29.76/0.8311  —  29.96/0.8349  29.85/0.8331  
26.00/0.7027  26.23/0.7098  27.50/0.7513  27.61/0.755  28.01/0.7674  28.02/0.7670  28.19/0.772  28.21/0.7720  28.07/0.7701  
29.56/0.8431  29.92/0.8551  31.36/0.8879  31.53/0.892  31.90/0.8960  31.85/0.8942  31.80/0.895  32.05/0.8973  32.00/0.8974  
BSD100  27.21/0.7385  27.41/0.7481  28.41/0.7863  28.53/0.791  28.82/0.7976  28.80/0.7963  —  28.95/0.8004  28.87/0.7991  
25.96/0.6675  26.13/0.6754  26.90/0.7101  26.98/0.715  27.29/0.7251  27.23/0.7233  27.32/0.728  27.38/0.7284  27.25/0.7263  
26.88/0.8403  27.25/0.8503  29.50/0.8946  29.88/0.902  30.76/0.9140  30.75/0.9133  30.41/0.910  31.23/0.9188  31.08/0.9176  
Urban100  24.46/0.7349  24.68/0.7430  26.24/0.7989  26.43/0.808  27.14/0.8279  27.15/0.8276  —  27.53/0.8378  27.24/0.8317  
23.14/0.6577  23.32/0.6641  24.52/0.7221  24.62/0.728  25.18/0.7524  25.14/0.7510  25.21/0.756  25.44/0.7638  25.13/0.7549 
Ground Truth  Bicubic  SRCNN  VDSR  DRCN  LapSRN  Ours 
(PSNR/SSIM)  (20.43/0.4914)  (21.06/0.5735)  (21.34/0.6034)  (21.36/0.6025)  (21.28/0.6018)  (21.30/0.6046) 
(PSNR/SSIM)  (18.07/0.6839)  (21.27/0.8130)  (21.22/0.8629)  (21.25/0.8621)  (21.18/0.8624)  (21.31/0.8649) 
The numbers in Table 1 show that the method with adaptive resampling module is much better than the one without. Instead of directly generating a new value as a pixel in HR image space, we propose a way to interpolate nearby pixels in LR image space. This shares a similar philosophy as the global residual connection in the last layer of [22, 23], which makes the training easier and has been proven to improve the performance.
Large upscaling factor. As is mentioned in Section 3, we explored two ways to address the weakness of interpolationbased methods in case of large upscaling factors. One way is to use the sum of an atrous spatial pyramid to approximate a large interpolation kernel such that it can combine information from a large neighborhood. We experimented with 2level and 3level atrous spatial pyramids, denoted as ASP and ASP in Table 1. We find that the use of atrous spatial pyramid somewhat improves the performance because its approximated kernels can cover larger context. The alternative way we proposed is progressive upsampling, i.e. to progressively upsample a lowresolution image to an intermediate resolution image and further upsample it to a highresolution one. The result shows progressive upsampling significantly improves the result in case of superresolution. We conclude that the progressive upsampling for large upscaling factor is more effective and this will be used in our final model for superresolution.
Recursive refinement. We first apply a basic deep adaptive image resampling module with 10 convolutional layers to the low resolution image and obtain an initial result. Then another adaptive image resampling module with 10 convolutional layers is recursively applied to the previous result. We experiment with one time (Recur1_10) and two times (Recur2_10) recursion. The number of parameters for Recur1_10 and Recur2_10 are the same and the difference is the times the adaptive image resampling is applied. Table 1 shows that more iterations further refines the superresolution result. We will use Recur2_10 in our final model.






(a) Guidance  (b) GT  (c) GF  (d) JBU  (e) Ours 
4.3 Comparison with stateoftheart methods
We use Recur2_10 as our final model for and superresolution and combine Prog_10 and Recur2_10 as the final model for superresolution. We compare the proposed deep adaptive image resampling method with several stateoftheart methods: SRCNN [9], FSRCNN [10], VDSR [22], DRCN [23], LapSRN [27] and DRRN [41]. As shown in Table 2, our method achieves competitive performance to stateoftheart methods on four benchmarks, especially for small upscaling factors. We also show some visual comparison results. In Figure 6, we can find that the proposed method performs well on textured regions and regions with strong edges.
4.4 Joint Image Filtering
To demonstrate the effectiveness of the proposed model in joint image filtering, we carry out an experiment on the task of depth map superresolution. The basic model with 10 convolutional layers, i.e. DAIR_10, is used here to validate the effectiveness of the deep adaptive image resampling on joint image filtering. A downsampled depth map is first resized using nearest neighbor interpolation and then concatenated with a guidance RGB image as the input to the interpolation kernel generation module. The high resolution reconstruction is computed by applying the estimated adaptive interpolation kernels to the nearest neighbor interpolation of the downsampled depth map.
Similar to [29], we also collect the training data by cropping patch pairs from 1,000 RGB images and depth maps in the NYUv2 dataset [38]. The sizes of cropped patches are for and and for upsampling. The depth map patches are downsampled using the nearest neighbor method as lowresolution input and the RGB patches are used as guidance. Once the model is trained, it is evaluated on two datasets, the rest of 449 images in NYUv2 dataset and Middlebury dataset [16, 36] with missing values filled in by Lu et al. [32].
We compare the proposed method with several joint image filtering methods, which include JBU [25], GF [14], TGV [11], MSGNet [19], FBS [2] and DJF [29]. Quantitative results^{1}^{1}1The results of other methods are obtained from [29]. are shown in Table 3
with root mean squared errors (RMSE) as evaluation metric.
Middlebury [16, 36]  NYUv2 [38]  

4  8  16  4  8  16  
Bicubic  4.44  7.58  11.87  8.16  14.22  22.32 
GF [14]  4.01  7.22  11.70  7.32  13.62  22.03 
JBU [25]  2.44  3.81  6.13  4.07  8.29  13.35 
TGV [11]  3.39  5.41  12.03  6.98  11.23  28.13 
MSGNet [19]  1.79  3.39  5.87  3.78  6.37  11.16 
FBS [2]  2.58  4.19  7.30  4.29  8.94  14.59 
DJF [29]  2.14  3.77  6.12  3.54  6.20  10.21 
Ours  1.79  3.27  6.08  2.67  5.86  10.03 
We can see that the proposed method performs favorably against stateoftheart methods. Compared with [19], which uses a multiscale guidance, our method is simpler and more effective. Besides, our method which is trained on NYUv2 dataset also shows good generalization ability when evaluated on Middleburry dataset. We attribute this to the adaptive nature of the proposed method. Similar to the experiment on SISR, we observe that the performance somewhat decreases in case of large upscaling factor, especially for upsampling. For large upscaling factors, other deep learning based methods may learn to directly predict the depth map from the guidance image, which is impossible with our approach. We also show a qualitative comparison with filterbased methods in Figure 7. As shown, JBU and GF, both based on handdesigned spatially variant filters, can adapt to the content of the image to some extent. However, they tend to pay much attention to the strong edges even if the two sides of the edge have the same depth. They are also prone to produce artefacts on the boundary. The deep neural network used in our model is able to capture more complicated relations among LR input, guidance and HR output. Hence the proposed method is more powerful than the methods based on handdesigned spatially variant filters.
5 Conclusion
In this paper we propose a Deep Adaptive Image Resampling method to address the image superresolution task. Spatially variant interpolation kernels are estimated with a convolutional neural network and then applied to a low resolution image to reconstruct the high resolution image. We demonstrate the effectiveness of the proposed method by evaluating it on both single image superresolution and joint image filtering tasks. Visualization of the estimated inteporlation kernels gives more insight on the effectiveness of the proposed method.
References
 [1] P. Arbelaez, M. Maire, C. C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. 33(5):898–916, 2011.
 [2] J. T. Barron and B. Poole. The fast bilateral solver. In ECCV, 2016.
 [3] M. Bevilacqua, A. Roumy, C. Guillemot, and M. AlberiMorel. Lowcomplexity singleimage superresolution based on nonnegative neighbor embedding. In BMVC, 2012.
 [4] T. Blu, P. Thévenaz, and M. Unser. Linear interpolation revitalized. TIP, 13(5):710–719, 2004.
 [5] H. Chang, D. Yeung, and Y. Xiong. Superresolution through neighbor embedding. In CVPR, 2004.
 [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016.
 [7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017.
 [8] W. D. Dengwen Zhou, Xiaoliu Shen. Image zooming using directional cubic convolution interpolation. IET image processing, 6(6):627–634, 2012.
 [9] C. Dong, C. C. Loy, K. He, and X. Tang. Image superresolution using deep convolutional networks. TPAMI, 38(2):295–307, 2016.
 [10] C. Dong, C. C. Loy, and X. Tang. Accelerating the superresolution convolutional neural network. In ECCV, 2016.
 [11] D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof. Image guided depth upsampling using anisotropic total generalized variation. In ICCV, 2013.
 [12] C. Finn, I. J. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
 [13] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
 [14] K. He, J. Sun, and X. Tang. Guided image filtering. TPAMI, 35(6):1397–1409, 2013.
 [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [16] H. Hirschmüller and D. Scharstein. Evaluation of cost functions for stereo matching. In CVPR, 2007.
 [17] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [18] J. Huang, A. Singh, and N. Ahuja. Single image superresolution from transformed selfexemplars. In CVPR, 2015.
 [19] T. Hui, C. C. Loy, and X. Tang. Depth map superresolution by deep multiscale guidance. In ECCV, 2016.
 [20] Y. Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classification. In CVPR, 2017.
 [21] X. Jia, B. D. Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In NIPS, 2016.
 [22] J. Kim, J. K. Lee, and K. M. Lee. Accurate image superresolution using very deep convolutional networks. In CVPR, 2016.
 [23] J. Kim, J. K. Lee, and K. M. Lee. Deeplyrecursive convolutional network for image superresolution. In CVPR, 2016.
 [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [25] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. ACM Trans. Graph., 26(3):96, 2007.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [27] W. Lai, J. Huang, N. Ahuja, and M. Yang. Deep laplacian pyramid networks for fast and accurate superresolution. In CVPR, 2017.
 [28] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photorealistic single image superresolution using a generative adversarial network. In CVPR, 2017.
 [29] Y. Li, J.B. Huang, A. Narendra, and M.H. Yang. Deep joint image filtering. In European Conference on Computer Vision, 2016.
 [30] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image superresolution. In CVPR Workshops, 2017.
 [31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [32] S. Lu, X. Ren, and F. Liu. Depth enhancement via lowrank matrix completion. In CVPR, 2014.
 [33] X. Mao, C. Shen, and Y. Yang. Image restoration using very deep convolutional encoderdecoder networks with symmetric skip connections. In NIPS, 2016.
 [34] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive separable convolution. In ICCV, 2017.
 [35] G. Riegler, S. Schulter, M. Rüther, and H. Bischof. Conditioned regression models for nonblind single image superresolution. In ICCV, 2015.
 [36] D. Scharstein and C. Pal. Learning conditional random fields for stereo. In CVPR, 2007.
 [37] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Realtime single image and video superresolution using an efficient subpixel convolutional neural network. In CVPR, 2016.
 [38] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, 2012.
 [39] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [41] Y. Tai, J. Yang, and X. Liu. Image superresolution via deep recursive residual network. In CVPR, 2017.
 [42] H. Takeda, S. Farsiu, and P. Milanfar. Kernel regression for image processing and reconstruction. TIP, 16(2):349–366, 2007.
 [43] R. Timofte, V. D. Smet, and L. J. V. Gool. Anchored neighborhood regression for fast examplebased superresolution. In ICCV, 2013.
 [44] R. Timofte, V. D. Smet, and L. J. V. Gool. A+: adjusted anchored neighborhood regression for fast superresolution. In ACCV, 2014.
 [45] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 13(4):600–612, 2004.
 [46] T. Xue, J. Wu, K. L. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, 2016.
 [47] C. Yang, C. Ma, and M. Yang. Singleimage superresolution: A benchmark. In ECCV, 2014.
 [48] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image superresolution as sparse representation of raw image patches. In CVPR, 2008.
 [49] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image superresolution via sparse representation. TIP, 19(11):2861–2873, 2010.
 [50] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. In ICLR, 2016.
 [51] R. Zeyde, M. Elad, and M. Protter. On single image scaleup using sparserepresentations. In Curves and Surfaces, 2010.
 [52] L. Zhang and X. Wu. An edgeguided image interpolation algorithm via directional filtering and data fusion. TIP, 15.
Comments
There are no comments yet.