Deep Adaptive Inference Networks for Single Image Super-Resolution

04/08/2020 ∙ by Ming Liu, et al. ∙ Microsoft Harbin Institute of Technology 0

Recent years have witnessed tremendous progress in single image super-resolution (SISR) owing to the deployment of deep convolutional neural networks (CNNs). For most existing methods, the computational cost of each SISR model is irrelevant to local image content, hardware platform and application scenario. Nonetheless, content and resource adaptive model is more preferred, and it is encouraging to apply simpler and efficient networks to the easier regions with less details and the scenarios with restricted efficiency constraints. In this paper, we take a step forward to address this issue by leveraging the adaptive inference networks for deep SISR (AdaDSR). In particular, our AdaDSR involves an SISR model as backbone and a lightweight adapter module which takes image features and resource constraint as input and predicts a map of local network depth. Adaptive inference can then be performed with the support of efficient sparse convolution, where only a fraction of the layers in the backbone is performed at a given position according to its predicted depth. The network learning can be formulated as the joint optimization of reconstruction and network depth losses. In the inference stage, the average depth can be flexibly tuned to meet a range of efficiency constraints. Experiments demonstrate the effectiveness and adaptability of our AdaDSR in contrast to its counterparts (e.g., EDSR and RCAN).



There are no comments yet.


page 12

Code Repositories


Deep Adaptive Inference Networks for Single Image Super-Resolution

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image super-resolution aims at recovering high-resolution (HR) image from its low-resolution (LR) counterpart, is a representative low-level vision task with many real-world applications such as medical imaging [shi2013cardiac], surveillance [zou2011very] and entertainment [old_film]. Recently, driven by the development of deep convolutional neural networks (CNNs), tremendous progress has been made in single image super-resolution (SISR). On the one hand, the quantitative performance of SISR has been continuously improved by many outstanding representative models such as SRCNN [SRCNN], VDSR [VDSR], SRResNet [SRResNet], EDSR [EDSR], RCAN [RCAN], SAN [SAN], . On the other hand, considerable attention has also been given to handle several other issues in SISR, including visual quality [SRResNet], degradation model [DPSR], and blind SISR [zhang2019multiple].

(c) Depth - Performance
(d) Comparison on Set5 (x2)
Figure 1: Illustration of our motivation and performance. (a) and (b) show an LR image and the depth map predicted by our AdaDSR model, and three representative patches with various SISR difficulty are marked out. (c) explores the performance of EDSR models with different number of residual blocks on these patches. In (d), we compare two versions of our AdaDSR against their backbones on Set5 dataset. Please zoom in for better observation, and refer to the supplementary materials for more comparison on other conditions.

Albeit their unprecedented success of SISR, for most existing networks, the computational cost of each model is still independent to image content and application scenarios. Given such an SISR model, once the training is finished, the inference process is deterministic and only depends on the model architecture and the input image size. Actually, instead of deterministic inference, it is inspiring to make the inference to be adaptive to local image content. To illustrate this point, Fig. 1(c) shows the SISR results of three image patches using EDSR [EDSR] with different numbers of residual blocks. It can be seen that EDSR with 8 residual blocks is sufficient to super-resolve a smooth patch with less textures. In contrast, at least 24 residual blocks are required for the patch with rich details. Consequently, treating the whole image equally and processing all regions with identical number of residual blocks will certainly lead to the waste of computation resource. Thus, it is encouraging to develop the spatially adaptive inference method for better tradeoff between accuracy and efficiency.

Moreover, the SISR model may be deployed to diverse hardware platforms. Even for a given hardware device, the model can be run under different battery conditions or workloads, and has to meet various efficiency constraints. One natural solution is to design and train numerous deep SISR models in advance, and dynamically select the appropriate one according to the hardware platform and efficiency constraints. Nonetheless, both the training and storage of multiple deep SISR models are expensive, greatly limiting their practical applications to the scenarios with highly dynamic efficiency constraints. Instead, we suggest to address this issue by further making the inference method to be adaptive to efficiency constraints.

To make the learned model to adapt to local image content and efficiency constraints, this paper presents a kind of adaptive inference networks for deep SISR, , AdaDSR. Considering that stacked residual blocks have been widely adopted in the representative SISR models [SRResNet, EDSR, RCAN], the AdaDSR introduces a lightweight adapter module which takes image features as the input and produces a map of local network depth. Therefore, given a position with the local network depth , only the first blocks are required to be computed in the testing stage. Thus, our AdaDSR can apply shallower networks for the smooth regions (, lower depth), and exploit deeper ones for the regions with detailed textures (, higher depth), thereby benefiting the tradeoff between accuracy and efficiency. Taking all the positions into account, sparse convolution can be adopted to facilitate efficient and adaptive inference.

We further improve AdaDSR to be adaptive to efficiency constraints. Note that the average of depth map can be used as an indicator of inference efficiency. For simplicity, the efficiency constraint on hardware platform and application scenario can be represented as a specific desired depth. Thus, we also take the desired depth as the input of the adapter module, and require the average of predicted depth map to approximate the desired depth. And the learning of AdaDSR can then be formulated as the joint optimization of reconstruction and network depth loss. After training, we can dynamically set the desired depth values to accommodate various application scenarios, and then adopt our AdaDSR to meet the efficiency constraints.

Experiments are conducted to assess our AdaDSR. Without loss of generality, we adopt EDSR [EDSR] model as the backbone of our AdaDSR (denoted by AdaEDSR). It can be observed from Fig. 1(b) that the predicted depth map has smaller depth values for the smooth regions and higher ones for the regions with rich small-scale details. As shown in Fig. 1(d), our AdaDSR can be flexibly tuned to meet various efficiency constraints (, FLOPs) by specifying proper desired depth values. In contrast, most existing SISR methods can only be performed with deterministic inference and fixed computational cost. Quantitative and qualitative results further show the effectiveness and adaptability of our AdaDSR in comparison to the state-of-the-art deep SISR methods. Furthermore, we also take another representative SISR model RCAN [RCAN] as the backbone model (denoted by AdaRCAN), which illustrates the generality of our AdaDSR. Considering the training efficiency, ablation analyses are performed on AdaDSR with EDSR backbone (, AdaEDSR).

To sum up, the contributions of this work include:

  • We present adaptive inference networks for deep SISR, , AdaDSR, which adds the backbone with a lightweight adapter module to produce local depth map for spatially adaptive inference.

  • Both image features and desired depth are taken as the input of the adapter, and reconstruction loss is incorporated with depth loss for network learning, thereby making AdaDSR equipped with sparse convolution to be adaptive to various efficiency constraints.

  • Experiments show that our AdaDSR achieves better tradeoff between accuracy and efficiency than it counterparts (, EDSR and RCAN), and can adapt to different efficiency constraints without training from scratch.

2 Related Work

In this section, we briefly review several topics relevant to our AdaDSR, including deep SISR models and adaptive inference methods.

2.1 Deep Single Image Super-Resolution

Dong introduce a three-layer convolutional network in their pioneer work SRCNN [SRCNN], since then, the quantitative performance of SISR has been continuously promoted with the rapid development of CNNs. Kim  [VDSR]

further propose a deeper model named VDSR with residual blocks and adjustable gradient clipping. Liu  

[MWCNN] propose MWCNN, which accelerates the running speed and enlarges the receptive field by deploying U-Net [U-Net] like architecture, and multi-scale wavelet transformation is applied rather than traditional down-sampling or up-sampling module to avoid information lost.

These methods take interpolated LR images as input, resulting in heavy computation burden, so many recent SISR methods choose to increase the spatial resolution via PixelShuffle 

[PixelShuffle] at the tail of the model. SRResNet [SRResNet], EDSR [EDSR] and WDSR [WDSR] follow this setting and have a deep main body by stacking several identical residual blocks [ResNet] before the tail component, and they obtain better performance and efficiency by modifying the architecture of the residual blocks. Zhang  [RCAN] build a very deep (more than 400 layers) yet narrow (64 channels 256 channels in EDSR) RCAN model and learn a content-related weight for each feature channel inside the residual blocks. Dai  [SAN]

propose SAN to obtain better feature representation via second-order attention model, and non-locally enhanced residual group is incorporated to capture long-distance features.

Apart from the fidelity track, considerable attention has also been given to handle several other issues in SISR. For example, SRGAN [SRResNet] incorporates adversarial loss to improve perceptual quality, DPSR [DPSR] proposes a new degradation model and performs super-resolution and deblurring simultaneously, Zhang  [zhang2019multiple] solve real image SISR problem in an unsupervised manner by taking advantage of generative adversarial networks. In addition, lightweight networks such as IDN [IDN] and CARN [CARN] are proposed, but most lightweight models are accelerated at the cost of quantitative performance. In this paper, we propose an AdaDSR model, which achieves better tradeoff between accuracy and efficiency.

2.2 Adaptive Inference

Traditional deterministic CNNs tend to be less flexible to meet various requirements in the applications. As a remedy, many adaptive inference methods have been explored in recent years. Inspired by [bengio2013better], Upchurch  [upchurch2017deep]

propose to learn an interpolation of deep features extracted by a pre-trained model, and manipulate the attributes of facial images. Shoshan  

[DynamicNet] further propose a dynamic model named DynamicNet by deploying tuning blocks alongside the backbone model, and linearly manipulate the features to learn an interpolation of two objectives, which can be tuned to explore the whole objective space during the inference phase. Similarly, CFSNet [CFSNet] implements continuous transition of different objectives, and automatically learns the trade-off between perception and distortion for SISR.

Some methods also leverage adaptive inference to obtain computing efficient models. Li  [li2019improved]

deploy multiple classifiers between the main blocks, and the last one performs as a teacher net to guide the previous ones. During the inference phase, the confidence score of a classifier indicates whether to perform the next block and the corresponding classifier. Figurnov  

[patch_adaptive] predict a stop score for the patches, which determines whether to skip the subsequent layers, indicating different regions have unequal importance for detection tasks. Therefore, skipping layers at less important regions can save the inference time. Yu  [Path-restore] propose to build a denoising model with several multi-path blocks, and in each block, a path finder is deployed to select a proper path for each image patch. These methods are similar to our AdaDSR, however, they perform adaptive inference on patch-level, and the adaptation depends only on the features. In this paper, our AdaDSR implements pixel-wise adaptive inference via sparse convolution and is manually controllable to meet various resource constraints.

3 Proposed Method

This section presents our AdaDSR model for single image super-resolution. To begin with, we equip the backbone with a network depth map to facilitate spatially variant inference. Then, sparse convolution is introduced to speed up the inference by omitting the unnecessary computation. Furthermore, a lightweight adapter module is deployed to predict the network depth map. Finally, the overall network structure (see Fig. 2) and learning objective are provided.

Figure 2: Overall illustration of AdaDSR. On the bottom are diagrams showing AdaEDSR and AdaRCAN, respectively. On the top left, a five-layer adapter takes as input and the weight of the first convolution is tuned by on the fly. The adapter generats a depth map (for AdaEDSR , while for AdaRCAN ). Each channel of is delivered to a group of sparse residual blocks (as shown on the top right). Only a fraction of the positions (marked by dark blue) require computation.

3.1 AdaDSR with Spatially Variant Network Depth

Single image super-resolution aims at learning a mapping to reconstruct the high-resolution image from its low-resolution (LR) observation , and can be written as,


where denotes the SISR network with the network parameters . In this work, we consider a representative category of deep SISR networks that consist of three major modules, , feature extraction , residual blocks, and HR reconstruction . Several representative SISR models, , SRResNet [SRResNet], EDSR [EDSR], and RCAN [RCAN], belong to this category. Using EDSR as an example, we let . The output of the residual blocks can then be formulated as,


where is the network parameters associated with the -th residual block. Given the output of the -th residual block, the -th residual block can be written as . Finally, the reconstructed HR image can be obtained by .

As shown in Fig. 1, the difficulty of super-resolution is spatially variant. For examples, it is not required to go through all the residual blocks in Eqn. (2) to reconstruct the smooth regions. As for the regions with rich and detailed textures, more residual blocks generally are required to fulfill high quality reconstruction. Therefore, we introduce a 2D network depth map () which has the same spatial size with . Intuitively, the network depth is smaller for the smooth region and larger for the region with rich details. To facilitate spatially adaptive inference, we modify Eqn. (2) as,


where denotes the entry-wise product. Here, is defined as,


Let be the ceiling function, thus, the last residual blocks are not required to compute for a position with the network depth . Given the 2D network depth map , we can then exploit Eqn. (3) to conduct spatially adaptive inference.

Figure 3: An example to illustrate the im2col [im2col] based sparse convolution. , and represent convolution, entry-wise product and matrix multiplication, respectively. , and are input feature, convolution kernel and output feature of standard convolution operation, which is implemented by arbitrary convolution implementation algorithms, while and are reorganized from and during the im2col procedure. Given the mask , the reorganized indicates that the shaded rows can be safely ignored in the im2col based sparse convolution, therefore reducing computation amount comparing to standard convolution based sparse convolution (as shown in the upper half).

3.2 Sparse Convolution for Efficient Inference

Let (, for the -th residual block) be a mask to indicate the positions where the convolution activations should be kept. As shown in Fig. 3

, for some convolution implementations such as fast Fourier transform (FFT) 

[fft-1, fft-2] and Winograd [winograd] based algorithms, one should first perform the standard convolution to obtain the whole output feature map by . Here, , and denote input feature map, convolution kernel and convolution operation, respectively. Then the sparse results can be represented by . Nonetheless, such implementations meet the requirement of spatially adaptive inference while maintaining the same computational complexity with the standard convolution.

Instead, we adopt the im2col [im2col] based sparse convolution for efficient adaptive inference. As shown in Fig. 3, the patch from related to a point in is organized as a row in matrix , and the convolution kernel (

) is also converted as vector

. Then the convolution operation is transformed into a matrix multiplication problem, which is highly optimized in many Basic Linear Algebra Subprograms (BLAS) libraries. Then, the result can be organized back to the output feature map. Given the mask , we can simply skip the corresponding row when constructing the reorganized input feature if it has zero mask value (see the shaded rows of in Fig. 3), and the computation is skipped as well. Thus, the spatially adaptive inference in Eqn. (3) can be efficiently implemented via the im2col and col2im procedure. Moreover, the efficiency can be further improved when more rows are masked out, , when the average depth of is smaller.

It is worth noting that sparse convolution has been suggested in many works and evaluated in image classification [SparseWinogradConv], object detection [SparseCNN, SBNet], model pruning [FasterCNN] and 3D semantic segmentation [SparseConvNet] tasks. [SparseCNN] and [SBNet] are based on im2col and Winograd algorithm respectively, however, these methods implement patch-level sparse convolution. [SparseConvNet] designs new data structure for sparse convolution and constructs a whole CNN framework to suit the designed data structure, making it incompatible with standard methods. [SparseWinogradConv] incorporates sparsity into Winograd algorithm, which is not mathematically equivalent to the vanilla CNN nor the conventional Winograd CNN. The most relevant work [FasterCNN] skips unnecessary points when traversing all spatial positions and achieves pixel-level sparse convolution, which is implemented on serial devices (, CPUs) via for-loops. In this work, we use im2col based sparse convolution, which combines this intuitive thought and im2col algorithm, and deploy the proposed model on the parallel platforms (, GPUs). To the best of our knowledge, this is the first attempt to deploy pixel-wise sparse convolution on SISR task and achieves image content and resource adaptive inference.

3.3 Lightweight Adapter Module

In this subsection, we introduce a lightweight adapter module to predict a 2D network depth map . In order to adapt to local image content, the adapter module is required to produce lower network depth for smooth region and higher depth for detailed region. Let be the average value of , and be the desired network depth. To make the model to be adaptive to efficiency constraints, we also take the desired network depth into account, and require that the decrease of can result in smaller , , better inference efficiency.

As shown in Fig. 2, the adapter module takes the feature map

as the input and is comprised of four convolution layers with PReLU nonlinearity followed by another convolution layer with ReLU nonlinearity. Let

. We then use Eqn. (4) to generate the mask for each residual block. It is noted that may not be a binary mask but contains many zeros. Thus, we can construct a sparse residual block which can omit the computation for the regions with zero mask values to facilitate efficient adaptive inference. To meet the efficiency constraint, we also take the desired network depth as the input to the adapter, and predict the network depth map by


where denotes the network parameters of the adapter module. Specifically, denote the weight of the first convolution layer in the adapter as , we make the convolution adjustable by replacing the weight with when the desired depth is , therefore the adapter is able to meet the aforementioned -oriented constraints.

3.4 Network Architecture and Learning Objective

Network Architecture. As shown in Fig. 2, our proposed AdaDSR is comprised of a backbone SISR network and a lightweight adapter module to facilitate image content and efficiency adaptive inference. Without loss of generality, in this section, we take EDSR [EDSR] as the backbone to illustrate the network architecture, and it is feasible to apply our AdaDSR to other representative SISR models [SRResNet, WDSR, RCAN] with a number of residual blocks [ResNet]. Following [EDSR], the backbone involves 32 residual blocks, each of which has two

convolution layers with stride 1, padding 1 and 256 channels with ReLU nonlinearity. Another

convolution layer is deployed right behind the residual blocks. The feature extraction module is a convolution layer, and the reconstruction module is comprised of an upsampling unit to enlarge the features followed by a convolution layer which reconstructs the output image. The upsampling unit is composed by a series of Convolution-PixelShuffle [PixelShuffle] according to the super-resolution scale. Besides, the lightweight adapter module takes both the feature map and the desired network depth as the input, and consists of five convolution layers to produce an one-channel network depth map.

It is worth noting that, we implement two versions of AdaDSR. The first takes EDSR [EDSR] as backbone, which is denoted by AdaEDSR. To further show the generality of proposed AdaDSR and compare against state-of-the-art methods, we also take RCAN [RCAN] as backbone and implement an AdaRCAN model. The main difference is that, RCAN replaces the 32 residual blocks with 10 residual groups, and each residual groups is composed of 20 residual blocks equipped with channel attention. Therefore, we modify the adapter to generate 10 depth maps simultaneously, and each of which is deployed to a residual group.

Learning Objective. The learning objective of our AdaDSR includes a reconstruction loss term and a network depth loss term to achieve a proper tradeoff between SISR performance and efficiency. In terms of the SISR performance, we adopt the reconstruction loss defined on the super-resolved output and the ground-truth high-resolution image,


where and respectively represent the high-resolution ground-truth and the super-resolved image by our AdaDSR. Considering the efficiency constraint, we require the average of the predicted network depth map to approximate the desired depth , and then introduce the following network depth loss,


To sum up, the overall learning objective of our AdaDSR is formulated as,


where is a tradeoff hyper-parameter and is set to in all our experiments.

4 Experiments

Method Scale Set5 Set14 B100 Urban100 Manga109

33.66 0.9299 30.24 0.8688 29.56 0.8431 26.88 0.8403 30.80 0.9339

36.66 0.9542 32.45 0.9067 31.36 0.8879 29.50 0.8946 35.60 0.9663

37.53 0.9590 33.05 0.9130 31.90 0.8960 30.77 0.9140 37.22 0.9750

38.11 0.9602 33.92 0.9195 32.32 0.9013 32.93 0.9351 39.10 0.9773

38.21 0.9611 33.97 0.9208 32.35 0.9017 32.91 0.9353 39.11 0.9778

38.24 0.9614 34.01 0.9212 32.34 0.9017 32.89 0.9353 39.18 0.9780

38.27 0.9614 34.12 0.9216 32.41 0.9027 33.34 0.9384 39.44 0.9786

38.31 0.9620 34.07 0.9213 32.42 0.9028 33.10 0.9370 39.32 0.9792

38.28 0.9615 34.12 0.9216 32.41 0.9026 33.29 0.9380 39.44 0.9785

30.39 0.8682 27.55 0.7742 27.21 0.7385 24.46 0.7349 26.95 0.8556

32.75 0.9090 29.30 0.8215 28.41 0.7863 26.24 0.7989 30.48 0.9117

33.67 0.9210 29.78 0.8320 28.83 0.7990 27.14 0.8290 32.01 0.9340

34.65 0.9280 30.52 0.8462 29.25 0.8093 28.80 0.8653 34.17 0.9476

34.65 0.9288 30.57 0.8463 29.27 0.8091 28.78 0.8649 34.16 0.9482

34.71 0.9296 30.57 0.8468 29.26 0.8093 28.80 0.8653 34.13 0.9484

34.74 0.9299 30.65 0.8482 29.32 0.8111 29.09 0.8702 34.44 0.9499

34.75 0.9300 30.59 0.8476 29.33 0.8112 28.93 0.8671 34.30 0.9494

34.79 0.9302 30.65 0.8481 29.33 0.8111 29.03 0.8689 34.49 0.9498

28.42 0.8104 26.00 0.7027 25.96 0.6675 23.14 0.6577 24.89 0.7866

30.48 0.8628 27.50 0.7513 26.90 0.7101 24.52 0.7221 27.58 0.8555

31.35 0.8830 28.02 0.7680 27.29 0.0726 25.18 0.7540 28.83 0.8870

32.46 0.8968 28.80 0.7876 27.71 0.7420 26.64 0.8033 31.02 0.9148

32.49 0.8977 28.76 0.7865 27.71 0.7410 26.58 0.8011 30.96 0.9150

32.47 0.8990 28.81 0.7871 27.72 0.7419 26.61 0.8028 31.00 0.9151

32.63 0.9002 28.87 0.7889 27.77 0.7436 26.82 0.8087 31.22 0.9173

32.64 0.9003 28.92 0.7888 27.78 0.7436 26.79 0.8068 31.18 0.9169

32.61 0.8998 28.88 0.7883 27.77 0.7428 26.80 0.8067 31.22 0.9172
Table 1: Quantitative results in comparison with the state-of-the-art methods. Best three methods are highlighted by red, blue and green, respectively.
Method Scale Set5 Set14 B100 Urban100 Manga109
FLOPs Time FLOPs Time FLOPs Time FLOPs Time FLOPs Time
(G) (ms) (G) (ms) (G) (ms) (G) (ms) (G) (ms)

6.1 43.0 12.3 7.9 8.2 4.4 41.4 19.2 51.6 24.2

70.5 86.9 143.0 88.7 95.4 58.0 481.6 301.0 599.5 368.8

1338.8 395.7 2552.2 630.7 1776.9 469.9 8041.1 2163.8 9891.4 2554.5

650.6 312.3 1397.3 489.8 965.3 371.5 4844.9 1655.2 5208.4 1864.5

801.1 345.9 1527.3 617.1 1063.3 407.7 4811.9 2198.5 5919.2 3417.1

577.9 633.2 1101.8 813.6 767.0 607.0 3471.2 1955.0 4270.0 2342.9

3835.9 1276.0 17500.4 3314.0 3943.2 1637.8 372727.5 N/A 645359.4 N/A

469.1 614.5 925.5 751.9 649.2 606.3 2907.2 1749.2 3300.7 2034.2

6.1 43.0 12.3 7.9 8.2 4.4 41.4 19.2 51.6 24.2

70.5 86.9 143.0 88.7 95.4 58.0 481.6 301.0 599.5 368.8

699.1 251.0 1305.7 341.7 924.1 259.6 3984.0 957.9 4904.0 1271.1

504.8 231.4 1013.5 302.0 722.8 232.0 3314.2 858.4 3695.9 1023.3

437.1 195.0 816.3 290.3 577.8 190.2 2490.9 1045.3 3066.1 1404.0

328.5 553.6 613.5 551.7 434.2 511.2 1872.0 1029.2 2304.2 1208.7

463.2 582.2 1930.1 992.9 517.5 600.7 36735.2 5416.2 61976.0 8194.4

277.7 572.6 512.9 559.1 369.3 523.8 1596.3 968.1 1842.2 1107.1

6.1 43.0 12.3 7.9 8.2 4.4 41.4 19.2 51.6 24.2

70.5 86.9 143.0 88.7 95.4 58.0 481.6 301.0 599.5 368.8

501.9 214.6 908.8 239.8 655.7 240.8 2699.4 640.0 3297.6 762.2

371.7 181.1 716.8 215.4 508.5 195.2 2265.8 563.3 2588.4 656.0

337.9 128.3 611.9 163.7 441.5 132.9 1817.3 512.4 2220.9 646.2

270.1 546.9 489.0 505.7 352.8 490.0 1452.5 684.7 1774.4 843.8

159.4 482.7 522.2 568.5 190.9 445.2 7770.0 2258.0 12858.7 3174.3

227.5 561.6 418.1 524.9 304.8 520.0 1263.0 659.8 1463.3 712.8
Table 2: Inference efficiency in comparison with the state-of-the-art methods. Note that the GPU memory is not enough to run SAN [SAN] with scale on Urban100 and Manga109 datasets.
Urban100 ():
depth map
19.22/0.4316 19.97/0.5558 20.16/0.5748 20.96/0.6616 20.91/0.6642
21.06/0.6752 22.13/0.7378 21.67/0.7044 22.21/0.7482 PSNR/SSIM
Urban100 ():
depth map
12.81/0.1879 13.36/0.3607 13.13/0.3580 12.69/0.3297 12.73/0.3822
12.92/0.3728 15.39/0.6261 13.20/0.4616 15.53/0.6393 PSNR/SSIM
Figure 4: Visual comparison for SR on Urban100 dataset. Note that the depth map of AdaRCAN is an average of the 10 groups. Kindly refer to the supplementary materials for more results.

4.1 Implementation Details

Model Training. For training our AdaDSR model, we use the 800 training images and the first five validation images from DIV2K dataset [div2k] as training and validation set, respectively. The input and output images are in RGB color space, and the input images are obtained by bicubic degradation model. Following previous works [SRResNet, EDSR, RCAN], during training we subtract the mean value of the DIV2K dataset on RGB channels and apply data augmentation on training images, including random horizontal flip, random vertical flip and rotation. The AdaDSR model is optimized by the Adam [adam] algorithm with and

for 800 epochs. In each iteration, there are 16 LR patches of size

. And the learning rate is initialized as and decays to half after every 200 epochs. During training, the desired depth is randomly sampled from , where is 32 and 20 for AdaEDSR and AdaRCAN, respectively. Note that, due to the data structure of the sparse convolution is identical with standard convolution, we can use the pretrained backbone model to initialize the AdaDSR model to improve the training stability and save training time.

Model Evaluation. Following previous works [SRResNet, EDSR, RCAN]

, we use PSNR and SSIM as model evaluation metrics, and five standard benchmark datasets (, Set5 

[set5], Set14 [set14], B100 [b100], Urban100 [urban100] and Manga109 [manga109]) are employed as test sets, and the PSNR and SSIM indices are calculated on the luminance channel (a.k.a. Y channel) of YCbCr color space with scale pixels on the boundary ignored. Furthermore, the computation efficiency is evaluated by FLOPs and inference time. For a fair comparison with the competing methods, when counting the running time, we implement all competing methods in our framework and replace the convolution layers of the main body with im2col [im2col]

based convolutions. All evaluations are conducted in the PyTorch 

[pytorch] environment running on a single Nvidia TITAN RTX GPU. The source code and pre-trained models are publicly available at

4.2 Comparison with State-of-the-arts

To evaluate the effectiveness of our AdaDSR model, we first compare AdaDSR111Note that the desired depth is set to 32 and 20 for AdaEDSR and AdaRCAN in Tables 1 and 2 respectively, , the number of residual blocks in EDSR and that of each group in RCAN. with the backbone EDSR [EDSR] and RCAN [RCAN] models as well as four other state-of-the-art methods, , SRCNN [SRCNN], VDSR [VDSR], RDN [RDN] and SAN [SAN]. Note that all visual results of other methods given in this section are generated by the officially released models, while the FLOPs and inference time are evaluated in our framework.

As shown in Table 1, both AdaEDSR and AdaRCAN perform favorably against their counterparts EDSR and RCAN in terms of quantitative PSNR and SSIM metrics. Besides, it can be seen from Table 2, although the adapter module introduces extra computation cost, it is very lightweight and efficient in comparison to the backbone super-resolution model, and the deployment of the lightweight adapter module greatly reduces computation amount of the whole model, resulting in lower FLOPs and faster inference, especially on large images (, Urban100 and Manga109). Note that SAN [SAN] has similar performance with RCAN and AdaRCAN, yet its computation cost is too heavy on large images.

Apart from the quantitative comparison, visual results are also given in Fig. 4. One can see that AdaEDSR and AdaRCAN are able to generate super-resolved images of similar or better visual quality to their counterparts. Kindly refer to the supplementary materials for more qualitative results. We also show the pixel-wise depth map of AdaRCAN (due to space limit, we show the average of the depth maps for 10 groups of AdaRCAN) to discuss the relationship between the processed image and the depth map. As we can see from Fig. 4, greater depth is predicted for the regions with detailed textures, while most of the computation in smooth areas can be omitted for efficiency purpose, which is intuitive and verifies our discussions in Sec. 1.

Considering both quantitative and qualitative results, our AdaDSR can achieve comparable performance against state-of-the-art methods while greatly reducing the computation amount. Further analysis on the adaptive adjustment of please refer to Sec. 4.3.

4.3 Adaptive Inference with Varying Depth

Taking both the feature map and desired depth as input, the adapter module is able to predict an image content adaptive network depth map while satisfying the computation efficiency constraints. Consequently, our AdaDSR can be flexibly tuned to meet various efficiency constraints on the fly. In comparison, the competing methods are based on deterministic inference and can only be performed with the fixed complexity. As shown in Fig. 5, we evaluate our AdaDSR model with different desired depth (, 8, 16, 24, 32 for AdaEDSR and 5, 10, 15, 20 for AdaRCAN), and record the corresponding FLOPs and PSNR values on Set5. More results please refer to the supplementary materials.

From the figures, we can draw several conclusions. First, our AdaDSR can be tuned with the hyper-parameter , and resulting in a curve in the figures, rather than a single point as the competing methods. With an increasing desired depth , AdaDSR requires more computation resources and generates better super-resolved images. It is worth noting that, AdaDSR taps the potential of the backbone models, and can obtain comparable performance against the well-trained backbone model when higher is set. Furthermore, AdaDSR reaches the saturation point with a relatively lower FLOPs, which indicates that a shallower model is sufficient for most regions. Experiments on both versions (, AdaEDSR and AdaRCAN) verify the effectiveness and generality of our adapter module.

(a) Scale
(b) Scale
(c) Scale
Figure 5: Comparison against state-of-the-art methods in terms of FLOPs and PSNR on Set5. Note that SAN is not given in Scale due to that its computation cost is 3835.9 GFLOPs, which is much more than other methods.

5 Ablation Analysis

Considering the training efficiency in multi-GPU environment, we perform ablation analysis with EDSR backbone. Without loss of generality, we select AdaEDSR model and scale .

(dB) (G) (ms) (dB) (G) (ms) (dB) (G) (ms)
EDSR (8) 38.05 0.9607 408.23 147.2 FAdaEDSR (8) 38.17 0.9609 504.87 280.6 AdaEDSR (8) 38.10 0.9605 329.50 169.6
EDSR (16) 38.11 0.9610 718.41 230.7 FAdaEDSR (16) 38.21 0.9611 719.62 327.7 AdaEDSR (16) 38.17 0.9608 472.90 217.0
EDSR (24) 38.15 0.9612 1028.58 305.8 FAdaEDSR (24) 38.23 0.9613 997.95 366.8 AdaEDSR (24) 38.19 0.9610 574.85 243.8
EDSR (32) 38.16 0.9611 1338.76 395.7 FAdaEDSR (32) 38.24 0.9613 1358.30 402.9 AdaEDSR (32) 38.21 0.9611 650.65 312.3
Table 3: Quantitative evaluation of EDSR and AdaEDSR variants on Set5 ().

EDSR variants. To begin with, we train EDSR variants in our framework, , EDSR (8), EDSR (16), EDSR (24) and EDSR (32) by setting the number of residual blocks to 8, 16, 24 and 32, respectively. Note that EDSR (32) performs slightly better than released EDSR model, so we use this one for a fair comparison. The quantitative results on Set5 are given in Table 3. Comparing all EDSR variants, generally one can observe performance gains as the model depth grows.

Besides, as previously illustrated in Fig. 1(c), a shallow model is sufficient for smooth areas, while regions with rich contexture usually require a deep model for better reconstruction of the details. Taking advantage of this phenomenon, with lightweight adapter, AdaDSR is able to predict suitable depth for various areas according to difficulty and resource constraints, and achieves better efficiency-performance tradeoff, resulting in the curve at the top left of their corresponding counterparts as shown in Figs 1(d) and 5. Detailed data can be found in Table 3.

AdaEDSR variants. We further implement several AdaEDSR variants, , FAdaEDSR (8), FAdaEDSR (16), FAdaEDSR (24) and FAdaEDSR (32), which are trained with a fixed depth 8, 16, 24 and 32 respectively, and the adapter module takes only the image features as input. The models are trained under the same settings (except for the fixed in the learning objective) with AdaEDSR. As shown in Table 3, with the per-pixel depth map, these models obtain much better quantitative results than EDSR variants with similar computation cost.

It is worth noting that FAdaEDSR (32) achieves comparable performance with RDN [RDN], which clearly shows the effectiveness of the predicted network depth map. Furthermore, we also show the performance of our AdaEDSR model in Table 3. Specifically, AdaEDSR () means that desired depth at the test time. One can see that although the quantitative performance is slightly worse than FAdaEDSR, AdaEDSR is more computationally efficient and can be flexibly tuned in the testing phase, indicating that AdaDSR achieves adaptive inference with minor sacrifice of performance.

6 Conclusion

In this paper, we revisit the relationship between the model depth and quantitative performance on single image super-resolution task, and present an AdaDSR model by incorporating a lightweight adapter module and sparse convolution in deep SISR networks. The adapter module predicts an image content oriented network depth map, and the value is higher in regions with detailed textures and lower in smooth areas. According to the predicted depth, only a fraction of residual blocks are performed at each point by using im2col based sparse convolution. Furthermore, the parameters of the adapter module are adjustable on the fly according to the desired depth, so that the AdaDSR model can be tuned to meet various efficiency constraints in the inference phase. Experimental results show the effectiveness and adaptiveness of our AdaDSR model, and indicate that AdaDSR can obtain state-of-the-art performance while adaptive to a range of efficiency requirements.