AdaDSR
Deep Adaptive Inference Networks for Single Image SuperResolution
view repo
Recent years have witnessed tremendous progress in single image superresolution (SISR) owing to the deployment of deep convolutional neural networks (CNNs). For most existing methods, the computational cost of each SISR model is irrelevant to local image content, hardware platform and application scenario. Nonetheless, content and resource adaptive model is more preferred, and it is encouraging to apply simpler and efficient networks to the easier regions with less details and the scenarios with restricted efficiency constraints. In this paper, we take a step forward to address this issue by leveraging the adaptive inference networks for deep SISR (AdaDSR). In particular, our AdaDSR involves an SISR model as backbone and a lightweight adapter module which takes image features and resource constraint as input and predicts a map of local network depth. Adaptive inference can then be performed with the support of efficient sparse convolution, where only a fraction of the layers in the backbone is performed at a given position according to its predicted depth. The network learning can be formulated as the joint optimization of reconstruction and network depth losses. In the inference stage, the average depth can be flexibly tuned to meet a range of efficiency constraints. Experiments demonstrate the effectiveness and adaptability of our AdaDSR in contrast to its counterparts (e.g., EDSR and RCAN).
READ FULL TEXT VIEW PDF
Recent work on superresolution show that a very deep convolutional neur...
read it
Deep convolutional networks based superresolution is a fastgrowing fie...
read it
Convolutional Neural Networks have been the backbone of recent rapid pro...
read it
Traditional single image superresolution (SISR) methods that focus on
s...
read it
Despite significant progress toward super resolving more realistic image...
read it
Convolutional Neural Networks (CNNs) have demonstrated great results for...
read it
Querying the content of images, video, and other nontextual data source...
read it
Deep Adaptive Inference Networks for Single Image SuperResolution
Image superresolution aims at recovering highresolution (HR) image from its lowresolution (LR) counterpart, is a representative lowlevel vision task with many realworld applications such as medical imaging [shi2013cardiac], surveillance [zou2011very] and entertainment [old_film]. Recently, driven by the development of deep convolutional neural networks (CNNs), tremendous progress has been made in single image superresolution (SISR). On the one hand, the quantitative performance of SISR has been continuously improved by many outstanding representative models such as SRCNN [SRCNN], VDSR [VDSR], SRResNet [SRResNet], EDSR [EDSR], RCAN [RCAN], SAN [SAN], . On the other hand, considerable attention has also been given to handle several other issues in SISR, including visual quality [SRResNet], degradation model [DPSR], and blind SISR [zhang2019multiple].
Albeit their unprecedented success of SISR, for most existing networks, the computational cost of each model is still independent to image content and application scenarios. Given such an SISR model, once the training is finished, the inference process is deterministic and only depends on the model architecture and the input image size. Actually, instead of deterministic inference, it is inspiring to make the inference to be adaptive to local image content. To illustrate this point, Fig. 1(c) shows the SISR results of three image patches using EDSR [EDSR] with different numbers of residual blocks. It can be seen that EDSR with 8 residual blocks is sufficient to superresolve a smooth patch with less textures. In contrast, at least 24 residual blocks are required for the patch with rich details. Consequently, treating the whole image equally and processing all regions with identical number of residual blocks will certainly lead to the waste of computation resource. Thus, it is encouraging to develop the spatially adaptive inference method for better tradeoff between accuracy and efficiency.
Moreover, the SISR model may be deployed to diverse hardware platforms. Even for a given hardware device, the model can be run under different battery conditions or workloads, and has to meet various efficiency constraints. One natural solution is to design and train numerous deep SISR models in advance, and dynamically select the appropriate one according to the hardware platform and efficiency constraints. Nonetheless, both the training and storage of multiple deep SISR models are expensive, greatly limiting their practical applications to the scenarios with highly dynamic efficiency constraints. Instead, we suggest to address this issue by further making the inference method to be adaptive to efficiency constraints.
To make the learned model to adapt to local image content and efficiency constraints, this paper presents a kind of adaptive inference networks for deep SISR, , AdaDSR. Considering that stacked residual blocks have been widely adopted in the representative SISR models [SRResNet, EDSR, RCAN], the AdaDSR introduces a lightweight adapter module which takes image features as the input and produces a map of local network depth. Therefore, given a position with the local network depth , only the first blocks are required to be computed in the testing stage. Thus, our AdaDSR can apply shallower networks for the smooth regions (, lower depth), and exploit deeper ones for the regions with detailed textures (, higher depth), thereby benefiting the tradeoff between accuracy and efficiency. Taking all the positions into account, sparse convolution can be adopted to facilitate efficient and adaptive inference.
We further improve AdaDSR to be adaptive to efficiency constraints. Note that the average of depth map can be used as an indicator of inference efficiency. For simplicity, the efficiency constraint on hardware platform and application scenario can be represented as a specific desired depth. Thus, we also take the desired depth as the input of the adapter module, and require the average of predicted depth map to approximate the desired depth. And the learning of AdaDSR can then be formulated as the joint optimization of reconstruction and network depth loss. After training, we can dynamically set the desired depth values to accommodate various application scenarios, and then adopt our AdaDSR to meet the efficiency constraints.
Experiments are conducted to assess our AdaDSR. Without loss of generality, we adopt EDSR [EDSR] model as the backbone of our AdaDSR (denoted by AdaEDSR). It can be observed from Fig. 1(b) that the predicted depth map has smaller depth values for the smooth regions and higher ones for the regions with rich smallscale details. As shown in Fig. 1(d), our AdaDSR can be flexibly tuned to meet various efficiency constraints (, FLOPs) by specifying proper desired depth values. In contrast, most existing SISR methods can only be performed with deterministic inference and fixed computational cost. Quantitative and qualitative results further show the effectiveness and adaptability of our AdaDSR in comparison to the stateoftheart deep SISR methods. Furthermore, we also take another representative SISR model RCAN [RCAN] as the backbone model (denoted by AdaRCAN), which illustrates the generality of our AdaDSR. Considering the training efficiency, ablation analyses are performed on AdaDSR with EDSR backbone (, AdaEDSR).
To sum up, the contributions of this work include:
We present adaptive inference networks for deep SISR, , AdaDSR, which adds the backbone with a lightweight adapter module to produce local depth map for spatially adaptive inference.
Both image features and desired depth are taken as the input of the adapter, and reconstruction loss is incorporated with depth loss for network learning, thereby making AdaDSR equipped with sparse convolution to be adaptive to various efficiency constraints.
Experiments show that our AdaDSR achieves better tradeoff between accuracy and efficiency than it counterparts (, EDSR and RCAN), and can adapt to different efficiency constraints without training from scratch.
In this section, we briefly review several topics relevant to our AdaDSR, including deep SISR models and adaptive inference methods.
Dong introduce a threelayer convolutional network in their pioneer work SRCNN [SRCNN], since then, the quantitative performance of SISR has been continuously promoted with the rapid development of CNNs. Kim [VDSR]
further propose a deeper model named VDSR with residual blocks and adjustable gradient clipping. Liu
[MWCNN] propose MWCNN, which accelerates the running speed and enlarges the receptive field by deploying UNet [UNet] like architecture, and multiscale wavelet transformation is applied rather than traditional downsampling or upsampling module to avoid information lost.These methods take interpolated LR images as input, resulting in heavy computation burden, so many recent SISR methods choose to increase the spatial resolution via PixelShuffle
[PixelShuffle] at the tail of the model. SRResNet [SRResNet], EDSR [EDSR] and WDSR [WDSR] follow this setting and have a deep main body by stacking several identical residual blocks [ResNet] before the tail component, and they obtain better performance and efficiency by modifying the architecture of the residual blocks. Zhang [RCAN] build a very deep (more than 400 layers) yet narrow (64 channels 256 channels in EDSR) RCAN model and learn a contentrelated weight for each feature channel inside the residual blocks. Dai [SAN]propose SAN to obtain better feature representation via secondorder attention model, and nonlocally enhanced residual group is incorporated to capture longdistance features.
Apart from the fidelity track, considerable attention has also been given to handle several other issues in SISR. For example, SRGAN [SRResNet] incorporates adversarial loss to improve perceptual quality, DPSR [DPSR] proposes a new degradation model and performs superresolution and deblurring simultaneously, Zhang [zhang2019multiple] solve real image SISR problem in an unsupervised manner by taking advantage of generative adversarial networks. In addition, lightweight networks such as IDN [IDN] and CARN [CARN] are proposed, but most lightweight models are accelerated at the cost of quantitative performance. In this paper, we propose an AdaDSR model, which achieves better tradeoff between accuracy and efficiency.
Traditional deterministic CNNs tend to be less flexible to meet various requirements in the applications. As a remedy, many adaptive inference methods have been explored in recent years. Inspired by [bengio2013better], Upchurch [upchurch2017deep]
propose to learn an interpolation of deep features extracted by a pretrained model, and manipulate the attributes of facial images. Shoshan
[DynamicNet] further propose a dynamic model named DynamicNet by deploying tuning blocks alongside the backbone model, and linearly manipulate the features to learn an interpolation of two objectives, which can be tuned to explore the whole objective space during the inference phase. Similarly, CFSNet [CFSNet] implements continuous transition of different objectives, and automatically learns the tradeoff between perception and distortion for SISR.Some methods also leverage adaptive inference to obtain computing efficient models. Li [li2019improved]
deploy multiple classifiers between the main blocks, and the last one performs as a teacher net to guide the previous ones. During the inference phase, the confidence score of a classifier indicates whether to perform the next block and the corresponding classifier. Figurnov
[patch_adaptive] predict a stop score for the patches, which determines whether to skip the subsequent layers, indicating different regions have unequal importance for detection tasks. Therefore, skipping layers at less important regions can save the inference time. Yu [Pathrestore] propose to build a denoising model with several multipath blocks, and in each block, a path finder is deployed to select a proper path for each image patch. These methods are similar to our AdaDSR, however, they perform adaptive inference on patchlevel, and the adaptation depends only on the features. In this paper, our AdaDSR implements pixelwise adaptive inference via sparse convolution and is manually controllable to meet various resource constraints.This section presents our AdaDSR model for single image superresolution. To begin with, we equip the backbone with a network depth map to facilitate spatially variant inference. Then, sparse convolution is introduced to speed up the inference by omitting the unnecessary computation. Furthermore, a lightweight adapter module is deployed to predict the network depth map. Finally, the overall network structure (see Fig. 2) and learning objective are provided.
Single image superresolution aims at learning a mapping to reconstruct the highresolution image from its lowresolution (LR) observation , and can be written as,
(1) 
where denotes the SISR network with the network parameters . In this work, we consider a representative category of deep SISR networks that consist of three major modules, , feature extraction , residual blocks, and HR reconstruction . Several representative SISR models, , SRResNet [SRResNet], EDSR [EDSR], and RCAN [RCAN], belong to this category. Using EDSR as an example, we let . The output of the residual blocks can then be formulated as,
(2) 
where is the network parameters associated with the th residual block. Given the output of the th residual block, the th residual block can be written as . Finally, the reconstructed HR image can be obtained by .
As shown in Fig. 1, the difficulty of superresolution is spatially variant. For examples, it is not required to go through all the residual blocks in Eqn. (2) to reconstruct the smooth regions. As for the regions with rich and detailed textures, more residual blocks generally are required to fulfill high quality reconstruction. Therefore, we introduce a 2D network depth map () which has the same spatial size with . Intuitively, the network depth is smaller for the smooth region and larger for the region with rich details. To facilitate spatially adaptive inference, we modify Eqn. (2) as,
(3) 
where denotes the entrywise product. Here, is defined as,
(4) 
Let be the ceiling function, thus, the last residual blocks are not required to compute for a position with the network depth . Given the 2D network depth map , we can then exploit Eqn. (3) to conduct spatially adaptive inference.
Let (, for the th residual block) be a mask to indicate the positions where the convolution activations should be kept. As shown in Fig. 3
, for some convolution implementations such as fast Fourier transform (FFT)
[fft1, fft2] and Winograd [winograd] based algorithms, one should first perform the standard convolution to obtain the whole output feature map by . Here, , and denote input feature map, convolution kernel and convolution operation, respectively. Then the sparse results can be represented by . Nonetheless, such implementations meet the requirement of spatially adaptive inference while maintaining the same computational complexity with the standard convolution.Instead, we adopt the im2col [im2col] based sparse convolution for efficient adaptive inference. As shown in Fig. 3, the patch from related to a point in is organized as a row in matrix , and the convolution kernel (
) is also converted as vector
. Then the convolution operation is transformed into a matrix multiplication problem, which is highly optimized in many Basic Linear Algebra Subprograms (BLAS) libraries. Then, the result can be organized back to the output feature map. Given the mask , we can simply skip the corresponding row when constructing the reorganized input feature if it has zero mask value (see the shaded rows of in Fig. 3), and the computation is skipped as well. Thus, the spatially adaptive inference in Eqn. (3) can be efficiently implemented via the im2col and col2im procedure. Moreover, the efficiency can be further improved when more rows are masked out, , when the average depth of is smaller.It is worth noting that sparse convolution has been suggested in many works and evaluated in image classification [SparseWinogradConv], object detection [SparseCNN, SBNet], model pruning [FasterCNN] and 3D semantic segmentation [SparseConvNet] tasks. [SparseCNN] and [SBNet] are based on im2col and Winograd algorithm respectively, however, these methods implement patchlevel sparse convolution. [SparseConvNet] designs new data structure for sparse convolution and constructs a whole CNN framework to suit the designed data structure, making it incompatible with standard methods. [SparseWinogradConv] incorporates sparsity into Winograd algorithm, which is not mathematically equivalent to the vanilla CNN nor the conventional Winograd CNN. The most relevant work [FasterCNN] skips unnecessary points when traversing all spatial positions and achieves pixellevel sparse convolution, which is implemented on serial devices (, CPUs) via forloops. In this work, we use im2col based sparse convolution, which combines this intuitive thought and im2col algorithm, and deploy the proposed model on the parallel platforms (, GPUs). To the best of our knowledge, this is the first attempt to deploy pixelwise sparse convolution on SISR task and achieves image content and resource adaptive inference.
In this subsection, we introduce a lightweight adapter module to predict a 2D network depth map . In order to adapt to local image content, the adapter module is required to produce lower network depth for smooth region and higher depth for detailed region. Let be the average value of , and be the desired network depth. To make the model to be adaptive to efficiency constraints, we also take the desired network depth into account, and require that the decrease of can result in smaller , , better inference efficiency.
As shown in Fig. 2, the adapter module takes the feature map
as the input and is comprised of four convolution layers with PReLU nonlinearity followed by another convolution layer with ReLU nonlinearity. Let
. We then use Eqn. (4) to generate the mask for each residual block. It is noted that may not be a binary mask but contains many zeros. Thus, we can construct a sparse residual block which can omit the computation for the regions with zero mask values to facilitate efficient adaptive inference. To meet the efficiency constraint, we also take the desired network depth as the input to the adapter, and predict the network depth map by(5) 
where denotes the network parameters of the adapter module. Specifically, denote the weight of the first convolution layer in the adapter as , we make the convolution adjustable by replacing the weight with when the desired depth is , therefore the adapter is able to meet the aforementioned oriented constraints.
Network Architecture. As shown in Fig. 2, our proposed AdaDSR is comprised of a backbone SISR network and a lightweight adapter module to facilitate image content and efficiency adaptive inference. Without loss of generality, in this section, we take EDSR [EDSR] as the backbone to illustrate the network architecture, and it is feasible to apply our AdaDSR to other representative SISR models [SRResNet, WDSR, RCAN] with a number of residual blocks [ResNet]. Following [EDSR], the backbone involves 32 residual blocks, each of which has two
convolution layers with stride 1, padding 1 and 256 channels with ReLU nonlinearity. Another
convolution layer is deployed right behind the residual blocks. The feature extraction module is a convolution layer, and the reconstruction module is comprised of an upsampling unit to enlarge the features followed by a convolution layer which reconstructs the output image. The upsampling unit is composed by a series of ConvolutionPixelShuffle [PixelShuffle] according to the superresolution scale. Besides, the lightweight adapter module takes both the feature map and the desired network depth as the input, and consists of five convolution layers to produce an onechannel network depth map.It is worth noting that, we implement two versions of AdaDSR. The first takes EDSR [EDSR] as backbone, which is denoted by AdaEDSR. To further show the generality of proposed AdaDSR and compare against stateoftheart methods, we also take RCAN [RCAN] as backbone and implement an AdaRCAN model. The main difference is that, RCAN replaces the 32 residual blocks with 10 residual groups, and each residual groups is composed of 20 residual blocks equipped with channel attention. Therefore, we modify the adapter to generate 10 depth maps simultaneously, and each of which is deployed to a residual group.
Learning Objective. The learning objective of our AdaDSR includes a reconstruction loss term and a network depth loss term to achieve a proper tradeoff between SISR performance and efficiency. In terms of the SISR performance, we adopt the reconstruction loss defined on the superresolved output and the groundtruth highresolution image,
(6) 
where and respectively represent the highresolution groundtruth and the superresolved image by our AdaDSR. Considering the efficiency constraint, we require the average of the predicted network depth map to approximate the desired depth , and then introduce the following network depth loss,
(7) 
To sum up, the overall learning objective of our AdaDSR is formulated as,
(8) 
where is a tradeoff hyperparameter and is set to in all our experiments.
Method  Scale  Set5  Set14  B100  Urban100  Manga109  

PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  
Bicubic 
33.66  0.9299  30.24  0.8688  29.56  0.8431  26.88  0.8403  30.80  0.9339  
SRCNN [SRCNN] 
36.66  0.9542  32.45  0.9067  31.36  0.8879  29.50  0.8946  35.60  0.9663  
VDSR [VDSR] 
37.53  0.9590  33.05  0.9130  31.90  0.8960  30.77  0.9140  37.22  0.9750  
EDSR [EDSR] 
38.11  0.9602  33.92  0.9195  32.32  0.9013  32.93  0.9351  39.10  0.9773  
AdaEDSR 
38.21  0.9611  33.97  0.9208  32.35  0.9017  32.91  0.9353  39.11  0.9778  
RDN [RDN] 
38.24  0.9614  34.01  0.9212  32.34  0.9017  32.89  0.9353  39.18  0.9780  
RCAN [RCAN] 
38.27  0.9614  34.12  0.9216  32.41  0.9027  33.34  0.9384  39.44  0.9786  
SAN [SAN] 
38.31  0.9620  34.07  0.9213  32.42  0.9028  33.10  0.9370  39.32  0.9792  
AdaRCAN 
38.28  0.9615  34.12  0.9216  32.41  0.9026  33.29  0.9380  39.44  0.9785  
Bicubic 
30.39  0.8682  27.55  0.7742  27.21  0.7385  24.46  0.7349  26.95  0.8556  
SRCNN [SRCNN] 
32.75  0.9090  29.30  0.8215  28.41  0.7863  26.24  0.7989  30.48  0.9117  
VDSR [VDSR] 
33.67  0.9210  29.78  0.8320  28.83  0.7990  27.14  0.8290  32.01  0.9340  
EDSR [EDSR] 
34.65  0.9280  30.52  0.8462  29.25  0.8093  28.80  0.8653  34.17  0.9476  
AdaEDSR 
34.65  0.9288  30.57  0.8463  29.27  0.8091  28.78  0.8649  34.16  0.9482  
RDN [RDN] 
34.71  0.9296  30.57  0.8468  29.26  0.8093  28.80  0.8653  34.13  0.9484  
RCAN [RCAN] 
34.74  0.9299  30.65  0.8482  29.32  0.8111  29.09  0.8702  34.44  0.9499  
SAN [SAN] 
34.75  0.9300  30.59  0.8476  29.33  0.8112  28.93  0.8671  34.30  0.9494  
AdaRCAN 
34.79  0.9302  30.65  0.8481  29.33  0.8111  29.03  0.8689  34.49  0.9498  
Bicubic 
28.42  0.8104  26.00  0.7027  25.96  0.6675  23.14  0.6577  24.89  0.7866  
SRCNN [SRCNN] 
30.48  0.8628  27.50  0.7513  26.90  0.7101  24.52  0.7221  27.58  0.8555  
VDSR [VDSR] 
31.35  0.8830  28.02  0.7680  27.29  0.0726  25.18  0.7540  28.83  0.8870  
EDSR [EDSR] 
32.46  0.8968  28.80  0.7876  27.71  0.7420  26.64  0.8033  31.02  0.9148  
AdaEDSR 
32.49  0.8977  28.76  0.7865  27.71  0.7410  26.58  0.8011  30.96  0.9150  
RDN [RDN] 
32.47  0.8990  28.81  0.7871  27.72  0.7419  26.61  0.8028  31.00  0.9151  
RCAN [RCAN] 
32.63  0.9002  28.87  0.7889  27.77  0.7436  26.82  0.8087  31.22  0.9173  
SAN [SAN] 
32.64  0.9003  28.92  0.7888  27.78  0.7436  26.79  0.8068  31.18  0.9169  
AdaRCAN 
32.61  0.8998  28.88  0.7883  27.77  0.7428  26.80  0.8067  31.22  0.9172  
Method  Scale  Set5  Set14  B100  Urban100  Manga109  

FLOPs  Time  FLOPs  Time  FLOPs  Time  FLOPs  Time  FLOPs  Time  
(G)  (ms)  (G)  (ms)  (G)  (ms)  (G)  (ms)  (G)  (ms)  
SRCNN [SRCNN] 
6.1  43.0  12.3  7.9  8.2  4.4  41.4  19.2  51.6  24.2  
VDSR [VDSR] 
70.5  86.9  143.0  88.7  95.4  58.0  481.6  301.0  599.5  368.8  
EDSR [EDSR] 
1338.8  395.7  2552.2  630.7  1776.9  469.9  8041.1  2163.8  9891.4  2554.5  
AdaEDSR 
650.6  312.3  1397.3  489.8  965.3  371.5  4844.9  1655.2  5208.4  1864.5  
RDN [RDN] 
801.1  345.9  1527.3  617.1  1063.3  407.7  4811.9  2198.5  5919.2  3417.1  
RCAN [RCAN] 
577.9  633.2  1101.8  813.6  767.0  607.0  3471.2  1955.0  4270.0  2342.9  
SAN [SAN] 
3835.9  1276.0  17500.4  3314.0  3943.2  1637.8  372727.5  N/A  645359.4  N/A  
AdaRCAN 
469.1  614.5  925.5  751.9  649.2  606.3  2907.2  1749.2  3300.7  2034.2  
SRCNN [SRCNN] 
6.1  43.0  12.3  7.9  8.2  4.4  41.4  19.2  51.6  24.2  
VDSR [VDSR] 
70.5  86.9  143.0  88.7  95.4  58.0  481.6  301.0  599.5  368.8  
EDSR [EDSR] 
699.1  251.0  1305.7  341.7  924.1  259.6  3984.0  957.9  4904.0  1271.1  
AdaEDSR 
504.8  231.4  1013.5  302.0  722.8  232.0  3314.2  858.4  3695.9  1023.3  
RDN [RDN] 
437.1  195.0  816.3  290.3  577.8  190.2  2490.9  1045.3  3066.1  1404.0  
RCAN [RCAN] 
328.5  553.6  613.5  551.7  434.2  511.2  1872.0  1029.2  2304.2  1208.7  
SAN [SAN] 
463.2  582.2  1930.1  992.9  517.5  600.7  36735.2  5416.2  61976.0  8194.4  
AdaRCAN 
277.7  572.6  512.9  559.1  369.3  523.8  1596.3  968.1  1842.2  1107.1  
SRCNN [SRCNN] 
6.1  43.0  12.3  7.9  8.2  4.4  41.4  19.2  51.6  24.2  
VDSR [VDSR] 
70.5  86.9  143.0  88.7  95.4  58.0  481.6  301.0  599.5  368.8  
EDSR [EDSR] 
501.9  214.6  908.8  239.8  655.7  240.8  2699.4  640.0  3297.6  762.2  
AdaEDSR 
371.7  181.1  716.8  215.4  508.5  195.2  2265.8  563.3  2588.4  656.0  
RDN [RDN] 
337.9  128.3  611.9  163.7  441.5  132.9  1817.3  512.4  2220.9  646.2  
RCAN [RCAN] 
270.1  546.9  489.0  505.7  352.8  490.0  1452.5  684.7  1774.4  843.8  
SAN [SAN] 
159.4  482.7  522.2  568.5  190.9  445.2  7770.0  2258.0  12858.7  3174.3  
AdaRCAN 
227.5  561.6  418.1  524.9  304.8  520.0  1263.0  659.8  1463.3  712.8  
valign=t


valign=t

Model Training. For training our AdaDSR model, we use the 800 training images and the first five validation images from DIV2K dataset [div2k] as training and validation set, respectively. The input and output images are in RGB color space, and the input images are obtained by bicubic degradation model. Following previous works [SRResNet, EDSR, RCAN], during training we subtract the mean value of the DIV2K dataset on RGB channels and apply data augmentation on training images, including random horizontal flip, random vertical flip and rotation. The AdaDSR model is optimized by the Adam [adam] algorithm with and
for 800 epochs. In each iteration, there are 16 LR patches of size
. And the learning rate is initialized as and decays to half after every 200 epochs. During training, the desired depth is randomly sampled from , where is 32 and 20 for AdaEDSR and AdaRCAN, respectively. Note that, due to the data structure of the sparse convolution is identical with standard convolution, we can use the pretrained backbone model to initialize the AdaDSR model to improve the training stability and save training time.Model Evaluation. Following previous works [SRResNet, EDSR, RCAN]
, we use PSNR and SSIM as model evaluation metrics, and five standard benchmark datasets (, Set5
[set5], Set14 [set14], B100 [b100], Urban100 [urban100] and Manga109 [manga109]) are employed as test sets, and the PSNR and SSIM indices are calculated on the luminance channel (a.k.a. Y channel) of YCbCr color space with scale pixels on the boundary ignored. Furthermore, the computation efficiency is evaluated by FLOPs and inference time. For a fair comparison with the competing methods, when counting the running time, we implement all competing methods in our framework and replace the convolution layers of the main body with im2col [im2col]based convolutions. All evaluations are conducted in the PyTorch
[pytorch] environment running on a single Nvidia TITAN RTX GPU. The source code and pretrained models are publicly available at https://github.com/csmliu/AdaDSR.To evaluate the effectiveness of our AdaDSR model, we first compare AdaDSR^{1}^{1}1Note that the desired depth is set to 32 and 20 for AdaEDSR and AdaRCAN in Tables 1 and 2 respectively, , the number of residual blocks in EDSR and that of each group in RCAN. with the backbone EDSR [EDSR] and RCAN [RCAN] models as well as four other stateoftheart methods, , SRCNN [SRCNN], VDSR [VDSR], RDN [RDN] and SAN [SAN]. Note that all visual results of other methods given in this section are generated by the officially released models, while the FLOPs and inference time are evaluated in our framework.
As shown in Table 1, both AdaEDSR and AdaRCAN perform favorably against their counterparts EDSR and RCAN in terms of quantitative PSNR and SSIM metrics. Besides, it can be seen from Table 2, although the adapter module introduces extra computation cost, it is very lightweight and efficient in comparison to the backbone superresolution model, and the deployment of the lightweight adapter module greatly reduces computation amount of the whole model, resulting in lower FLOPs and faster inference, especially on large images (, Urban100 and Manga109). Note that SAN [SAN] has similar performance with RCAN and AdaRCAN, yet its computation cost is too heavy on large images.
Apart from the quantitative comparison, visual results are also given in Fig. 4. One can see that AdaEDSR and AdaRCAN are able to generate superresolved images of similar or better visual quality to their counterparts. Kindly refer to the supplementary materials for more qualitative results. We also show the pixelwise depth map of AdaRCAN (due to space limit, we show the average of the depth maps for 10 groups of AdaRCAN) to discuss the relationship between the processed image and the depth map. As we can see from Fig. 4, greater depth is predicted for the regions with detailed textures, while most of the computation in smooth areas can be omitted for efficiency purpose, which is intuitive and verifies our discussions in Sec. 1.
Considering both quantitative and qualitative results, our AdaDSR can achieve comparable performance against stateoftheart methods while greatly reducing the computation amount. Further analysis on the adaptive adjustment of please refer to Sec. 4.3.
Taking both the feature map and desired depth as input, the adapter module is able to predict an image content adaptive network depth map while satisfying the computation efficiency constraints. Consequently, our AdaDSR can be flexibly tuned to meet various efficiency constraints on the fly. In comparison, the competing methods are based on deterministic inference and can only be performed with the fixed complexity. As shown in Fig. 5, we evaluate our AdaDSR model with different desired depth (, 8, 16, 24, 32 for AdaEDSR and 5, 10, 15, 20 for AdaRCAN), and record the corresponding FLOPs and PSNR values on Set5. More results please refer to the supplementary materials.
From the figures, we can draw several conclusions. First, our AdaDSR can be tuned with the hyperparameter , and resulting in a curve in the figures, rather than a single point as the competing methods. With an increasing desired depth , AdaDSR requires more computation resources and generates better superresolved images. It is worth noting that, AdaDSR taps the potential of the backbone models, and can obtain comparable performance against the welltrained backbone model when higher is set. Furthermore, AdaDSR reaches the saturation point with a relatively lower FLOPs, which indicates that a shallower model is sufficient for most regions. Experiments on both versions (, AdaEDSR and AdaRCAN) verify the effectiveness and generality of our adapter module.
Considering the training efficiency in multiGPU environment, we perform ablation analysis with EDSR backbone. Without loss of generality, we select AdaEDSR model and scale .
Method  PSNR  SSIM  FLOPs  Time  Method  PSNR  SSIM  FLOPs  Time  Method  PSNR  SSIM  FLOPs  Time 

(dB)  (G)  (ms)  (dB)  (G)  (ms)  (dB)  (G)  (ms)  
EDSR (8)  38.05  0.9607  408.23  147.2  FAdaEDSR (8)  38.17  0.9609  504.87  280.6  AdaEDSR (8)  38.10  0.9605  329.50  169.6 
EDSR (16)  38.11  0.9610  718.41  230.7  FAdaEDSR (16)  38.21  0.9611  719.62  327.7  AdaEDSR (16)  38.17  0.9608  472.90  217.0 
EDSR (24)  38.15  0.9612  1028.58  305.8  FAdaEDSR (24)  38.23  0.9613  997.95  366.8  AdaEDSR (24)  38.19  0.9610  574.85  243.8 
EDSR (32)  38.16  0.9611  1338.76  395.7  FAdaEDSR (32)  38.24  0.9613  1358.30  402.9  AdaEDSR (32)  38.21  0.9611  650.65  312.3 
EDSR variants. To begin with, we train EDSR variants in our framework, , EDSR (8), EDSR (16), EDSR (24) and EDSR (32) by setting the number of residual blocks to 8, 16, 24 and 32, respectively. Note that EDSR (32) performs slightly better than released EDSR model, so we use this one for a fair comparison. The quantitative results on Set5 are given in Table 3. Comparing all EDSR variants, generally one can observe performance gains as the model depth grows.
Besides, as previously illustrated in Fig. 1(c), a shallow model is sufficient for smooth areas, while regions with rich contexture usually require a deep model for better reconstruction of the details. Taking advantage of this phenomenon, with lightweight adapter, AdaDSR is able to predict suitable depth for various areas according to difficulty and resource constraints, and achieves better efficiencyperformance tradeoff, resulting in the curve at the top left of their corresponding counterparts as shown in Figs 1(d) and 5. Detailed data can be found in Table 3.
AdaEDSR variants. We further implement several AdaEDSR variants, , FAdaEDSR (8), FAdaEDSR (16), FAdaEDSR (24) and FAdaEDSR (32), which are trained with a fixed depth 8, 16, 24 and 32 respectively, and the adapter module takes only the image features as input. The models are trained under the same settings (except for the fixed in the learning objective) with AdaEDSR. As shown in Table 3, with the perpixel depth map, these models obtain much better quantitative results than EDSR variants with similar computation cost.
It is worth noting that FAdaEDSR (32) achieves comparable performance with RDN [RDN], which clearly shows the effectiveness of the predicted network depth map. Furthermore, we also show the performance of our AdaEDSR model in Table 3. Specifically, AdaEDSR () means that desired depth at the test time. One can see that although the quantitative performance is slightly worse than FAdaEDSR, AdaEDSR is more computationally efficient and can be flexibly tuned in the testing phase, indicating that AdaDSR achieves adaptive inference with minor sacrifice of performance.
In this paper, we revisit the relationship between the model depth and quantitative performance on single image superresolution task, and present an AdaDSR model by incorporating a lightweight adapter module and sparse convolution in deep SISR networks. The adapter module predicts an image content oriented network depth map, and the value is higher in regions with detailed textures and lower in smooth areas. According to the predicted depth, only a fraction of residual blocks are performed at each point by using im2col based sparse convolution. Furthermore, the parameters of the adapter module are adjustable on the fly according to the desired depth, so that the AdaDSR model can be tuned to meet various efficiency constraints in the inference phase. Experimental results show the effectiveness and adaptiveness of our AdaDSR model, and indicate that AdaDSR can obtain stateoftheart performance while adaptive to a range of efficiency requirements.
Comments
There are no comments yet.