AIM 2020 Challenge on Real Image Super-Resolution: Methods and Results

09/25/2020 ∙ by Pengxu Wei, et al. ∙ HUAWEI Technologies Co., Ltd. NetEase, Inc. Los Alamos National Laboratory Microsoft Université Libre de Bruxelles ETH Zurich Harbin Institute of Technology IEEE Tencent QQ Hong Kong Polytechnic University Baidu, Inc. USTC NetEase, Inc SUN YAT-SEN UNIVERSITY 7

This paper introduces the real image Super-Resolution (SR) challenge that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ECCV 2020. This challenge involves three tracks to super-resolve an input image for ×2, ×3 and ×4 scaling factors, respectively. The goal is to attract more attention to realistic image degradation for the SR task, which is much more complicated and challenging, and contributes to real-world image super-resolution applications. 452 participants were registered for three tracks in total, and 24 teams submitted their results. They gauge the state-of-the-art approaches for real image SR in terms of PSNR and SSIM.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Single image super-resolution (SR) reconstructs high-resolution (HR) images from low-resolution (LR) counterparts with image quality degradations [glasner2009super][yang2010image]. Instead of imposing higher requirements on hardware devices and sensors, it could be applicable to many practical scenarios, such as video surveillance, satellite, medical imaging, etc

. As a fundamental research topic, SR has attracted a long-standing and considerable attention in computer vision community.

With the emergence of deep learning, convolutional neural network (CNN) based SR methods (


) inherit the powerful capacity of deep learning and have achieved remarkable performance improvements. Nevertheless, so far, the remarkable progress of SR is mainly driven by the supervised learning of models from LR images and their HR counterparts. While the bicubic downsampling is usually adopted to simulate the LR images, the learned deep SR model performs much less effective for real-world SR applications since the image degradation in real-world is much more complicated.

To mitigate this issue, several real SR datasets have been recently built, City 100 [city100] and SR-RAW [zoomlearn]. The images in City100 were captured for the printed postcards in the indoor environment , which are limited in capturing the complicated image and degradation characteristics of natural scenes. The images in SR-RAW were collected in the real world and a contextual bilateral loss was proposed to address the misalignment problem in the dataset. Besides, Cai et al. [cai2019toward] released another real image SR dataset, named RealSR, which was captured from two DSLR cameras. They proposed the LP-KPN method in a Laplacian pyramid framework. Considering the complex image degradation across different scenes and devices, a large-scale diverse real SR dataset, named DRealSR [CDC], was released to further promote the research on real-world image SR. Images of DRealSR were captured by five different DSLR cameras and posed more challenging image degradation. In [CDC], the proposed component divide-and-conquer model (CDC) built a baseline, hourglass SR network (HGSR), in a stacked architecture, explored different reconstruction difficulties in terms of three low-level image components inspired by corner point detection, i.e, the flat, edges and corner points, and trained the model with a mediate supervision strategy. Besides, its proposed gradient-weighted (GW) loss also drives the model to adapt learning objectives to the reconstruction difficulties of three image components and has a flexibility of the application to any SR model.

Jointly with the Advances in Image Manipulation (AIM) 2020 workshop, we organize the AIM Challenge on Real-world Image Super-Resolution. Specifically, this challenge concerns the real-world SISR, which poses two challenging issues [CDC]: (1) more complex degradation against bicubic downsampling, and (2) diverse degradation processes among devices, aiming to learn a generic model to super-resolve LR images captured in practical scenarios. To achieve this goal, paired LR and HR images are captured by various DSLR cameras and provided for training. They are randomly selected from the DRealSR dataset. Images for training, validation and testing are captured in the same way with the same set of cameras.

This challenge is one of the AIM 2020 associated challenges on: scene relighting and illumination estimation

[elhelou2020aim_relighting], image extreme inpainting[ntavelis2020aim_inpainting], learned image signal processing pipeline[ignatov2020aim_ISP], rendering realistic bokeh[ignatov2020aim_bokeh], real image super-resolution[wei2020aim_realSR], efficient super-resolution[zhang2020aim_efficientSR], video temporal super-resolution[son2020aim_VTSR] and video extreme super-resolution[fuoli2020aim_VXSR].

2 AIM 2020 Challenge on Real Image Super-Resolution

The objectives of the AIM 2020 challenge on real image super-resolution challenge are: (i) to further explore the researches on real image SR; (ii) to fully evaluate different SR approaches on different scale factors; (iii) to offer an opportunity of communications between academic and industrial participants.

2.1 DRealSR Dataset

DRealSR111The dataset is publicly available at Conquer-for-Real-World-Image-Super-Resolution  [CDC] is a large-scale real-world image super-resolution. Only half of images in DRealSR are randomly selected for this challenge. These images are captured from five DSLR cameras (i.e., Canon, Sony, Nikon, Olympus and Panasonic) in natural scenes and cover indoor and outdoor scenes avoiding moving objects, e.g., advertising posters, plants, offices, buildings, etc. These HR-LR image pairs are aligned. To get access to the training and validation data and submit SR results, the registration on Codalab222 is required. Details of the dataset in this challenge are given in Table 1.

Scale Split Type Number Size (LR) Evaluation
2 Train Cropped Patches 19,000 380380
(on RGB channels),
Validation Aligned Images 20 20003000
Test Aligned Images 60
3 Train Cropped Patches 19,000 272272
Validation Aligned Images 20 13002000
Test Aligned Images 60
4 Train Cropped Patches 19,000 192192
Validation Aligned Images 20 10001250
Test Aligned Images 60
Table 1: Details of the dataset for the challenge

2.2 Track and Competition

Tracks. The challenge uses the newly released DRealSR dataset and has three tracks corresponding to 2, 3, 4 upscaling factors. The aim is to obtain a network design or solution capable to produce high-quality results with the best fidelity to the reference ground truth.

Challenge phases. (1) Development phase: HR images from DRealSR have 40006000 pixels on average. For the convenience of model training, images are cropped into patches. For 2 scale factor, LR image patches are 380380; for 3 scale factor, LR image patches are 272272; for 4 scale factor, LR image patches are 192192. (2) Testing phase: In the final test phase, participants have access to LR images for three tracks, submit their SR results to Codalab evaluation server and email their codes and factsheets to the organizers. The organizers checked all the SR results and the provided codes to obtain the final results.

Team PSNR SSIM Score Ensemble ExtraData Loss
Track1 ()
Baidu 33.446 0.927 0.7736 Model+Self False + SSIM
CETC-CSKT 33.314 0.925 0.7702 Model+Self False
OPPO_CAMERA 33.309 0.924 0.7699 Model+Self False + SSIM + MS-SSIM
AiAiR 33.263 0.924 0.7695 Model+Self True Clip
TeamInception 33.232 0.924 0.7690 Model+Self True + MS-SSIM + VGG
Noah_TerminalVision 33.289 0.923 0.7686 Self False adaptive robust loss
DeepBlueAI 33.177 0.924 0.7681 Self False /
ALONG 33.098 0.924 0.7674 Self False +
LISA-ULB 32.987 0.923 0.7659 / False + SSIM
lyl 32.937 0.921 0.7635 / False
GDUT-SL 32.973 0.920 0.7634 Model False
MCML-Yonsei 32.903 0.919 0.7612 None False
Kailos 32.708 0.920 0.7601 Self False + wavelet loss
qwq 31.640 0.913 0.7436 None False + SSIM
debut_kele 31.236 0.889 0.7196 None True /
EDSR* 31.220 0.889 0.7194 / / /
RRDN_IITKGP 29.851 0.845 0.6696 None True /
Track2 ()
Baidu 30.950 0.876 0.7063 Model+Self False + SSIM
CETC-CSKT 30.765 0.871 0.7005 Model+Self False
OPPO_CAMERA 30.537 0.870 0.6966 Model+Self False + SSIM + MS-SSIM
Noah_TerminalVision 30.564 0.866 0.6941 Self False adaptive robust loss
MCML-Yonsei 30.477 0.866 0.6931 Self False
TeamInception 30.418 0.866 0.6928 Model+Self True + MS-SSIM + VGG
ALONG 30.375 0.866 0.6922 Self False +
DeepBlueAI 30.302 0.867 0.6918 Self False /
lyl 30.365 0.864 0.6905 / False
Kailos 30.130 0.866 0.6900 Self False + wavelet loss
qwq 29.266 0.852 0.6694 None False + SSIM
EDSR* 28.763 0.821 0.6383 / / /
anonymous 18.190 0.825 0.5357 / False /
Track3 ()
Baidu 31.396 0.875 0.7099 Model+Self False + SSIM
ALONG 31.237 0.874 0.7075 Self False +
CETC-CSKT 31.123 0.874 0.7066 Model+Self False
SR-IM 31.174 0.873 0.7057 Self False /
DeepBlueAI 30.964 0.874 0.7044 Self False /
JNSR 30.999 0.872 0.7035 Model+Self True /
OPPO_CAMERA 30.86 0.874 0.7033 Model+Self False + SSIM + MS-SSIM
Kailos 30.866 0.873 0.7032 Self False + wavelet loss
SR_DLu 30.605 0.866 0.6944 Self False /
Noah_TerminalVision 30.587 0.866 0.6944 Self False adaptive robust loss
Webbzhou 30.417 0.867 0.6936 None False /
TeamInception 30.347 0.868 0.6935 Model+Self True + MS-SSIM + VGG
lyl 30.319 0.866 0.6911 / False
MCML-Yonsei 30.420 0.864 0.6906 Self False
MoonCloud 30.283 0.864 0.6898 Model + Self True /
qwq 29.588 0.855 0.6748 None False + SSIM
SrDance 29.595 0.852 0.6729 / True MAE+VGG+GAN loss
MLP_SR 28.619 0.831 0.6457 Self True GAN,TV,,SSIM,MS-SSIM,Cycle
EDSR* 28.212 0.824 0.6356 / / /
RRDN_IITKGP 27.971 0.809 0.6201 None True /
congxiaofeng 26.392 0.826 0.6187 None False
Table 2: Evaluation results in the final testing phase. “Score” indicates the weighted score (Equ.1), i.e.

, the evaluation metric for the challenge. For “Ensemble”, “model” and “self” indicate the model ensemble and the self-ensemble, respectively. “/” indicates that those items are not provided by participants. We also provide results of “EDSR*” for comparison with the same challenge dataset.

Evaluation protocol.

The evaluation includes the comparison of the super-resolved images with the reference ground truth images. We use the standard peak signal to noise ratio (PSNR) and, complementary, the structural similarity (SSIM) index as often employed in the literature. PSNR and SSIM implementations are found in most of the image processing toolboxes. For each dataset, we report the average results (i.e.

and ) over all the processed images belonging to it and employ for ranking the weighted value of normalized and , which is defined as follows,


3 Challenge Results

There are 174, 128 and 168 registered participants for three tracks, respectively. In total, 24 teams submitted their super-resolution results; 10, 2 and 11 teams submitted results of one, two and three tracks, respectively. Among those submitted results of one track, seven teams are for scale factor. Details of final testing results are provided in Table 2. It mainly reports the final evaluation results and model training details.

As for the evaluation metric of weighted score claimed in Sec.2.2, the leading entries for Track 1, 2 and 3 are all from team Baidu. For Track 1 and 2, the CETC-CSKT and the OPPO_CAMERA team win the second and the third places, respectively. For Track 3, ALONG and CETC-CSKT win the second and the third places, respectively. Among those solutions for the challenge, some interesting trends can be observed as follows.

Network Architecture.

All the teams utilize deep neural networks for super-resolution. The architecture of the deep network will greatly affect the performance of super-resolution images. Several teams,

e.g., TeamInception, construct a network with the residual structure to reduce the difficulty of optimization, While OPPO_CAMERA connected the input to the output with a trainable convolution layer. CETC-CSKT further proposed to pre-train the trainable layer in the skip branch in advance. Several teams, such as DeepBlueAI and SR-IM applied channel attention module in their network, while several others like TeamInception and Noah_TerminalVision employ both spatial attention and channel attention on the feature level.

Data Augmentation. Most solutions conduct the data augmentation by randomly flipping and rotating images by 90 degrees. The newly proposed CutBlur method was employed by ALONG and OPPO_CAMERA and performance improvements are reported by these teams.

Ensemble Strategy. Most solutions adopted self-ensemble x8. Some solutions also performed model-ensemble by fusing results from models with different training parameter, or even of different architectures.


All the teams except one team using Tensorflow utilized PyTorch to conduct their experiments.

4 Challenge Methods and Teams


The Baidu team proposed to apply Neural Architecture Search (NAS) approach selecting variations of their previous dense residual model as well as RCAN model[zhihong2020aim]. In order to accelerate the searching process, Gaussian Process based Neural Architecture Search (GP-NAS) was applied as in [Li_2020_CVPR]. Specifically, given the hyper-parameters of GP-NAS, they are capable of predicting the performance of any architectures in the search space effectively. Then, the NAS process is converted to hyper-parameters estimation. By mutual information maximization, the Baidu team can efficiently sample networks. Accordingly, based on the performances of sampled networks, the posterior distribution of hyper-parameters can be gradually and efficiently updated. Based on the estimated hyper-parameters, the architecture with the best performance can be obtained.

Figure 1: The dense residual network architecture of the Baidu team for image Super-Resolution

The backbone model of the proposed method is a deep dense residual network originally developed for raw image demosaicking and denoising. As depicted in Fig.1, in addition to the shallow feature convolution at the front and the upsampler at the end, the proposed network consists of a total depth of dense residual blocks (DRB). The input convolution layer converts the 3-channel LR input to a total of F-channel shallow features. For the middle DRB blocks, each one includes stages of double layers of convolution and the outputs of all stages are concatenated together before convoluted from to F channels. An additional channel-attention layers are included at the end of each block, similar to RCAN [RCAN]. There are two types of skip connections included in each block, the block skip connection (BSC) and inter-block skip connection (IBSC). The BSC is the shortcut between input and output of block , while IBSC includes two shortcuts from the input of block to the two stages inside block , respectively. The various skip connections, especially IBSC, are included to combine features with a large range of receptive fields. The last block is an enhanced upsampler that transforms all F-channel LR features to the estimated 3-channel SR image. This dense residual network has three main hyper-parameters: is the number of feature channels, is the number of DRB layers and is the number of stages for each DRB. All these three hyper-parameters construct the search space for NAS.

During training, a 120

120 patch is randomly cropped and augmented with flipping and transposing from each training image for each epoch. A mixed loss of

and multi-scale structural similarity (MS-SSIM) is taken for training. For the experiment, the new model candidate search scheme using GP-NAS was implemented in PaddlePaddle [Paddle] and the final-training of searched models were conducted using PyTorch. A multi-level ensemble scheme is proposed in testing, including self-ensemble for patches, as well as patch-ensemble and model-ensemble for full-size images. The proposed method is validated to be highly effective, generating impressive testing results on all three tracks of AIM2020 Real Image Super-resolution Challenge.


Figure 2: Framework of Adaptive Dense Connection Super Resolution reconstruction (ADCSR) for the CETC-CSKT team

The CETC-CSKT team proposed Adaptive Dense Connection Super Resolution reconstruction(ADCSR)[xie2019adaptive]. The algorithm is divided into BODY and SKIP. The BODY part improves the utilization of convolution features through adaptive dense connection. An adaptive sub-pixel reconstruction module (AFSC) is also proposed to reconstruct the features of BODY output. By pre-training SKIP in advance, the BODY part focuses on high-frequency feature learning. for track 1 (2), spatial attention is added after each residual block. The architecture is shown in Fig.2. Self-ensemble is used in EDSR [EDSR]. The test image is divided into pixel blocks for reconstruction. Finally, only input is used for splicing to reduce the edge difference of blocks.

The proposed ADCSR uses the first 18900 training data sets for training, and the last 100 as the test set for training. The input image block size is . SKIP is trained separately, and then the entire network is trained at the same time. The initial learning rate is . When the learning rate drops to , the training stops. loss is utilized to optimize the proposed model. The model is trained with NVIDIA RTX2080Ti * 4. Pytorch1.1.0 + Cuda10.0 + cudnn7.5.0 is selected as the deep learning environment.


Figure 3: The detailed network architecture of the proposed network for the OPPO_CAMERA team
Figure 4: The proposed BRB and MAB for the OPPO_CAMERA team. The top of the figure shows the basic convolution structure of the proposed network with the dense connection. The middle of the figure shows the basic residual block. The bottom of the figure presents the channel attention mechanism of the network.

The OPPO_CAMERA team proposed Self-Calibrated Attention Neural Network for Real-World Super Resolution. As shown in Fig.3, the proposed model is constituted of four integral components, i.e.

, feature extraction, residual in residual deep feature extraction, upsampling and reconstruction. It employs the same residual structure and dense connections to DRLN 

[anwar2019densely]. A longer skip connection is also added to connect the input to the output with a trainable parameter, which can greatly reduces the difficulty of optimization and thus, the network would pay more attention to the learning of the high frequency parts in images. As shown in Fig.4, three Basic Residual Block (BRB) forms a Large Residual Block (LRB) with dense connection. Self-Calibration convolution (SCC) [liu2020improving], shown at top of Fig.4, is adopted as a basic unit in order to expand receptive field. Unlike conventional convolution, SCC enables each point in space to have interactive information from nearby regions and channels. Dense connections are established between the Self-Calibration convolution block (SCCB), each densely connected residual block has three SCCB. To incorporate channel information efficiently, an attention block with multi-scale feature integration is added in every basic residual block as DRLN [anwar2019densely]. For the network optimization, loss function was introduced as pixel-wise loss. In order to improve the fidelity, SSIM and MS-SSIM loss were also used as structure loss. With pixel loss and structure loss, the total loss is formulated as follows,

For the training, the proposed method splits the training data randomly into two parts, i.e., training set and validation set, with the ratio of 18500:500. Considering its significant improvement in the Real World SR task, CutBlur [yoo2020rethinking] is applied to augment training images. Self-ensemble and Parameter-fusion strategy would obviously improve the fidelity index(PSNR and SSIM), and meanwhile, less noise in result images. The strategy of self-ensembles (x8) was used as explained in RCAN [zhang2018image], and all the corresponding parameters of last 3 models are fused to derive a fused model , as described in [shang2020perceptual]. Experiments are conducted with Tesla V100 GPU.


Figure 5: OADDet and Deep-OADDet for the AiAiR team.
Figure 6: Overall architectures of OADDet and Deep-OADDet for the AiAiR team.

The AiAiR team proposes that orientation-aware convolutions meet dual path enhancement network (OADDet). Their method consists of four basic models (model ensemble): OADDet, Deep-OADDet, original EDSR [Lim_2017_CVPR_Workshops] and original DRLN [anwar2019densely]. The core modules of OADDet, illustrated in Figure 5, are borrowed from DDet [shi2020ddet], Inception [inception] and OANet [OANet]

with minor improvements, such as less attention modules, removing skip connections and replacing ReLU with LeakyReLU. Overall architectures are similar to DDet

[shi2020ddet]. It is found that redundant attention modules will damage the performance and slow down the training process. Therefore, attention modules are only applied to the last few blocks of the backbone network and the last layer of the shallow network. Similar to RealSR [cai2019toward], PixelConv is also utilized, which contributes to dB improvement on the validation set.

  • The training process generally consists of four stages on three different datasets. The total training time is about 2000 GPU hours on V100.

  • OADDet models are trained from scratch and download DIV2K pre-trained EDSR/DRLN from official links.

  • DIV2K dataset is used to pre-train our OADDet models and use manually washed AIM2020 datasets to fine-tune all models (further details in GitHub README).

  • Four models are trained using three different strategies:

    1) For OADDet: Pre-training on DIV2K (300 epochs) then fine-tuning on original AIM2020 x2 dataset (600 epochs) and AIM2020 washed x2 dataset (100 epochs).

    2) For Deep-OADDet: Pre-training on DIV2K (30 epochs) then fine-tuning on AIM2020 washed x2+x3 dataset (350 epochs), AIM2020 washed x2 dataset (350 epochs) and AIM2020 washed x2 dataset (100 epochs).

    2) For EDSR/DRLN: Using DIV2K well-trained models then fine-tuning on washed AIM2020 x2 dataset (1000 epochs).

  • Self-ensemble (), model-ensemble (four models) and proposed “crop-ensemble” are conducted (further details in GitHub README Reproduce x2 test dataset results).

  • OADDet enjoys a more stable and faster training process than OANet, which introduces too many attention modules at the early stage of the networks. DDet proposes to use dynamic PixelConv with kernelsize=5,7,9; however, it is proved that kernelsize=3,5,7 works better during training and testing time.


The TeamInception team proposes learning Enriched Features for Real Image Restoration and Enhancement. MIRNet, recently introduced in [Zamir2020MIRNet], is utilized with the collective goals of maintaining spatially-precise high-resolution representations through the entire network and receiving strong contextual information from the low-resolution representations. In Fig. 7. MIRNet333The code is publicly available at has a multi-scale residual block (MRB) containing several key elements: (a) parallel multi-resolution convolution streams for extracting (fine-to-coarse) semantically-richer and (coarse-to-fine) spatially-precise feature representations, (b) information exchange across multi-resolution streams, (c) attention-based aggregation of features arriving from multiple streams, and (d) dual-attention units to capture contextual information in both spatial and channel dimensions.

The MRB consists of multiple (three in this work) fully-convolutional streams connected in parallel. It allows information exchange across parallel streams in order to consolidate the high-resolution features with the help of low-resolution features, and vice versa. Each component of MRB is described as follows.

Figure 7: Framework of the network MIRNet (recently introduced in [Zamir2020MIRNet]) for the TeamInception team.

Selective kernel feature fusion (SKFF). The SKFF module performs dynamic adjustment of receptive fields via two operations –Fuse and Select, as illustrated in Fig. 8. The fuse operator generates global feature descriptors by combining the information from multi-resolution streams. The select operator uses these descriptors to recalibrate the feature maps (of different streams) followed by their aggregation. Details of both operators for the three-stream case are elaborated as follows. (1) Fuse: SKFF receives inputs from three parallel convolution streams carrying different scales of information. We first combine these multi-scale features using an element-wise sum as: . We then apply global average pooling (GAP) across the spatial dimension of to compute channel-wise statistics . Next, a channel-downscaling convolution layer is used to generate a compact feature representation , where

for our experiments. Finally, the feature vector

passes through three parallel channel-upscaling convolution layers (one for each resolution stream) and provides us with three feature descriptors and , each with dimensions . (2) Select: this operator applies the softmax function to and , yielding attention activations and that we use to adaptively recalibrate multi-scale feature maps and , respectively. The overall process of feature recalibration and aggregation is defined as: . Note that the SKFF uses fewer parameters than aggregation with the concatenation but generates more favorable results.

Dual attention unit (DAU).

While the SKFF block fuses information across multi-resolution branches, we also need a mechanism to share information within a feature tensor, both along the spatial and the channel dimensions. The dual attention unit (DAU) is proposed to extract features in the convolutional streams. The schematic of DAU is shown in Fig. 

9. The DAU suppresses less useful features and only allows more informative ones to pass further. This feature recalibration is achieved by using channel attention [hu2018squeeze] and spatial attention [woo2018cbam] mechanisms. (1) Channel attention (CA) branch exploits the inter-channel relationships of the convolutional feature maps by applying squeeze and excitation operations [hu2018squeeze]. Given a feature map , the squeeze operation applies global average pooling across spatial dimensions to encode global context, thus yielding a feature descriptor . The excitation operator passes through two convolutional layers followed by the sigmoid gating and generates activations . Finally, the output of CA branch is obtained by rescaling with the activations . (2) Spatial attention (SA) branch is designed to exploit the inter-spatial dependencies of convolutional features. The goal of SA is to generate a spatial attention map and use it to recalibrate the incoming features

. To generate the spatial attention map, the SA branch first independently applies global average pooling and max pooling operations on features

along the channel dimensions and concatenates the outputs to form a feature map . The map is passed through a convolution and sigmoid activation to obtain the spatial attention map , which is used to rescale .

Figure 8: Schematic for selective kernel feature fusion (SKFF) for the TeamInception team. It operates on features from multiple convolutional.
Figure 9: Dual attention unit incorporating spatial and channel attention mechanisms for the TeamInception team.

For training, , multi-scale SSIM and VGG loss functions are considered in the model, defined as follows


uses the features of conv2 layer after ReLU in the pre-trained VGG-16 network. Three RRGs are utilized, each of which contains MRBs. MRB consists of parallel streams with channel dimensions of at resolutions , respectively. Each stream has DAUs. Patches with the size of are cropped. Horizontal and vertical flips are employed for data augmentation. The model is trained from scratch with the Adam optimizer (, and ) for iterations. The initial learning rate is and the batch size is . The cosine annealing strategy is employed to steadily decrease the learning rate from the initial value to during training.

At inference time, the self-ensemble strategy [2] is employed. For each test image, a set of following 8 images are created: original, flipped, rotated , rotated , rotated , flipped, flipped, and flipped. Next, these transformed images are passed through our model and obtain super-resolved outputs. Then we undo the transformations and perform averaging to obtain the final image. To fuse results, three different variants of the proposed networks are trained with different loss functions (Eq. 2): (1) only the first term, (2) the first two terms (i.e., ), and (3) all the terms. For the variant 2, and ; for the variant 3, and , .

Given an image, the generated self-ensembled results with each of these three networks are averaged to obtain the final image. Results with self-ensemble strategy and fusion are reported in Table 3. With 4 Tesla-V100 GPUs, it takes 3 days to train the network. The time required to process a test image of size is 2 seconds (single method), 30 seconds (self-ensemble) and 87 seconds (fusion).

SM 29.72
SM 29.83
SM 29.89
SM + F 30.08
SE + F 30.25
Table 3: Results of validation set for the scale factor for the TeamInception team. Comparison of using single method (SM), self-ensemble (SE) and Fusion (F) on validation set.


The Noah_TerminalVision team proposed Super Resolution with weakly-paired data using an Adaptive Robust Loss. The network is based on RRDBNet with 23 Residual in Residual Denseblocks. Only training pairs with a high PSNR score were used for training. To further alleviate the bad effect of miss-alignment of training data, the adaptive robust loss function proposed by Jon Barron was used. For track 3, it additionally used a spatial attention module and an efficient channel attention module. The spatial attention module is borrowed from EDVR [wang2019edvr] and the efficient attention module is borrowed from ECA-Net [wang2020eca]. Considering that the training data are not perfectly aligned, Adaptive Robust Loss Function [barron2019general] for super resolution tasks is utilized to solve the weakly-paired training problem. The self-ensemble strategy is to run inference on the combination of the 90/180/270-degree rotated images of the original/flipped input and then to average the results.

Only training pairs with a high PSNR score (29) were used for training. The learning rate is 2e-4, the patch size of inputs is and the batchsize is 4. CosineAnnealingLR_Restart learning rate scheme is employed and the restart period is 250,000 steps. For each input, due to GPU memory constraint, images are tested patch-wisely. The crop window is of size 120

120, and a stride of 110

110 was used to collect patches.


The DeepBlueAI team proposed a solution based on RCAN [RCAN], which was implemented with PyTorch. In each RG, the RCAB number is 20, G=10 and C=128 in the RIR structure. The model is trained from scratch, which costs about 4 days with 432G Tesla V100 GPU. For training, all the training images are augmented by random horizontal flips and 90 rotations. In each training batch, LR color patches with the size of 6464 are extracted as inputs. The initial leaning rate is set to and learning rate of each parameter group use a cosine annealing schedule with total iterations and without restart. For testing, each low resolution image is flipped and rotated to generate seven augmented inputs; with the trained RCAN model, the corresponding super-resolved images are generated. An inverse transform is applied to those output images to get the original geometry. The transformed outputs are averaged all together to yield the self-ensemble result.


The ALONG team proposed Dual Path Network with high frequency guided for real-world image Super-Resolution. The proposed method follows the main structure of RCAN [RCAN] and utilizes the guild filter to decompose the detail layer and to restore high-frequency details. As illustrated in Figure 10, a lot of share-source skip connections in the original feature extraction path with channel attention. Due to share-source skip connections, the abundant low-frequency information can be bypassed and facilitate to train deeper network. Compared with the previous simulated datasets, the image degradation process for real SR is much more complicated. Low-resolution images lose more high-frequency information and look blurry. Inspired by other image deblurring tasks [wang2019edvr, zhou2019davanet, zhou2019spatio], a pre-deblur module is used before the residual groups to pre-process blurry inputs and improve super-resolution accuracy. Specifically, the input image is first down-sampled with strided convolution layers; then the upsampling layer at the end will resize the features back to the original input resolution. The proposed dual path network restores fine details by decomposing the input image and focusing on the detail layers. An additional branch focuses on the high-frequency reconstruction. The input LR image is decomposed into the detail layer using the guided filter, an edge-preserving low-pass filter [he2012guided]. Then a high-frequency module is adopted on the detail layer, so the output result can focus on restoring high-frequency details.

Besides, a variety of data augmentation strategies are combined to achieve competitive results in different tracks, including Cutout [devries2017improved], CutMix [yun2019cutmix], Mixup [zhang2017mixup], CutMixup, RGB permutation, Blend. In addition, inspired by [yoo2020rethinking], CutBlur, unlike Cutout, can utilize the entire image information while it enjoys the regularization effect due to the varied samples of random HR ratios and locations. The experimental results also show that a reasonable combination of data enhancement can improve the model performance without additional computation cost in the test phase. The model is trained with 8 2080Ti, 11G memory each GPU. Pseudo ensemble is also employed. The inputs are flipped/rotated and the HR results are aligned and averaged for enhanced prediction.

Figure 10: RCAN for the Real Image Super-Resolution (RCANv2) for the ALONG team.


Figure 11: The architecture of the proposed network by the LISA-ULB team.

The LISA-ULB team proposed VCycles BackProjection networks generation two (VCBPv2), which utilized an iterative error correcting feedback mechanism to guide the reconstruction of the final SR output. As shown in Figure 11, the proposed network is composed of an outer loop of 10 cycles and an inner loop of 3 cycles. The input of the proposed VCBPv2 is the LR image and the upsampled counterpart. The upsample and downsample modules iteratively transform features between high- and low-resolution space as residual for error correction. The decoder in the end reconstructs the corrected feature to SR image.

The model is trained using AdamW optimizer with learning rate of and halved at every 400 epochs, then the training is followed by SGDM optimizer. Equally weighted and SSIM loss is adopted for training.


The lyl team proposed a coarse to fine network for progressive super-resolution. As shown in Fig12, based on the Laplacian pyramid framework, the proposed model takes an LR image as input and progressively predicts residual images at levels. is the scale factor, , where .

was adopted to optimize the proposed network. Each level of the proposed CFN was supervised by different scales of HR images.

Figure 12: The architecture of the proposed network by the lyl team.


The GDUT-SL team used the RRDBNet of ESRGAN[ESRGAN]

to perform super-resolution. Typical RRDB block has 3 Dense blocks, which including 5 Conv layers with Leaky-ReLU and remove BN layers. The RRDB number was set to 23. Two UpConv layer is used for upsampling. Different from ESRGAN, the GDUT-SL team replaced the activation function with ReLU to obtain better PSNR results.

Residual scaling and smaller initialization were adopted to facilitate training a deep architecture. In training phase, the mini-batch size was set to 16, with image size of 9696. 20 promising models were selected for model-ensemble.


Figure 13: Overview of the network for the MCML-Yonsei team.
Figure 14: Resblock with Attention Module for the MCML-Yonsei team.

As shown in Fig.13, the MCML-Yonsei team proposed an attention based multi-scale deep residual network based on MDSR [EDSR], which shares most of the parameters across different scales. In order to utilize various features in each real image adaptively, the MCML-Yonsei team added an attention module in the existing Resblock. As shown in Fig.14, the attention module is based on MAMNet [MAMNet]

where the global variance pooling was replaced with total variation pooling.

They initialized all parameters except the attention module with the pre-trained MDSR, which was optimized for bicubic downsampling based training data. The mini-batch size was set to 16 and the patch size was set to 48. They subtracted the mean of each R, G, B channel of the train set for data normalization. The learning rate was initially set to , and it decayed at the 15k steps. The total training step was 20k.


Figure 15: Overview of the proposed method of for the kailos team.

The kailos team proposed RRBD Network with Attention mechanism using Wavelet loss for Single Image Super-Resolution. The loss function consisted of conventional loss and novel wavelet loss . The conventional loss is given as , where is reconstructed image and is ground truth image.

A wavelet transform can separate the signal features along the low and high frequency components. Most of the energy distribution in the signal, such as global structure and color distribution, is concentrated in the low frequency components. On the other hand, the high frequency components include signal patterns and image textures. Since both frequency components have different characteristics, a different loss function must be applied to each component. Therefore, the proposed novel wavelet loss is the sum of loss for high frequency components and loss for low frequency components given as , , and , where denotes the stage of wavelet transform and and are high and low frequency decomposition filters, respectively.

In the experiment, is 2 and Haar wavelet filters are used as wavelet decomposition filters. Therefore, a total loss is defined by , where denotes the regularization parameter and was used in the proposed method. Fig.16 shows an overview of the proposed method. Adam optimizer was used in training process, and the size of image patch was the quarter size of training data.


Figure 16: The total learning diagram of for the qwq team. In upsample network, they used features from 0.25, 1, 2 and 4(HR) five scales.

The qwq team proposed a Multi-Scale Network based on RCAN[RCAN]. As shown in Fig[1], the multi-scale mechanism was integrated into the base block of RCAN in order to enlarge the receptive field. Dual Loss was adopted for training. MixCorrupt augmentation was conducted, for it allowed the network to learn from robust SR results from different degradations, which is specially designed for the real-world scenario.


The RRDN_IITKGP used a GAN based Residual in Residual Dense Network [ESRGAN], where the model is pre-trained on other dataset and evaluated on the challenge dataset.


Figure 17: Structure of Frequency-aware Network (FAN) for the SR-IM team. There are three branches, representing the high frequency, middle frequency and low frequency components. The gate attention is used to adaptively select the required frequency components.

The SR-IM team proposed frequency-aware network, as shown in Fig.17. A hierarchical feature extractor (HFE) is utilized to extract the high representation, middle representation and low representation. The basic unit of the body consists of residual dense block and channel attention module. Finally, the three branches are fused into one super-resolved image by the gate and fusion module.

The mini-batch size was set to 8 and the patch size was set to 160 during training. They used Adam optimizer with an initial learning rate of 0.0001. The learning rate decayed by a factor of 0.5 every 30 epochs. The entire training time is about 48 hours.


The JNSR team utilized EDSR [EDSR] and DRLN [anwar2019densely] to perform model ensemble. The EDSR and DRLN were trained on AIM2020 dataset, the best models were chosen for model ensemble.


The SR_DL team proposed attention back projection network (ABPN++), as shown in Fig.18

. The proposed ABPN++ network first conducts feature extraction to expand the feature space of the input LR image. Then the densely connected enhanced down- and up-sampling back projection blocks perform up- and down-sampling the feature maps. The Cross-scale Attention Block (CAB) takes the outputs from down-sampling back projection blocks to compute the cross-correlation for feature fusion. Finally, the Refined Back Projection Block works as a final refinement that estimates the feature residuals between input LR and predicted LR images for update. The complete network includes 10 down- and up-sampling back projection block, 2 feature extraction blocks and 1 refined back projection block. Each back projection block is made of 5 convolutional layers. The kernel number is 32 for all convolution and deconvolution layers. For down- and up-sampling convolution layer, the kernel size is 6, stride is 4 and padding is 1.

The mini-batch size was set to 16 and the LR patch size was set to 48 during training. The learning rate is fixed to 1e-4 for all layers for iterations in total as the first stage. Then the batch size increases to 32 for iterations as fine-tuning.

Figure 18: (a): ABPN++: Attention based Back Projection Network for image super-resolution. (b): the proposed Cross-scale Attention Block by the SR_DL team.


The Webbzhou team fine-tuned the pre-trained RRDB [ESRGAN] on the challenge dataset.


The MoonCloud team utilized RCAN [RCAN] for the challenge. Totally 6 models were used for model ensemble. Three of them were trained on challenge dataset with scale of 4. The other three were trained on the challenge dataset with scale of 3, which were fine-tuned on the dataset with scale of 4 after. The final outputs were obtained by averaging the outputs of these six models.


The SrDance team utilized RRDB [ESRGAN]. A new training strategy was adopted for model optimization. The model was firstly pre-trained on DIV2K dataset. Then they trained their model by randomly picking one image in dataset and randomly crop a few

patches, which is alike stochastic gradient descent. Second, when model stepped, they trained on 10 pics, one

patch from each picture and fed to the model.


Figure 19: Illustration of the structure of SR approach setup proposed by the MLP_SR team.

The MLP_SR team proposed Deep Cyclic Generative Adversarial Residual Convolutional Networks for Real Image Super-Resolution, as shown in Fig.19. The SR generator [Umer2020DeepGA] network was trained in a GAN framework by using the LR () images with their corresponding HR images with pixel-wise supervision in the clean HR target domain (), while maintaining the cyclic consistency between the LR and HR domain.


The congxiaofeng team proposed RDB-P SRNet, which contains several residual-dense blocks with pixel shuffle for upsampling. The network was inspired by RDN [RDN].


The debut_kele team proposed Enhanced Deep Residual Networks for real image super-resolution.


We thank the AIM 2020 sponsors: Huawei, MediaTek, Google, NVIDIA, Qualcomm and Computer Vision Lab (CVL) ETH Zurich. This work was partially supported from National Key Research and Development Project, Fundamental Research Funds for the Central Universities under Grant No.19lgpy228, China Postdoctoral Science Foundation (2020M672968).

A. Teams and affiliations

AIM2020 team

Title: AIM 2020 Real Image Super-Resolution Challenge


Pengxu Wei (),

Hannan Lu (),

Radu Timofte (),

Liang Lin (),

Wangmeng Zuo ()


Sun Yat-sen University

Harbin Institute of Technology University

Computer Vision Lab, ETH Zurich, Switzerland


Title: Real Image Super Resolution via Heterogeneous Model Ensemble using GP-NAS

Members: Zhihong Pan (), Baopu Li Teng Xi, Yanwen Fan, Gang Zhang, Jingtuo Liu, Junyu Han, Errui Ding


Baidu Research (USA)

Department of Computer Vision Technology (VIS), Baidu Incorportation


Title: Adaptive dense connection super resolution reconstruction

Members: Tangxin Xie (), Yi Shen, Jialiang Zhang, Yu Jia, Liang Cao, Yan Zou

Affiliation: China Electronic Technology Cyber Security Co., Ltd.


Title: Self-Calibrated Attention Neural Network for Real-World Super Resolution

Members: Kaihua Cheng (), Chenhuan Wu

Affiliation: Guangdong OPPO Mobile Telecommunications Corp., Ltd


Title: Dual Path Network with High Frequency Guided for Real World Image Super-Resolution

Members: Yue Lin (), Cen Liu, Yunbo Peng

Affiliation: NetEase Games AI Lab


Title: Super Resolution with weakly-paired data using an Adaptive Robust Loss

Members: Xueyi Zou (),

Affiliation: Noah’s Ark Lab, Huawei


Title: A solution based on RCAN

Members: Zhipeng Luo, Yuehan Yao (), Zhenyu Xu

Affiliation: DeepBlue Technology (Shanghai) Co., Ltd


Title: Learning Enriched Features for Real Image Restoration and Enhancement

Members: Syed Waqas Zamir (), Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan


Inception Institute of Artificial Intelligence (IIAI)


Title: Multi-scale Dynamic Residual Network Using Total Variation for Real Image Super-Resolution

Members: Keon-Hee Ahn (), Jun-Hyuk Kim, Jun-Ho Choi, Jong-Seok Lee

Affiliation: Yonsei University


Title: Coarse to Fine Pyramid Networks for Progressive image super-resolution

Members: Tongtong Zhao (), Shanshan Zhao

Affiliation: Dalian Maritime Univerity


Title: RRDB Network with Attention mechanism using Wavelet loss for Single Image Super-Resolution

Members: Yoseob Han (), Byung-Hoon Kim, JaeHyun Baek


Loa Alamos National Laboratory (LANL)

Korea Advanced Institute of Science and Technology (KAIST)

Amazon Web Services (AWS)


Title: Dual Learning for SR using Multi-Scale Network

Members: Haoning Wu, Dejia Xu Affiliation: Peking University


Title: OADDet: Orientation-aware Convolutions Meet Dual Path Enhancement Network

Members: Bo Zhou (),

Haodong Yu ()


Jiangnan University

Karlsruher Institut fuer Technologie


Title: Dual Path Enhancement Network

Members: Bo Zhou ()

Affiliation: Jiangnan University


Title: Training Strategy Optimization

Members: Wei Guan (), Xiaobo Li, Chen Ye

Affiliation: Tongji University


Title: Ensemble of RRDB for Image Restoration

Members: Hao Li (), Haoyu Zhong, Yukai Shi, Zhijing Yang, Xiaojun Yang

Affiliation: Guangdong University of Technology


Title: Mixed Residual Channel Attention

Members: Haoyu Zhong (), Yukai Shi, Xiaojun Yang, Zhijing Yang,

Affiliation: Guangdong University of Technology,


Title: FAN: Frequency-aware network for image super-resolution

Members: Xin Li (), Xin Jin, Yaojun Wu, Yingxue Pang, Sen Liu

Affiliation: University of Science and Technology of China


Title: ABPN++: Attention based Back Projection Network for image super-resolution

Members: Zhi-Song Liu (), Li-Wen Wang, Chu-Tak Li, Marie-Paule Cani, Wan-Chi Siu


LIX - Computer science laboratory at the Ecole polytechnique [Palaiseau]

Center of Multimedia Signal Processing, The Hong Kong Polytechnic University


Title: RRDB for Real World Super-Resolution

Members:Yuanbo Zhou (),

Affiliation: Fuzhou University, Fujian Province, China


Title: Deep Cyclic Generative Adversarial Residual Convolutional Networks for Real Image Super-Resolution

Members: Rao Muhammad Umer (), Christian Micheloni

Affiliation: University Of Udine, Italy


Title: RDB-P SRNet: Residual-dense block with pixel shuffle

Members: Xiaofeng Cong ()

Affiliation: (Not provided)


Title: A GAN based Residual in Residual Dense Network

Members: Rajat Gupta ()

Affiliation: Indian Institute of Technology


Title: Self-supervised Learning for Pretext Training

Members: Kele Xu (), Hengxing Cai, Yuzhong Liu

Affiliation: National University of Defense Technology


Title: VCBPv2 - VCycles Backprojection Upscaling Network

Members: Feras Almasri (), Thomas Vandamme, Olivier Debeir

Affiliation: Universié Libre de Bruxelles, LISA department