Single image super-resolution (SR) reconstructs high-resolution (HR) images from low-resolution (LR) counterparts with image quality degradations [glasner2009super][yang2010image]. Instead of imposing higher requirements on hardware devices and sensors, it could be applicable to many practical scenarios, such as video surveillance, satellite, medical imaging, etc
. As a fundamental research topic, SR has attracted a long-standing and considerable attention in computer vision community.
) inherit the powerful capacity of deep learning and have achieved remarkable performance improvements. Nevertheless, so far, the remarkable progress of SR is mainly driven by the supervised learning of models from LR images and their HR counterparts. While the bicubic downsampling is usually adopted to simulate the LR images, the learned deep SR model performs much less effective for real-world SR applications since the image degradation in real-world is much more complicated.
To mitigate this issue, several real SR datasets have been recently built, City 100 [city100] and SR-RAW [zoomlearn]. The images in City100 were captured for the printed postcards in the indoor environment , which are limited in capturing the complicated image and degradation characteristics of natural scenes. The images in SR-RAW were collected in the real world and a contextual bilateral loss was proposed to address the misalignment problem in the dataset. Besides, Cai et al. [cai2019toward] released another real image SR dataset, named RealSR, which was captured from two DSLR cameras. They proposed the LP-KPN method in a Laplacian pyramid framework. Considering the complex image degradation across different scenes and devices, a large-scale diverse real SR dataset, named DRealSR [CDC], was released to further promote the research on real-world image SR. Images of DRealSR were captured by five different DSLR cameras and posed more challenging image degradation. In [CDC], the proposed component divide-and-conquer model (CDC) built a baseline, hourglass SR network (HGSR), in a stacked architecture, explored different reconstruction difficulties in terms of three low-level image components inspired by corner point detection, i.e, the flat, edges and corner points, and trained the model with a mediate supervision strategy. Besides, its proposed gradient-weighted (GW) loss also drives the model to adapt learning objectives to the reconstruction difficulties of three image components and has a flexibility of the application to any SR model.
Jointly with the Advances in Image Manipulation (AIM) 2020 workshop, we organize the AIM Challenge on Real-world Image Super-Resolution. Specifically, this challenge concerns the real-world SISR, which poses two challenging issues [CDC]: (1) more complex degradation against bicubic downsampling, and (2) diverse degradation processes among devices, aiming to learn a generic model to super-resolve LR images captured in practical scenarios. To achieve this goal, paired LR and HR images are captured by various DSLR cameras and provided for training. They are randomly selected from the DRealSR dataset. Images for training, validation and testing are captured in the same way with the same set of cameras.
This challenge is one of the AIM 2020 associated challenges on: scene relighting and illumination estimation[elhelou2020aim_relighting], image extreme inpainting[ntavelis2020aim_inpainting], learned image signal processing pipeline[ignatov2020aim_ISP], rendering realistic bokeh[ignatov2020aim_bokeh], real image super-resolution[wei2020aim_realSR], efficient super-resolution[zhang2020aim_efficientSR], video temporal super-resolution[son2020aim_VTSR] and video extreme super-resolution[fuoli2020aim_VXSR].
2 AIM 2020 Challenge on Real Image Super-Resolution
The objectives of the AIM 2020 challenge on real image super-resolution challenge are: (i) to further explore the researches on real image SR; (ii) to fully evaluate different SR approaches on different scale factors; (iii) to offer an opportunity of communications between academic and industrial participants.
2.1 DRealSR Dataset
DRealSR111The dataset is publicly available at https://github.com/xiezw5/Component-Divide-and- Conquer-for-Real-World-Image-Super-Resolution [CDC] is a large-scale real-world image super-resolution. Only half of images in DRealSR are randomly selected for this challenge. These images are captured from five DSLR cameras (i.e., Canon, Sony, Nikon, Olympus and Panasonic) in natural scenes and cover indoor and outdoor scenes avoiding moving objects, e.g., advertising posters, plants, offices, buildings, etc. These HR-LR image pairs are aligned. To get access to the training and validation data and submit SR results, the registration on Codalab222https://competitions.codalab.org is required. Details of the dataset in this challenge are given in Table 1.
2.2 Track and Competition
Tracks. The challenge uses the newly released DRealSR dataset and has three tracks corresponding to 2, 3, 4 upscaling factors. The aim is to obtain a network design or solution capable to produce high-quality results with the best fidelity to the reference ground truth.
Challenge phases. (1) Development phase: HR images from DRealSR have 40006000 pixels on average. For the convenience of model training, images are cropped into patches. For 2 scale factor, LR image patches are 380380; for 3 scale factor, LR image patches are 272272; for 4 scale factor, LR image patches are 192192. (2) Testing phase: In the final test phase, participants have access to LR images for three tracks, submit their SR results to Codalab evaluation server and email their codes and factsheets to the organizers. The organizers checked all the SR results and the provided codes to obtain the final results.
|OPPO_CAMERA||33.309||0.924||0.7699||Model+Self||False||+ SSIM + MS-SSIM|
|TeamInception||33.232||0.924||0.7690||Model+Self||True||+ MS-SSIM + VGG|
|Noah_TerminalVision||33.289||0.923||0.7686||Self||False||adaptive robust loss|
|Kailos||32.708||0.920||0.7601||Self||False||+ wavelet loss|
|OPPO_CAMERA||30.537||0.870||0.6966||Model+Self||False||+ SSIM + MS-SSIM|
|Noah_TerminalVision||30.564||0.866||0.6941||Self||False||adaptive robust loss|
|TeamInception||30.418||0.866||0.6928||Model+Self||True||+ MS-SSIM + VGG|
|Kailos||30.130||0.866||0.6900||Self||False||+ wavelet loss|
|OPPO_CAMERA||30.86||0.874||0.7033||Model+Self||False||+ SSIM + MS-SSIM|
|Kailos||30.866||0.873||0.7032||Self||False||+ wavelet loss|
|Noah_TerminalVision||30.587||0.866||0.6944||Self||False||adaptive robust loss|
|TeamInception||30.347||0.868||0.6935||Model+Self||True||+ MS-SSIM + VGG|
|MoonCloud||30.283||0.864||0.6898||Model + Self||True||/|
, the evaluation metric for the challenge. For “Ensemble”, “model” and “self” indicate the model ensemble and the self-ensemble, respectively. “/” indicates that those items are not provided by participants. We also provide results of “EDSR*” for comparison with the same challenge dataset.
The evaluation includes the comparison of the super-resolved images with the reference ground truth images. We use the standard peak signal to noise ratio (PSNR) and, complementary, the structural similarity (SSIM) index as often employed in the literature. PSNR and SSIM implementations are found in most of the image processing toolboxes. For each dataset, we report the average results (i.e.and ) over all the processed images belonging to it and employ for ranking the weighted value of normalized and , which is defined as follows,
3 Challenge Results
There are 174, 128 and 168 registered participants for three tracks, respectively. In total, 24 teams submitted their super-resolution results; 10, 2 and 11 teams submitted results of one, two and three tracks, respectively. Among those submitted results of one track, seven teams are for scale factor. Details of final testing results are provided in Table 2. It mainly reports the final evaluation results and model training details.
As for the evaluation metric of weighted score claimed in Sec.2.2, the leading entries for Track 1, 2 and 3 are all from team Baidu. For Track 1 and 2, the CETC-CSKT and the OPPO_CAMERA team win the second and the third places, respectively. For Track 3, ALONG and CETC-CSKT win the second and the third places, respectively. Among those solutions for the challenge, some interesting trends can be observed as follows.
All the teams utilize deep neural networks for super-resolution. The architecture of the deep network will greatly affect the performance of super-resolution images. Several teams,e.g., TeamInception, construct a network with the residual structure to reduce the difficulty of optimization, While OPPO_CAMERA connected the input to the output with a trainable convolution layer. CETC-CSKT further proposed to pre-train the trainable layer in the skip branch in advance. Several teams, such as DeepBlueAI and SR-IM applied channel attention module in their network, while several others like TeamInception and Noah_TerminalVision employ both spatial attention and channel attention on the feature level.
Data Augmentation. Most solutions conduct the data augmentation by randomly flipping and rotating images by 90 degrees. The newly proposed CutBlur method was employed by ALONG and OPPO_CAMERA and performance improvements are reported by these teams.
Ensemble Strategy. Most solutions adopted self-ensemble x8. Some solutions also performed model-ensemble by fusing results from models with different training parameter, or even of different architectures.
4 Challenge Methods and Teams
The Baidu team proposed to apply Neural Architecture Search (NAS) approach selecting variations of their previous dense residual model as well as RCAN model[zhihong2020aim]. In order to accelerate the searching process, Gaussian Process based Neural Architecture Search (GP-NAS) was applied as in [Li_2020_CVPR]. Specifically, given the hyper-parameters of GP-NAS, they are capable of predicting the performance of any architectures in the search space effectively. Then, the NAS process is converted to hyper-parameters estimation. By mutual information maximization, the Baidu team can efficiently sample networks. Accordingly, based on the performances of sampled networks, the posterior distribution of hyper-parameters can be gradually and efficiently updated. Based on the estimated hyper-parameters, the architecture with the best performance can be obtained.
The backbone model of the proposed method is a deep dense residual network originally developed for raw image demosaicking and denoising. As depicted in Fig.1, in addition to the shallow feature convolution at the front and the upsampler at the end, the proposed network consists of a total depth of dense residual blocks (DRB). The input convolution layer converts the 3-channel LR input to a total of F-channel shallow features. For the middle DRB blocks, each one includes stages of double layers of convolution and the outputs of all stages are concatenated together before convoluted from to F channels. An additional channel-attention layers are included at the end of each block, similar to RCAN [RCAN]. There are two types of skip connections included in each block, the block skip connection (BSC) and inter-block skip connection (IBSC). The BSC is the shortcut between input and output of block , while IBSC includes two shortcuts from the input of block to the two stages inside block , respectively. The various skip connections, especially IBSC, are included to combine features with a large range of receptive fields. The last block is an enhanced upsampler that transforms all F-channel LR features to the estimated 3-channel SR image. This dense residual network has three main hyper-parameters: is the number of feature channels, is the number of DRB layers and is the number of stages for each DRB. All these three hyper-parameters construct the search space for NAS.
During training, a 120
120 patch is randomly cropped and augmented with flipping and transposing from each training image for each epoch. A mixed loss ofand multi-scale structural similarity (MS-SSIM) is taken for training. For the experiment, the new model candidate search scheme using GP-NAS was implemented in PaddlePaddle [Paddle] and the final-training of searched models were conducted using PyTorch. A multi-level ensemble scheme is proposed in testing, including self-ensemble for patches, as well as patch-ensemble and model-ensemble for full-size images. The proposed method is validated to be highly effective, generating impressive testing results on all three tracks of AIM2020 Real Image Super-resolution Challenge.
The CETC-CSKT team proposed Adaptive Dense Connection Super Resolution reconstruction(ADCSR)[xie2019adaptive]. The algorithm is divided into BODY and SKIP. The BODY part improves the utilization of convolution features through adaptive dense connection. An adaptive sub-pixel reconstruction module (AFSC) is also proposed to reconstruct the features of BODY output. By pre-training SKIP in advance, the BODY part focuses on high-frequency feature learning. for track 1 (2), spatial attention is added after each residual block. The architecture is shown in Fig.2. Self-ensemble is used in EDSR [EDSR]. The test image is divided into pixel blocks for reconstruction. Finally, only input is used for splicing to reduce the edge difference of blocks.
The proposed ADCSR uses the first 18900 training data sets for training, and the last 100 as the test set for training. The input image block size is . SKIP is trained separately, and then the entire network is trained at the same time. The initial learning rate is . When the learning rate drops to , the training stops. loss is utilized to optimize the proposed model. The model is trained with NVIDIA RTX2080Ti * 4. Pytorch1.1.0 + Cuda10.0 + cudnn7.5.0 is selected as the deep learning environment.
The OPPO_CAMERA team proposed Self-Calibrated Attention Neural Network for Real-World Super Resolution. As shown in Fig.3, the proposed model is constituted of four integral components, i.e.anwar2019densely]. A longer skip connection is also added to connect the input to the output with a trainable parameter, which can greatly reduces the difficulty of optimization and thus, the network would pay more attention to the learning of the high frequency parts in images. As shown in Fig.4, three Basic Residual Block (BRB) forms a Large Residual Block (LRB) with dense connection. Self-Calibration convolution (SCC) [liu2020improving], shown at top of Fig.4, is adopted as a basic unit in order to expand receptive field. Unlike conventional convolution, SCC enables each point in space to have interactive information from nearby regions and channels. Dense connections are established between the Self-Calibration convolution block (SCCB), each densely connected residual block has three SCCB. To incorporate channel information efficiently, an attention block with multi-scale feature integration is added in every basic residual block as DRLN [anwar2019densely]. For the network optimization, loss function was introduced as pixel-wise loss. In order to improve the fidelity, SSIM and MS-SSIM loss were also used as structure loss. With pixel loss and structure loss, the total loss is formulated as follows,
For the training, the proposed method splits the training data randomly into two parts, i.e., training set and validation set, with the ratio of 18500:500. Considering its significant improvement in the Real World SR task, CutBlur [yoo2020rethinking] is applied to augment training images. Self-ensemble and Parameter-fusion strategy would obviously improve the fidelity index(PSNR and SSIM), and meanwhile, less noise in result images. The strategy of self-ensembles (x8) was used as explained in RCAN [zhang2018image], and all the corresponding parameters of last 3 models are fused to derive a fused model , as described in [shang2020perceptual]. Experiments are conducted with Tesla V100 GPU.
The AiAiR team proposes that orientation-aware convolutions meet dual path enhancement network (OADDet). Their method consists of four basic models (model ensemble): OADDet, Deep-OADDet, original EDSR [Lim_2017_CVPR_Workshops] and original DRLN [anwar2019densely]. The core modules of OADDet, illustrated in Figure 5, are borrowed from DDet [shi2020ddet], Inception [inception] and OANet [OANet]
with minor improvements, such as less attention modules, removing skip connections and replacing ReLU with LeakyReLU. Overall architectures are similar to DDet[shi2020ddet]. It is found that redundant attention modules will damage the performance and slow down the training process. Therefore, attention modules are only applied to the last few blocks of the backbone network and the last layer of the shallow network. Similar to RealSR [cai2019toward], PixelConv is also utilized, which contributes to dB improvement on the validation set.
The training process generally consists of four stages on three different datasets. The total training time is about 2000 GPU hours on V100.
OADDet models are trained from scratch and download DIV2K pre-trained EDSR/DRLN from official links.
DIV2K dataset is used to pre-train our OADDet models and use manually washed AIM2020 datasets to fine-tune all models (further details in GitHub README).
Four models are trained using three different strategies:
1) For OADDet: Pre-training on DIV2K (300 epochs) then fine-tuning on original AIM2020 x2 dataset (600 epochs) and AIM2020 washed x2 dataset (100 epochs).
2) For Deep-OADDet: Pre-training on DIV2K (30 epochs) then fine-tuning on AIM2020 washed x2+x3 dataset (350 epochs), AIM2020 washed x2 dataset (350 epochs) and AIM2020 washed x2 dataset (100 epochs).
2) For EDSR/DRLN: Using DIV2K well-trained models then fine-tuning on washed AIM2020 x2 dataset (1000 epochs).
Self-ensemble (), model-ensemble (four models) and proposed “crop-ensemble” are conducted (further details in GitHub README Reproduce x2 test dataset results).
OADDet enjoys a more stable and faster training process than OANet, which introduces too many attention modules at the early stage of the networks. DDet proposes to use dynamic PixelConv with kernelsize=5,7,9; however, it is proved that kernelsize=3,5,7 works better during training and testing time.
The TeamInception team proposes learning Enriched Features for Real Image Restoration and Enhancement. MIRNet, recently introduced in [Zamir2020MIRNet], is utilized with the collective goals of maintaining spatially-precise high-resolution representations through the entire network and receiving strong contextual information from the low-resolution representations. In Fig. 7. MIRNet333The code is publicly available at https://github.com/swz30/MIRNet has a multi-scale residual block (MRB) containing several key elements: (a) parallel multi-resolution convolution streams for extracting (fine-to-coarse) semantically-richer and (coarse-to-fine) spatially-precise feature representations, (b) information exchange across multi-resolution streams, (c) attention-based aggregation of features arriving from multiple streams, and (d) dual-attention units to capture contextual information in both spatial and channel dimensions.
The MRB consists of multiple (three in this work) fully-convolutional streams connected in parallel. It allows information exchange across parallel streams in order to consolidate the high-resolution features with the help of low-resolution features, and vice versa. Each component of MRB is described as follows.
Selective kernel feature fusion (SKFF). The SKFF module performs dynamic adjustment of receptive fields via two operations –Fuse and Select, as illustrated in Fig. 8. The fuse operator generates global feature descriptors by combining the information from multi-resolution streams. The select operator uses these descriptors to recalibrate the feature maps (of different streams) followed by their aggregation. Details of both operators for the three-stream case are elaborated as follows. (1) Fuse: SKFF receives inputs from three parallel convolution streams carrying different scales of information. We first combine these multi-scale features using an element-wise sum as: . We then apply global average pooling (GAP) across the spatial dimension of to compute channel-wise statistics . Next, a channel-downscaling convolution layer is used to generate a compact feature representation , where
for our experiments. Finally, the feature vectorpasses through three parallel channel-upscaling convolution layers (one for each resolution stream) and provides us with three feature descriptors and , each with dimensions . (2) Select: this operator applies the softmax function to and , yielding attention activations and that we use to adaptively recalibrate multi-scale feature maps and , respectively. The overall process of feature recalibration and aggregation is defined as: . Note that the SKFF uses fewer parameters than aggregation with the concatenation but generates more favorable results.
Dual attention unit (DAU).
While the SKFF block fuses information across multi-resolution branches, we also need a mechanism to share information within a feature tensor, both along the spatial and the channel dimensions. The dual attention unit (DAU) is proposed to extract features in the convolutional streams. The schematic of DAU is shown in Fig.9. The DAU suppresses less useful features and only allows more informative ones to pass further. This feature recalibration is achieved by using channel attention [hu2018squeeze] and spatial attention [woo2018cbam] mechanisms. (1) Channel attention (CA) branch exploits the inter-channel relationships of the convolutional feature maps by applying squeeze and excitation operations [hu2018squeeze]. Given a feature map , the squeeze operation applies global average pooling across spatial dimensions to encode global context, thus yielding a feature descriptor . The excitation operator passes through two convolutional layers followed by the sigmoid gating and generates activations . Finally, the output of CA branch is obtained by rescaling with the activations . (2) Spatial attention (SA) branch is designed to exploit the inter-spatial dependencies of convolutional features. The goal of SA is to generate a spatial attention map and use it to recalibrate the incoming features
. To generate the spatial attention map, the SA branch first independently applies global average pooling and max pooling operations on featuresalong the channel dimensions and concatenates the outputs to form a feature map . The map is passed through a convolution and sigmoid activation to obtain the spatial attention map , which is used to rescale .
For training, , multi-scale SSIM and VGG loss functions are considered in the model, defined as follows
uses the features of conv2 layer after ReLU in the pre-trained VGG-16 network. Three RRGs are utilized, each of which contains MRBs. MRB consists of parallel streams with channel dimensions of at resolutions , respectively. Each stream has DAUs. Patches with the size of are cropped. Horizontal and vertical flips are employed for data augmentation. The model is trained from scratch with the Adam optimizer (, and ) for iterations. The initial learning rate is and the batch size is . The cosine annealing strategy is employed to steadily decrease the learning rate from the initial value to during training.
At inference time, the self-ensemble strategy  is employed. For each test image, a set of following 8 images are created: original, flipped, rotated , rotated , rotated , flipped, flipped, and flipped. Next, these transformed images are passed through our model and obtain super-resolved outputs. Then we undo the transformations and perform averaging to obtain the final image. To fuse results, three different variants of the proposed networks are trained with different loss functions (Eq. 2): (1) only the first term, (2) the first two terms (i.e., ), and (3) all the terms. For the variant 2, and ; for the variant 3, and , .
Given an image, the generated self-ensembled results with each of these three networks are averaged to obtain the final image. Results with self-ensemble strategy and fusion are reported in Table 3. With 4 Tesla-V100 GPUs, it takes 3 days to train the network. The time required to process a test image of size is 2 seconds (single method), 30 seconds (self-ensemble) and 87 seconds (fusion).
|SM + F||30.08|
|SE + F||30.25|
The Noah_TerminalVision team proposed Super Resolution with weakly-paired data using an Adaptive Robust Loss. The network is based on RRDBNet with 23 Residual in Residual Denseblocks. Only training pairs with a high PSNR score were used for training. To further alleviate the bad effect of miss-alignment of training data, the adaptive robust loss function proposed by Jon Barron was used. For track 3, it additionally used a spatial attention module and an efficient channel attention module. The spatial attention module is borrowed from EDVR [wang2019edvr] and the efficient attention module is borrowed from ECA-Net [wang2020eca]. Considering that the training data are not perfectly aligned, Adaptive Robust Loss Function [barron2019general] for super resolution tasks is utilized to solve the weakly-paired training problem. The self-ensemble strategy is to run inference on the combination of the 90/180/270-degree rotated images of the original/flipped input and then to average the results.
Only training pairs with a high PSNR score (29) were used for training. The learning rate is 2e-4, the patch size of inputs is and the batchsize is 4. CosineAnnealingLR_Restart learning rate scheme is employed and the restart period is 250,000 steps. For each input, due to GPU memory constraint, images are tested patch-wisely. The crop window is of size 120
120, and a stride of 110110 was used to collect patches.
The DeepBlueAI team proposed a solution based on RCAN [RCAN], which was implemented with PyTorch. In each RG, the RCAB number is 20, G=10 and C=128 in the RIR structure. The model is trained from scratch, which costs about 4 days with 432G Tesla V100 GPU. For training, all the training images are augmented by random horizontal flips and 90 rotations. In each training batch, LR color patches with the size of 6464 are extracted as inputs. The initial leaning rate is set to and learning rate of each parameter group use a cosine annealing schedule with total iterations and without restart. For testing, each low resolution image is flipped and rotated to generate seven augmented inputs; with the trained RCAN model, the corresponding super-resolved images are generated. An inverse transform is applied to those output images to get the original geometry. The transformed outputs are averaged all together to yield the self-ensemble result.
The ALONG team proposed Dual Path Network with high frequency guided for real-world image Super-Resolution. The proposed method follows the main structure of RCAN [RCAN] and utilizes the guild filter to decompose the detail layer and to restore high-frequency details. As illustrated in Figure 10, a lot of share-source skip connections in the original feature extraction path with channel attention. Due to share-source skip connections, the abundant low-frequency information can be bypassed and facilitate to train deeper network. Compared with the previous simulated datasets, the image degradation process for real SR is much more complicated. Low-resolution images lose more high-frequency information and look blurry. Inspired by other image deblurring tasks [wang2019edvr, zhou2019davanet, zhou2019spatio], a pre-deblur module is used before the residual groups to pre-process blurry inputs and improve super-resolution accuracy. Specifically, the input image is first down-sampled with strided convolution layers; then the upsampling layer at the end will resize the features back to the original input resolution. The proposed dual path network restores fine details by decomposing the input image and focusing on the detail layers. An additional branch focuses on the high-frequency reconstruction. The input LR image is decomposed into the detail layer using the guided filter, an edge-preserving low-pass filter [he2012guided]. Then a high-frequency module is adopted on the detail layer, so the output result can focus on restoring high-frequency details.
Besides, a variety of data augmentation strategies are combined to achieve competitive results in different tracks, including Cutout [devries2017improved], CutMix [yun2019cutmix], Mixup [zhang2017mixup], CutMixup, RGB permutation, Blend. In addition, inspired by [yoo2020rethinking], CutBlur, unlike Cutout, can utilize the entire image information while it enjoys the regularization effect due to the varied samples of random HR ratios and locations. The experimental results also show that a reasonable combination of data enhancement can improve the model performance without additional computation cost in the test phase. The model is trained with 8 2080Ti, 11G memory each GPU. Pseudo ensemble is also employed. The inputs are flipped/rotated and the HR results are aligned and averaged for enhanced prediction.
The LISA-ULB team proposed VCycles BackProjection networks generation two (VCBPv2), which utilized an iterative error correcting feedback mechanism to guide the reconstruction of the final SR output. As shown in Figure 11, the proposed network is composed of an outer loop of 10 cycles and an inner loop of 3 cycles. The input of the proposed VCBPv2 is the LR image and the upsampled counterpart. The upsample and downsample modules iteratively transform features between high- and low-resolution space as residual for error correction. The decoder in the end reconstructs the corrected feature to SR image.
The model is trained using AdamW optimizer with learning rate of and halved at every 400 epochs, then the training is followed by SGDM optimizer. Equally weighted and SSIM loss is adopted for training.
The lyl team proposed a coarse to fine network for progressive super-resolution. As shown in Fig12, based on the Laplacian pyramid framework, the proposed model takes an LR image as input and progressively predicts residual images at levels. is the scale factor, , where .
was adopted to optimize the proposed network. Each level of the proposed CFN was supervised by different scales of HR images.
The GDUT-SL team used the RRDBNet of ESRGAN[ESRGAN]
to perform super-resolution. Typical RRDB block has 3 Dense blocks, which including 5 Conv layers with Leaky-ReLU and remove BN layers. The RRDB number was set to 23. Two UpConv layer is used for upsampling. Different from ESRGAN, the GDUT-SL team replaced the activation function with ReLU to obtain better PSNR results.
Residual scaling and smaller initialization were adopted to facilitate training a deep architecture. In training phase, the mini-batch size was set to 16, with image size of 9696. 20 promising models were selected for model-ensemble.
As shown in Fig.13, the MCML-Yonsei team proposed an attention based multi-scale deep residual network based on MDSR [EDSR], which shares most of the parameters across different scales. In order to utilize various features in each real image adaptively, the MCML-Yonsei team added an attention module in the existing Resblock. As shown in Fig.14, the attention module is based on MAMNet [MAMNet]
where the global variance pooling was replaced with total variation pooling.
They initialized all parameters except the attention module with the pre-trained MDSR, which was optimized for bicubic downsampling based training data. The mini-batch size was set to 16 and the patch size was set to 48. They subtracted the mean of each R, G, B channel of the train set for data normalization. The learning rate was initially set to , and it decayed at the 15k steps. The total training step was 20k.
The kailos team proposed RRBD Network with Attention mechanism using Wavelet loss for Single Image Super-Resolution. The loss function consisted of conventional loss and novel wavelet loss . The conventional loss is given as , where is reconstructed image and is ground truth image.
A wavelet transform can separate the signal features along the low and high frequency components. Most of the energy distribution in the signal, such as global structure and color distribution, is concentrated in the low frequency components. On the other hand, the high frequency components include signal patterns and image textures. Since both frequency components have different characteristics, a different loss function must be applied to each component. Therefore, the proposed novel wavelet loss is the sum of loss for high frequency components and loss for low frequency components given as , , and , where denotes the stage of wavelet transform and and are high and low frequency decomposition filters, respectively.
In the experiment, is 2 and Haar wavelet filters are used as wavelet decomposition filters. Therefore, a total loss is defined by , where denotes the regularization parameter and was used in the proposed method. Fig.16 shows an overview of the proposed method. Adam optimizer was used in training process, and the size of image patch was the quarter size of training data.
The qwq team proposed a Multi-Scale Network based on RCAN[RCAN]. As shown in Fig, the multi-scale mechanism was integrated into the base block of RCAN in order to enlarge the receptive field. Dual Loss was adopted for training. MixCorrupt augmentation was conducted, for it allowed the network to learn from robust SR results from different degradations, which is specially designed for the real-world scenario.
The RRDN_IITKGP used a GAN based Residual in Residual Dense Network [ESRGAN], where the model is pre-trained on other dataset and evaluated on the challenge dataset.
The SR-IM team proposed frequency-aware network, as shown in Fig.17. A hierarchical feature extractor (HFE) is utilized to extract the high representation, middle representation and low representation. The basic unit of the body consists of residual dense block and channel attention module. Finally, the three branches are fused into one super-resolved image by the gate and fusion module.
The mini-batch size was set to 8 and the patch size was set to 160 during training. They used Adam optimizer with an initial learning rate of 0.0001. The learning rate decayed by a factor of 0.5 every 30 epochs. The entire training time is about 48 hours.
The JNSR team utilized EDSR [EDSR] and DRLN [anwar2019densely] to perform model ensemble. The EDSR and DRLN were trained on AIM2020 dataset, the best models were chosen for model ensemble.
The SR_DL team proposed attention back projection network (ABPN++), as shown in Fig.18
. The proposed ABPN++ network first conducts feature extraction to expand the feature space of the input LR image. Then the densely connected enhanced down- and up-sampling back projection blocks perform up- and down-sampling the feature maps. The Cross-scale Attention Block (CAB) takes the outputs from down-sampling back projection blocks to compute the cross-correlation for feature fusion. Finally, the Refined Back Projection Block works as a final refinement that estimates the feature residuals between input LR and predicted LR images for update. The complete network includes 10 down- and up-sampling back projection block, 2 feature extraction blocks and 1 refined back projection block. Each back projection block is made of 5 convolutional layers. The kernel number is 32 for all convolution and deconvolution layers. For down- and up-sampling convolution layer, the kernel size is 6, stride is 4 and padding is 1.
The mini-batch size was set to 16 and the LR patch size was set to 48 during training. The learning rate is fixed to 1e-4 for all layers for iterations in total as the first stage. Then the batch size increases to 32 for iterations as fine-tuning.
The Webbzhou team fine-tuned the pre-trained RRDB [ESRGAN] on the challenge dataset.
The MoonCloud team utilized RCAN [RCAN] for the challenge. Totally 6 models were used for model ensemble. Three of them were trained on challenge dataset with scale of 4. The other three were trained on the challenge dataset with scale of 3, which were fine-tuned on the dataset with scale of 4 after. The final outputs were obtained by averaging the outputs of these six models.
The SrDance team utilized RRDB [ESRGAN]. A new training strategy was adopted for model optimization. The model was firstly pre-trained on DIV2K dataset. Then they trained their model by randomly picking one image in dataset and randomly crop a few
patches, which is alike stochastic gradient descent. Second, when model stepped, they trained on 10 pics, onepatch from each picture and fed to the model.
The MLP_SR team proposed Deep Cyclic Generative Adversarial Residual Convolutional Networks for Real Image Super-Resolution, as shown in Fig.19. The SR generator [Umer2020DeepGA] network was trained in a GAN framework by using the LR () images with their corresponding HR images with pixel-wise supervision in the clean HR target domain (), while maintaining the cyclic consistency between the LR and HR domain.
The congxiaofeng team proposed RDB-P SRNet, which contains several residual-dense blocks with pixel shuffle for upsampling. The network was inspired by RDN [RDN].
The debut_kele team proposed Enhanced Deep Residual Networks for real image super-resolution.
We thank the AIM 2020 sponsors: Huawei, MediaTek, Google, NVIDIA, Qualcomm and Computer Vision Lab (CVL) ETH Zurich. This work was partially supported from National Key Research and Development Project, Fundamental Research Funds for the Central Universities under Grant No.19lgpy228, China Postdoctoral Science Foundation (2020M672968).
A. Teams and affiliations
Title: AIM 2020 Real Image Super-Resolution Challenge
Pengxu Wei (),
Hannan Lu (),
Radu Timofte (),
Liang Lin (),
Wangmeng Zuo ()
Sun Yat-sen University
Harbin Institute of Technology University
Computer Vision Lab, ETH Zurich, Switzerland
Title: Real Image Super Resolution via Heterogeneous Model Ensemble using GP-NAS
Members: Zhihong Pan (), Baopu Li Teng Xi, Yanwen Fan, Gang Zhang, Jingtuo Liu, Junyu Han, Errui Ding
Baidu Research (USA)
Department of Computer Vision Technology (VIS), Baidu Incorportation
Title: Adaptive dense connection super resolution reconstruction
Members: Tangxin Xie (), Yi Shen, Jialiang Zhang, Yu Jia, Liang Cao, Yan Zou
Affiliation: China Electronic Technology Cyber Security Co., Ltd.
Title: Self-Calibrated Attention Neural Network for Real-World Super Resolution
Members: Kaihua Cheng (), Chenhuan Wu
Affiliation: Guangdong OPPO Mobile Telecommunications Corp., Ltd
Title: Dual Path Network with High Frequency Guided for Real World Image Super-Resolution
Members: Yue Lin (), Cen Liu, Yunbo Peng
Affiliation: NetEase Games AI Lab
Title: Super Resolution with weakly-paired data using an Adaptive Robust Loss
Members: Xueyi Zou (),
Affiliation: Noah’s Ark Lab, Huawei
Title: A solution based on RCAN
Members: Zhipeng Luo, Yuehan Yao (), Zhenyu Xu
Affiliation: DeepBlue Technology (Shanghai) Co., Ltd
Title: Learning Enriched Features for Real Image Restoration and Enhancement
Members: Syed Waqas Zamir (), Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan
Inception Institute of Artificial Intelligence (IIAI)
Title: Multi-scale Dynamic Residual Network Using Total Variation for Real Image Super-Resolution
Members: Keon-Hee Ahn (), Jun-Hyuk Kim, Jun-Ho Choi, Jong-Seok Lee
Affiliation: Yonsei University
Title: Coarse to Fine Pyramid Networks for Progressive image super-resolution
Members: Tongtong Zhao (), Shanshan Zhao
Affiliation: Dalian Maritime Univerity
Title: RRDB Network with Attention mechanism using Wavelet loss for Single Image Super-Resolution
Members: Yoseob Han (), Byung-Hoon Kim, JaeHyun Baek
Loa Alamos National Laboratory (LANL)
Korea Advanced Institute of Science and Technology (KAIST)
Amazon Web Services (AWS)
Title: Dual Learning for SR using Multi-Scale Network
Members: Haoning Wu, Dejia Xu Affiliation: Peking University
Title: OADDet: Orientation-aware Convolutions Meet Dual Path Enhancement Network
Members: Bo Zhou (),
Haodong Yu ()
Karlsruher Institut fuer Technologie
Title: Dual Path Enhancement Network
Members: Bo Zhou ()
Affiliation: Jiangnan University
Title: Training Strategy Optimization
Members: Wei Guan (), Xiaobo Li, Chen Ye
Affiliation: Tongji University
Title: Ensemble of RRDB for Image Restoration
Members: Hao Li (), Haoyu Zhong, Yukai Shi, Zhijing Yang, Xiaojun Yang
Affiliation: Guangdong University of Technology
Title: Mixed Residual Channel Attention
Members: Haoyu Zhong (), Yukai Shi, Xiaojun Yang, Zhijing Yang,
Affiliation: Guangdong University of Technology,
Title: FAN: Frequency-aware network for image super-resolution
Members: Xin Li (), Xin Jin, Yaojun Wu, Yingxue Pang, Sen Liu
Affiliation: University of Science and Technology of China
Title: ABPN++: Attention based Back Projection Network for image super-resolution
Members: Zhi-Song Liu (), Li-Wen Wang, Chu-Tak Li, Marie-Paule Cani, Wan-Chi Siu
LIX - Computer science laboratory at the Ecole polytechnique [Palaiseau]
Center of Multimedia Signal Processing, The Hong Kong Polytechnic University
Title: RRDB for Real World Super-Resolution
Members:Yuanbo Zhou (),
Affiliation: Fuzhou University, Fujian Province, China
Title: Deep Cyclic Generative Adversarial Residual Convolutional Networks for Real Image Super-Resolution
Members: Rao Muhammad Umer (), Christian Micheloni
Affiliation: University Of Udine, Italy
Title: RDB-P SRNet: Residual-dense block with pixel shuffle
Members: Xiaofeng Cong ()
Affiliation: (Not provided)
Title: A GAN based Residual in Residual Dense Network
Members: Rajat Gupta ()
Affiliation: Indian Institute of Technology
Title: Self-supervised Learning for Pretext Training
Members: Kele Xu (), Hengxing Cai, Yuzhong Liu
Affiliation: National University of Defense Technology
Title: VCBPv2 - VCycles Backprojection Upscaling Network
Members: Feras Almasri (), Thomas Vandamme, Olivier Debeir
Affiliation: Universié Libre de Bruxelles, LISA department