Simple realization of papers in oppo Research Institute super score competition.
Perceptual Extreme Super-Resolution for single image is extremely difficult, because the texture details of different images vary greatly. To tackle this difficulty, we develop a super resolution network with receptive field block based on Enhanced SRGAN. We call our network RFB-ESRGAN. The key contributions are listed as follows. First, for the purpose of extracting multi-scale information and enhance the feature discriminability, we applied receptive field block (RFB) to super resolution. RFB has achieved competitive results in object detection and classification. Second, instead of using large convolution kernels in multi-scale receptive field block, several small kernels are used in RFB, which makes us be able to extract detailed features and reduce the computation complexity. Third, we alternately use different upsampling methods in the upsampling stage to reduce the high computation complexity and still remain satisfactory performance. Fourth, we use the ensemble of 10 models of different iteration to improve the robustness of model and reduce the noise introduced by each individual model. Our experimental results show the superior performance of RFB-ESRGAN. According to the preliminary results of NTIRE 2020 Perceptual Extreme Super-Resolution Challenge, our solution ranks first among all the participants.READ FULL TEXT VIEW PDF
Simple realization of papers in oppo Research Institute super score competition.
Single image super-resolution (SISR) is a task to generate high-resolution (HR) image with a single low-resolution image. The algorithms of SISR can be divided into three categories: interpolation-based methods, reconstruction-based methods, and learning-based methods. Interpolation-based SISR methods are very speedy and straightforward, such as bicubic interpolation  and Lanczos resampling . But some works have shown that interpolation methods would lose the details of images [31, 7]. Reconstruction-based SR methods [5, 24, 30] adopt sophisticated prior knowledge to restrict the possible solution space with the advantage of generating flexible and sharp details 
. However, as the scale factor increases, the performance of reconstruction-based SR methods decreases, and reconstruction-based SISR methods typically cost a lot of time. Learning-based SISR methods usually use machine learning algorithms to get the model which produces the mapping from low resolution to high resolution images. The learning-based methods has attracted much attention owning to their outstanding performance and fast computation. Such as Markov random field method, neighbor embedding method , sparse coding methods [31, 33, 27]
, and random forest method
. Recently, many deep learning based methods have been proposed to solve the SISR problem, and deep learning based SISR methods have demonstrated great superiority to other SISR methods.
Recently, deep learning algorithms have been widely used in different fields. Super-resolution CNN (SRCNN) 
is the first work to solve SISR problem using neural network method, it reportedly demonstrated vast superiority over traditional methods. The main reason it achieves good results is the CNN’s strong capability of learning rich features from big data in an end-to-end manner. After SRCNN was proposed, VDSR further use deep model to solve SISR problem, it has 20 layers in the network. EDSR 
proposed to remove the batch normalization (BN) layer in model, for BN layer will introduce a shift to the feature, and this shift may be harmful to the final performance. RCAN was proposed using the channel attention in SISR problem. However, these methods’ objective function has largely focused on minimizing the mean squared reconstruction error, which lead to the SR results lack of high-frequency details. To address this problem, super-resolution using generative adversarial network (SRGAN)  has been proposed, which can recover the finer texture details even with large upscaling factors. Enhanced super-resolution generative adversarial networks (ESRGAN) 
was proposed to further improve the performance of deep learning based SISR model. With the powerful feature extraction capabilities of deep learning models and the generative adversarial method, deep learning-based methods can effectively recover the finer details and textures.
NTIRE 2020 Perceptual Extreme Super-Resolution Challenge, that is, the task of super-resolving (increasing the resolution) an input image with a magnification factor 16 based on a set of prior examples of low and corresponding high resolution images.The aim is to obtain a model capable to produce high resolution results with the best perceptual quality and similarity to the ground truth. There are two difficulties in this challenge. First, we need to develop a model that can effectively recover the finer details and textures of low resolution image, and make the results be both photo-realistic and with high perceptual quality. Second, we need to minimize time complexity as much as possible while keep the satisfactory results at the same time.
In this work, we proposed to use multi-scale Receptive Fields Block (RFB) in the generative network to restore the finer details and textures of the super-resolution image. RFB can extract different scale features from previous feature map, which means it can extract the coarse and fine features from input LR images. To reduce time complexity and still maintain satisfactory performance, RFB use several small kernels instead of large kernels, and we alternately use different upsampling methods in um-sampling stage of the generative network. Finally in the testing phase, we use model fusion to improve the robustness and stability of the model to different test images.
Single Image Super-Resolution. Since the pioneer work of SRCNN , deep learning based methods have brought significant improvement in image super-resolution [17, 20, 35, 19, 29]. For image super-resolution, VDSR  reveals that increasing network depth shows a significant improvement in SISR. EDSR  abandoned batch normalization (BN) layers to prevent BN artifacts of SR images. Perceptual loss  was first proposed in the field of style transfer. SRGAN  use the perceptual loss to reduce the gap between SR images and human visual perception, and achieved very good results. ESRGAN 
introduced the Residual-in-Residual Dense Block (RRDB) into generative network, and proposed to let the discriminator predict relative realness instead of the absolute value in SISR. Our RFB-ESRGAN use a deep neural network without BN layers as the backbone of the generative network, also benefit from the RRDB and use relative realness in loss function instead of the absolute value.
Multi-scale Receptive Fields. GoogleNet  increase the width of the network in classification field, use multi-scale kernels to extract different scale features. After the pioneer work of GoogleNet, many other deep networks have tried to use multi-scale kernels to increase the diversity of features of the network in different network structures, and achieved good results. Inspired by the multi-scale kernels and the structure of Receptive Fields (RFs) in human visual systems, RFB-SSD  proposed Receptive Fields Block (RFB) for object detection. In our work, we introduce RFB into our generative network for super-resolution.
Upsampling Methods. In the early deep learning based SISR, most works put the upsampling stage in the front of the models, like SRCNN , VDSR . It will make the model very large, and cost a lot of time in test phase. FSRCNN  make the upsampling stage in the end of the model, this make the input size more small and the model more deeper possible. FSRCNN use deconvolution for upsampling, while ESRGAN  and some other works use nearest interpolation for upsampling. ESPCN  proposed the sub-pixel method for upsampling to reduce the time complexity. For RFB-ESRGAN, we alternately use nearest interpolation and sub-pixel convolution for upsampling. Here is our thought, nearest interpolation method focus on the computation in space dimension, while the sub-pixel convolution method focus on the computation in depth dimension. The alternative use of them allows for full communication of information between depth and space.
Minimize Time Complexity. For the purpose of minimize time complexity many networks design tricks have been proposed. GoogleNet  uses bottleneck layers to reduce the time complexity. MobileNet  uses depth-wise separable convolution to speed up the model running on edge devices. In our work, RFB uses small kernels to instead of large kernels, and we also alternately use nearest and sub-pixel methods in upsampling stage. Thus, we can minimize the time complexity of the model as much as possible while keep satisfactory performance at the same time.
Extreme single image super-resolution reconstruction aims to recover lost high-frequency (rich detail) while maintaining content consistency 
. Most SR network architectures are designed based on improving the PSNR (Peak Signal-to-Noise Ratio) value. However, the images reconstructed by PSNR-oriented methods are particularly smooth and lack high-frequency details. Perceptual-driven methods have been proposed to improve perceptual quality of SR results. Generative adversarial network is introduced to SR to generate results more naturally. SRGAN  and ESRGAN  significantly improves the overall perceptual quality of SR outputs over PSNR-oriented methods. We proposed a novel Super Resolution Network based on ESRGAN named RFB-ESRGAN.
The proposed network structure consists of 5 parts shown in Fig. 1, namely the first convolution module, the Trunk-a module, the Trunk-RFB module, upsampling module and the final convolution module.
The first convolution module is a convolution layer with a kernel size of , which can be formulated as equation (1). where denotes the first convolution function for the input LR image .
For perceptual extreme SR task we introduced RRFDBs (Fig. 3) in our work, where we assemble RFB (Fig. 4) in it. The RFB is consist of vary scale convolution filters, such we can restore rich image details for super resolution. Define the function of th RRFDB in Trunk-RFB as . The output of several stacked RRFDBs can be given by equation (3).
The output of Trunk-RFB module is fed to a single RFB block and the upsampling module. In the upsampling phase, we alternately use Nearest Neighborhood Interpolation and Sub-pixel Convolution shown in Fig. 5. The output of upsampling module can formulated as equation (4).where means the function of RFB, means the function of Nearest Neighborhood Interpolation, means the function of Sub-pixel Convolution.
Final convolution module consists of two layers of convolution with kernel size . Use and represent the functions of final two convolution layers, the final super resolution results can be given as equation (5).
For perceptual extreme super resolution task, RFB-ESRGAN proposed to extract multi-scale receptive fields feature for restoring details of the SR images. For this purpose, we need to assemble vary sizes of convolution filter into the generative network, such as , , . But large convolution kernel will greatly increase the time complexity of the model, it is needed to use small filters instead of large filters. In our work, we introduce the Receptive Fields Block (RFB) 
to assemble the RFB-ESRGAN. RFB has been proposed to strengthen the deep features learned from lightweight CNN models. Specifically, RFB makes use of multi-branch pooling with varying kernels corresponding to reception fields of different sizes, applies dilated convolution layers to control their eccentricities, and reshapes them to generate final representation. Here, the RFB is used in RRFDBs to remain the deep rich features for restoring the details of super resolution image.
In RFB-ESRGAN, the trunk-RFB is stacked of 8 Residual of Receptive Field Dense Blocks (RRFDBs), and each RRFDB contains 5 RFBs (Fig. 3). The composition structure of RFB is shown in Fig. 4. RFB highlights the relationship between receptive filed size and eccentricity in a daisy-shape configuration, where bigger weights are assigned to the positions nearer to the center by smaller kernels, claiming that they are more important than the farther ones. This makes RFB more effect on simulating the human visual system than the other multi-scale receptive fields methods like the Inception family , ASPP, and Deformable CNN . In the RFB, instead of large kernels such as , it uses the combination of small kernels (), which can effectively reduce the amount of parameters and time complexity. Besides, such substitutions enable RFB to extract very detailed features especially line features, such as hair, skin texture, edge, etc. This makes RFB exactly what we need for extracting multi-scale features and minimizing time complexity at the same time. The most important reason to use RFB is the ability of extracting the very detailed features, which is exactly what is needed in the field of image reconstruction.
To make RFB suitable for our RFB-ESRGAN, we drop all the batch normalization layers in RFB. In addition, we use Leaky Relu instead of Relu as the activation function of the whole RFB, while the activation functions in each branch are still Relu.
In the upsampling phase, instead of only use Nearest Neighborhood Interpolation (NNI) or Sub-pixel Convolution (SPC) , we alternately use NNI and SPC. NNI performs spatial transformation on input features, and the RFB after NNI makes the results of NNI’s spatial transformation fully affect on depth. SPC makes depth to space transformation, and the RFB after SPC makes the results of SPC’s depth to space transformation fully affect on space. Use them alternately will improve the information communication between space and depth. Also, the use of SPC will reduce the amount of parameters and time complexity.
We apply GAN loss that used in ESRGAN  on RFB-ESRGAN, which results in the following loss for generative network and discriminator network. Generative loss function of RFB-ESRGAN contains three terms: VGG loss which has been successfully applied on other tasks such as image synthesis and style transfer. The purpose of VGG loss here is encouraging our network to restore the high-frequency content for perceptually satisfaction. We use the pretrained VGG model to extract the feature representation of and , denotes the images generated by RFB-ESRGAN, denotes the ground truth high resolution images. Adversarial loss for encouraging our network to favor solutions that reside on the manifold of natural images. Pixel loss used to restrict the generation of too much high-frequency content. Use denotes the training dataset, describes the discriminator network function, describes the generative network function, and represents L1 loss. can be formulated as equation (6).
Here describes the input low resolution image. Pixel loss is the manhattan distance between reconstructed image and the reference ground truth image , shown as equation (7).
VGG loss is the manhattan distance between the VGG feature representations of a reconstructed image and the reference ground truth image , shown as equation (8).
Where represents the feature map of th layer in pretrained VGG model. Use represents the difference between the realistic degree of reconstructed image and reference ground truth image , the difference between and shown as (9). The adversarial loss can be formulated as equation (10).
is the sigmoid function andrepresents the average operation of all data in a mini-batch.
With pixel loss, VGG loss, and adversarial loss, we can formulate the generative loss of RFB-ESRGAN shown as equation (11).
Discriminator loss function of RFB-ESRGAN contains two terms: Real Loss for encouraging the real image is more realistic than fake image, shown as (12). Fake loss for encouraging the fake image is less realistic than real image, shown as equation (13).
With the real loss and fake loss , the loss function of discriminator can be formulated as equation (14).
Different from ESRGAN , which fuses the parameters of PSNR-oriented model and GAN-based model . In order to extremely improve the perceptual performance of the reconstructed image, we fuse the model without any PSNR-oriented model. The final model is ensemble of 10 GAN-based models with the best perceptual performance among all recorded models in GAN training stage. We fuse all the corresponding parameters of top 10 models to derive an ensemble model , whose parameters are:
where represents the parameters of , represents the parameters of , and is set as for NTIRE 2020 Perceptual Extreme Super-Resolution Challenge. The final ensemble model can effectively reduce the noise of reconstructed images and be more robust for different test images. We also attempt to fuse the models with more GAN-based models. For instance, use 20 or 40 best GAN-based models for ensemble. We find that, the ensemble model with more GAN-based models can reduce the noise of reconstructed images a little more. However, it doesn’t further improve the model’s perceptual performance of ensemble model. Instead, with more models for ensemble has a negative impact on perceptual performance. We balanced the performance of different numbers of fusion models, and finally chose to use 10 models for ensemble.
For NTIRE 2020 Perceptual Extreme Super-Resolution Challenge, all experiments are performed with a scaling factor of 16 between LR and HR images. We obtained the corresponding LR images via default setting (bicubic interpolation) of Matlab function ”imresize” with scale factor 16. The mini-batch size is set as 16. The spatial size of cropped HR patch is , and spatial size of corresponding input LR image is .
The training process can be divided into two stages. First stage, a PSNR-oriented model with loss is trained. The learning rate is initialized as and decayed by a factor of every of mini-batch steps. Second stage (GAN-based training stage), after fully trained of PSNR-oriented model, generative network is initialized with the parameters of pre-trained PSNR-oriented model and trained using the generative loss function (11) and adversarial loss function (10). In the generative loss function, is set as 10 and is set as . The learning rate is set as and halved at iterations. During the GAN-based training stage, parameters of generative network is recorded every 5000 iteration.
For optimization, we use Adam  with and
. The generative network and discriminator network are alternately updated. we implement our models with Pytorch framework and train them using Tesla V100 GPUs. There are 20.5M parameters in RFB-ESRGAN model, and it costs 0.82s using one Tesla V100 GPU for processing per image with 128x128 pixels.
NTIRE 2020 Perceptual Extreme Super-Resolution Challenge has provided DIV8K dataset  for training, which includes 1,500 HR images with high resolution vary from 2K to 8K, we use 1,400 images for training and the rest 100 images for validation. In order to enrich our training dataset, we added other datasets, including 800 images from DIV2k dataset , 2,650 images from Flickr2K  dataset and 785 images from OST dataset .
Our models are trained with RGB channels. For data augmenting, the training images are random flipped with horizontal and random rotated with 90 degree. The result models are evaluated on DIV8K dataset provided by NTIRE 2020 Perceptual Extreme Super-Resolution Challenge.
We have compared our final models on DIV8K with PSNR-oriented method RCNN, and also with perceptual-driven approach ESRGAN. Because the original RCNN and ESRGAN didn’t adjust to 16 scale super resolution task, we finetuned them on datasets same as the proposed RFB-ESRGAN. We present some representative qualitative results in Fig. 6.
From Fig. 6, we can observe that, our proposed RFB-ESRGAN outperforms previous approaches in both similar to ground truth and details. For instance, RFB-ESRGAN can produce sharper and clearer textures than PSNR-oriented method RCAN. The PSNR-oriented methods always tend to produce smooth and blurry SR images, which is not friendly to human visual perception. RFB-ESRGAN is capable of generating sharper and more natural details than ESRGAN. The fur textures of cat (see image 1608) are more real, the textures of plants and buildings (see image 1643) are more natural.
We also compare the results on 1,401-1,500 images from DIV8K, which haven’t been used for training. PSNR, SSIM, LPIPS and PI were calculated to evaluate the sharpness and fidelity of results. The results are shown in Tab. 1, in which the arrows indicate if high or low values are desired. Besides, our solution RFB-ESRGAN won the NTIRE 2020 Perceptual Extreme Super-Resolution Challenge according to preliminary results. We present the top 6 results from the Challenge in Tab. 2, more information on the evaluation and competing methods can be found in the challenge report .
In order to study the effects of each component of the proposed RFB-ESRGAN, we remove the different components of RFB-ESRGAN to measure the influence of it. The overall visual comparison is shown in Fig. 7. Each column images represents the super resolution results of the model with configurations shown in the top. Among them, Configurations of column represents the model use only Subpixel Convolution for upsampling, column represents the model use only Nearest Neighbor Interpolation for upsampling, colunn represents the model Alternately use Subpixel Convolution and Nearest Neighbor Interpolation for upsampling. Detailed of ablation study is provided as follows.
RFB. In order to prove the effect of RFB, we remove all RFBs in the model while keep the entire structure of the model unchanged. From some cases of column, we can observe that the textures of hair from people and fur from cat are too rough, and some with wrong direction. While the results of RFB-ESRGAN in column achieve fine and smooth hair and fur.
Methods for Upsampling. We have Alternately used Nearest Neighbor Interpolation (NNI) and Subpixel Convolution (SPC) in upsampling stage, shown in Fig. 5. In order to demonstrate the effect of this upsampling methods, we test the upsampling methods of using only NNI in column and using only SPC in column. As shown in the column, results of the method with only NNI are more blurry than the other upsampling methods. While using only SPC, the textures of some cases are too sharp and not natural (see image 1608 and 1643 in column), and also some unreal artifacts have been generated (see image 1617 in column). It can be observed our upsampling method yields the most clear and realistic results.
Ensemble. To evaluate the effect of model ensemble, we compare the SR results with model ensemble and without model ensemble. From column, we can observe that the results without ensemble have obvious noise though the textures are sharper and clear. While most noises can be eliminated by model ensemble as shown in . The hair textures become more natural (see image 1601 and image 1645), and the noise is suppressed to some extent (see image 1617 and 1643).
Besides, we have calculated PSNR, SSIM, LPIPS and PI on the results of 1,401-1,500 images form DIV8K, which haven’t been used for training. The results are shown in Tab. 3. The configuration of each column is as shown as Fig. 7.
We proposed RFB-ESRGAN for single image extreme perceptual super resolution problem. For 16 scale super resolution, we proposed using multi-scale receptive fields for extracting multi-scale features of LR image. In addition, we proposed using small convolution kernels to extract detailed features of input image for reconstructing the detailed features of SR image. We have also proposed using nearest interpolation and sub-pixel convolution alternately for improving the information exchange between spacial and depth, and reducing the amount of parameters in upsampling stage. Our experiments and the results of NTIRE 2020 Perceptual Extreme Super-Resolution Challenge have demonstrate the effectiveness of our solution for perceptual extreme super-resolution.
Accelerating the super-resolution convolutional neural network.In European conference on computer vision, pages 391–407. Springer, 2016.