Progressive Perception-Oriented Network for Single Image Super-Resolution

by   Zheng Hui, et al.
Alibaba Cloud
Xidian University

Recently, it has been shown that deep neural networks can significantly improve the performance of single image super-resolution (SISR). Numerous studies have focused on raising the quantitative quality of super-resolved (SR) images. However, these methods that target PSNR maximization usually produce smooth images at large upscaling factor. The introduction of generative adversarial networks (GANs) can mitigate this issue and show impressive results with synthetic high-frequency textures. Nevertheless, these GAN-based approaches always tend to add fake textures and even artifacts to make the SR image of visually higher-resolution. In this paper, we propose a novel perceptual image super-resolution method that progressively generates visually high-quality results by constructing a stage-wise network. Specifically, the first phase concentrates on minimizing pixel-wise error and the second stage utilizes the features extracted by the previous stage to pursue results with better structural retention. The final stage employs fine structure features distilled by the second phase to produce more realistic results. In this way, we can maintain the pixel and structure level information in the perceptual image as much as possible. It is worth note that the proposed method can build three types of images in a feed-forward process. Also, we explore a new generator that adopts multi-scale hierarchical features fusion. Extensive experiments on benchmark datasets show that our approach is superior to the state-of-the-art methods. Code is available at



page 1

page 2

page 5

page 7

page 8

page 9

page 10

page 11


RankSRGAN: Generative Adversarial Networks with Ranker for Image Super-Resolution

Generative Adversarial Networks (GAN) have demonstrated the potential to...

Learning Structral coherence Via Generative Adversarial Network for Single Image Super-Resolution

Among the major remaining challenges for single image super resolution (...

Fine-grained Attention and Feature-sharing Generative Adversarial Networks for Single Image Super-Resolution

The traditional super-resolution methods that aim to minimize the mean s...

EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis

Single image super-resolution is the task of inferring a high-resolution...

Structure-Preserving Image Super-Resolution

Structures matter in single image super-resolution (SISR). Benefiting fr...

Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-Resolution

Single image super-resolution (SISR) with generative adversarial network...

MFAGAN: A Compression Framework for Memory-Efficient On-Device Super-Resolution GAN

Generative adversarial networks (GANs) have promoted remarkable advances...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Related Work

In this section, we focus on deep neural network approaches to solve the SR problem.

I-a Deep learning-based super-resolution

The pioneering work was done by Dong et al. [1, 2], who proposed SRCNN for SISR task, which outperformed conventional algorithms. To further improve the accuracy, Kim et al.proposed two deep networks, i.e., VDSR [4], and DRCN [5], which apply global residual learning and recursive layer respectively to the SR problem. Tai et al. [7] developed a deep recursive residual network (DRRN) to reduce the model size of the very deep network by using parameter sharing mechanism. Another work designed by the authors is a very deep end-to-end persistent memory network (MemNet) [10] for image restoration task, which tackles the long-term dependency problem in the previous CNN architectures. The aforementioned methods need to take the interpolated LR images as inputs. It inevitably increases the computational complexity and often results in visible reconstruction artifacts [8].

For the sake of speeding up the execution time of deep learning-based SR approaches, Shi et al. [6] proposed an efficient sub-pixel convolutional neural network (ESPCN), which extracts features in the LR space and magnifies the spatial resolution at the end of the network by conducting an efficient sub-pixel convolution layer. Afterward, Dong et al. [3] developed a fast SRCNN (FSRCNN), which employs the transposed convolution to upscale and aggregate the LR space features. However, these two methods fail to learn complicated mapping due to the limitation of the model capacity. EDSR [9], the winner solution of NTIRE2017 [25], was presented by Lim et al.. This work is far superior in performance to previous models. To alleviate the difficulty of SR task with large scaling factors such as , Lai et al. [8] proposed the LapSRN, which progressively reconstructs the multiple SR images with different scales in one feed-forward network. Tong et al. [11] presented a network for SR by employing dense skip connections, which demonstrated that the combination of features at different levels is helpful for improving SR performance. Recently, Zhang et al. [13] extended this idea and proposed a residual dense network (RDN), where the kernel is residual dense block (RDB) that extract abundant local features via dense connected convolutional layers. Furthermore, the authors proposed very deep residual channel attention networks (RCAN) [14] that verified the very deep network can availably improve SR performance and advantages of channel attention mechanism. To leverage the execution speed and performance, IDN [12] and CARN [26] were proposed by Hui et al.and Ahn et al., respectively. More concretely, Hui et al.constructed a deep but compact network, which mainly exploited and fused different types of features. And Ahn et al.designed a cascading network architecture. The main idea is to add multiple cascading connections from each intermediary layer to others. Such connections help this model performing SISR accurately and efficiently.

I-B Super-resolution considering naturalness

SRGAN [17], as a landmark work in perceptual-driven SR, was proposed by Ledig et al.. This approach is the first attempt to apply GAN [22] framework to SR, where the generator is composed of residual blocks [27]. To improve the naturalness of the images, perceptual and adversarial losses were used to train the model in SRGAN. Sajjadi et al. [18] explored the local texture matching loss and further improved the visual quality of the composite images. Park et al. [28] developed a GAN-based SISR method that produced realistic results by attaching an additional discriminator that works in the feature domain. Mechrez et al. [19] defined the Contextual loss that measured the similarity between the generated image and a target image by comparing the statistical distribution of the feature space. Wang et al. [20] enhanced SRGAN from three key components: network architecture, adversarial loss, and perceptual loss. A variant of Enhanced SRGAN (ESRGAN) won the first place in the PIRM2018-SR Challenge [29]. Choi et al. [30] introduced two quantitative score predictor networks to facilitate the generator improving the perceptual quality of the upscaled images.

Ii Proposed Method

Ii-a The proposed PSNR-oriented SR model

The single image super-resolution aims to estimate the SR image from its LR counterpart . An overall structure of the proposed basic model (RFN) is shown in Figure 1. This network mainly consists of two parts: content feature extraction module (CFEM) and reconstruction part, where the first part extracts content features for conventional image SR task (pursuing high PSNR value) and the second part naturally reconstructs through the front features related to the image content. The first procedure could be expressed by


where denotes content feature extractor, i.e., CFEM. Then, is sent to the content reconstruction module (CRM) ,


where denotes the function of our RFN.

The basic model is optimized with MAE loss function followed by the previous works [9, 13, 14]. Given a training set , where is the number of training images, is the ground-truth high resolution image of the low-resolution image , the loss function of our basic SR model is


where denotes the parameter set of our content-oriented branch (COBranch), i.e., RFN.

Ii-B Progressive perception-oriented SR model

As depicted in Figure 2, based on the content features extracted by the CFEM, we design a SFEM to distill structure-related information for restore images with SRM. This process can be expressed by


where and denote the functions of SRM and SFEM, respectively. To this end, we employ the multi-scale structural similarity index (MS-SSIM) and multi-scale as loss functions to optimize this branch. SSIM is defined as


where , are the mean, is the covariance of and , and , are constants. Given multiple scales through a process of stages of downsampling, MS-SSIM is defined as


where and are the term we defined in Equation 5 at scale and , respectively. From [31], we set and . Therefore, the total loss function of our structure branch can be expressed by


where represents the cascade of SFEM and SRM (light red area in Figure 5). denotes content features (see Equation 1) corresponding to -th training sample in a batch. Thus, the total loss function of this branch can be formulated as follows


where and is a scalar value to balance two losses, denotes the parameter set of structure-oriented branch (SOBranch). Here, we set , through experience.

Similarly, to obtain photo-realistic images, we utilize structural-related features refined by SFEM and send them to our perception feature extraction module (PFEM). The merit of this practice is to avoid re-extracting features from the image domain and these extracted features contain abundant and superior quality structural information, which tremendously helps perceptual-oriented branch (POBranch, see in Figure 5) generate visually plausible SR images while maintaining the basic structure. Concretely, structural feature is entered in PFEM


where and indicate PRM and PFEM as shown in Figure 2, respectively. For pursuing better visual effect, we adopt Relativistic GAN [32] as in [20]. Given a real image and a fake one

, the relativistic discriminator intends to estimate the probability that

is more realistic than . In standard GAN, the discriminator can be defined, in term of the non-transformed layer , as , where

is sigmoid function. The Relativistic average Discriminator (RaD, denoted by

[32] can be formulated as , if is real. Here, is the average of all fake data in a batch. The discriminator loss is defined by


The corresponding adversarial loss for generator is


where represents the generated images at the current perception-maximization stage, i.e., in equation 9.

VGG loss that has been investigated in recent SR works [16, 17, 18, 20] for better visual quality is also introduced in this stage. We calculate the VGG loss based on the “conv5_4” layer of VGG19 [21],


where and

indicate the tensor volume and channel number of the feature maps, respectively, and

denotes the -th channel of the feature maps extracted from the hidden layer of VGG19 model. Therefore, the total loss for the perception stage is:


where is the coefficients to balance these loss functions. And is the training parameters of POBranch.

Ii-C Residual-in-residual fusion block

We now give more details about our proposed RRFB structure (see Figure 3(b)), which consists of multiple hierarchical feature fusion blocks (HFFB) (see Figure 3(b)). Different from the frequently-used residual block in SR, we intensify its representational ability by introducing the spatial pyramid of dilated convolutions [24]. Specifically, we develop dilated convolutional kernels simultaneously, each with a dilation rate of , . Due to these dilated convolutions preserve different receptive fields, we can aggregate them to obtain multi-scale features. As shown in Figure 4, single dilated convolution with dilation rate of 3 (yellow block) looks sparse. To acquire effective receptive field, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them. A simple example is illustrated in Figure 4. For explaining this hierarchical feature fusion process clearly, the output of dilated convolution with dilation rate of is denoted by . In this way, the concatenated multi-scale features can be expressed by


After collecting these multi-scale features, we fuse them through a convolution , that is . Finally, the local skip connection with residual scaling is utilized to complete our HFFB.

Fig. 4: The diagrammatic sketch of multiple dilated convolutions addition. Taking the middle sub-figure as an example, indicates dilated convolution with dilation rate of 2. Under the same conditions of receptive field, is more dense than .
Fig. 5: The training scheme for our PPON. Light green region (COBranch) in the first row is actually our basic model RFN. Light red and yellow areas represent SOBranch and POBranch mentioned in Section II-B, respectively. The entire training process is divided into 3 stages. The module with little lock means to freeze its parameters.

Iii Experiments

Iii-a Datasets and Training Details

We use the DIV2K dataset [25], which consists of 1,000 high-quality RGB images (800 training images, 100 validation images, and 100 test images) with 2K resolution. For increasing the diversity of training images, we also use the Flickr2K dataset [9] consisting of 2,650 2K resolution images. In this way, we have 3,450 high-resolution images for training purpose. LR training images are obtained by downscaling HR with a scaling factor of images using bicubic interpolation function in MATLAB. The HR image patches with a size of are randomly cropped from HR images as the input of our proposed model and the mini-batch size is set to 25. Data augmentation is performed on the 3,450 training images, which are randomly horizontal flip and 90 degree rotation. For evaluation, we use six widely used benchmark datasets: Set5 [33], Set14 [34], BSD100 [35], Urban100 [36], Manga109 [37] and the PIRM dataset [29]. The SR results are evaluated with PSNR, SSIM [15], learned perceptual image patch similarity (LPIPS) [38], and perceptual index (PI) on Y (luminance) channel, in which PI is based on the non-reference image quality measures of Ma et al. [39] and NIQE [40]i.e., . The lower the values of LPIPS and PI, the better.

As depicted in Figure 5, the training process is divided into three phases. First, we train the COBranch with Equation 3. The initial learning rate is set to

, which is decreased by the factor of 2 for every 1000 epochs (

iterations). And then we fix the parameters of COBranch and only train the SOBranch through the loss function in Equation 8 with . This process is illustrated in the second row of Figure 5. During this stage, the learning rate is set to and halved at every 250 epochs ( iterations). Similarly, we eventually only train the POBranch by Equation 13 with . The learning rate scheme is the same as the second phase. All the stages are trained by ADAM optimizer [41] with the momentum parameter

. We apply PyTorch framework to implement our model and train them using NVIDIA GTX 1080Ti GPUs.

We set the dilated convolutions number as in the HFFB structure. All dilated convolutions have kernels and 32 filters as shown in Figure 3(a). In each RRFB, we set HFFB number as 3. In COBranch, we apply 24 RRFBs. And only 2 RRFBs are employed in both SOBranch and POBranch. All standard convolutional layers have 64 filters and their kernel sizes are set to expect for that at the end of HFFB, whose kernel size is . The residual scaling parameter and the negative scope of LReLU is set as .

Iii-B Model analysis

Fig. 6: PSNR performance and number of parameters. The results are evaluated on Set5 dataset for a scaling factor of .

Model Parameters. We compare the trade-off between performance and model size in Figure 6. Among the nine models, RFN and RCAN show higher PSNR values than others. In particular, RFN records the best performance in Set5. It should be noted that RFN uses fewer parameters than RCAN to achieve this performance. It means that RFN can better balance performance and model size.

Dilated convolution
Hierarchical fusion
PSNR on Set5 31.68 31.69 31.63 31.72
TABLE I: Investigations of dilated convolution and hierarchical fusion. These models are trained 200k iterations with DIV2K training dataset.

Study of dilation convolution and hierarchical feature fusion. We remove the hierarchical feature fusion structure. Furthermore, in order to investigate the function of dilated convolution, we use ordinary convolutions. For validating quickly, only RRFB is used in CFEM and this network is called RFN_mini. We conduct the training process with the DIV2K dataset and the results are depicted in Table I. As the number of RRFB increases, the benefits will increase accumulatively (see in Table II).

Method N_blocks Set5 Set14 BSD100 Urban100
w/o dilation 2 32.05 28.51 27.52 25.91
RFN_Mini 2 32.07 28.53 27.53 25.91
w/o dilation 4 32.18 28.63 27.59 26.16
RFN_Mini 4 32.26 28.67 27.60 26.23
TABLE II: Investigations of dilated convolution. Above models are trained 300k iterations with DIV2K training dataset.
PIRM_Val: 71
HR w/o CRM & SOBranch w/o SOBranch PPON
Fig. 7: Ablation study of progressive structure.
Fig. 8: The feature maps of CFEM, SFEM, and PFEM are visualized from left to right. Best viewed with zoom-in.
Item w/o CRM & SOBranch w/o SOBranch PPON
Memory footprint (M) 7,357 6,311 6,317
Training time (sec/epoch) 1,063 541 567
PIRM_Val (PSNR / SSIM / LPIPS / PI) 25.61 / 0.6802 / 0.1287 / 2.2857 26.32 / 0.6981 / 0.1250 / 2.2282 26.20 / 0.6995 / 0.1194 / 2.2353
PIRM_Test (PSNR / SSIM/ LPIPS / PI) 25.47 / 0.6667 / 0.1367 / 2.2055 26.16 / 0.6831 / 0.1309 / 2.1704 26.01 / 0.6831 / 0.1273 / 2.1511
TABLE III: Ablation study of progressive structure.

Iii-C Progressive structure analysis

HR SRGAN [17] ENet [18] CX [19] SuperSR [20] ESRGAN [20] PPON(Ours)
Fig. 9: An example of the structure distortion. The image is from the BSD100 dataset [35].
Fig. 10: A comparison of the visual effects between the three branch outputs. SRc, SRs, and SRp are outputs of the COBranch, SOBranch, and POBranch, respectively. The image is from the PIRM_Val dataset [29].

We observe that perceptual-driven SR results produced by GAN-based approaches [17, 18, 19] often suffer from structure distortion as illustrated in Figure 9. To alleviate this problem, we explicitly add structural information through our devised progressive architecture described in the main manuscript. To make it easier to understand this progressive practice, we show an example in Figure 10. From this picture, we can see that the difference between SRc and SRp is mainly reflected in the sharper texture of SRp. Therefore, the remaining component is substantially the same. Based on this viewpoint, we naturally design the progressive topology structure, i.e., gradually adding high-frequency details.

To validate the feature maps extracted by the CFEM, SFEM, and PFEM have dependencies and relationships, we visualize the intermediate feature maps as shown in Figure 8. From this picture, we can find that the feature maps distilled by three different extraction modules are similar. Thus, features extracted in the previous stage can be utilized in the current phase. In addition, feature maps in the third sub-figure contain more texture information, which is instructive to the reconstruction of visually high-quality images. To verify the necessity of using progressive structure, we remove CRM and SOBranch from PPON (i.e., changing to normal structure, similar to ESRGAN [20]). We observe that PPON without CRM & SOBranch cannot generate clear structural information, while PPON can better recover it. Table III suggests that our progressive structure can greatly improve the fidelity measured by PSNR and SSIM, while improving perceptual quality. And it indicates that fewer updatable parameters not only occupy less memory but also encourage faster training.

Few learnable model parameters (1.3M) complete task migration (i.e.from structure-aware to perceptual-aware) well in our work, while ESRGAN [20] uses 16.7M to generate perceptual results. Simply, we explicitly decompose a task into three subtasks (content, structure, perception). This approach is similar to human painting, first sketching the lines, then adding details. Our topology structure can easily achieve migration of similar tasks and inference of multiple tasks according to the specific needs.

Iii-D Comparisons with state-of-the-art methods

Method Set5 Set14 B100 Urban100 Manga109
Bicubic 28.42 0.8104 26.00 0.7027 25.96 0.6675 23.14 0.6577 24.89 0.7866
SRCNN [1] 30.48 0.8628 27.50 0.7513 26.90 0.7101 24.52 0.7221 27.58 0.8555
FSRCNN [3] 30.72 0.8660 27.61 0.7550 26.98 0.7150 24.62 0.7280 27.90 0.8610
VDSR [4] 31.35 0.8838 28.01 0.7674 27.29 0.7251 25.18 0.7524 28.87 0.8865
DRCN [5] 31.53 0.8854 28.02 0.7670 27.23 0.7233 25.14 0.7510 28.93 0.8854
LapSRN [8] 31.54 0.8852 28.09 0.7700 27.32 0.7275 25.21 0.7562 29.02 0.8900
MemNet [10] 31.74 0.8893 28.26 0.7723 27.40 0.7281 25.50 0.7630 29.42 0.8942
IDN [12] 31.82 0.8903 28.25 0.7730 27.41 0.7297 25.41 0.7632 29.41 0.8936
EDSR [9] 32.46 0.8968 28.80 0.7876 27.71 0.7420 26.64 0.8033 31.02 0.9148
SRMDNF [42] 31.96 0.8925 28.35 0.7772 27.49 0.7337 25.68 0.7731 30.09 0.9024
D-DBPN [43] 32.47 0.8980 28.82 0.7860 27.72 0.7400 26.38 0.7946 30.91 0.9137
RDN [13] 32.47 0.8990 28.81 0.7871 27.72 0.7419 26.61 0.8028 31.00 0.9151
MSRN [44] 32.07 0.8903 28.60 0.7751 27.52 0.7273 26.04 0.7896 30.17 0.9034
CARN [26] 32.13 0.8937 28.60 0.7806 27.58 0.7349 26.07 0.7837 30.47 0.9084
RCAN [14] 32.63 0.9002 28.87 0.7889 27.77 0.7436 26.82 0.8087 31.22 0.9173
SRFBN [45] 32.47 0.8983 28.81 0.7868 27.72 0.7409 26.60 0.8015 31.15 0.9160
SAN [46] 32.64 0.9003 28.92 0.7888 27.78 0.7436 26.79 0.8068 31.18 0.9169
RFN(Ours) 32.71 0.9007 28.95 0.7901 27.83 0.7449 27.01 0.8135 31.59 0.9199
S-RFN(Ours) 32.66 0.9022 28.86 0.7946 27.74 0.7515 26.95 0.8169 31.51 0.9211
TABLE IV: Quantitative evaluation results in terms of PSNR and SSIM. Red and blue colors indicates the best and second best methods, respectively. Here, S-RFN is the combination of RFN and SOBranch.
Dataset Scores SRGAN [17] ENet [18] CX [19]  [47]  [47] NatSR [48] ESRGAN [20] PPON (Ours)
Set5 PSNR 29.43 28.57 29.12 31.24 29.59 31.00 30.47 30.84
SSIM 0.8356 0.8103 0.8323 0.8650 0.8415 0.8617 0.8518 0.8561
PI 3.3554 2.9261 3.2947 4.1123 3.2571 4.1875 3.7550 3.4590
LPIPS 0.0837 0.1014 0.0806 0.0978 0.0889 0.0943 0.0748 0.0664
Set14 PSNR 26.12 25.77 26.06 27.77 26.36 27.53 26.28 26.97
SSIM 0.6958 0.6782 0.7001 0.7440 0.7097 0.7356 0.6984 0.7194
PI 2.8816 3.0176 2.7590 3.0246 2.6981 3.1138 2.9259 2.7741
LPIPS 0.1488 0.1620 0.1452 0.1861 0.1576 0.1765 0.1329 0.1176
B100 PSNR 25.18 24.94 24.59 26.28 25.19 26.45 25.32 25.74
SSIM 0.6409 0.6266 0.6440 0.6905 0.6468 0.6835 0.6514 0.6684
PI 2.3513 2.9078 2.2501 2.7458 2.1990 2.7746 2.4789 2.3775
LPIPS 0.1843 0.2013 0.1881 0.2474 0.2474 0.2115 0.1614 0.1597
PIRM_Val PSNR N/A 25.07 25.41 27.35 25.46 27.03 25.18 26.20
SSIM N/A 0.6459 0.6747 0.7277 0.6657 0.7199 0.6596 0.6995
PI N/A 2.6876 2.1310 2.3880 2.0688 2.4758 2.5550 2.2353
LPIPS N/A 0.1667 0.1447 0.1750 0.1869 0.1648 0.1443 0.1194
PIRM_Test PSNR N/A 24.95 25.31 27.04 25.35 26.95 25.04 26.01
SSIM N/A 0.6306 0.6636 0.7068 0.6535 0.7090 0.6454 0.6831
PI N/A 2.7232 2.1133 2.2752 2.0131 2.3772 2.4356 2.1511
LPIPS N/A 0.1776 0.1519 0.1739 0.1902 0.1712 0.1523 0.1273
TABLE V: Results on public benchmark datasets, PIRM_Val, and PIRM_Test for existing perceptual quality specific methods and our proposed PPON. Red color indicates the best performance and blue color indicates the second best performance.

We compare our RFN with state-of-the-art methods: SRCNN [1, 2], FSRCNN [3], VDSR [4], DRCN [5], LapSRN [8], MemNet [10], IDN [12], EDSR [9], SRMDNF [42], D-DBPN [43], RDN [13], MSRN [44], CARN [26], RCAN [14], and SRFBN [45]. Table IV shows quantitative comparisons for SR. It can be seen that our RFN performs the best in terms of PSNR on all the datasets. And the proposed S-RFN shows great advantages with regard to SSIM. In Figure 11, we present visual comparisons on different datasets. For image “img_011”, we observe that most of the compared methods cannot recover the lines and would suffer from blurred artifacts. In contrast, our RFN can slightly alleviate this phenomenon and restore more details.

Urban100 ():
HR VDSR [4] LapSRN [8] DRRN [7] MemNet [10]
PSNR/SSIM 19.38/0.5925 19.34/0.6037 19.51/0.6161 19.62/0.6179
EDSR [9] RDN [13] CARN [26] RCAN [14] RFN(Ours)
20.52/0.6826 20.62/0.6827 20.08/0.6449 21.13/0.7119 21.54/0.7304
Urban100 ():
HR VDSR [4] LapSRN [8] DRRN [7] MemNet [10]
PSNR/SSIM 17.29/0.7565 17.21/0.7568 17.12/0.7660 17.29/0.7773
EDSR [9] RDN [13] CARN [26] RCAN [14] RFN(Ours)
18.42/0.8225 17.85/0.8003 17.71/0.7986 18.42/0.8224 19.17/0.8485
Urban100 ():
HR VDSR [4] LapSRN [8] DRRN [7] MemNet [10]
PSNR/SSIM 19.19/0.7713 19.33/0.7836 19.91/0.8056 20.01/0.8099
EDSR [9] RDN [13] CARN [26] RCAN [14] RFN(Ours)
21.24/0.8573 21.12/0.8540 20.44/0.8297 21.46/0.8656 21.91/0.8745
Urban100 ():
HR VDSR [4] LapSRN [8] DRRN [7] MemNet [10]
PSNR/SSIM 31.01/0.8683 30.97/0.8707 31.10/0.8702 31.10/0.8690
EDSR [9] RDN [13] CARN [26] RCAN [14] RFN(Ours)
31.95/0.8861 31.85/0.8833 31.42/0.8769 31.96/0.8857 32.37/0.8915
Manga109 ():
HR VDSR [4] LapSRN [8] DRRN [7] MemNet [10]
PSNR/SSIM 28.04/0.9201 28.30/0.9239 29.15/0.9356 29.28/0.9378
EDSR [9] RDN [13] CARN [26] RCAN [14] RFN(Ours)
31.55/0.9626 31.41/0.9617 30.46/0.9516 32.39/0.9688 33.01/0.9724
Fig. 11: Visual comparisons for SR with RFN on Urban100 and Manga109 datasets.
BSD100 ():
HR SRGAN [17] ENet [18] CX [19]  [47] ESRGAN [20] S-RFN(Ours) PPON(Ours)
PSNR/LPIPS 23.70/0.2471 23.27/0.2547 23.82/0.1999 25.72/0.2224 23.54/0.1806 26.48/0.2998 24.39/0.1546
BSD100 ():
HR SRGAN [17] ENet [18] CX [19]  [47] ESRGAN [20] S-RFN(Ours) PPON(Ours)
PSNR/LPIPS 28.78/0.1355 25.71/0.3490 27.69/0.1610 29.97/0.2064 29.58/0.1554 31.17/0.2807 29.78/0.1332
PIRM_Val ():
HR ENet [18] CX [19]  [47] ESRGAN [20] SuperSR [20] S-RFN(Ours) PPON(Ours)
PSNR/LPIPS 24.16/0.1276 24.88/0.1248 26.79/0.1031 24.02/0.1189 25.03/0.1353 28.47/0.1352 26.32/0.0773
PIRM_Test ():
HR ENet [18] CX [19]  [47] ESRGAN [20] SuperSR [20] S-RFN(Ours) PPON(Ours)
PSNR/LPIPS 19.80/0.1756 20.64/0.1552 20.15/0.1797 18.96/0.2128 20.43/0.1710 23.11/0.2915 20.22/0.1466
PIRM_Val ():
HR ENet [18] CX [19] EPSR3 [47] SuperSR [20] ESRGAN [20] PPON_128 (Ours) PPON (Ours)
PSNR/LPIPS 24.77/0.1560 25.55/0.1466 24.53/0.2305 25.12/0.1421 23.80/0.1861 24.57/0.1320 24.87/0.1256
PIRM_Val ():
HR ENet [18] CX [19] EPSR3 [47] SuperSR [20] ESRGAN [20] PPON_128 (Ours) PPON (Ours)
PSNR/LPIPS 25.04/0.1429 25.60/0.1301 25.36/0.1597 25.60/0.1468 24.56/0.1513 25.84/0.1092 25.91/0.1067
PIRM_Val ():
HR ENet [18] CX [19] EPSR3 [47] SuperSR [20] ESRGAN [20] PPON_128 (Ours) PPON (Ours)
PSNR/LPIPS 24.12/0.1795 25.03/0.1697 22.55/0.3107 24.07/0.1632 22.73/0.2307 24.37/0.1461 23.99/0.1547
PIRM_Test ():
HR ENet [18] CX [19] EPSR3 [47] SuperSR [20] ESRGAN [20] PPON_128 (Ours) PPON (Ours)
PSNR/LPIPS 23.82/0.2265 24.83/0.1891 23.80/0.2250 23.06/0.2323 22.55/0.2782 24.62/0.1593 24.26/0.1544
PIRM_Test ():
HR ENet [18] CX [19] EPSR3 [47] SuperSR [20] ESRGAN [20] PPON_128 (Ours) PPON (Ours)
PSNR/LPIPS 19.92/0.2776 20.41/0.2504 20.27/0.2938 20.54/0.2616 18.67/0.2741 20.17/0.2202 20.43/0.2068
Fig. 12: Qualitative comparisons of perceptual-driven SR methods with our results at scaling factor of 4. Here, SuperSR is the variant of ESRGAN and it won the first place in the PIRM2018-SR Challenge.

Table V shows our quantitative evaluation results compared with perceptual-driven state-of-the-arts approaches: SRGAN [17], ENet [18], CX [19], EPSR [47], NatSR [48], and ESRGAN [20]. The proposed PPON achieves the best in terms of LPIPS and keep the presentable PSNR values. For image “86” in Figures 12, the result generated by S-RFN is blurred but has the fine structure. Based on S-RFN, our PPON can synthesis realistic textures while retaining the nice structure. It also validates the effectiveness of the proposed progressive architecture.

Iii-E The choice of main evaluation metric

296059 from BSD100
HR SRGAN [17] ENet [18] CX [19] EPSR2 [47]
( / 0 / 2.3885) (28.96 / 0.1564 / 2.6015) (29.18 / 0.1432 / 2.8138) (28.57 / 0.1563 / 2.3492) (30.47 / 0.2046 /3.2575)
EPSR3 [47] SuperSR [20] ESRGAN [20] S-RFN (Ours) PPON (Ours)
(29.02 / 0.1911 /2.2666) (29.80 / 0.1703 /2.2913) (29.38 / 0.1333 / 2.3481) (31.40 / 0.3314 / 4.7222) (29.26 / 0.1305 / 2.5130)
Fig. 13: A visual comparison with the state-of-the-art perceptual image SR algorithms.

We consider LPIPS111 [38] and PI222 [29] as our evaluation indices of perceptual image SR. As illustrated in Figure 13, we can obviously see that the PI score of EPSR3 (2.2666) is even better than HR (2.3885), but EPSR3 shows unnatural and lacks proper texture and structure. When observing the results of ESRGAN and our PPON, their perception effect is superior to that of EPSR3, which is exactly in accordance with corresponding LPIPS values. From the results of S-RFN and PPON, it can be demonstrated that both PI and LPIPS have the ability of distinguishing blurring image. From the images of EPSR3, SuperSR and ground-truth (HR), we can distinctly know that the lower PI value does not mean the better image quality. Compared with the image generated by ESRGAN [20], it is obvious that the proposed PPON get the better visual effect with more structure information, which is corresponding to the lower LPIPS value. Due to the PI (non-reference measure) is not sensitive to deformation through the experiment and cannot reflect the similarity with ground-truth, we take LPIPS as our primary perceptual measure and PI as a secondary metric.

Besides, we performed a MOS (mean opinion score) test to validate the effectiveness of our PPON further. Specifically, we collect raters to assign an integral score from (bad quality) to (excellent quality). To ensure the reliability of the results, we provide the raters with test and original HR images at the same time. The ground-truth images are set to , and the raters then score the test images based on it. The average MOS results are shown in Table VI.

MOS 2.42 3.23 1.82 3.58
PSNR 25.41 25.18 28.63 26.20
SSIM 0.6747 0.6596 0.7913 0.6995
TABLE VI: Comparison of CX, ESRGAN, S-RFN, and PPON.

Iii-F The influence of training patch size

Method PIRM_Val PIRM_Test
ESRGAN [20] 0.1443 / 2.5550 0.1523 / 2.4356
PPON_128 (Ours) 0.1241 / 2.3026 0.1321 / 2.2080
PPON (Ours) 0.1194 / 2.2736 0.1273 / 2.1770
TABLE VII: Quantitative evaluation of different perceptual-driven SR methods in LPIPS and PI. PPON_128 indicates the POBranch trained with image patches. The best and second best results are highlighted and underlined, respectively.

In ESRGAN [20], the authors mentioned that larger training patch size costs more training time and consumes more computing resources. Thus, they used for PSNR-oriented methods and for perceptual-driven methods. In our main manuscript, we train the COBranch, SOBranch, and POBranch with image patches. Here, we further explore the influence of larger patches in the perceptual image generation stage. It is important to note that training perceptual-driven model requires more GPU memory and larger computing resources than the PSNR-oriented model since the VGG model and discriminator need to be loaded during the training of the former. Therefore, larger patches () are hard to be used in optimizing ESRGAN [20] due to their huge generator and discriminator to be updated. Thanks to our POBranch only containing very few parameters, we employ training patches and achieve the better results as shown in Table VII. With regard to the discriminators, we illustrate them in Figure 14. For a fair comparison with the ESRGAN [20], we retrain our POBranch with patches and provide the results in Table VII.

(a) Discriminator for training patches in PPON_128.
(b) Discriminator for training patches in PPON.
Fig. 14:

The network structure of the discriminators. The output size is scaled down by stride 2, and the parameter of LReLU is


Iv Conclusion

In this paper, we propose a progressive perception-oriented network (PPON) for better perceptual image SR. Concretely, three branches are developed to learn the content, structure, and perceptual details, respectively. By exerting stage-by-stage training scheme, we can steadily get the promising results. It is worth mentioning that these three branches are not independent. In other words, extracted features and output images of the content-oriented branch can be exploited by a structure-oriented branch. Extensive experiments on both traditional SR and perceptual SR demonstrate the effectiveness of our proposed PPON.


This work was supported in part by the National Natural Science Foundation of China under Grant 61432014, 61772402, U1605252, 61671339 and 61871308, in part by the National Key Research and Development Program of China under Grant 2016QY01W0200, in part by National High-Level Talents Special Support Program of China under Grant CS31117200001.


  • [1] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in ECCV, 2014, pp. 184–199.
  • [2] ——, “Image super-resolution using deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016.
  • [3] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in ECCV, 2016, pp. 391–407.
  • [4] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in CVPR, 2016, pp. 1646–1654.
  • [5] ——, “Deeply-recursive convolutional network for image super-resolution,” in CVPR, 2016, pp. 1637–1645.
  • [6] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in CVPR, 2016, pp. 1874–1883.
  • [7] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in CVPR, 2017, pp. 3147–3155.
  • [8] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” in CVPR, 2017, pp. 624–632.
  • [9] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in CVPR Workshop, 2017, pp. 136–144.
  • [10] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration,” in ICCV, 2017, pp. 3147–3155.
  • [11] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense skip connections,” in ICCV, 2017, pp. 4799–4807.
  • [12] Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-resolution via information distillation network,” in CVPR, 2018, pp. 723–731.
  • [13] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in CVPR, 2018, pp. 2472–2481.
  • [14] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in ECCV, 2018, pp. 286–301.
  • [15] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [16] J. Johnson, A. Alahi, and F.-F. Li, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV, 2016, pp. 694–711.
  • [17] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in CVPR, 2017, pp. 4681–4690.
  • [18] M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch, “Enhancenet: Single image super-resolution through automated texture synthesis,” in ICCV, 2017, pp. 4491–4500.
  • [19] R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor, “Maintaining natural image statistics with the contextual loss,” in ACCV, 2018.
  • [20] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang, “Esrgan: Enhanced super-resolution generative adversarial networks,” in ECCV Workshop, 2018.
  • [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014, pp. 2672–2680.
  • [23] X. Wang, K. Yu, C. Dong, and C. C. Loy, “Recovering realistic texture in image super-resolution by deep spatial feature transform,” in CVPR, 2018, pp. 606–615.
  • [24] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi, “Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation,” in ECCV, 2018, pp. 552–568.
  • [25] R. Timofte, E. Agustsson, L. V. Gool, M.-H. Yang, L. Zhang, and et al, “Ntire 2017 challenge on single image super-resolution: Methods and results,” in CVPR Workshop, 2017, pp. 1110–1121.
  • [26] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and lightweight super-resolution with cascading residual network,” in ECCV, 2018, pp. 252–268.
  • [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  • [28] S.-J. Park, H. Son, S. Cho, K.-S. Hong, and S. Lee, “Srfeat: Single image super-resolution with feature discrimination,” in ECCV, 2018, pp. 439–455.
  • [29] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor, “2018 pirm challenge on perceptual image super-resolution,” in ECCV Workshop, 2018.
  • [30] J.-H. Choi, J.-K. Kim, M. Cheon, and J.-S. Lee, “Deep learning-based image super-resolution considering quantitative and perceptual quality,” in ECCV Workshop, 2018.
  • [31] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, vol. 2, 2003, pp. 1398–1402.
  • [32] A. Jolicoeur-Martineau, “The relativistic discriminator: a key element missing from standard gan,” in ICLR, 2019.
  • [33] M. Bevilacqua, A. Roumy, C. Guillemot, and M. line Alberi Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in BMVC, 2012, pp. 135.1–135.10.
  • [34] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in Curves and Surfaces, 2010, pp. 711–730.
  • [35] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in CVPR, 2001, pp. 416–423.
  • [36] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in CVPR, 2015, pp. 5197–5206.
  • [37] Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa, “Sketch-based manga retrieval using manga109 dataset,” Multimedia Tools and Applications, vol. 76, no. 20, pp. 21 811–21 838, 2017.
  • [38]

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in

    CVPR, 2018, pp. 586–595.
  • [39] C. Ma, C.-Y. Yang, X. Yang, and M.-H. Yang, “Learning a no-reference quality metric for single-image super-resolution,” Computer Vision and Image Understanding, vol. 158, pp. 1–16, 2017.
  • [40] A. Mittal, R. Soundararagan, and A. C. Bovik, “Making a ”completely blind” image quality analyzer,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 209–212, 2013.
  • [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2014.
  • [42] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in CVPR, 2018, pp. 3262–3271.
  • [43] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” in CVPR, 2018, pp. 1664–1673.
  • [44] J. Li, F. Fang, K. Mei, and G. Zhang, “Multi-scale residual network for image super-resolution,” in ECCV, 2018, pp. 517–532.
  • [45] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, “Feedback network for image super-resolution,” in CVPR, 2019, pp. 3867–3876.
  • [46] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, “Second-order attention network for single image super-resolution,” in CVPR, 2019, pp. 11 065–11 074.
  • [47] S. Vasu, N. T. Madam, and R. A.N., “Analyzing perception-distortion tradeoff using enhanced perceptual super-resolution network,” in ECCV Workshop, 2018.
  • [48] J. W. Soh, G. Y. Park, J. Jo, and N. I. Cho, “Natural and realistic single image super-resolution with explicit natural manifold discrimination,” in CVPR, 2019, pp. 8122–8131.