Progressive Perception-Oriented Network for Single Image Super-Resolution
Recently, it has been shown that deep neural networks can significantly improve the performance of single image super-resolution (SISR). Numerous studies have focused on raising the quantitative quality of super-resolved (SR) images. However, these methods that target PSNR maximization usually produce smooth images at large upscaling factor. The introduction of generative adversarial networks (GANs) can mitigate this issue and show impressive results with synthetic high-frequency textures. Nevertheless, these GAN-based approaches always tend to add fake textures and even artifacts to make the SR image of visually higher-resolution. In this paper, we propose a novel perceptual image super-resolution method that progressively generates visually high-quality results by constructing a stage-wise network. Specifically, the first phase concentrates on minimizing pixel-wise error and the second stage utilizes the features extracted by the previous stage to pursue results with better structural retention. The final stage employs fine structure features distilled by the second phase to produce more realistic results. In this way, we can maintain the pixel and structure level information in the perceptual image as much as possible. It is worth note that the proposed method can build three types of images in a feed-forward process. Also, we explore a new generator that adopts multi-scale hierarchical features fusion. Extensive experiments on benchmark datasets show that our approach is superior to the state-of-the-art methods. Code is available at https://github.com/Zheng222/PPON.READ FULL TEXT VIEW PDF
Generative Adversarial Networks (GAN) have demonstrated the potential to...
The traditional super-resolution methods that aim to minimize the mean s...
Image quality measurement is a critical problem for image super-resoluti...
In recent years, single image super-resolution (SISR) methods using deep...
We consider image transformation problems, where an input image is
Recent deep learning approaches to single image super-resolution have
Maintaining natural image statistics is a crucial factor in restoration ...
Progressive Perception-Oriented Network for Single Image Super-Resolution
In this section, we focus on deep neural network approaches to solve the SR problem.
The pioneering work was done by Dong et al. [1, 2], who proposed SRCNN for SISR task, which outperformed conventional algorithms. To further improve the accuracy, Kim et al.proposed two deep networks, i.e., VDSR , and DRCN , which apply global residual learning and recursive layer respectively to the SR problem. Tai et al.  developed a deep recursive residual network (DRRN) to reduce the model size of the very deep network by using parameter sharing mechanism. Another work designed by the authors is a very deep end-to-end persistent memory network (MemNet)  for image restoration task, which tackles the long-term dependency problem in the previous CNN architectures. The aforementioned methods need to take the interpolated LR images as inputs. It inevitably increases the computational complexity and often results in visible reconstruction artifacts .
For the sake of speeding up the execution time of deep learning-based SR approaches, Shi et al.  proposed an efficient sub-pixel convolutional neural network (ESPCN), which extracts features in the LR space and magnifies the spatial resolution at the end of the network by conducting an efficient sub-pixel convolution layer. Afterward, Dong et al.  developed a fast SRCNN (FSRCNN), which employs the transposed convolution to upscale and aggregate the LR space features. However, these two methods fail to learn complicated mapping due to the limitation of the model capacity. EDSR , the winner solution of NTIRE2017 , was presented by Lim et al.. This work is far superior in performance to previous models. To alleviate the difficulty of SR task with large scaling factors such as , Lai et al.  proposed the LapSRN, which progressively reconstructs the multiple SR images with different scales in one feed-forward network. Tong et al.  presented a network for SR by employing dense skip connections, which demonstrated that the combination of features at different levels is helpful for improving SR performance. Recently, Zhang et al.  extended this idea and proposed a residual dense network (RDN), where the kernel is residual dense block (RDB) that extract abundant local features via dense connected convolutional layers. Furthermore, the authors proposed very deep residual channel attention networks (RCAN)  that verified the very deep network can availably improve SR performance and advantages of channel attention mechanism. To leverage the execution speed and performance, IDN  and CARN  were proposed by Hui et al.and Ahn et al., respectively. More concretely, Hui et al.constructed a deep but compact network, which mainly exploited and fused different types of features. And Ahn et al.designed a cascading network architecture. The main idea is to add multiple cascading connections from each intermediary layer to others. Such connections help this model performing SISR accurately and efficiently.
SRGAN , as a landmark work in perceptual-driven SR, was proposed by Ledig et al.. This approach is the first attempt to apply GAN  framework to SR, where the generator is composed of residual blocks . To improve the naturalness of the images, perceptual and adversarial losses were used to train the model in SRGAN. Sajjadi et al.  explored the local texture matching loss and further improved the visual quality of the composite images. Park et al.  developed a GAN-based SISR method that produced realistic results by attaching an additional discriminator that works in the feature domain. Mechrez et al.  defined the Contextual loss that measured the similarity between the generated image and a target image by comparing the statistical distribution of the feature space. Wang et al.  enhanced SRGAN from three key components: network architecture, adversarial loss, and perceptual loss. A variant of Enhanced SRGAN (ESRGAN) won the first place in the PIRM2018-SR Challenge . Choi et al.  introduced two quantitative score predictor networks to facilitate the generator improving the perceptual quality of the upscaled images.
The single image super-resolution aims to estimate the SR image from its LR counterpart . An overall structure of the proposed basic model (RFN) is shown in Figure 1. This network mainly consists of two parts: content feature extraction module (CFEM) and reconstruction part, where the first part extracts content features for conventional image SR task (pursuing high PSNR value) and the second part naturally reconstructs through the front features related to the image content. The first procedure could be expressed by
where denotes content feature extractor, i.e., CFEM. Then, is sent to the content reconstruction module (CRM) ,
where denotes the function of our RFN.
The basic model is optimized with MAE loss function followed by the previous works [9, 13, 14]. Given a training set , where is the number of training images, is the ground-truth high resolution image of the low-resolution image , the loss function of our basic SR model is
where denotes the parameter set of our content-oriented branch (COBranch), i.e., RFN.
As depicted in Figure 2, based on the content features extracted by the CFEM, we design a SFEM to distill structure-related information for restore images with SRM. This process can be expressed by
where and denote the functions of SRM and SFEM, respectively. To this end, we employ the multi-scale structural similarity index (MS-SSIM) and multi-scale as loss functions to optimize this branch. SSIM is defined as
where , are the mean, is the covariance of and , and , are constants. Given multiple scales through a process of stages of downsampling, MS-SSIM is defined as
where represents the cascade of SFEM and SRM (light red area in Figure 5). denotes content features (see Equation 1) corresponding to -th training sample in a batch. Thus, the total loss function of this branch can be formulated as follows
where and is a scalar value to balance two losses, denotes the parameter set of structure-oriented branch (SOBranch). Here, we set , through experience.
Similarly, to obtain photo-realistic images, we utilize structural-related features refined by SFEM and send them to our perception feature extraction module (PFEM). The merit of this practice is to avoid re-extracting features from the image domain and these extracted features contain abundant and superior quality structural information, which tremendously helps perceptual-oriented branch (POBranch, see in Figure 5) generate visually plausible SR images while maintaining the basic structure. Concretely, structural feature is entered in PFEM
, the relativistic discriminator intends to estimate the probability thatis more realistic than . In standard GAN, the discriminator can be defined, in term of the non-transformed layer , as , where
is sigmoid function. The Relativistic average Discriminator (RaD, denoted by)  can be formulated as , if is real. Here, is the average of all fake data in a batch. The discriminator loss is defined by
The corresponding adversarial loss for generator is
where represents the generated images at the current perception-maximization stage, i.e., in equation 9.
VGG loss that has been investigated in recent SR works [16, 17, 18, 20] for better visual quality is also introduced in this stage. We calculate the VGG loss based on the “conv5_4” layer of VGG19 ,
indicate the tensor volume and channel number of the feature maps, respectively, anddenotes the -th channel of the feature maps extracted from the hidden layer of VGG19 model. Therefore, the total loss for the perception stage is:
where is the coefficients to balance these loss functions. And is the training parameters of POBranch.
We now give more details about our proposed RRFB structure (see Figure 3(b)), which consists of multiple hierarchical feature fusion blocks (HFFB) (see Figure 3(b)). Different from the frequently-used residual block in SR, we intensify its representational ability by introducing the spatial pyramid of dilated convolutions . Specifically, we develop dilated convolutional kernels simultaneously, each with a dilation rate of , . Due to these dilated convolutions preserve different receptive fields, we can aggregate them to obtain multi-scale features. As shown in Figure 4, single dilated convolution with dilation rate of 3 (yellow block) looks sparse. To acquire effective receptive field, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them. A simple example is illustrated in Figure 4. For explaining this hierarchical feature fusion process clearly, the output of dilated convolution with dilation rate of is denoted by . In this way, the concatenated multi-scale features can be expressed by
After collecting these multi-scale features, we fuse them through a convolution , that is . Finally, the local skip connection with residual scaling is utilized to complete our HFFB.
We use the DIV2K dataset , which consists of 1,000 high-quality RGB images (800 training images, 100 validation images, and 100 test images) with 2K resolution. For increasing the diversity of training images, we also use the Flickr2K dataset  consisting of 2,650 2K resolution images. In this way, we have 3,450 high-resolution images for training purpose. LR training images are obtained by downscaling HR with a scaling factor of images using bicubic interpolation function in MATLAB. The HR image patches with a size of are randomly cropped from HR images as the input of our proposed model and the mini-batch size is set to 25. Data augmentation is performed on the 3,450 training images, which are randomly horizontal flip and 90 degree rotation. For evaluation, we use six widely used benchmark datasets: Set5 , Set14 , BSD100 , Urban100 , Manga109  and the PIRM dataset . The SR results are evaluated with PSNR, SSIM , learned perceptual image patch similarity (LPIPS) , and perceptual index (PI) on Y (luminance) channel, in which PI is based on the non-reference image quality measures of Ma et al.  and NIQE , i.e., . The lower the values of LPIPS and PI, the better.
, which is decreased by the factor of 2 for every 1000 epochs (iterations). And then we fix the parameters of COBranch and only train the SOBranch through the loss function in Equation 8 with . This process is illustrated in the second row of Figure 5. During this stage, the learning rate is set to and halved at every 250 epochs ( iterations). Similarly, we eventually only train the POBranch by Equation 13 with . The learning rate scheme is the same as the second phase. All the stages are trained by ADAM optimizer  with the momentum parameter
. We apply PyTorch framework to implement our model and train them using NVIDIA GTX 1080Ti GPUs.
We set the dilated convolutions number as in the HFFB structure. All dilated convolutions have kernels and 32 filters as shown in Figure 3(a). In each RRFB, we set HFFB number as 3. In COBranch, we apply 24 RRFBs. And only 2 RRFBs are employed in both SOBranch and POBranch. All standard convolutional layers have 64 filters and their kernel sizes are set to expect for that at the end of HFFB, whose kernel size is . The residual scaling parameter and the negative scope of LReLU is set as .
Model Parameters. We compare the trade-off between performance and model size in Figure 6. Among the nine models, RFN and RCAN show higher PSNR values than others. In particular, RFN records the best performance in Set5. It should be noted that RFN uses fewer parameters than RCAN to achieve this performance. It means that RFN can better balance performance and model size.
|PSNR on Set5||31.68||31.69||31.63||31.72|
Study of dilation convolution and hierarchical feature fusion. We remove the hierarchical feature fusion structure. Furthermore, in order to investigate the function of dilated convolution, we use ordinary convolutions. For validating quickly, only RRFB is used in CFEM and this network is called RFN_mini. We conduct the training process with the DIV2K dataset and the results are depicted in Table I. As the number of RRFB increases, the benefits will increase accumulatively (see in Table II).
|Item||w/o CRM & SOBranch||w/o SOBranch||PPON|
|Memory footprint (M)||7,357||6,311||6,317|
|Training time (sec/epoch)||1,063||541||567|
|PIRM_Val (PSNR / SSIM / LPIPS / PI)||25.61 / 0.6802 / 0.1287 / 2.2857||26.32 / 0.6981 / 0.1250 / 2.2282||26.20 / 0.6995 / 0.1194 / 2.2353|
|PIRM_Test (PSNR / SSIM/ LPIPS / PI)||25.47 / 0.6667 / 0.1367 / 2.2055||26.16 / 0.6831 / 0.1309 / 2.1704||26.01 / 0.6831 / 0.1273 / 2.1511|
|HR||SRGAN ||ENet ||CX ||SuperSR ||ESRGAN ||PPON(Ours)|
We observe that perceptual-driven SR results produced by GAN-based approaches [17, 18, 19] often suffer from structure distortion as illustrated in Figure 9. To alleviate this problem, we explicitly add structural information through our devised progressive architecture described in the main manuscript. To make it easier to understand this progressive practice, we show an example in Figure 10. From this picture, we can see that the difference between SRc and SRp is mainly reflected in the sharper texture of SRp. Therefore, the remaining component is substantially the same. Based on this viewpoint, we naturally design the progressive topology structure, i.e., gradually adding high-frequency details.
To validate the feature maps extracted by the CFEM, SFEM, and PFEM have dependencies and relationships, we visualize the intermediate feature maps as shown in Figure 8. From this picture, we can find that the feature maps distilled by three different extraction modules are similar. Thus, features extracted in the previous stage can be utilized in the current phase. In addition, feature maps in the third sub-figure contain more texture information, which is instructive to the reconstruction of visually high-quality images. To verify the necessity of using progressive structure, we remove CRM and SOBranch from PPON (i.e., changing to normal structure, similar to ESRGAN ). We observe that PPON without CRM & SOBranch cannot generate clear structural information, while PPON can better recover it. Table III suggests that our progressive structure can greatly improve the fidelity measured by PSNR and SSIM, while improving perceptual quality. And it indicates that fewer updatable parameters not only occupy less memory but also encourage faster training.
Few learnable model parameters (1.3M) complete task migration (i.e.from structure-aware to perceptual-aware) well in our work, while ESRGAN  uses 16.7M to generate perceptual results. Simply, we explicitly decompose a task into three subtasks (content, structure, perception). This approach is similar to human painting, first sketching the lines, then adding details. Our topology structure can easily achieve migration of similar tasks and inference of multiple tasks according to the specific needs.
|Dataset||Scores||SRGAN ||ENet ||CX ||||||NatSR ||ESRGAN ||PPON (Ours)|
We compare our RFN with state-of-the-art methods: SRCNN [1, 2], FSRCNN , VDSR , DRCN , LapSRN , MemNet , IDN , EDSR , SRMDNF , D-DBPN , RDN , MSRN , CARN , RCAN , and SRFBN . Table IV shows quantitative comparisons for SR. It can be seen that our RFN performs the best in terms of PSNR on all the datasets. And the proposed S-RFN shows great advantages with regard to SSIM. In Figure 11, we present visual comparisons on different datasets. For image “img_011”, we observe that most of the compared methods cannot recover the lines and would suffer from blurred artifacts. In contrast, our RFN can slightly alleviate this phenomenon and restore more details.
Table V shows our quantitative evaluation results compared with perceptual-driven state-of-the-arts approaches: SRGAN , ENet , CX , EPSR , NatSR , and ESRGAN . The proposed PPON achieves the best in terms of LPIPS and keep the presentable PSNR values. For image “86” in Figures 12, the result generated by S-RFN is blurred but has the fine structure. Based on S-RFN, our PPON can synthesis realistic textures while retaining the nice structure. It also validates the effectiveness of the proposed progressive architecture.
We consider LPIPS111https://github.com/richzhang/PerceptualSimilarity  and PI222https://github.com/roimehrez/PIRM2018  as our evaluation indices of perceptual image SR. As illustrated in Figure 13, we can obviously see that the PI score of EPSR3 (2.2666) is even better than HR (2.3885), but EPSR3 shows unnatural and lacks proper texture and structure. When observing the results of ESRGAN and our PPON, their perception effect is superior to that of EPSR3, which is exactly in accordance with corresponding LPIPS values. From the results of S-RFN and PPON, it can be demonstrated that both PI and LPIPS have the ability of distinguishing blurring image. From the images of EPSR3, SuperSR and ground-truth (HR), we can distinctly know that the lower PI value does not mean the better image quality. Compared with the image generated by ESRGAN , it is obvious that the proposed PPON get the better visual effect with more structure information, which is corresponding to the lower LPIPS value. Due to the PI (non-reference measure) is not sensitive to deformation through the experiment and cannot reflect the similarity with ground-truth, we take LPIPS as our primary perceptual measure and PI as a secondary metric.
Besides, we performed a MOS (mean opinion score) test to validate the effectiveness of our PPON further. Specifically, we collect raters to assign an integral score from (bad quality) to (excellent quality). To ensure the reliability of the results, we provide the raters with test and original HR images at the same time. The ground-truth images are set to , and the raters then score the test images based on it. The average MOS results are shown in Table VI.
|LPIPS / PI||LPIPS/ PI|
|ESRGAN ||0.1443 / 2.5550||0.1523 / 2.4356|
|PPON_128 (Ours)||0.1241 / 2.3026||0.1321 / 2.2080|
|PPON (Ours)||0.1194 / 2.2736||0.1273 / 2.1770|
In ESRGAN , the authors mentioned that larger training patch size costs more training time and consumes more computing resources. Thus, they used for PSNR-oriented methods and for perceptual-driven methods. In our main manuscript, we train the COBranch, SOBranch, and POBranch with image patches. Here, we further explore the influence of larger patches in the perceptual image generation stage. It is important to note that training perceptual-driven model requires more GPU memory and larger computing resources than the PSNR-oriented model since the VGG model and discriminator need to be loaded during the training of the former. Therefore, larger patches () are hard to be used in optimizing ESRGAN  due to their huge generator and discriminator to be updated. Thanks to our POBranch only containing very few parameters, we employ training patches and achieve the better results as shown in Table VII. With regard to the discriminators, we illustrate them in Figure 14. For a fair comparison with the ESRGAN , we retrain our POBranch with patches and provide the results in Table VII.
The network structure of the discriminators. The output size is scaled down by stride 2, and the parameter of LReLU is.
In this paper, we propose a progressive perception-oriented network (PPON) for better perceptual image SR. Concretely, three branches are developed to learn the content, structure, and perceptual details, respectively. By exerting stage-by-stage training scheme, we can steadily get the promising results. It is worth mentioning that these three branches are not independent. In other words, extracted features and output images of the content-oriented branch can be exploited by a structure-oriented branch. Extensive experiments on both traditional SR and perceptual SR demonstrate the effectiveness of our proposed PPON.
This work was supported in part by the National Natural Science Foundation of China under Grant 61432014, 61772402, U1605252, 61671339 and 61871308, in part by the National Key Research and Development Program of China under Grant 2016QY01W0200, in part by National High-Level Talents Special Support Program of China under Grant CS31117200001.
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” inCVPR, 2018, pp. 586–595.