Log In Sign Up

Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-Resolution

by   Jie Liang, et al.

Single image super-resolution (SISR) with generative adversarial networks (GAN) has recently attracted increasing attention due to its potentials to generate rich details. However, the training of GAN is unstable, and it often introduces many perceptually unpleasant artifacts along with the generated details. In this paper, we demonstrate that it is possible to train a GAN-based SISR model which can stably generate perceptually realistic details while inhibiting visual artifacts. Based on the observation that the local statistics (e.g., residual variance) of artifact areas are often different from the areas of perceptually friendly details, we develop a framework to discriminate between GAN-generated artifacts and realistic details, and consequently generate an artifact map to regularize and stabilize the model training process. Our proposed locally discriminative learning (LDL) method is simple yet effective, which can be easily plugged in off-the-shelf SISR methods and boost their performance. Experiments demonstrate that LDL outperforms the state-of-the-art GAN based SISR methods, achieving not only higher reconstruction accuracy but also superior perceptual quality on both synthetic and real-world datasets. Codes and models are available at


page 1

page 5

page 7

page 8


ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks

The Super-Resolution Generative Adversarial Network (SRGAN) is a seminal...

RankSRGAN: Generative Adversarial Networks with Ranker for Image Super-Resolution

Generative Adversarial Networks (GAN) have demonstrated the potential to...

Best-Buddy GANs for Highly Detailed Image Super-Resolution

We consider the single image super-resolution (SISR) problem, where a hi...

GLEAN: Generative Latent Bank for Image Super-Resolution and Beyond

We show that pre-trained Generative Adversarial Networks (GANs) such as ...

Perception-Oriented Single Image Super-Resolution using Optimal Objective Estimation

Single-image super-resolution (SISR) networks trained with perceptual an...

Learning to Maintain Natural Image Statistics

Maintaining natural image statistics is a crucial factor in restoration ...

GCFSR: a Generative and Controllable Face Super Resolution Method Without Facial and GAN Priors

Face image super resolution (face hallucination) usually relies on facia...

1 Introduction

Figure 1: Three representative types of SISR regions generated by ESRGAN [wang2018esrgan]. For each example, the left is an LR patch and the right is its GAN-SR result. Type A patches represent regions that are easy to super-resolve, e.g., smooth and large-scale structural areas, where the main structures are preserved in the LR input. In contrast, patches of type B and type C are with fine-scale details, which are hard to be faithfully restored due to the signal aliasing in the LR inputs. The results of texture-like type B patches are perceptually realistic despite the pixel-wise differences to the ground-truth, since the patterns are naturally irregular with weak priors for observers. However, the results of type C patches exhibit perceptually unpleasant visual artifacts since the overshoot pixels and distorted structures are sensitive to human perception.

Single image super-resolution (SISR) [dong2014learning, johnson2016perceptual, ledig2017photo, sajjadi2017enhancenet, soh2019natural, kim2016deeply, sun2010gradient, zhang2021designing, wang2021realesrgan, shi2016real, wang2018esrgan, yan2015single, zhang2018image, zhang2018residual, jo2021tackling]

, which aims to reconstruct a high-resolution (HR) image from its low-resolution (LR) observation, is one hot yet challenging research topic in low-level computer vision. It has become prevalent to train deep neural networks (DNNs) for SISR, while many DNN-based SISR models 

[dong2014learning, zhang2018residual, anwar2019drln, niu2020single, wang2021learning] are trained with pixel-wise and losses, and/or local window based metrics (such as SSIM [wang2004image]). It is well-known that though high PSNR and SSIM indices might be induced by these losses, they can hardly produce rich image details [blau2018perception, ledig2017photo].

With the rapid development of generative adversarial networks (GAN) [goodfellow2014generative, jolicoeur2018relativistic], GAN-based SISR (GAN-SR for short) has recently attracted significant attention for its potentials to recover sharp images with rich details [ledig2017photo, sajjadi2017enhancenet, wang2018esrgan, soh2019natural, zhang2020deep]. Though great progresses have been achieved, adversarial training is unstable and often introduces unpleasant visual artifacts [ledig2017photo, zhang2020deep]. As users are mostly expecting rich and realistic details in SISR results [prashnani2018pieapp, ding2020image, jinjin2020pipal], how to inhibit the visual artifacts of GAN-SR without affecting the realistic details becomes a key issue. Unfortunately, details and artifacts are often entangled in high-frequency components of images. As a result, optimizing one of them often harms the other under existing frameworks [blau2018perception, ledig2017photo, wang2018esrgan, ma2020structure].

In order to address the above mentioned challenges, we investigate in-depth the GAN-SR methods and categorize their results into three typical types of regions, as illustrated in Figure 1. Specifically, type A patches (e.g., flat sky, long edges) are easy to reconstruct since they are smooth or contain only large-scale structures. In contrast, it is difficult to produce high-fidelity SISR results for patches of type B and type C because they have much fine-scale details and suffer from signal aliasing in the degradation process, where most high-frequency components are lost. Fortunately, for texture-like type B patches (e.g., animal fur, tree leaves in distance), the pixels are randomly distributed so that the differences between SR results and ground truth are insensitive to human perception. Therefore, rich details generated by GAN-SR methods can lead to better perceptual quality in these regions. However, patches of type C (e.g., thin twigs, dense windows in the building) contain many fine-scale regular structures or sharp transitions among adjacent pixels. The distorted structures and overshoot pixels generated by GAN-SR methods can be easily perceived by observers as unpleasant artifacts.

Based on the above analysis, we can see that to get perceptually realistic SISR results, the visual artifacts in type C regions should be inhibited, while the realistic details generated in type A and type B regions should be preserved. To achieve this goal, we analyze the local statistic of the three types of GAN-SR regions, and find that the local variance of residuals between SISR results and ground truth HR images can serve as an effective feature to distinguish unpleasant artifacts from realistic details. Accordingly, we construct a pixel-wise map indicating the probability of each pixel being artifacts based on the local and patch-level residual variances. We further refine the discrimination map via a model ensemble strategy to encourage stable and accurate optimization direction toward high-fidelity reconstruction. Based on the refined map, we design a Locally Discriminative Learning (LDL) framework to penalize the artifacts without affecting realistic details.

To sum up, in this paper we first analyze the GAN-SR results and the instability of model training. We then propose to explicitly discriminate visual artifacts from realistic details, and design an LDL framework to regularize the adversarial training. Our method is simple yet effective, and it can be easily plugged into off-the-shelf GAN-SR methods. It provides a novel way to suppress the artifacts in GAN-SR while generating rich realistic details. We conduct extensive experiments on synthetic and real-world SISR tasks, and LDL demonstrates clear improvements against the state-of-the-arts both quantitatively and qualitatively.

2 Related work

Since the pioneer work of SRCNN [dong2014learning]

, which firstly introduces a three-layer convolutional neural network (CNN) for SISR, a number of CNN based SISR models have been proposed, which can be roughly divided into signal fidelity-oriented ones 

[zhang2018residual, anwar2019drln, niu2020single, wang2021learning] and perceptual quality-oriented ones [johnson2016perceptual, ledig2017photo, sajjadi2017enhancenet, soh2019natural, wang2018esrgan, liang2021hierarchical], depending on the losses and training strategies employed by them.

Signal fidelity-oriented SISR methods. SISR methods in this category adopt the pixel-wise distance measures (such as and losses) and local structural similarity measures (such as SSIM [wang2004image]) to optimize the signal fidelity between the SISR outputs and the HR ground-truth. Since SRCNN [dong2014learning], researchers have made remarkable progresses by stacking more convolution layers [kim2016accurate, kim2016deeply] and designing more complex building blocks [lim2017enhanced, tai2017memnet] and connections [ledig2017photo, zhang2018residual, tong2017image]

. For instance, benefited from the very deep network, effective residual connections, and channel attentions, RCAN 

[zhang2018image] achieves superior performance on reconstruction accuracy (e.g., PSNR). However, due to the ill-posedness of the SISR problem, optimizing the pixel-wise losses tends to find a blurry result that is the average of many possible solutions [sajjadi2017enhancenet, soh2019natural, blau2018perception]. The SSIM loss can preserve better the image local structures but it is hard to reproduce fine details.

Perceptual quality-oriented SISR methods. To improve the perceptual quality of SISR images, Johnson et al[johnson2016perceptual] proposed a perceptual loss by calculating the distance between HR and SISR results in the VGG feature space. To tackle the difficulties of signal fidelity-oriented methods in reproducing image details, most recent works have resorted to using the GAN techniques [goodfellow2014generative] for their capability to generate desired images by discriminating between image distributions [wang2018recovering, rad2019srobb, fuoli2021fourier]. For example, Ledig et al[ledig2017photo] proposed SRGAN with adversarial training on top of the SRResNet generator. To improve the visual quality, Wang et al[wang2018esrgan] proposed the ESRGAN by introducing the Residual-in-Residual Dense Block (RRDB) along with other improvements on adversarial training and perceptual loss. RRDB has been employed as a standard backbone in many state-of-the-art GAN-SR methods [wang2021realesrgan, zhang2021designing, ma2020structure].

Figure 2: An illustration of possible optimization directions of GAN-SR models. The patch in the center is obtained by a pre-trained SISR model using -loss, while the patches in red and yellow boxes are possible GAN-SR results by adversarial losses.

Zhang et al[zhang2020deep] proposed a trainable unfolding network, termed USRGAN, which integrates the merits of traditional model-based methods and CNN-based ones. Ma et al[ma2020structure] introduced a gradient guidance via an additional branch in the network. By alleviating the structural distortion and inconsistency problem, the proposed SPSR method achieves leading performance among GAN-SR methods on synthetic data. Nonetheless, one key issue of all existing GAN-SR works lies in that they will produce many unpleasant visual artifacts due to the instability of adversarial training.

Remarks. As indicated in [blau2018perception], both signal fidelity- and perceptual quality-oriented SISR methods fall in a perception-distortion trade-off; that is, improving either the perceptual quality or signal fidelity will affect the other under the existing training strategies. Empirical experiences also tell us that inhibiting the artifacts can limit the generation of details. In this paper, we propose to regularize the adversarial training by explicitly discriminating the artifacts from realistic details, which effectively addresses the dilemma. Recent researches, e.g., BSRGAN [zhang2021designing] and RealESRGAN [wang2021realesrgan], have also recognized the significance of the real-world image SR task. As a plug-and-play module, our method can also be easily extended to such challenging task. The experimental results demonstrated its high generalization performance in generating realistic details while inhibiting artifacts.

3 Methodology

3.1 GAN-SR induced visual artifacts

Figure 3: Toy examples of the GAN-SR results on three types of regions. The LR patches are obtained by applying

average pooling with stride

on the HR patches. The large-scale structure in type A patch can be well reproduced with good fidelity and perceptual quality. Though the pixels in texture-like type B patch are not faithfully reconstructed, the perceptual quality of the reconstructed patch is not bad due to the random distribution of pixels in HR patch. However, for those type C patches, visually unpleasant artifacts are perceived in the GAN-SR results since the fine-scale yet regular structures are destroyed.

Most of the existing GAN-SR methods [ledig2017photo, wang2018esrgan] are trained using a weighted combination of three losses:


where indicates the pixel-wise reconstruction loss such as and distances, is the perceptual loss [johnson2016perceptual, ledig2017photo] measuring the feature distance in VGG feature space and denotes the adversarial loss [goodfellow2014generative, wang2018esrgan]. and are balancing parameters, which are usually set to , respectively, as in ESRGAN [wang2018esrgan].

According to the pioneer work of SRGAN [ledig2017photo], using only the loss will result in a blurred average of all possible HR images, while the loss can push the SISR solution away from the blurred average, generating more details. Unfortunately, GAN-SR models also generate many perceptually-unpleasant artifacts in addition to the details. An intuitive illustration is shown in Figure 2. Since SISR is an ill-posed task, one LR input corresponds to many possible HR counterparts scattering in the high-dimensional image space. Starting from the blurry solution (the center patch in Figure 2) generated by an SISR model pre-trained using only the loss, the loss can update it along many possible directions, some yielding perceptually pleasant results (in yellow boxes) and some producing unpleasant ones (in red boxes). This leads to an unstable optimization process that may generate artifacts along with details.

The above situation can vary among different image regions, as discussed in Figure 1. To better understand how GAN-SR generates visual artifacts in different areas of an image, in Figure 3 we show toy examples of the three types of patches. We see that for type A patch, the large-scale structure is preserved in its LR version and the HR patch can be easily reproduced with good fidelity and perceptual quality. For the texture-like type B patch, though it is not pixel-wise faithfully reconstructed, the perceptual quality of the GAN-SR output is not bad. This is mainly because the pixels in texture-like patches are often randomly distributed in a relatively small range so that human eyes are hard to perceive the pixel-wise difference. In contrast, type C patches have regular and sharp transitions, while the local patterns are lost in the LR patch after degradation. The largely varied and even contradictory HR targets lead to unstable adversarial training, and the irregular and unnatural patterns in the GAN-SR results can be easily perceived by observers as artifacts.

In Figure 4, we further investigate the training stability of GAN-SR methods on different patches, including the flat sky (type A), animal fur (type B) and thin twigs (type C) in Figure 1. We calculate the mean absolute difference (MAD) of the intermediate GAN-SR outputs at two different iterations, i.e., MAD=, where is the GAN-SR result at iteration , and we set to . The curves of MAD vs. for ESRGAN [wang2018esrgan] are plotted as solid lines. As can be seen, the training process of type A patch is stable (small value and variation of MAD). Type B shows larger variation, indicating higher uncertainty during optimization. Type C has the largest variation and instability, implying that many possible GAN-SR solutions of type C are available in a large space, as illustrated in Figure 2.

Figure 4: The stability on the training of different patches by ESRGAN [wang2018esrgan] and our LDL. The patches of flat sky (type A), animal fur (type B) and thin twigs (type C) in Figure 1 are used here. The mean absolute differences (MAD) of intermediate GAN-SR results between iterations k and k+5000 are plotted.

3.2 Discriminating artifacts from realistic details

According to the investigations in Section 3.1, we should inhibit the generation of artifacts in type C patches while preserving the realistic details in type A and B patches. To achieve this challenging goal, we carefully design a pixel-wise map to discriminate artifacts from realistic details, as well as a learning strategy to stabilize the training of GAN-SR models. The whole procedure of map generation is illustrated in Figure 5 using three patches.

Discrimination of artifacts. Suppose that the resolution of a full-color SISR image is , our goal is to find a pixel-wise map , where indicates the probability of being an artifact pixel. Considering that both the artifacts and details belong to high-frequency image components, we first calculate the residual between ground truth image and SISR result to extract high-frequency components:


As shown in the column of Figure 5, most pixels in the smooth type A patch have very small residuals. Both type B and type C patches have large residuals, while the distribution of residuals in patch B is much more random. Based on the observation that artifacts usually consist of overshoot pixel values, we propose to calculate the local variance of the residual map as the primary map to indicate artifact pixels:


where represents the variance operator and denotes the local window size. We empirically set .

As shown in the column of Figure 5, the primary map can effectively detect the artifact pixels in patch C. However, since the local variance is calculated with a very small receptive field, it is unstable to discriminate artifacts from edges and textures. Some pixels in patches A and B will also have large response, causing wrong punishment on the generation of realistic details. To address this issue, we further calculate a stable patch-level variance from the whole residual map as follows:


where scales the global variance to an appropriate scale. We fix to throughout our experiments. In general, type A patches have smaller values than type B and type C patches, while type C patches have the largest values. By using to scale the primary map as , a more reliable artifact map can be obtained. As shown in the column of Figure 5, the over-punishment issue on patches A and B is mostly addressed, while the artifacts in patch C are still identified.

Figure 5: Visualization on the generation process of artifact map. and indicate the SISR output of a GAN-SR method, the ground truth patch, the absolute value of the residual between and , the primary map calculated by Eq. (3), the scaling factor computed by Eq. (4), and the refined map by Eq. (6), respectively. In the column, the values for type A, B and C patches are , respectively. The last column shows the locations where with white pixels.

Stabilization and refinement. Although the map can discriminate the artifacts in different types of patches, it may still over-penalize the realistic details in patch C, and slightly penalize the generation of high-fidelity details in patches A and B, especially at the early training stages. To alleviate this problem, we further stabilize the training process and refine the artifact map.

Specifically, denote by the GAN-SR model optimized via gradient decent on-the-fly, we use the exponential moving average (EMA) technique to temporally ensemble a more stable model from as:


where is the weighting parameter. Compared to , is more reliable to alleviate the generation of random artifacts. As in prior arts of EMA [karras2019style, karras2020analyzing], we set .

With , we can further refine the artifact map to alleviate penalty on generation of realistic details during optimization. Denote by and the outputs of two GAN-SR models. Usually, the output of the ensemble model, i.e., , has few artifacts, while may contain more details and artifacts simultaneously. We then calculate two residuals map and , and refine the artifact map by:


That is, the refined map will only penalize the pixels where . At locations where the residuals of are smaller than , the model is updated towards the correct direction and should not be penalized. The refined map and the location map are shown in the last two columns of Figure 5. We see that the locations of fine textures and desirable edges are removed from the refined artifact map so that the penalty can be imposed more precisely on the artifact pixels.

3.3 Loss and learning strategy

Given the refined artifact map , we propose an artifact discrimination loss as follows:


The loss

can be easily introduced to the existing GAN-SR models and the final loss function is:


where is defined in Eq. (1) and is a weighting parameter. We simply fix in all our experiments.

Figure 6: Overall learning pipeline of the proposed LDL method.

The pipeline of the proposed locally discriminative learning (LDL) method is shown in Figure 6. The input is fed into two models, i.e., and , to output and , respectively. The artifact map is then constructed using the ground-truth image , as well as and . After that, the loss is calculated based on , and . Finally, the model is optimized using , and the parameters of are temporally ensembled to . This process is iterated until converge.

With the proposed LDL, we train the same RRDB backbone [wang2018esrgan] and plot the MAD curves of intermediate GAN-SR outputs in Figure 4 using dash lines. As can be seen, our LDL method has much better stability than ESRGAN in model learning, especially for type B and type C patches, resulting in much smaller MAD and MAD variations.

Metrics Benchmark SFTGAN [wang2018recovering] SRGAN [ledig2017photo] SRResNet [ledig2017photo]+LDL ESRGAN [wang2018esrgan] USRGAN [zhang2020deep] SPSR [ma2020structure] RRDB [wang2018esrgan]+LDL RRDB[wang2018esrgan]+LDL SwinIR [liang2021swinir]+ SwinIR [liang2021swinir]+LDL
Training Dataset ImageNet + OST DIV2K DIV2K DF2K + OST DF2K DIV2K DIV2K DF2K DF2K DF2K
LPIPS Set5 0.0800 0.0753 0.0759 0.0758 0.0795 0.0647 0.0670 0.0691 0.0656 0.0655
Set14 0.1313 0.1327 0.1303 0.1241 0.1347 0.1207 0.1207 0.1132 0.1160 0.1091
Manga109 0.0716 0.0707 0.0673 0.0649 0.0630 0.0672 0.0553 0.0544 0.0542 0.0469
General100 0.0947 0.0964 0.0898 0.0879 0.0937 0.0862 0.0790 0.0796 0.0796 0.0740
Urban100 0.1343 0.1439 0.1330 0.1229 0.1330 0.1184 0.1096 0.1084 0.1077 0.1021
DIV2K100 0.1331 0.1257 0.1172 0.1154 0.1325 0.1099 0.1011 0.0999 0.1038 0.0944
DISTS Set5 0.1085 0.1003 0.1010 0.0949 0.1045 0.0921 0.0917 0.0919 0.0930 0.0899
Set14 0.1133 0.1067 0.1016 0.0951 0.0997 0.0920 0.0935 0.0866 0.0930 0.0869
Manga109 0.0646 0.0557 0.0523 0.0471 0.0471 0.0463 0.0404 0.0355 0.0365 0.0315
General100 0.0992 0.0982 0.0939 0.0874 0.0931 0.0884 0.0827 0.0801 0.0835 0.0794
Urban100 0.1062 0.1081 0.0989 0.0880 0.0975 0.0849 0.0822 0.0793 0.0835 0.0800
DIV2K100 0.0736 0.0663 0.0624 0.0593 0.0645 0.0546 0.0528 0.0526 0.0531 0.0507
FID Set5 39.261 31.507 27.542 27.215 37.006 30.904 25.288 24.803 35.401 27.955
Set14 60.493 63.945 52.080 54.933 55.635 53.867 49.577 43.454 48.910 46.057
Manga109 21.464 11.948 12.652 11.552 10.658 10.662 9.855 10.161 9.703 8.680
General100 36.845 33.868 32.737 29.843 32.959 30.159 27.506 27.211 27.557 25.304
Urban100 21.370 22.162 21.512 20.345 21.555 18.672 17.758 16.351 17.555 16.282
DIV2K100 18.183 13.922 14.823 13.557 14.031 13.754 12.145 12.121 12.736 12.075
PSNR Set5 30.057 29.920 30.527 30.438 30.910 30.397 30.985 31.033 30.873 31.028
Set14 26.743 26.839 27.278 26.594 27.405 26.860 27.491 27.228 27.282 27.526
Manga109 28.167 28.110 28.664 28.413 28.753 28.561 29.407 29.620 29.345 30.143
General100 29.159 29.327 29.775 29.425 30.001 29.424 30.232 30.289 30.104 30.441
Urban100 24.338 24.410 24.745 24.365 24.891 24.804 25.498 25.459 25.736 26.231
DIV2K100 28.085 28.165 28.602 28.175 28.787 28.182 28.951 28.819 28.784 29.117
SSIM Set5 0.8483 0.8478 0.8570 0.8523 0.8657 0.8443 0.8626 0.8611 0.8655 0.8611
Set14 0.7175 0.7252 0.7366 0.7144 0.7486 0.7254 0.7476 0.7358 0.7407 0.7478
Manga109 0.8562 0.8632 0.8702 0.8595 0.8717 0.8590 0.8746 0.8734 0.8796 0.8880
General100 0.8060 0.8074 0.8164 0.8095 0.8241 0.8091 0.8277 0.8280 0.8305 0.8347
Urban100 0.7235 0.7302 0.7409 0.7341 0.7503 0.7474 0.7673 0.7661 0.7786 0.7918
DIV2K100 0.7707 0.7745 0.7855 0.7759 0.7941 0.7720 0.7951 0.7897 0.7911 0.8011
Table 1: Quantitative comparison between GAN-SR methods and the proposed LDL. Three groups of comparisons are made based on the employed backbone networks: SRResNet-like backbone for the first 3 columns, RRDB backbone for the middle 5, and SwinIR backbone for the last 2. The best results of each group are highlighted in bold. and mean that the larger or smaller score is better, respectively.

4 Experimental results

4.1 Experiment setup

Figure 7: Visual comparison (better zoom-in on screen) to state-of-the-art GAN-SR methods that use RRDB [wang2018esrgan] as backbone, including ESRGAN [wang2018esrgan], USRGAN [zhang2020deep], SPSR [ma2020structure] and our RRDB+LDL. As can be seen, our method has clear advantages in reconstructing realistic details and inhibiting artifacts. More visual comparisons can be found in the supplementary materials.

Backbones and compared methods. We validate the effectiveness of the proposed LDL method on top of three representative backbone networks, i.e., SRResNet [ledig2017photo], RRDB [wang2018esrgan] and SwinIR [liang2021swinir], resulting in SRResNet+LDL, RRDB+LDL and SwinIR+LDL. SRResNet is a light-weight network, and we compare SRResNet+LDL against SRGAN [ledig2017photo] and SFTGAN [wang2018recovering], which have comparable number of parameters. RRDB is widely used in recent GAN-SR methods [wang2018esrgan, zhang2020deep, ma2020structure] for its competitive performance. We compare RRDB+LDL against ESRGAN [wang2018esrgan], USRGAN [zhang2020deep] and SPSR [ma2020structure], which all use RRDB as backbone. Very recently, SwinIR has reported excellent SISR performance by using the Swin Transformer architecture [liu2021swin]. We also train SwinIR with the and (SwinIR+) losses, respectively, and compare their performance. We further validate LDL for real-world SISR by applying LDL to RealESRGAN [wang2021realesrgan], and compare the obtained RealESRGAN+LDL model with both RealESRGAN and BSRGAN [zhang2021designing] models.

Training datasets and settings. Following prior arts [ledig2017photo, wang2018esrgan, ma2020structure], we conduct experiments with a scaling factor of on both synthetic (downsampled using MATLAB bicubic kernel) and real-world experiments. We also report GAN-SR results on synthetic data in the supplementary materials. We use the same data augmentation, discriminator and optimizer settings as in ESRGAN [wang2018esrgan]. We train our model on either DIV2K [agustsson2017ntire] ( images) or DF2K ( images) dataset [lim2017enhanced, timofte2017ntire], and the resolution of HR patches is . We implement the experiments on

NVIDIA GTX 2080Ti GPUs with PyTorch and the batch size is

per GPU. We initialize the generator with a pretrained fidelity-oriented model, and calculate the perceptual loss as in [wang2021realesrgan] for both synthetic and real-world settings. The learning rate is and the number of training iteration is .

Evaluation benchmarks and metrics. We employ benchmarks for evaluation, including Set5 [bevilacqua2012low], Set14 [zeyde2010single], Manga109 [matsui2017sketch], General100 [dong2016accelerating], Urban100 [huang2015single] and DIV2K100 [agustsson2017ntire]. We compare the GAN-SR results in terms of both perceptual quality and reconstruction accuracy. For the former, we employ LPIPS [zhang2018unreasonable], DISTS [ding2020image] and FID [heusel2017gans] as metrics. LPIPS and DISTS have been validated effective on evaluating GAN-SR results [jinjin2020pipal], and FID is widely used to evaluate the image perceptual quality in image generation tasks [karras2019style]. For the latter, we compute PSNR and SSIM indices on the Y channel in the YCbCr space.

4.2 Comparison with state-of-the-arts

Quantitative comparison. Table 1 compares quantitatively the state-of-the-art GAN-SR methods and our LDL. We can see that our proposed LDL scheme improves both the perceptual quality (LPIPS, DISTS, FID) and reconstruction accuracy (PSNR, SSIM) on most benchmarks under all the three backbones, i.e., SRResNet, RRDB and SwinIR.

Specifically, for the three light-weight models, SRResNet+LDL outperforms SFTGAN and SRGAN on most benchmarks in terms of those perceptual quality metrics LPIPS, DISTS and FID, and it outperforms SFTGAN and SRGAN on all benchmarks in terms reconstruction accuracy, e.g., PSNR +dB and SSIM + over the second best method, respectively.

Figure 8: Visual comparison (better zoom-in on screen) to state-of-the-art real-world SISR methods, including BSRGAN [zhang2021designing] and RealESRGAN [wang2021realesrgan] . The training setting of RealESRGAN+LDL is the same as RealESRGAN except for the proposed loss. More visual comparisons of different backbones can be found in the supplementary materials.

For the CNN based backbone RRDB, we train the GAN-SR models on DIV2K and DF2K, respectively, to be consistent with the employed competing models. We can see that among the three competing methods, SPSR performs the best in terms of perceptual quality metrics since it benefits from the additional network branch to restore the gradient map of images. By explicitly discriminating artifacts and regularizing the adversarial training, LDL achieves improvements against SPSR, e.g., LPIPS from to (about ) on DIV2K validation set. USRGAN achieves the best reconstruction accuracy among the three competing methods since it integrates learning-based and model-based strategies. Compared to the USRGAN, LDL not only achieves much better reconstruction accuracy on all benchmarks, but also improves the perceptual indexes. This validates that LDL can simultaneously inhibit the visual artifacts and generate more details with high-fidelity.

For the transformer-based backbone SwinIR, we see that SwinIR+ outperforms the CNN based methods on most benchmarks in terms of both perceptual quality and reconstruction accuracy, demonstrating the potentials of transformer-based architecture for GAN-SR. As expected, SwinIR+LDL further improves SwinIR+ on most benchmarks, demonstrating the generalization capacity of LDL on different network architectures.

Qualitative comparison. Figure 7 presents some visual comparisons among the GAN-SR methods using the RRDB backbone. Similar conclusions to the quantitative comparisons can be drawn. LDL generates much less visual artifacts compared to ESRGAN, USRGAN and SPSR, especially on regions with fine-scale aliasing structures. In addition, by regularizing the adversarial training process, LDL is able to reconstruct more details with high fidelity, such as the areas with regular patterns (e.g., the lines on windows and the grid on bridge). These improvements make LDL a practical GAN-SR solution for image quality enhancement.

1 0.1154 28.175
2 0.1020 28.740
3 0.1006 28.678
4 0.1001 28.761
5 0.0999 28.819
Table 2: Ablation study on the different components of the proposed LDL method. Results are obtained by RRDB+LDL trained on DF2K and evaluated on DIV2K validation set. denotes that the corresponding operation is used.

4.3 Applications to real-world SISR

To demonstrate the generalization capability of the proposed LDL, we also apply it to the real-world SISR task. Compared to SISR on synthetic LR images, SISR on real-world LR images faces unknown and much more complicated degradation [zhang2021designing]. We introduce the loss to the RealESRGAN method [wang2021realesrgan] and keep all other settings unchanged to train our RealESRGAN+LDL model. Since there is no ground-truth, we show qualitative comparisons with RealESRGAN and BSRGAN in Figure 8. As can be seen in the area of dense windows, RealESRGAN introduces unpleasant artifacts, while BSRGAN produces relatively smooth structures. In contrast, our LDL suppresses the generation of artifacts and encourages sharp details. In the area of twigs, the proposed LDL improves the generation of fine details, benefiting from the explicit and accurate discrimination between artifacts and realistic details.

4.4 Ablation study

We conduct ablation studies to investigate the roles of major components in our LDL method, including the primary artifact map , the globally scaled map in Eq. (4), the refined map in Eq. (6) and the EMA model . Results are reported in Table 2. #1 gives the baseline performance when none of the above operations is used. By introducing in #2, we can observe a clear performance gain in both perceptual quality and reconstruction accuracy. This demonstrates the effectiveness of explicitly discriminating and penalizing the visual artifacts in GAN-SR. The usage of in #3 and in #4 each further improves the performance. Finally, by using the stable EMA model during testing in #5, we achieve more performance gain as expected.

4.5 Limitations

Although the proposed LDL is effective in improving both the perceptual quality and reconstruction accuracy of SISR outputs, it still has some limitations in discriminating the visual artifacts in regions suffering from heavy aliasing. Take the last row of Figure 7 for example, there still remain some artifacts around the dense windows in our result. In this paper, we discussed how the artifacts are generated by GAN-SR methods and proposed a simple attempt to tackle this problem, while we believe there exist more effective designs for artifacts discrimination and details generation.

5 Conclusion

In this paper, we analyzed how the visual artifacts were generated in the GAN-based SISR methods, and proposed a locally discriminative learning (LDL) strategy to address this issue. A framework to discriminate visual artifacts from realistic details during the GAN-SR model training process was carefully designed, and an artifact map was generated to explicitly penalize the artifacts without sacrificing the realistic details. The proposed LDL method can be easily plugged into different off-the-shelf GAN-SR models for both synthetic and real-world SISR tasks. Extensive experiments on the widely used datasets demonstrated that LDL outperforms the existing GAN-SR methods both quantitatively and qualitatively.