Single image super-resolution (SISR) [dong2014learning, johnson2016perceptual, ledig2017photo, sajjadi2017enhancenet, soh2019natural, kim2016deeply, sun2010gradient, zhang2021designing, wang2021realesrgan, shi2016real, wang2018esrgan, yan2015single, zhang2018image, zhang2018residual, jo2021tackling]
, which aims to reconstruct a high-resolution (HR) image from its low-resolution (LR) observation, is one hot yet challenging research topic in low-level computer vision. It has become prevalent to train deep neural networks (DNNs) for SISR, while many DNN-based SISR models[dong2014learning, zhang2018residual, anwar2019drln, niu2020single, wang2021learning] are trained with pixel-wise and losses, and/or local window based metrics (such as SSIM [wang2004image]). It is well-known that though high PSNR and SSIM indices might be induced by these losses, they can hardly produce rich image details [blau2018perception, ledig2017photo].
With the rapid development of generative adversarial networks (GAN) [goodfellow2014generative, jolicoeur2018relativistic], GAN-based SISR (GAN-SR for short) has recently attracted significant attention for its potentials to recover sharp images with rich details [ledig2017photo, sajjadi2017enhancenet, wang2018esrgan, soh2019natural, zhang2020deep]. Though great progresses have been achieved, adversarial training is unstable and often introduces unpleasant visual artifacts [ledig2017photo, zhang2020deep]. As users are mostly expecting rich and realistic details in SISR results [prashnani2018pieapp, ding2020image, jinjin2020pipal], how to inhibit the visual artifacts of GAN-SR without affecting the realistic details becomes a key issue. Unfortunately, details and artifacts are often entangled in high-frequency components of images. As a result, optimizing one of them often harms the other under existing frameworks [blau2018perception, ledig2017photo, wang2018esrgan, ma2020structure].
In order to address the above mentioned challenges, we investigate in-depth the GAN-SR methods and categorize their results into three typical types of regions, as illustrated in Figure 1. Specifically, type A patches (e.g., flat sky, long edges) are easy to reconstruct since they are smooth or contain only large-scale structures. In contrast, it is difficult to produce high-fidelity SISR results for patches of type B and type C because they have much fine-scale details and suffer from signal aliasing in the degradation process, where most high-frequency components are lost. Fortunately, for texture-like type B patches (e.g., animal fur, tree leaves in distance), the pixels are randomly distributed so that the differences between SR results and ground truth are insensitive to human perception. Therefore, rich details generated by GAN-SR methods can lead to better perceptual quality in these regions. However, patches of type C (e.g., thin twigs, dense windows in the building) contain many fine-scale regular structures or sharp transitions among adjacent pixels. The distorted structures and overshoot pixels generated by GAN-SR methods can be easily perceived by observers as unpleasant artifacts.
Based on the above analysis, we can see that to get perceptually realistic SISR results, the visual artifacts in type C regions should be inhibited, while the realistic details generated in type A and type B regions should be preserved. To achieve this goal, we analyze the local statistic of the three types of GAN-SR regions, and find that the local variance of residuals between SISR results and ground truth HR images can serve as an effective feature to distinguish unpleasant artifacts from realistic details. Accordingly, we construct a pixel-wise map indicating the probability of each pixel being artifacts based on the local and patch-level residual variances. We further refine the discrimination map via a model ensemble strategy to encourage stable and accurate optimization direction toward high-fidelity reconstruction. Based on the refined map, we design a Locally Discriminative Learning (LDL) framework to penalize the artifacts without affecting realistic details.
To sum up, in this paper we first analyze the GAN-SR results and the instability of model training. We then propose to explicitly discriminate visual artifacts from realistic details, and design an LDL framework to regularize the adversarial training. Our method is simple yet effective, and it can be easily plugged into off-the-shelf GAN-SR methods. It provides a novel way to suppress the artifacts in GAN-SR while generating rich realistic details. We conduct extensive experiments on synthetic and real-world SISR tasks, and LDL demonstrates clear improvements against the state-of-the-arts both quantitatively and qualitatively.
2 Related work
Since the pioneer work of SRCNN [dong2014learning]
, which firstly introduces a three-layer convolutional neural network (CNN) for SISR, a number of CNN based SISR models have been proposed, which can be roughly divided into signal fidelity-oriented ones[zhang2018residual, anwar2019drln, niu2020single, wang2021learning] and perceptual quality-oriented ones [johnson2016perceptual, ledig2017photo, sajjadi2017enhancenet, soh2019natural, wang2018esrgan, liang2021hierarchical], depending on the losses and training strategies employed by them.
Signal fidelity-oriented SISR methods. SISR methods in this category adopt the pixel-wise distance measures (such as and losses) and local structural similarity measures (such as SSIM [wang2004image]) to optimize the signal fidelity between the SISR outputs and the HR ground-truth. Since SRCNN [dong2014learning], researchers have made remarkable progresses by stacking more convolution layers [kim2016accurate, kim2016deeply] and designing more complex building blocks [lim2017enhanced, tai2017memnet] and connections [ledig2017photo, zhang2018residual, tong2017image]
. For instance, benefited from the very deep network, effective residual connections, and channel attentions, RCAN[zhang2018image] achieves superior performance on reconstruction accuracy (e.g., PSNR). However, due to the ill-posedness of the SISR problem, optimizing the pixel-wise losses tends to find a blurry result that is the average of many possible solutions [sajjadi2017enhancenet, soh2019natural, blau2018perception]. The SSIM loss can preserve better the image local structures but it is hard to reproduce fine details.
Perceptual quality-oriented SISR methods. To improve the perceptual quality of SISR images, Johnson et al. [johnson2016perceptual] proposed a perceptual loss by calculating the distance between HR and SISR results in the VGG feature space. To tackle the difficulties of signal fidelity-oriented methods in reproducing image details, most recent works have resorted to using the GAN techniques [goodfellow2014generative] for their capability to generate desired images by discriminating between image distributions [wang2018recovering, rad2019srobb, fuoli2021fourier]. For example, Ledig et al. [ledig2017photo] proposed SRGAN with adversarial training on top of the SRResNet generator. To improve the visual quality, Wang et al. [wang2018esrgan] proposed the ESRGAN by introducing the Residual-in-Residual Dense Block (RRDB) along with other improvements on adversarial training and perceptual loss. RRDB has been employed as a standard backbone in many state-of-the-art GAN-SR methods [wang2021realesrgan, zhang2021designing, ma2020structure].
Zhang et al. [zhang2020deep] proposed a trainable unfolding network, termed USRGAN, which integrates the merits of traditional model-based methods and CNN-based ones. Ma et al. [ma2020structure] introduced a gradient guidance via an additional branch in the network. By alleviating the structural distortion and inconsistency problem, the proposed SPSR method achieves leading performance among GAN-SR methods on synthetic data. Nonetheless, one key issue of all existing GAN-SR works lies in that they will produce many unpleasant visual artifacts due to the instability of adversarial training.
Remarks. As indicated in [blau2018perception], both signal fidelity- and perceptual quality-oriented SISR methods fall in a perception-distortion trade-off; that is, improving either the perceptual quality or signal fidelity will affect the other under the existing training strategies. Empirical experiences also tell us that inhibiting the artifacts can limit the generation of details. In this paper, we propose to regularize the adversarial training by explicitly discriminating the artifacts from realistic details, which effectively addresses the dilemma. Recent researches, e.g., BSRGAN [zhang2021designing] and RealESRGAN [wang2021realesrgan], have also recognized the significance of the real-world image SR task. As a plug-and-play module, our method can also be easily extended to such challenging task. The experimental results demonstrated its high generalization performance in generating realistic details while inhibiting artifacts.
3.1 GAN-SR induced visual artifacts
Most of the existing GAN-SR methods [ledig2017photo, wang2018esrgan] are trained using a weighted combination of three losses:
where indicates the pixel-wise reconstruction loss such as and distances, is the perceptual loss [johnson2016perceptual, ledig2017photo] measuring the feature distance in VGG feature space and denotes the adversarial loss [goodfellow2014generative, wang2018esrgan]. and are balancing parameters, which are usually set to , respectively, as in ESRGAN [wang2018esrgan].
According to the pioneer work of SRGAN [ledig2017photo], using only the loss will result in a blurred average of all possible HR images, while the loss can push the SISR solution away from the blurred average, generating more details. Unfortunately, GAN-SR models also generate many perceptually-unpleasant artifacts in addition to the details. An intuitive illustration is shown in Figure 2. Since SISR is an ill-posed task, one LR input corresponds to many possible HR counterparts scattering in the high-dimensional image space. Starting from the blurry solution (the center patch in Figure 2) generated by an SISR model pre-trained using only the loss, the loss can update it along many possible directions, some yielding perceptually pleasant results (in yellow boxes) and some producing unpleasant ones (in red boxes). This leads to an unstable optimization process that may generate artifacts along with details.
The above situation can vary among different image regions, as discussed in Figure 1. To better understand how GAN-SR generates visual artifacts in different areas of an image, in Figure 3 we show toy examples of the three types of patches. We see that for type A patch, the large-scale structure is preserved in its LR version and the HR patch can be easily reproduced with good fidelity and perceptual quality. For the texture-like type B patch, though it is not pixel-wise faithfully reconstructed, the perceptual quality of the GAN-SR output is not bad. This is mainly because the pixels in texture-like patches are often randomly distributed in a relatively small range so that human eyes are hard to perceive the pixel-wise difference. In contrast, type C patches have regular and sharp transitions, while the local patterns are lost in the LR patch after degradation. The largely varied and even contradictory HR targets lead to unstable adversarial training, and the irregular and unnatural patterns in the GAN-SR results can be easily perceived by observers as artifacts.
In Figure 4, we further investigate the training stability of GAN-SR methods on different patches, including the flat sky (type A), animal fur (type B) and thin twigs (type C) in Figure 1. We calculate the mean absolute difference (MAD) of the intermediate GAN-SR outputs at two different iterations, i.e., MAD=, where is the GAN-SR result at iteration , and we set to . The curves of MAD vs. for ESRGAN [wang2018esrgan] are plotted as solid lines. As can be seen, the training process of type A patch is stable (small value and variation of MAD). Type B shows larger variation, indicating higher uncertainty during optimization. Type C has the largest variation and instability, implying that many possible GAN-SR solutions of type C are available in a large space, as illustrated in Figure 2.
3.2 Discriminating artifacts from realistic details
According to the investigations in Section 3.1, we should inhibit the generation of artifacts in type C patches while preserving the realistic details in type A and B patches. To achieve this challenging goal, we carefully design a pixel-wise map to discriminate artifacts from realistic details, as well as a learning strategy to stabilize the training of GAN-SR models. The whole procedure of map generation is illustrated in Figure 5 using three patches.
Discrimination of artifacts. Suppose that the resolution of a full-color SISR image is , our goal is to find a pixel-wise map , where indicates the probability of being an artifact pixel. Considering that both the artifacts and details belong to high-frequency image components, we first calculate the residual between ground truth image and SISR result to extract high-frequency components:
As shown in the column of Figure 5, most pixels in the smooth type A patch have very small residuals. Both type B and type C patches have large residuals, while the distribution of residuals in patch B is much more random. Based on the observation that artifacts usually consist of overshoot pixel values, we propose to calculate the local variance of the residual map as the primary map to indicate artifact pixels:
where represents the variance operator and denotes the local window size. We empirically set .
As shown in the column of Figure 5, the primary map can effectively detect the artifact pixels in patch C. However, since the local variance is calculated with a very small receptive field, it is unstable to discriminate artifacts from edges and textures. Some pixels in patches A and B will also have large response, causing wrong punishment on the generation of realistic details. To address this issue, we further calculate a stable patch-level variance from the whole residual map as follows:
where scales the global variance to an appropriate scale. We fix to throughout our experiments. In general, type A patches have smaller values than type B and type C patches, while type C patches have the largest values. By using to scale the primary map as , a more reliable artifact map can be obtained. As shown in the column of Figure 5, the over-punishment issue on patches A and B is mostly addressed, while the artifacts in patch C are still identified.
Stabilization and refinement. Although the map can discriminate the artifacts in different types of patches, it may still over-penalize the realistic details in patch C, and slightly penalize the generation of high-fidelity details in patches A and B, especially at the early training stages. To alleviate this problem, we further stabilize the training process and refine the artifact map.
Specifically, denote by the GAN-SR model optimized via gradient decent on-the-fly, we use the exponential moving average (EMA) technique to temporally ensemble a more stable model from as:
where is the weighting parameter. Compared to , is more reliable to alleviate the generation of random artifacts. As in prior arts of EMA [karras2019style, karras2020analyzing], we set .
With , we can further refine the artifact map to alleviate penalty on generation of realistic details during optimization. Denote by and the outputs of two GAN-SR models. Usually, the output of the ensemble model, i.e., , has few artifacts, while may contain more details and artifacts simultaneously. We then calculate two residuals map and , and refine the artifact map by:
That is, the refined map will only penalize the pixels where . At locations where the residuals of are smaller than , the model is updated towards the correct direction and should not be penalized. The refined map and the location map are shown in the last two columns of Figure 5. We see that the locations of fine textures and desirable edges are removed from the refined artifact map so that the penalty can be imposed more precisely on the artifact pixels.
3.3 Loss and learning strategy
Given the refined artifact map , we propose an artifact discrimination loss as follows:
can be easily introduced to the existing GAN-SR models and the final loss function is:
where is defined in Eq. (1) and is a weighting parameter. We simply fix in all our experiments.
The pipeline of the proposed locally discriminative learning (LDL) method is shown in Figure 6. The input is fed into two models, i.e., and , to output and , respectively. The artifact map is then constructed using the ground-truth image , as well as and . After that, the loss is calculated based on , and . Finally, the model is optimized using , and the parameters of are temporally ensembled to . This process is iterated until converge.
With the proposed LDL, we train the same RRDB backbone [wang2018esrgan] and plot the MAD curves of intermediate GAN-SR outputs in Figure 4 using dash lines. As can be seen, our LDL method has much better stability than ESRGAN in model learning, especially for type B and type C patches, resulting in much smaller MAD and MAD variations.
|Metrics||Benchmark||SFTGAN [wang2018recovering]||SRGAN [ledig2017photo]||SRResNet [ledig2017photo]+LDL||ESRGAN [wang2018esrgan]||USRGAN [zhang2020deep]||SPSR [ma2020structure]||RRDB [wang2018esrgan]+LDL||RRDB[wang2018esrgan]+LDL||SwinIR [liang2021swinir]+||SwinIR [liang2021swinir]+LDL|
|Training Dataset||ImageNet + OST||DIV2K||DIV2K||DF2K + OST||DF2K||DIV2K||DIV2K||DF2K||DF2K||DF2K|
4 Experimental results
4.1 Experiment setup
Backbones and compared methods. We validate the effectiveness of the proposed LDL method on top of three representative backbone networks, i.e., SRResNet [ledig2017photo], RRDB [wang2018esrgan] and SwinIR [liang2021swinir], resulting in SRResNet+LDL, RRDB+LDL and SwinIR+LDL. SRResNet is a light-weight network, and we compare SRResNet+LDL against SRGAN [ledig2017photo] and SFTGAN [wang2018recovering], which have comparable number of parameters. RRDB is widely used in recent GAN-SR methods [wang2018esrgan, zhang2020deep, ma2020structure] for its competitive performance. We compare RRDB+LDL against ESRGAN [wang2018esrgan], USRGAN [zhang2020deep] and SPSR [ma2020structure], which all use RRDB as backbone. Very recently, SwinIR has reported excellent SISR performance by using the Swin Transformer architecture [liu2021swin]. We also train SwinIR with the and (SwinIR+) losses, respectively, and compare their performance. We further validate LDL for real-world SISR by applying LDL to RealESRGAN [wang2021realesrgan], and compare the obtained RealESRGAN+LDL model with both RealESRGAN and BSRGAN [zhang2021designing] models.
Training datasets and settings. Following prior arts [ledig2017photo, wang2018esrgan, ma2020structure], we conduct experiments with a scaling factor of on both synthetic (downsampled using MATLAB bicubic kernel) and real-world experiments. We also report GAN-SR results on synthetic data in the supplementary materials. We use the same data augmentation, discriminator and optimizer settings as in ESRGAN [wang2018esrgan]. We train our model on either DIV2K [agustsson2017ntire] ( images) or DF2K ( images) dataset [lim2017enhanced, timofte2017ntire], and the resolution of HR patches is . We implement the experiments on
NVIDIA GTX 2080Ti GPUs with PyTorch and the batch size isper GPU. We initialize the generator with a pretrained fidelity-oriented model, and calculate the perceptual loss as in [wang2021realesrgan] for both synthetic and real-world settings. The learning rate is and the number of training iteration is .
Evaluation benchmarks and metrics. We employ benchmarks for evaluation, including Set5 [bevilacqua2012low], Set14 [zeyde2010single], Manga109 [matsui2017sketch], General100 [dong2016accelerating], Urban100 [huang2015single] and DIV2K100 [agustsson2017ntire]. We compare the GAN-SR results in terms of both perceptual quality and reconstruction accuracy. For the former, we employ LPIPS [zhang2018unreasonable], DISTS [ding2020image] and FID [heusel2017gans] as metrics. LPIPS and DISTS have been validated effective on evaluating GAN-SR results [jinjin2020pipal], and FID is widely used to evaluate the image perceptual quality in image generation tasks [karras2019style]. For the latter, we compute PSNR and SSIM indices on the Y channel in the YCbCr space.
4.2 Comparison with state-of-the-arts
Quantitative comparison. Table 1 compares quantitatively the state-of-the-art GAN-SR methods and our LDL. We can see that our proposed LDL scheme improves both the perceptual quality (LPIPS, DISTS, FID) and reconstruction accuracy (PSNR, SSIM) on most benchmarks under all the three backbones, i.e., SRResNet, RRDB and SwinIR.
Specifically, for the three light-weight models, SRResNet+LDL outperforms SFTGAN and SRGAN on most benchmarks in terms of those perceptual quality metrics LPIPS, DISTS and FID, and it outperforms SFTGAN and SRGAN on all benchmarks in terms reconstruction accuracy, e.g., PSNR +dB and SSIM + over the second best method, respectively.
For the CNN based backbone RRDB, we train the GAN-SR models on DIV2K and DF2K, respectively, to be consistent with the employed competing models. We can see that among the three competing methods, SPSR performs the best in terms of perceptual quality metrics since it benefits from the additional network branch to restore the gradient map of images. By explicitly discriminating artifacts and regularizing the adversarial training, LDL achieves improvements against SPSR, e.g., LPIPS from to (about ) on DIV2K validation set. USRGAN achieves the best reconstruction accuracy among the three competing methods since it integrates learning-based and model-based strategies. Compared to the USRGAN, LDL not only achieves much better reconstruction accuracy on all benchmarks, but also improves the perceptual indexes. This validates that LDL can simultaneously inhibit the visual artifacts and generate more details with high-fidelity.
For the transformer-based backbone SwinIR, we see that SwinIR+ outperforms the CNN based methods on most benchmarks in terms of both perceptual quality and reconstruction accuracy, demonstrating the potentials of transformer-based architecture for GAN-SR. As expected, SwinIR+LDL further improves SwinIR+ on most benchmarks, demonstrating the generalization capacity of LDL on different network architectures.
Qualitative comparison. Figure 7 presents some visual comparisons among the GAN-SR methods using the RRDB backbone. Similar conclusions to the quantitative comparisons can be drawn. LDL generates much less visual artifacts compared to ESRGAN, USRGAN and SPSR, especially on regions with fine-scale aliasing structures. In addition, by regularizing the adversarial training process, LDL is able to reconstruct more details with high fidelity, such as the areas with regular patterns (e.g., the lines on windows and the grid on bridge). These improvements make LDL a practical GAN-SR solution for image quality enhancement.
4.3 Applications to real-world SISR
To demonstrate the generalization capability of the proposed LDL, we also apply it to the real-world SISR task. Compared to SISR on synthetic LR images, SISR on real-world LR images faces unknown and much more complicated degradation [zhang2021designing]. We introduce the loss to the RealESRGAN method [wang2021realesrgan] and keep all other settings unchanged to train our RealESRGAN+LDL model. Since there is no ground-truth, we show qualitative comparisons with RealESRGAN and BSRGAN in Figure 8. As can be seen in the area of dense windows, RealESRGAN introduces unpleasant artifacts, while BSRGAN produces relatively smooth structures. In contrast, our LDL suppresses the generation of artifacts and encourages sharp details. In the area of twigs, the proposed LDL improves the generation of fine details, benefiting from the explicit and accurate discrimination between artifacts and realistic details.
4.4 Ablation study
We conduct ablation studies to investigate the roles of major components in our LDL method, including the primary artifact map , the globally scaled map in Eq. (4), the refined map in Eq. (6) and the EMA model . Results are reported in Table 2. #1 gives the baseline performance when none of the above operations is used. By introducing in #2, we can observe a clear performance gain in both perceptual quality and reconstruction accuracy. This demonstrates the effectiveness of explicitly discriminating and penalizing the visual artifacts in GAN-SR. The usage of in #3 and in #4 each further improves the performance. Finally, by using the stable EMA model during testing in #5, we achieve more performance gain as expected.
Although the proposed LDL is effective in improving both the perceptual quality and reconstruction accuracy of SISR outputs, it still has some limitations in discriminating the visual artifacts in regions suffering from heavy aliasing. Take the last row of Figure 7 for example, there still remain some artifacts around the dense windows in our result. In this paper, we discussed how the artifacts are generated by GAN-SR methods and proposed a simple attempt to tackle this problem, while we believe there exist more effective designs for artifacts discrimination and details generation.
In this paper, we analyzed how the visual artifacts were generated in the GAN-based SISR methods, and proposed a locally discriminative learning (LDL) strategy to address this issue. A framework to discriminate visual artifacts from realistic details during the GAN-SR model training process was carefully designed, and an artifact map was generated to explicitly penalize the artifacts without sacrificing the realistic details. The proposed LDL method can be easily plugged into different off-the-shelf GAN-SR models for both synthetic and real-world SISR tasks. Extensive experiments on the widely used datasets demonstrated that LDL outperforms the existing GAN-SR methods both quantitatively and qualitatively.