Single image super-resolution (SISR) [SISR, ntire2017, ntire2018, ntire2019, deepsr]
has been an active topic for decades because of its practical value in improving the visual quality of digital images. SISR addresses the ill-posed problem of estimating a super-resolution (SR) image from its low-resolution (LR) counterpart. Benefiting from the success of deep learning, the performance of SISR has been consistently improved by carefully designed architectures[SRCNN, FSRCNN, ESPCN, VDSR, LapSRN, EDSR, CARN, RCAN, RDN, towardRwSR]
and diverse loss functions[perceptual, SRGAN, enhancenet, ESRGAN, zoomToLearn]. Despite those recent progress, the gap between papers and practical applications is still huge. Although these methods work well for common benchmark datasets, they often fail real-world images that are usually captured in the wild. Recent research attributes this to the lack of natural LR and SR image pairs and develops some different ways to acquire such image pairs [zoomToLearn, camLensSR, towardsRaw]. Even so, the diversity of real-world images and the complexity of image degradation are still far beyond our comprehension, which greatly complicates the mapping from low-resolution (LR) to high-resolution (HR) images. To this end, domain-specific super-resolution [fsrnet, SFTGAN, superFan, TDAE, yang2018hallucinating, yu2016ultra, lee2018attribute, SICNN, MTUN, learnDegradationFirst] simplifies this problem by targeting limited image domains. By leveraging the domain-specific information in training sets, this line of methods greatly improves generalization and robustness of image super-resolution, especially for large-scale super-resolution.
However, image super-resolution suffers from a fundamental problem, say, the curse of dimensionality describing that the distance between two arbitrary data points becomes indiscernible when the dimension is sufficiently high [NN99, Aggarwal01]. When measuring the distance between super-solved images and ground-truth images, the norm-based distances (e.g. and norms) are indispensable for existing algorithmic frameworks. For image super-resolution, however, the resulting dimension is very high. For example, the dimension is 1e+6 for an image of size . Therefore, the difficulty of image super-resolution from the curse of dimensionality ubiquitously exists in current algorithms as long as the distance-based reconstruction errors are applied.
Hopefully, GAN-based models [goodfellow2014GAN] are capable of recovering visually plausible images. In particular, the recent style-based generator architecture for GAN algorithm (StyleGAN) [styleGAN, styleGAN2]
can generate large-scale photo-realistic images from a low-dimensional random vector. But GAN-based models usually suffer from many artifacts. Instead, PSNR-based models produce more faithful images but not visually plausible. And both situations get worse as the SR scale factor increases. To leverage the strength of GAN models, we propose a novel distance-free algorithm for domain-specific image super-resolution. Our main contributions are summarized as follows:
To deal with the curse of dimensionality, we apply the progressive growing of neural architecture with skip connections instead of distance metrics. The semantic contents of high-resolution images are progressively derived and faithfully guaranteed from the low-resolution ones by fade-in layer. In this way, we remove the norm-based pixel losses such as and , which helps us obtain high-quality reconstructions with high fidelity to source images.
Adversarial learning is harnessed to enforce the outputs of the network to be highly photo-realistic. Different from the exiting SR methods, we only use the GAN loss. There is not any distance-based pixel losses involved in our framework. Therefore, our algorithm does not suffer from artifacts like blurriness and checkerboard patterns.
Our distance-free framework may put a dent in not only image super-resolution, but also image denoising, image deblurring, etc.
To the best of our knowledge, our algorithm is the first that can produce the perceptually faithful high-quality super-resolution images without using distance-based pixel losses. Particularly, our algorithm prefers the perceptually accurate reconstruction of ground truth rather than pixel-wise precision. To highlight the critical difference from the conventional fashion of image super-resolution, therefore, we use the concept of perceptual image super-resolution. From the practical point of view, we think that perceptual image super-resolution is more plausible and sound for the image super-resolution task. The characteristics of perceptual image super-resolution will be illustrated in detail in section 5.
2 Related Work
Deep CNN for Super-Resolution
. As a pioneer deep-learning-based SR method, Super-Resolution Convolutional Neural Network (SRCNN)[SRCNN] outperforms traditional algorithms with a deep convolutional neural network. Different from taking upsampled images as input, Efficient Sub-Pixel Convolutional Neural network (ESPCN) [ESPCN] and Fast Super-Resolution Convolutional Neural Network (FSRCNN) [FSRCNN] upsample images at the end of the networks. The reduction in the number of large-scale operations makes them less time-consuming compared to SRCNN. To better harness the power of deep-learning models, Kim et al. proposed Very Deep Super-Resolution (VDSR) [VDSR], a deep network which introduces residual learning to ease training process and achieves remarkable improvement in accuracy. By removing unnecessary modules in conventional residual networks, Lim et al. [EDSR] proposed Enhanced Deep Super-Resolution network (EDSR) and Multi-scale Deep Super-Resolution System (MDSR), which also achieve significant improvement.
To improve the visual quality of SR results, perceptual loss [perceptual] is applied by minimizing the error in the feature space of a well-trained model instead of directly in the pixel space. Besides, contextual loss [contextual] is developed to generate images using an objective that focuses on the contextual similarity rather than merely spatial location of pixels. Ledig et al. introduced Residual Neural Network (ResNet) [Resnet] to construct a deeper network named SRResNet [SRGAN]. In this work, they also proposed GAN for Image Super-Resolution (SRGAN) using perceptual loss and GAN loss to achieve photo-realistic SR. Following this approach, both SISR through automated texture synthesis (EnhanceNet) [enhancenet] and Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN) [ESRGAN] use GAN-based models to attain visually plausible SR results. Recovering realistic texture in image super-resolution by deep Spatial Feature Transform (SFTGAN) [SFTGAN] finds that more photo-realistic results can be obtained with a correct categorical prior.
Of all aforementioned literature, photo-realism is usually attained by adversarial training with GANs, but their predicted results may not be faithfully reconstructed and sometimes produce unpleasing artifacts. Just like what Blau et al. [blau2018perception]
have proved, the distortion and perceptual quality are at odds with each other.
Domain-Specific Super-Resolution. As for domain-specific SR, face SR a.k.a. face hallucination has been the most widely studied subject. These methods utilize facial priors explicitly or implicitly. Super-resolution of real-world low resolution faces in arbitrary poses with GANs (Super-FAN) [superFan] and Multi-Task Upsampling Network (MTUN) [MTUN] both apply facial landmarks from Face Attention Network (FAN) [FAN] to guarantee the consistency in end-to-end learning. And Face Super-Resolution Network (FSRNet) [fsrnet] tries not only facial landmark heatmaps but also face parsing maps as prior constraints. Super-Identity Convolutional Neural Network (SICNN) [SICNN]
uses a super-identity loss function to recover the person identity. Transformative-Discriminative Autoencoder (TDAE)[TDAE] adopts a decoder-encoder-decoder framework to handle noisy face images. Despite the fact that they get better SR results for face, they have settled for a low-resolution image generation, greatly limiting the visual quality of SR images.
3 Curse of Dimensionality
The principal argument about the issue in high-dimensional spaces is that the concept of distance-based nearest neighbors is no longer meaningful when the dimension goes sufficiently high [NN99, Aggarwal01]. This proposition is so important that it is worth being written rigorously, i.e.,
where and represent the maximum distance and the minimum distance to in the given data set, respectively. The above limit implies that for sufficiently large , the data points become spatially indiscernible by norm-based distance measures in high-dimensional spaces. To illustrate this effect, we compute the ratio of distance discrepancy in formula (1) in various dimensions. Figure 1
shows that the distance ratio is prone to converge and approach zero with negligible standard deviation after ten thousand dimensions.
The curse of dimensionality suggests that we would run the risk of adopting the norm-based measures or analogous counterparts to quantize the discrepancy between super-resolution images, e.g. reconstruction error. In our opinion, the subtle artifacts produced by networks for SR are partially due to this reason. From the view of dimensionality, we should avoid using distance measures like , , and perceptual loss for very high-dimensional super-resolution task. In this paper, we propose an alternative solution to replace the distance measures in image super-resolution.
4 Progressive Adversarial Network
The proposed PAN aims to estimate a SR image from its LR counterpart, synthesizing photo-realistic and faithful high-quality image while preserving the consistency with the LR image in content. And simultaneously, it is expected that the algorithm is able to generalize well to unseen real-world data. To this end, we need to achieve the following two goals.
Control the reconstruction error without distance measures, as interpreted in section 3. We fulfill this condition with progressive growing of a partial U-Net.
Maintain the high-quality photo-realistic effect. This goal is attained by adversarial learning. A GAN architecture with random noise injection is designed to synthesize high-quality images.
The details are given in the following sub-sections.
4.1 Progressive Super-Resolution
In our distance-free framework, we harness progressive growing of a partial U-Net to guarantee the faithful generation of resolved images. Progressive learning shows a good potential in ProGAN [progressiveGAN] and StyleGAN [styleGAN] and this idea is intuitively suitable for super-resolution task [progressiveSuperRes18]. The key role of progressive super-resolution is that the high-resolution image can be gradually generated from the low-resolution one. In this way, the reconstruction accuracy can be well assured without the constraint of distance-based losses, thus freeing our algorithm from the curse of dimensionality. Formally, the process of progressive growing can expressed as:
where means that the size of is larger than that of and the other is the same, and is the corresponding generator. The parameter of () is learned with the pretrained for each resolution.
When the size of the output image is no more than that of the input one, U-Net [Unet] is employed as the generator, which contains the encoder-decoder parts and skip connections. These connections are useful to save insights from different abstraction levels and transfer them from the encoder to the decoder network. Otherwise, only progressive growing of image resolution is involved. An overview of the proposed PAN is shown in Fig. 2.
To be specific, our progressive super-resolution is two-fold. First, the fade-in layer is introduced to ease the training on higher resolutions. As illustrated in Fig. 3, the toRGB layer projects feature maps to images and the fromRGB layer performs the reverse mapping. The weight increases linearly from 0 to 1 as the training proceeds, making the new layer fade in the network smoothly. Second, both input LR images and real images used to train discriminator are provided in a coarse-to-fine way. Namely our training starts with a resolution of pixels. So the input image and the ground truth image should both be degraded to pixels, which makes it much easier for generator to produce “real” images and succeed in fooling the discriminator.
4.2 Adversarial Learning without Distance Measures
To obtain the photo-realistic quality of generated SR images, the conventional way is to resort to adversarial learning. In general, the existing algorithms follow the GAN framework of Pix2Pix[PIX2PIX] that firstly proposes to combine the adversarial loss with norm-based regularization loss, so that the generator is trained not only to fool the discriminator but also to generate images as close to ground-truth as possible. As we analyze in section 3, the distance measures such as and
norms lead to ambiguous artifacts due to the curse of dimensionality. Typically, norm-based reconstruction losses aim to achieve higher Peak Signal-to-Noise Ratio (PSNR), but usually leads to blurry images. In addition, it is a very tricky step to balance different losses, say adversarial loss, reconstruction loss, and perceptual loss. For our distance-free algorithm, however, we only need a GAN model that is capable of significantly enhancing the visual quality of SR images. We use non-saturating loss[goodfellow2014GAN] with -regularization [regularizationR1] as loss function. The discriminator loss is defined as:
where is the hyper-parameter to weigh the -regularization. The adversarial loss for generator is non-saturating loss as:
Following StyleGAN [styleGAN], the structure of blocks to encode and decode features are shown in Fig. 3. Encoder and discriminator are similar in structure and decoder and discriminator are mirror images of each other.
4.3 Random Noise Injection
For StyleGAN [styleGAN], the authors discover an appealing property of random noise in enhancing the quality of generator. The photo-realistic details of generated images can be significantly improved by injecting random noise into feature maps of generator, i.e.
where is the -th channel of feature maps, is the random noise of same size with , and the learnable parameter of scaling the noise. For our PAN architecture, we adopt this type of noise injection as well. The role of random noise mainly regularizes the deep neural network [Noh2017noise, You2018noise], thus stabilizing the training process and facilitating the algorithmic convergence.
It is worth emphasizing that the advantage of noise injection cannot be achieved for GAN models with distance measures. is randomly sampled for each update during training. The different will result in the detail alteration for generated images. This operation amounts to randomly perturbing distance losses, thus hardening the convergence of the algorithm. However, our PAN framework is free from this negative influence and is able to take the advantage of noise injection, as StyleGAN does.
In this section, we first explain some training details and compare PAN with state-of-the-art SR methods on three commonly-used image quality metrics: PSNR, Structural Similarity Index (SSIM) [SSIM], and Naturalness Image Quality Evaluator (NIQE) [NIQE]. As noted and shown in the previous work [Chen2018dark, Yu2018exposure], these pixel-based metrics sometimes cannot accurately reflect the visual quality of resulting images. To remedy this, we also use another two metrics for GANs: Fréchet Inception Distance (FID) [FID] and Sliced Wasserstein Distance (SWD) [progressiveGAN] to estimate realism by measuring how our SR results resemble real face images. Then we conduct the ablation experiment to investigate how each part of the network influences the capability of performing super-resolution.
Actually, these pixel-level metrics such as PSNR, SSIM, and NIQE are not suitable for perceptual image super-resolution we raise, because perceptual image super-resolution aim to recover large-size images that are perceived holistically accurate rather than pixel-wise precision. The results of these metrics are just for reference of reconstruction precision for our algorithm.
5.1 Implementation Details
As shown in Fig. 2, our model super-resolves small images with a scaling factor of 8 (scaling images from to ). We start with resolution and stabilize the network in 600k iterations at each resolution (8, 32, 64, 128, 256, 512, and 1024). Every time right after doubling the resolution, the new layer fades in smoothly and it takes 600k iterations to be completely integrated into the network. The entire network is trained in an end-to-end manner using loss function in Eq. (3) and Eq. (4) alternately with . Adam optimizer is used with , , =1e-8 and learning rate for different resolutions follows this setting . We use different minibatch sizes according to to avoid out-of-memory problem. Our training set and test set are Flickr-Faces-HQ (FFHQ) dataset [styleGAN] and CelebA-HQ [progressiveGAN], respectively. It takes about 5 days to complete the training process using 4 Tesla P40 GPUs.
5.2 Quantitative Evaluation
We compare our models with bicubic scale and state-of-the-art SISR methods including ESRGAN [ESRGAN], Residual Channel Attention Networks (RCAN) [RCAN] and Cascading Residual Network (CARN) [CARN], on top 5000 images of CelebA-HQ. For the reason that scale is not supported in the official implementation of ESRGAN and CARN, our evaluation is performed on scale. Our model originally produces SR image. To compare with other methods, we apply a average pooling to our results and make it a SR image. Quantitative comparisons are given in Table 1.
Of all the compared methods, our model performs the best for FID and SWD, and second best for NIQE. Indeed, our method significantly reduces distortion and greatly improves perceptual quality as shown in Fig. 4. We observe that ESRGAN tends to produce over-sharp image with ringing artifacts and RCAN, CARN cannot recover details and suffers from blurry appearance. In contrast, our PAN produces SR images much more photo-realistic than others, especially when we zoom in and examine image details. For example, ESRGAN fails to recover the teeth shape (the second row) and the eyelash style (the last row), and over-sharpen the textural details of the eye (the first row) and the ear (the third row). Instead, our PAN algorithm generates more perceptually faithful results with ground truth than ESRGAN, even though ESRGAN’s PSNR, SSIM, and NIQE are better than ours.
What’s more intriguing is that our method frequently generates SR results even visually better than original HR images, as the second face example shows in Fig. 4. This unique strength stems from the perceptual characteristic of our algorithm without the constraint of pixel losses, which is beyond the capability of existing state-of-the-art algorithms that optimize distance-based reconstruction errors. These results solidly support the plausibility of perceptual image super-resolution.
To further demonstrate the power of our method, we compare our results with RCAN111Among compared state-of-the-art methods, only RCAN’s official implementation directly supports super-resolution. So, we only compare RCAN in this experiment. on different SR scales (, , ). As shown in Fig. 5(b), when we increase the scale factor, the FID of RCAN results dramatically increases, which means a significant degradation in image quality. We attribute this to the high-dimensional output at end of the network which typically has a large tile size. As a comparison, our algorithm is quite stable at different scales and consequently gives a much better SR result at resolution, implying that PAN is rather robust to dimension varying. Images in Fig. 5(a) also confirm the quantitative superiority of PAN. The obvious quality difference can be revealed by zooming in the facial details.
More results are shown at https://lonew.github.io/pan-sr/.
|(a) Super-resolution images ().||(b) FID accuracy.|
5.3 GAN Loss
To verify the advantage of GAN loss, we compare it with and losses. This evaluation is performed on SR results. Quantitative comparisons are given in Table 2. The model with or loss has better PSNR and SSIM, but generates blurry images with zipper artifacts, while the model with GAN loss produces clear and sharp images with better details. Though these tiny details such as hair and wrinkle are not identical to original HR image, the overall appearance visually stays the same and it is really hard to distinguish the difference as shown in Fig. 6. The difference of tiny details is caused by random noise injection. However, it is partially due to random noise to make recovered SR image more holistically photo-realistic.
5.4 Progressive Learning
Intuitively, progressive learning makes training process more stable and thus helps converge to a considerably better optimum. To verify this, we train our PAN algorithm without progressive learning. Table 2 illustrates the effectiveness of progressive learning quantitatively. There is an obvious degradation for SR images from the non-progressively trained network. In Fig. 7, we observe that the progressive version of PAN largely improves the visual quality and reduces the distortion in image details.
5.5 Random Noise Injection
StyleGAN indicates the fact that random noise leads to small stochastic variation in generated images. In our networks, we add random noise to feature maps at each resolution. Instead of rigidly following the distribution of LR image, noise injection allows the generator to produce more variation in image details, which helps generate more photo-realistic and texture-rich images to fool the discriminator. Wondering what it would be like without these noise, we train a non-noise version of PAN. Table 2 shows that noise injection only improves the NIQE metric and on the contrary, it makes FID and SWD drop a little. But in fact, these noise does have a huge influence in making the SR images more natural and realistic without changing holistic appearance, as shown in Fig. 8.
5.6 Skip Connection
How does skip connections affect the result of SR? What do we lose if we bypass some skip connections? To obtain some insights in this problem, we train PAN with different levels of skip connections on scale (from to ). In Fig. 9, images on the left are the results using all skip connections. While as we traverse to the right, we see the results layer-by-layer omitting them. We observe that skip connection over coarse spatial resolutions brings high-level semantic consistency such as same pose and similar general hair style while fine-resolution skip connections introduce detail consistency such as expression and wrinkle. Besides, the model with fewer skip connections seems to get harder to converge and tends to produce more artifacts, meaning that these connections indeed make a profound difference to our SR model.
It is well known that SR of large scale is very challenging endeavor and is prone to produce unrealistic image. But focusing on domain-specific dataset makes it possible for us to hallucinate details appropriately and faithfully, thus our method can be easily extended to handle SR problem of different scales. Besides, our method does not need face-specific information such as face landmarks and parsing maps, which makes our method applicable to other object categories. We also try our model () on LSUN cat dataset[LSUN].
As illustrated in Fig. 10, our results largely outperforms other methods perceptually. Though higher variation in dataset results in some artifacts, we notice that these artifacts usually locate in backgrounds, which means that our model apparently know what cat is and how to super-resolve a cat. Moreover, thanks to the mems in the training set which contains plenty of texts, some text regions in the background also get good SR results when we super-resolve a cat. So a text-specific SR model may also work using our method.
Since real-world low and high-resolution image pairs are not trivially available, in most common strategy for learning super-resolution models, images are first downscaled in order to create corresponding training pairs. As a consequence, however, the resulting low-resolution image is clean and almost noise-free. This often leads to dramatic artifacts when the algorithm is applied to images that come straight from the internet or distinctive cameras. To tackle this, we introduce an online random image degradation to the LR images during training. The image degradation is performed by , where is a clean image to be degraded, and are the continuous point spread functions caused by camera aperture and handshake, respectively, is the downsampling, is noise, denotes JPEG compression, and is a noisy, blurry, low quality image. By randomly introduce the degradation into input LR images during training, the generalization of our model to low-quality image is significantly improved. And notably, compared to the low quality LR image, our SR results have better details and fewer noise, which are traditionally thought to be at odds with each other. The results in Fig. 11 further proves that our model is robust in practical application.
In this paper, we propose Progressive Adversarial Network (PAN) to produce domain-specific SR image with high-fidelity and high-quality. To circumvent the curse of dimensionality in image super-resolution, we find an alternative to generate high-resolution images instead of using / losses. Progressive growing of algorithmic architecture and a partial U-Net are simultaneously applied to achieve reconstruction precision. Adversarial learning with random noise injection in generator is performed to facilitate the high-fidelity and photo-realistic effect of super-resolved images. Extensive experiments are conducted to demonstrate the effectiveness and robustness of PAN.
The super-resolution images recovered by our algorithm is perceptually accurate and plausible instead of being measured with pixel-wise reconstruction precision, in the sense that the super resolved images are visually perceived identical to ground truth but the tiny details may differ. We call this new fashion of image super-resolution perceptual image super-resolution.
We also see some promising results for other image-to-image translation problems such as denoising and deblurring. The principle of our algorithm is applicable to these tasks as well. We will study these problems in further work.