Fine-grained Attention and Feature-sharing Generative Adversarial Networksfor Single Image Super-Resolution
The traditional super-resolution methods that aim to minimize the mean square error usually produce the images with over-smoothed and blurry edges, due to the lose of high-frequency details. In this paper, we propose two novel techniques in the generative adversarial networks to produce photo-realistic images for image super-resolution. Firstly, instead of producing a single score to discriminate images between real and fake, we propose a variant, called Fine-grained Attention Generative Adversarial Network for image super-resolution (FASRGAN), to discriminate each pixel between real and fake. FASRGAN adopts a Unet-like network as the discriminator with two outputs: an image score and an image score map. The score map has the same spatial size as the HR/SR images, serving as the fine-grained attention to represent the degree of reconstruction difficulty for each pixel. Secondly, instead of using different networks for the generator and the discriminator in the SR problem, we use a feature-sharing network (Fs-SRGAN) for both the generator and the discriminator. By network sharing, certain information is shared between the generator and the discriminator, which in turn can improve the ability of producing high-quality images. Quantitative and visual comparisons with the state-of-the-art methods on the benchmark datasets demonstrate the superiority of our methods. The application of super-resolution images to object recognition further proves that the proposed methods endow the power to reconstruction capabilities and the excellent super-resolution effects.READ FULL TEXT VIEW PDF
Despite the breakthroughs in accuracy and speed of single image
This paper presents a Generative Adversarial Network based super-resolut...
Generative adversarial networks (GANs) have promoted remarkable advances...
Recently, it has been shown that deep neural networks can significantly
We propose the use of unsupervised learning to train projection networks...
Although many methods have been proposed to deal with nature image
With the development of deep learning, the single super-resolution image...
Fine-grained Attention and Feature-sharing Generative Adversarial Networksfor Single Image Super-Resolution
Single image super-resolution (SISR), which aims to recover a high-resolution (HR) image from its low-solution (LR) version, has been an active research topic in computer graphic and vision for decades. SISR has also attracted increasing attention in both academia and industry, with applications in various fields such as medical imaging, security surveillance, remote sensing, object recognition and so on. However, SISR is a typically ill-posed problem due to the irreversible image degradation process, i.e., multiple HR images can be generated from one single LR image. Learning the mapping between HR and LR images plays an important part in addressing this problem.
Recently, deep convolution neural networks (CNNs) have been shown great success in many vision tasks, such as image classification, object detection, and image restoration. Donget al.  firstly proposed a three-layer CNN for single image super-resolution (SRCNN) to directly learn the complex non-linear mapping from LR to HR images. Since then the CNN-based methods have been dominant for the SR problem because they greatly improved the reconstruction performance. Kumar et al.  tapped into the ability of polynomial neural networks to hierarchically learn refinements of a function that maps LR to HR patches. VDSR  obtained the remarkable performance by increasing the depth of the network to 20, proving the importance of the network depth for detecting effective features of images. FSRCNN 
accelerated the network training by directly extracting features from LR images instead of interpolated images, which greatly reduced the computation cost. Yanget al.  proposed a deep recurrent fusion network (DRFN) for SR with large-scale factors, which used transposed convolution to jointly extract and upsample raw features from the input and used multi-level fusion for reconstruction. EDSR 
removed unnecessary batch normalization layer in the ResNet architecture and widened the channels. EDSR significantly improved the performance of SISR and won the first place in the NTIRE 2017 Super-Resolution Challenge . There are more recent methods for SISR based on the work of EDSR. For example, Zhang et al. introduced the Residual Dense Network (RDN)  to extract hierarchical features, proving the effectiveness of residual dense architecture. RCAN  applied residual in residual structure to construct a very deep network and used a channel attention mechanism to adaptively rescale features.
The aforementioned methods use the optimization idea of minimizing the mean squared error (MSE) between the recovered SR image and the corresponding HR image. Such methods are designed to maximize the peak signal-to-noise ratio (PSNR). However, PSNR-oriented methods typically produce over-smoothed edges and lose tiny textures. In order to produce photo-realistic SR images, Lediget al. firstly introduced the residual learning within the generative adversarial network (GAN)  framework to decrease the distance between the distributions of real images and SR images. Yan et al.  proposed a novel full-reference image quality assessment (FR-IQA) approach for SISR, i.e.
, a loss function called SR-IQA. It was combined with-Norm to guide their proposed SISR network to achieve better results. ESRGAN  further extends the network to produce more photo-realistic images. However, as shown in Fig.1, the discriminator in these GAN-based methods only outputs a score of the whole input SR/HR image, which is a coarse way to guide the generator. Furthermore, the previous GAN-based methods typically use two independent networks for the generator and the discriminator to generate photo-realistic images and discriminate the HR image and the generated SR image, respectively. However, the shallow parts (first several layers) of the two networks both aim at extracting tiny features such as corners and edges, which we believe should be correlated.
To address these limitations, we propose two novel techniques in the GAN framework for image super-resolution, a fine-grained attention mechanism for the discriminator and a feature-sharing network component for both the generator and the discriminator. Specifically, we use a Unet-like  discriminator (Fig.2) to introduce a fine-grained attention in the GAN (FASRGAN). Our discriminator produces two outputs, a score of the whole input image and a fine-grained score map of every pixel in the image. The score map has the same spatial size as the input image, and measures the degree of differences at each pixel between the generated and the true distributions. To produce better high-quality images, we incorporate the score map into the loss function as an attention to make the generator pay more attention on the difficult reconstructing parts of the image, instead of treating all parts equally. In addition, we propose a feature-sharing mechanism (Fig.3
) to align shallow feature extraction of both the generator and the discriminator (Fs-SRGAN). This novel structure can significantly reduce the number of parameters and improve the performance.
Overall, our main contributions are three-fold:
We propose a novel Unet-like discriminator to generate a score of the whole image as well as a pixel-wise score map of the input image. We further incorporate the score map into the loss function as the attention mechanism for the generator. This attention mechanism makes the generator focus on the parts of an image that are difficult to generate.
We introduce a feature-sharing mechanism to define the shallow feature extraction for the generator and the discriminator. This reduces the number of model parameters and helps the generator and the discriminator extract more useful features, which can also improve the model performance.
The proposed two components are general, and can be applied to other GAN-based SR models. Extensive experiments on benchmark datasets illustrate the superiority of our proposed methods compared with current state-of-the-art methods.
The remainder of the paper is organized as follows. Section II describes related works. The proposed GAN-based methods are presented in Section III. Experimental results are discussed in Section IV. Finally, the conclusions are drawn in Section V.
Traditional SISR methods are exemplar or dictionary based. However, these methods are limited by the size of datasets or dictionaries, and are usually time-consuming. These shortcomings can be greatly alleviated by the recent CNN-based methods.
In their pioneer work, Dong et al.  applied convolutional neural networks with three layers for SISR, namely SRCNN, to learn a mapping from LR to HR images in an end-to-end manner. Kim et al.  increased the depth of the network and introduced residual learning to the SISR network, called VDSR. VDSR achieved great improvement in accuracy compared to SRCNN. Later, Kim et al.  used a deeply-recursive convolutional network (DRCN) to reconstruct SR image, which has a very deep recursive layer. DRRN  introduced recursive blocks for stabilizing the training. However, the inputs of all these methods are interpolated LR images with the same size as HR images, thus greatly increasing the computation complexity and losing some details. FSRCNN  extracted features from the origin LR images and upscaled the spatial size by upsampling layers at the tail of the network. This architecture is widely used in the subsequent image super-resolution methods. Various advanced upsampling structures have been proposed recently, for instance, deconvolutional layer [18, 19], sub-pixel convolution , and EUSR . LapSRN  and MSLapSRN  progressively reconstructed an HR image with increasing scales of an input image by the Laplacian pyramid structure. MRFN  employed multi-receptive-filed module to extract different features from different receptive fileds and fused them with a module for learning object/part-depending mappings. Besides, it proposed a two-parameter training loss (Weighted Huber) to adaptively adjust the value of back-propagated derivative according to the residual value. Lim et al.  proposed a very large network (EDSR) and its multi-scale version (MDSR), which removed the unnecessary batch normalization layer in the ResNet  and greatly improved super-resolution performance. D-DBPN  introduced an error-correcting feedback mechanism to learn relationships between LR features and SR features. ZSSR  uses a unsupervised method to learn the mapping between HR images and LR images. SRMDNF  tackled multiple degradation problems in a single network by treating degradation maps and images as inputs. RDN 
combined dense and residual connections to make full use of information of LR images. Different from RDN, MS-RHDN proposed multi-scale residual hierarchical dense networks to extract multi-scale and hierarchical feature maps. RNAN  utilized both local and non-local architectures to bias the most informative feature components. Meta-SR  proposed by Hu et al. firstly solved the problem of arbitrary scale factor super-resolution within a single model.
The aforementioned methods aim to achieve high PSNR and SSIM  values. However, these criteria usually causes heavy over-smoothed edges and artifacts. Images generated by these MSE-based SR methods lose various high-frequency details and have a bad perceptual quality. To generate more photo-realistic images, Ledig et al. firstly introduced generative adversarial network into image super-resolution, called SRGAN . SRGAN combined a perceptual loss and an adversarial loss to improve the reality of generated images. But some visually implausible artifacts still could be found in some generated images. To reduce the artifacts, EnhanceNet  combined a pixel-wise loss in the image space, a perceptual loss in the feature space, a texture matching loss  and an adversarial loss. The texture matching loss helped to generate more realistic textures. Yan et al.  firstly trained a novel full-reference image quality assessment (FR-IQA) approach for SISR, then employed the proposed loss function (SR-IQA) to train their SR network which contains their proposed highway unit. In addition, they also integrate SR-IQA loss to the GAN-based SR method to achieve better results for both accuracy and perceptual quality. Dahl et al.  proposed a pixel recursive super resolution model, an extension of PixelCNNs [35, 36], to reconstruct face super-resolution images. The contextual loss  was a kind of perceptual loss to make the generated images as similar as possible to ground truth images. Cheon et al. 
creatively utilized DCT transformation to make the generated images closer to the ground truth images in the frequency domain, reducing blurry-edge effects due to pixel loss. Based on SRGAN, ESRGAN i) substituted the standard residual block with a residual-in-residual dense block, removed batch normalization layers, utilized VGG feature before activated as perceptual loss, and replaced the standard discriminator with Relativistic Discriminator proposed in RaGAN . In addition, ESRGAN used network interpolation to balance the MSE loss and perceptual quality. Noteworthily, ESRGAN won the first place in the 2018 PIRM Challenge on Perceptual Image Super-Resolution , which pursued the high perceptual-quality images.
Our methods aim to reconstruct a high-resolution image from a low-resolution image , where and are the width and height of the LR image, is the upscaling factor, and is the number of channels of the color space. This section details our two strategies within the GAN framework for image super-resolution in order: FASRGAN and Fs-SRGAN. We propose a fine-grained attention in FASRGAN to make the generator focus on the difficult parts of image reconstruction instead of treating every part equally. At the same time, we propose a feature-sharing mechanism in Fs-SRGAN by sharing the shallow feature extraction of the generator and the discriminator. These two strategies contribute to the overall perceptual quality for SR. For simplicity, we use the same network architecture as ESRGAN  for the generator.
Our proposed fine-grained attention GAN (FASRGAN) designs a specific discriminator for SISR. As discussed above and shown in Fig.1, the discriminator in a standard GAN-based SR model outputs a score of the whole input SR/HR image. This can be considered as a coarse way to judge an input image and cannot discriminate local features of inputs. To tackle this problem, the proposed FASRGAN defines a Unet-like discriminator contained two outputs, corresponding to a score of the whole image and a fine-grained score map, respectively. The score map has the same spatial size as the input image and is used for pixel-wise discrimination. The proposed discriminator is illustrated in Fig. 2.
The Unet-like discriminator with two outputs can be divided into two parts: an encoder and a decoder.
Similar to the standard discriminator D in SRGAN, the encoder part of the proposed Unet-like discriminator uses a standard max-pooling layer with a stride of 2 to reduce the spatial size of a feature map and increase receptive fields. At the same time, the number of channels is increased for improving representative ability. At the end of the encoder, two fully connected layers are added to output a score, measuring the overall probability of an input imagebeing real or fake. We further enhance the discriminator based on the Relativistic GAN , which has also been used in ESRGAN . The loss function is defined as:
where and stand for the ground truth image and the generated SR image, respectively. refers to the function of the relativistic discriminator, which tries to predict the probability that a real image is more realistic than a fake one ;
is the discriminator output before sigmoid function andis the sigmoid function.
Decoder. We exploit an upsampling layer to extend the spatial size of feature maps as shown in Fig. 2. To make full use of features, we concatenate the previous outputs, which have the same spatial size as current ones. As shown in Fig. 2, the feature maps at the end of the decoder have the same spatial size as input images. Finally, we use the sigmoid function to produce a score map that provides pixel-wise discrimination between real and fake pixels of an input image. The loss function is defined as:
where and represent the score maps of the HR image and the generated SR image, respectively.
The score map generated by the Unet-like discriminator is pixel-wise discrimination scores of an input image, with values among . The higher score the closer to a real image. In this manner, the score map can indicate which parts of an image are more difficult to generate and which parts are easier. For instance, the structure background part of an image is sometimes simpler, and thus it would expect the discriminator reflects this to the generator when updating the generator. In other words, the part with lower scores (more difficult to generate) should receive more attention when updating the generator. As a result, we incorporate the score map as the fine-grained attention mechanism into the loss function that is defined as:
where represents the function of the generator, is the parameters of the generator and means the -th image. This function treats every position in the image equally. Our proposed fine-grained attention loss function is the following weighted function:
where is the score map of the generated image given by the discriminator.
Our fine-grained attention mechanism is general and can be applied to various GAN-based methods. In this paper, we use the stack of Residual-in-Residual Dense Blocks (RRDBs), the basic building block of ESRGAN , to define our generator. Our generator consists of several losses, described below:
Perceptual Loss. The perceptual loss  aims to make the SR image close to the corresponding HR image based on high-level features extracted from a pre-trained network. We consider both the SR and HR images as the input to the pre-trained VGG19 and extract the VGG19-54 layer features. The perceptual loss is defined as:
where is the function of VGG and is the -th image, ) is the function of the generator.
The discriminator contains two outputs, a whole estimation of the entire image and the pixel-wise fine-grained estimations of an input image. The total adversarial loss function for the generator is defined as:
As shown in Eq. 2, the discriminator tries to distinguish the real and fake image in a fine-grained way, while the generator aims to fool the discriminator. Thus the loss function for the fine-grained attention loss of generator is the symmetrical form of Eq. 2:
is also the symmetrical form of Eq.1 and defined as:
Combining the above losses and the attention loss, the total loss of the generator is:
where , , are the coefficients to balance different loss terms.
In the standard GANs, the generator and the discriminator are usually defined as two independent networks. In our problem, we observe that the shallow parts of these two networks always extract local textures such as edges and corners. To reflect this, we propose a new network structure (Fs-SRGAN) to share the shallow-feature-exaction parts of the generator and the discriminator. Consequently, our Fs-SRGAN contains three parts: a shared feature extractor, a generator, and a discriminator, as shown in Fig. 3.
The feature-sharing mechanism allows the generator and the discriminator to jointly optimize the shallow feature extractor. Similar to FASRGAN, we adopt RRDB, the basic block of ESRGAN , as the basic structure. The shared feature extractor contains RRDBs to extract helpful feature maps for both the generator and the discriminator, described as following:
where is the shallow shared feature maps extracted by the shared part, represents the function of the shared feature extractor, and is the input. For the generator, the input is an LR image, while for the discriminator it is the SR image or the HR image. The input sizes of the generator and the discriminator are different. We apply a fully Convolutional Neural Network with invariant size of feature map to extract features so that the different input sizes do not matter.
The rest parts of the generator and the discriminator are the same as those in standard GAN-based methods, except that the inputs are feature maps instead of images as shown in Fig.3.
The generator in SR generally contains three parts: shallow feature extraction, deep feature extraction, and reconstruction. Similar to the shared shallow feature extraction, we adopt RRDB as the basic part of deep feature extraction, except that more RRDBs are used to increase the depth of the network with the purpose of extracting more high-frequency feature for reconstruction. The reconstruction part utilizes an upsampling layer to upscale the feature maps and a Conv layer to reconstruct an SR image. The loss function of the generator includes adversarial loss, pixel-based loss, and perceptual loss, similar to Eq.9, except there is no attention loss .
Discriminator. Because the discriminator is a classification network that distinguishes the SR and HR image, we apply a structure similar to the VGG network 
as the discriminator. To avoid information loss, we replace the pooling layer for a Conv layer with a stride of 2 to decrease the size of feature map. At the tail of the discriminator, we use a Conv layer to transform the feature map into a one-dimensional vector, then use two fully connected layers to output the classification scoreamong . The value of closer to 1 means more real, otherwise more fake. The loss function of the discriminator is defined as follow:
where is the discriminator function and is the function of the generator.
In this section, we first describe our model training details, then provide quantitative and visual comparisons with several state-of-the-art methods on benchmark datasets for our two proposed novel methods, FASRGAN and Fs-SRGAN. We further combine the fine-grained attention and the feature-sharing mechanisms into one single model, termed FAFs-SRGAN.
The DIV2K dataset was proposed in NTIRE 2017 Challenge on Single Image Super-Resolution  and widely used in previous SR methods, which contains , and images of 2K-resolution for training, validation, and testing, respectively. We use the training set from DIV2K dataset for training. The LR images are obtained by bicubic downsampling (BI) from the source high-resolution images. At testing, we also use five standard benchmark datasets: Set5 , Set14 , BSD100 , Urban100 , and Manga109 . Blau et al..  proved perceptual quality is not always improved with the increase of PSNR value and revealed the trade-off between the average distortion and perceptual quality. We adopt perceptual index (PI) and root mean square error (RMSE) as our quantitative measurements, where PI measures the perceptual quality of the SR image and RMSE measures the reconstruction loss between HR image and SR image. Both PI and RMSE with lower values mean better results.
In training, images are augmented by rotating and flipping. The batch size is set to 16. Our methods are trained based on image patches and optimized with the ADAM optimizer 
. The hyperparametersand in the ADAM optimizer are set to and . We randomly crop patches from LR images as the input of the network. The generator is pre-trained by the loss function. Following [6, 22, 20, 9, 29], the initial learning rate is set to , and then decays to half every iterations. In Fs-SRGAN, we set the number of RRDBs in the shared feature extractor as
. We implement our models with the PyTorch framework on a Titan Xp GPU.
We first present the quantitative comparisons between our methods and the state-of-the-art methods. The results are shown in Fig. 4. These methods can be roughly divided into two categories: the top-left and the bottom-right. Methods in the top-left part are almost MSE-based with low RMSE loss and high PI scores due to the over-smoothed edges and lack of high-frequency details. The bottom-right category includes the GAN-based methods, such as SRGAN, EnhanceNet, ESRGAN, and our methods. These methods usually gain high-visual quality images even if their RMSE losses are larger than those of the MSE-based methods. Among these methods, our FASRGAN and Fs-SRGAN get better visual quality and less reduction error. As shown in Fig. 4, the FAFs-SRGAN attains the best reconstruction accuracy among all the GAN-based methods. FAFs-SRGAN also achieves a lower PI value. Fig. 5 plots the curves of PI values in the training process of our proposed methods on Set14. We observe that the training process of FASRGAN is more stable and obtains better perceptual quality. The average PI value of Fs-SRGAN is higher than FASRGAN. As mentioned above, Fs-SRGAN contains fewer RRDBs than FASRGAN. We speculate that less RRDBs caused higher PI values. FAFs-SRGAN, which combines the fine-grained attention mechanism into Fs-SRGAN, obtains the lower PI values than Fs-SRGAN.
We compare our final models on several public benchmark datasets with the state-of-the-art MSE-based methods: SRCNN , FSRCNN , EDSR , SRMDNF , RDN , and GAN-based approaches: SRGAN , EnhanceNet , ESRGAN . We conduct comparisons with our two methods respectively.
Some representative quality results are presented in Fig. 6. PSNR (evaluated on the luminance channel in YCbCr color space) and the perceptual index used in the 2018 PIRM-SR Challenge  are also provided for reference.
As shown in Fig. 6, our proposed FASRGAN outperforms previous methods by a large margin. Images generated by FASRGAN contain more fine-grained textures and details. For example, for image ’0801’ of DIV2K, MSE-based methods tend to generate blurry results, while results from SRGAN and EnhanceNet tend to be noisy; results from ESRGAN have blurry and over-smoothed edges. FASRGAN can produce sharper and more natural textures of the penguin beak. The cropped parts of image ’0828’ and ’YumeirCooking’ are full of textures. As we can see, all the compared MSE-based methods suffer from heavy blurry artifacts, failing to recover the structure and the gap of the stripes. SRGAN, EnhanceNet, and ESRGAN generate high-frequency noise and wrong textures; while our FASRGAN can recover them more correctly, producing more faithful results and being closer to the HR images. For image ’img_093’ in Urban100, the cropped part of the image generated by the compared methods contains heavily blurry artifacts and lines with wrong directions. By contrast, our FASRGAN can alleviate the artifacts better and recover zebra crossing with correct structures. These comparisons demonstrate the strong ability of FASRGAN for producing more photo-realistic and high-quality super-resolution images.
We further compare our Fs-SRGAN with the state-of-the-art methods in Fig. 7. Obviously, our Fs-SRGAN obtains better performance than other methods in producing SR images, in terms of sharpness and details. For image ’baboon’, the cropped parts of the image generated by the MSE-based methods are over-smoothed. Previous GAN-based methods not only fail to produce clear whiskers but also introduce lots of unpleasing noise. ESRGAN generates too many whiskers, which have not appeared in the original HR image. Our Fs-SRGAN produces more correct whiskers. For image ’0812’ and ’img_069’, MSE-based methods still suffer from heavy blurry artifacts and generate unnatural results. GAN-based methods cannot maintain the structures of the stairs or the train tracks. Our proposed Fs-SRGAN outperforms the compared methods and produces closer images to the original HR images. For image ’0879’, our Fs-SRGAN can recover tiny textures of windows that look more natural, while previous methods still have difficulties to produce high-quality SR images. This also indicates that the shared shallow feature extractor of the generator and the discriminator is beneficial.
|Evaluation||Top-1 error||Top-5 error|
To further demonstrate the quality of our generated SR images, we treat them as a pre-processing step for other high-level computer vision tasks such as object recognition, image classification and so on. In this section, we use the same setting as EnhanceNet and evaluate the object recognition performance with the generated images by our methods and other state-of-the-art methods: SRCNN, FSRCNN , SRGAN , EnhanceNet .
We use the pre-trained ResNet-50 on imageNet as an evaluation model and fetch the first 1000 images in ImageNet CLS-LOC validation dataset for evaluation. The test images are first down-sampled by bicubic and then upscaled by our methods and the compared methods. These SR images are then used as inputs to the ResNet-50 model to calculate their Top-1 and Top-5 errors for evaluation. As shown in TableI, both two methods we proposed and the variant FAFs-SRGAN achieve better accuracy compared to the state-of-the-art methods. Among these three methods, FASRGAN achieves the lowest Top-1 and Top-5 errors, demonstrating the effectiveness of both the fine-grained attention and the feature-sharing mechanisms.
|Evaluation||Top-1 error||Top-5 error|
To further illustrate the effectiveness of our models, we use another pre-trained model on imageNet, called vgg-19. The experiment setting is the same as that of ResNet-50, and the results are shown in Table II. Among the SR methods, our models still obtain better results, and Fs-SRGAN obtains the lowest Top-1 and Top-5 errors.
In order to study the effects of the two mechanisms in the proposed methods, we conduct ablation experiments by removing the mechanisms and test the differences, respectively. The overall visual comparisons are illustrated in Fig. 8 and Fig. 9. A detailed discussion is provided as follows.
We first remove the fine-grained attention (FA) mechanism in the FASRGAN. An obvious performance decrease can be observed in Fig. 8. For image ’img_009’, the model without FA mechanism introduces some unnatural noise and undesired edges, while FASRGAN can maintain the structure and produce high-quality SR images. For image ’img_004’, the FA mechanism can improve the deformation appeared in the image from the model that removes the FA mechanism. The images generated by FASRGAN are closer to the original HR images. The visual analysis indicates the effectiveness and benefit of the FA mechanism in removing unpleasant and unnatural artifacts.
Fig. 9 shows the results of removing the feature-sharing (Fs) mechanism and use two independent networks as the generator and the discriminator. We can observe that Fs-SRGAN outperforms SRGAN and the model without Fs mechanism by a large margin. The removal of feature-sharing mechanism tends to introduce unpleasant artifacts. For image ’zebra’, by employing the Fs mechanism, Fs-SRGAN can alleviate heavy artifacts and noises. For image ’img_083’, characters in the cropped image generated by Fs-SRGAN are clearer and more recognizable due to the benefit of the Fs mechanism.
To further study the effect of the depth of the shared feature extractor in Fs-SRGAN, we vary the number of RRDBs in both the shared shallow feature extractor and the deep feature extractor. As shown in Fig. 10, increasing the number of the shared part from 1 to 2 decreases the performance, manifested in the reconstruction accuracy. However, when the number is increased to 5, the model E5G12 performs better in visual qualities.
We propose two GAN-based models, FASRGAN and Fs-SRGAN, for SISR to overcome the limitations of existing methods. FASRAGN introduces a fine-grained attention mechanism into the GAN framework, where the discriminator has two outputs to measure quality of the overall input, as well as a fine-grained attention estimation for the input. The fine-grained attention delivers a fine-grained supervisor to the generator to ensure generation of pixel-wise photo-realistic images. The Fs-SRGAN shares the shallow feature extractor of the generator and the discriminator, reducing the number of parameters and improving the reconstruction performance. These two mechanisms are general and could be applied to other GAN-based SR models. Comparisons with other state-of-the-art methods on benchmark datasets demonstrate the effectiveness of our proposed methods.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 1646–1654.