Deep Learning-based Image Super-Resolution Considering Quantitative and Perceptual Quality

09/13/2018 ∙ by Jun-Ho Choi, et al. ∙ Yonsei University 0

Recently, it has been shown that in super-resolution, there exists a tradeoff relationship between the quantitative and perceptual quality of super-resolved images, which correspond to the similarity to the ground-truth images and the naturalness, respectively. In this paper, we propose a novel super-resolution method that can improve the perceptual quality of the upscaled images while preserving the conventional quantitative performance. The proposed method employs a deep network for multi-pass upscaling in company with a discriminator network and two quantitative score predictor networks. Experimental results demonstrate that the proposed method achieves a good balance of the quantitative and perceptual quality, showing more satisfactory results than existing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 7

page 16

page 17

page 18

page 20

page 21

Code Repositories

tf-perceptual-eusr

A TensorFlow-based image super-resolution model considering both quantitative and perceptual quality


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Single-image super-resolution, which is a task to increase the spatial resolution of low-resolution images, has been widely studied in recent decades. One of the simple solutions for the task is to employ interpolation methods such as nearest-neighbor and bicubic upsampling. However, their outputs are largely blurry because fine details of the images cannot be recovered. Therefore, many researchers have investigated how to effectively restore high-frequency details. Nevertheless, it is still highly challenging due to the lack of information in the low-resolution images, i.e., an ill-posed problem

[17].

Until the mid-2010s, feature extraction-based methods have been proposed, including sparse coding

[37], neighbor embedding [18], and Bayes forest [29]. After that, the emergence of deep learning for visual representation [7]

, which is triggered by an image classification challenge (i.e., ImageNet)

[16], has also flowed into the field of super-resolution [38]

. For instance, the super-resolution convolutional neural network (SRCNN) model proposed by Dong

et al. [5] introduced convolutional layers and showed better performance than the previous methods.

(a) (b) (c) (d)
Figure 1: Example results obtained for an image of the PIRM dataset [1]. (a) Ground-truth (b) Upscaled by bicubic interpolation (c) Upscaled without perceptual consideration (d) Upscaled with perceptual consideration

To build a deep learning-based super-resolution model, it is required to define loss functions that are the objectives of the model to be trained. Loss functions measuring pixel-by-pixel differences of the ground-truth and upscaled images are frequently considered, including mean squared error and mean absolute error

[38]. They mainly aim at guaranteeing quantitative conditions of the obtained images, which can be evaluated by quantitative quality measures such as peak signal-to-noise ratio (PSNR), root mean squared error (RMSE), and structural similarity (SSIM) [36]. Figure 1 (c) shows an example image generated by a deep learning-based super-resolution model, enhanced upscaling super-resolution (EUSR) [14], from the downscaled version of Figure 1 (a). Compared to the image upscaled by bicubic interpolation shown in Figure 1 (b), the image generated by the deep learning-based method follows the overall appearance of the original image with sharper boundaries of the objects and scenery.

Although existing methods based on minimizing pixel-by-pixel differences achieve great performance in a quantitative viewpoint, they do not ensure naturalness of the output images. For example, fine details of trees and houses are not sufficiently recovered in Figure 1 (c). To improve the naturalness of the images, two approaches have been proposed in the literature: using generative adversarial networks (GANs) [6] and employing intermediate features of the common image classification network models. For example, Ledig et al. [17] proposed a super-resolution model named SRGAN, which employs a discriminator network and trains the model to minimize differences of the intermediate features of VGG19 [31] when the ground-truth and upscaled images are inputted. It is known that these methods enhance perceptual performance significantly [4]. Here, the perceptual performance can be measured by the metrics for visual quality assessment such as blind/referenceless image spatial quality evaluator (BRISQUE) [24] and naturalness image quality evaluator (NIQE) [25].

However, two issues still remain unresolved in these approaches. First, although these approaches improve naturalness of the images, perceptual quality is only indirectly considered and thus the improvement may be limited. The network models for extracting intermediate features are for image classification tasks, thus forcing the features to be similar does not guarantee perceptually improved results. In addition, it is possible that the discriminator network learns the criteria that can differentiate generated images from the real ones but are not related to the perceptual aspects. For instance, when the trained discriminator relies on just finding high-frequency components, the super-resolution model may add some unexpected textures in low-frequency regions such as ground and sky.

Second, these approaches tend to sacrifice a large amount of the quantitative quality. For example, the SRGAN-based models achieve better perceptual performance than the other models in terms of BRISQUE and NIQE, but they record worse quantitative quality, showing larger RMSE values [4]. Since the primary objective of the super-resolution task is to make the upscaled images identical to the ground-truth high-resolution images, it is necessary to properly regularize the upscaling modules to keep balance of the quantitative and qualitative quality.

In this paper, we propose a novel super-resolution method named “Four-pass perceptual super-resolution with enhanced upscaling (4PP-EUSR),” which is based on the recently proposed EUSR model [14]. Our model aims at resolving the aforementioned issues via two innovative ways. First, our model employs so-called “multi-pass upscaling” during the training phase, where multiple upscaled images produced by passing the given low-resolution image through the multiple upscaling paths in our model are used in order to consider various possible characteristics of upscaled images. Second, we employ qualitative score predictors, which directly evaluate the aesthetic and subjective quality scores of the upscaled images. This architecture ensures high perceptual quality with preserving the quantitative performance of the upscaled images, as exemplified in Figure 1 (d).

The rest of the paper is organized as follows. First, we provide a brief review of the related work in Section 2. Then, an overview of the proposed method is given in Section 3, including the base deep learning model, multi-pass upscaling for training, structure of the discriminator, and structures of the qualitative score predictors. We explain training procedures of our model with the employed loss functions in Section 4. In-depth experimental analysis of our results is shown in Section 5. Finally, we conclude our work in Section 6.

2 Related work

In this section, we review the related work of deep learning-based super-resolution in two branches: super-resolution models without and with consideration of naturalness.

2.1 Deep learning-based super-resolution

One of the earliest super-resolution models based on deep learning is SRCNN, which was proposed by Dong et al. [5]. The model takes an image upscaled by the bicubic interpolation and enhances it via two convolutional layers. Kim et al. proposed the very deep super-resolution (VDSR) model [13], which consists of 20 convolutional layers. In recent days, residual blocks having shortcut connections [9] are commonly used in the super-resolution models. For example, Ledig et al. [17]

proposed a model named SRResNet, which contains 16 residual blocks with batch normalization

[11]

and parametric ReLU activation

[8]. Lim et al. [19] developed two super-resolution models for the NTIRE 2017 single-image super-resolution challenge [34]: the enhanced deep super-resolution (EDSR) model for single-scale super-resolution and the multi-scale deep super-resolution (MDSR) model for multi-scale super-resolution. They found that removing batch normalization and blending outputs generated from geometrically transformed inputs help improving the overall quantitative quality. Recently, Kim and Lee [14] suggested a multi-scale super-resolution method named EUSR, which consists of so-called “enhanced upscaling modules” and performed well in the NTIRE 2018 single-image super-resolution challenge [35].

2.2 Super-resolution considering naturalness

Along with ensuring high quantitative quality in terms of PSNR, RMSE, or SSIM, naturalness of the upscaled images, which can be measured by quality metrics such as BRISQUE and NIQE, has been also considered in some studies. There exist two common approaches: employing GANs [6]

and employing image classifiers. In the former approach, the discriminator network tries to distinguish the ground-truth images from the upscaled images and the super-resolution model is trained to fool the discriminator so that it cannot distinguish the upscaled images properly. When an image classifier is used, the super-resolution model is trained to minimize the difference of the features obtained at the intermediate layers of the classifier for the ground-truth and upscaled images. For example, Johnson

et al. [12] used the trained VGG16 network to extract the intermediate features and regarded the squared Euclidean distance between them as the loss function. Ledig et al. [17] employed an adversarial network and differences of the features obtained from the trained VGG19 network for calculating losses of their super-resolution model (i.e., SRResNet), which is named as SRGAN. Mechrez et al. [22] defined the so-called “contextual loss,” which compares the statistical distribution of the intermediate features obtained from the trained VGG19 model, to train their super-resolution model. These models focus on ensuring naturalness of the upscaled images but tend to sacrifice a large amount of the quantitative quality [4].

3 Overview of the proposed method

Figure 2: Overview of the proposed method. First, our super-resolution model (Section 3.1) generates three upscaled images via multi-pass upscaling (Section 3.2). The discriminator tries to differentiate the upscaled images from the ground-truth (Section 3.3). The two qualitative score predictors measure the aesthetic and subjective quality scores, respectively (Section 3.4). The outputs of the discriminator and the score predictors are used to update the super-resolution model.

The architecture of the proposed method can be disassembled into four components (Figure 2): a multi-scale upscaling model, employing the model in a multi-pass manner, a discriminator, and qualitative score predictors.

3.1 Enhanced upscaling super-resolution

Figure 3: Structure of the EUSR model [14].

The basic structure of our model is from the EUSR model [14], which is shown in Figure 3. It mainly consists of three parts: scale-aware feature extraction, shared feature extraction, and enhanced upscaling. First, the scale-aware feature extraction part extracts low-level features from the input image by using so-called “local residual blocks.” Then, a residual module in the shared feature extraction part, which consists of local residual blocks and a convolutional layer, extracts higher-level features regardless of the scale factor. Finally, the proceeded features are upscaled via “enhanced upscaling modules,” where each module increases the spatial resolution of the input by a factor of 2. Thus, the 2, 4, and 8 upscaling paths have one, two, and three enhanced upscaling modules, respectively. The configurable parameters of the EUSR model are the number of output channels of the first convolutional layer, the number of local residual blocks in the shared feature extraction part, and the number of local residual blocks in the enhanced upscaling modules. We consider EUSR as our base upscaling model because it is one of the state-of-the-art approaches supporting multi-scale super-resolution, which enables generating multiple upscaled images from a single model.

3.2 Multi-pass upscaling

The original EUSR model supports multi-scale super-resolution by factors of 2, 4, and 8. During the training phase, our model utilizes all these upscaling paths to produce three output images, where we make the output images have the same upscaling factor of 4 for a given image as follows (Figure 4). The first one is directly generated from the 4 path. The second one is generated by passing the given image through the 2 path two times. The third one is generated via bicubic downscaling of the image obtained from the 8 path by a factor of 2. Thus, the model is employed four times for each input image.

Figure 4: Multi-pass upscaling process, which produces three upscaled images by a factor of 4 from a shared pre-trained EUSR model.

This enables the model to handle various upscaling scenarios. The model has to learn reducing artifacts that may occur during direct upscaling via the 4 path, two-pass upscaling via the 2 path, and upscaling via the 8 path and downscaling. This prevents the model to overfit towards specific patterns.

3.3 Discriminator network

Figure 5: Structure of the discriminator network.

Our method employs a discriminator network during the training phase, which is designed to distinguish generated images from the ground-truth images. While the discriminator tries to do its best for identifying the upscaled images, the super-resolution model is trained to make the discriminator difficult to differentiate them from the ground-truth images. This helps our upscaling model generating more natural images [17, 22]. Inspired by SRGAN [17], our discriminator network consists of several convolutional layers followed by LeakyReLU activations with and two fully-connected layers, as shown in Figure 5

. The final sigmoid activation determines the probability that the input image is real or fake. Note that our discriminator network does not employ the batch normalization

[11], because the batch size is too small to use that optimization. In addition, it contains two more convolutional layers than the original SRGAN model due to the different size of the input image patches.

3.4 Qualitative score predictors

One of our main ideas for perceptually improved super-resolution is to utilize deep learning models classifying perceptual quality of images, instead of general image classifiers. For this, we employ two deep networks that predict aesthetic and subjective quality scores of images, respectively. To build the networks, we utilize the neural image assessment (NIMA) approach [33], which predicts the quality score of a given image. This approach replaces the last layer of a well-known image classifier such as VGG [31] or Inception-v3 [32] with a fully-connected layer with the softmax activation, which produces probabilities of 10 score classes. In our implementation, MobileNetV2 [30] is used as the base image classifier, because it is much faster than the other image classifiers and supports various sizes of input images.

We build two score predictors: one for predicting aesthetic scores and the other for predicting subjective scores. For the aesthetic score predictor, we employ the AVA dataset [26], which contains aesthetic user ratings of the images shared in DPChallenge111http://www.dpchallenge.com. For the subjective score predictor, we use the TID2013 dataset [27]

, which consists of the subjective quality evaluation results for the test images degraded by various distortion types (e.g., compression, noise, and blurring). While the AVA dataset provides exact score distributions, the TID2013 dataset only provides the mean and standard deviation of the scores. Therefore, we approximate a Gaussian distribution with the mean and standard deviation to train the network based on TID2013. In addition, we adjust the score range of the TID2013 dataset from

to to match the range of the AVA dataset (i.e., ). After training the predictors, we use only the mean values of the predicted score distributions to enhance the perceptual quality of the upscaled images.

4 Training details

We train our model in three phases: pre-training the EUSR model, building qualitative score predictors, and training the EUSR model in a perceptual manner. Our method is implemented on the TensorFlow framework

[2].

4.1 Pre-training multi-scale super-resolution model

In our method, we employ 32 and one local residual blocks in the residual module and the upscaling part of the EUSR model, respectively. The EUSR model is first pre-trained with the training set of the DIV2K dataset [35] (i.e., 800 images) using the L1 reconstruction loss as in [14]. For each training step, 16 image patches having a size of 4848 pixels are obtained by randomly cropping the training images. Then, one of the upscaling paths (i.e., 2, 4, and 8) is randomly selected and trained at that step. For instance, when the 2 path is selected, the parameters of the path of the model are trained to generate the upscaled images having a size of 9696 pixels. The Adam optimization method [15] with , , and is used to update the parameters. A total of 1,000,000 training steps are executed with an initial learning rate of and reducing the learning rate by a half for every 200,000 steps.

4.2 Training qualitative score predictors

Along with pre-training EUSR, we also train the qualitative score predictors explained in Section 3.4. As the base image classifier, we employ MobileNetV2 [30] pre-trained on the ImageNet dataset [28] with a width multiplier of 1. In the original procedure of training NIMA [33], the input image is rescaled to 256256 pixels without considering the aspect ratio and then randomly cropped to 224224 pixels, which is the input size of VGG19 [31] and Inception-v3 [32]. However, these rescaling and cropping processes are not considered in our case because the MobileNetV2 model does not limit the size of an input image. Instead, we set the input resolution of MobileNetV2 to 192192 pixels, which is the output size of the 4PP-EUSR model for input patches having a size of 4848 pixels. In addition, we do not employ the rescaling step and only employ the cropping step to make the input image have a size of 192192 pixels, because the objective of our score predictors is to evaluate the quality of patches, not the whole given image.

As the loss function for training the qualitative score predictors, we employ the squared Earth mover’s distance defined in [10] as

(1)

where and are the ground-truth and upscaled images, respectively, and

are the probability distributions of the qualitative scores obtained from the predictor for the two images, respectively, and

is the

-th element of the cumulative distribution function of the input. The Adam optimization method

[15] with , , and is used to train the parameters.

For the aesthetic score predictor, we use about 5,000 images of the AVA dataset [26]

for validation and the remaining 250,000 images for training. We first train the new last fully-connected layer for five epochs with a batch size of 128 and a learning rate of

, while freezing all other layers. Then, all the layers are fine-tuned for five epochs with a batch size of 32 and a learning rate of . For the validation images cropped in the center parts, the predictor achieves an average squared Earth mover’s distance of 0.075.

For the subjective score predictor, we consider the first three reference images and their degraded versions in the TID2013 dataset [27] (corresponding to 360 score distributions) for validation and the remaining 22 reference images and their degraded versions (corresponding to 2,640 score distributions) for training. Similarly to the aesthetic score predictor, we first train the subjective score predictor with freezing all the layers except the new last fully-connected layer for 100 epochs with a batch size of 128 and a learning rate of . Then, the whole network is fine-tuned for 100 epochs with a batch size of 32 and a learning rate of . For the validation images cropped in the center parts, the predictor achieves a Spearman’s rank correlation coefficient (SROCC) of 0.780.

4.3 Training super-resolution model

Finally, we fine-tune the pre-trained EUSR model together with the discriminator network using the two trained qualitative score predictors. At each training step, the 4PP-EUSR model outputs three upscaled images by a factor of 4. Then, the discriminator is trained to differentiate the ground-truth and upscaled images based on the sigmoid cross entropy loss as in [17]. After updating parameters of the discriminator, the 4PP-EUSR model is trained with six losses defined as follows.

  • Reconstruction loss (). The reconstruction loss represents the main objective of the super-resolution task: each pixel value of the super-resolved image must be as close as possible to that of the ground-truth image. In our model, this loss is measured by the pixel-by-pixel L1 loss between the ground-truth and generated images, i.e.,

    (2)

    where and are the width and height of the images, respectively, and and are the pixel values at of the ground-truth and upscaled images, respectively.

  • Adversarial loss ().

    The output of the discriminator network is used to train the super-resolution model towards enhancing perceptual quality, which is denoted as the adversarial loss. It is calculated by the sigmoid cross entropy of the logits obtained from the discriminator for the upscaled images

    [17]:

    (3)

    where is the output of the discriminator for the upscaled image , which represents the probability that the given image is a real one.

  • Aesthetic score loss (). We obtain the aesthetic scores of both the ground-truth and upscaled images from the trained aesthetic score predictor. Then, we define the aesthetic score loss as the weighted difference between the scores, i.e.,

    (4)

    where and are the predicted aesthetic scores of the ground-truth and upscaled images, respectively. is the maximum aesthetic score, which is 10 in our case. The term plays a role to control the expected level of aesthetic quality of the upscaled image. For example, enforces the model to generate an image that is even perceptually better than the ground-truth image. In our experiments, we set to 0.8.

  • Aesthetic representation loss (). Inspired by [17], we also define the aesthetic representation loss, which is the L2 loss between the intermediate outputs of the “global average pooling” layer in the aesthetic score predictor for both the ground-truth and upscaled images:

    (5)

    where and are the -th values of the intermediate outputs for the ground-truth and upscaled images, respectively. The length of each intermediate output is 1,280 [30].

  • Subjective score loss (). In the same manner as the aesthetic score loss, we calculate the subjective score loss using the trained subjective score predictor, i.e.,

    (6)

    where and are the predicted subjective scores of the ground-truth and upscaled images, respectively. is the maximum subjective score, which is 10 in our case. Similarly to , the term controls the contribution of , which is set to 0.8 in our experiments.

  • Subjective representation loss (). In the same manner as the aesthetic representation loss, we calculate the subjective representation loss using the subjective score predictor as

    (7)

    where and are the -th values of the intermediate outputs at the “global average pooling” layer for the ground-truth and upscaled images, respectively.

The losses are calculated for all the three upscaled images and then averaged.

While only the training set of the DIV2K dataset is employed during the pre-training phase, we use both the training and validation sets in the training phase, which contain a total of 900 images. In addition, we also use the downscaled images by factors of 2 and 8 as the ground-truth and input images, respectively. Thus, a total of 1,800 images are used for training. The Adam optimization method [15] with , , and is used to train both the 4PP-EUSR and discriminator. At every training step, two input image patches are selected, which results in generating six upscaled images. Thus, the effective batch sizes of the upscaling and discriminative models are six and eight (i.e., two ground-truth and six upscaled images), respectively. A total of 200,000 steps are executed with learning rates of and for the 4PP-EUSR and discriminator, respectively.

5 Results

In this section, we report the results of four experiments: comparing the performance of our method and other state-of-the-art super-resolution models, comparing the outputs obtained from different upscaling paths, investigating the roles of loss functions, and comparing the results obtained from different combinations of the loss weights. For the first three experiments, we train our model with the following weighted sum of the six losses defined in Section 4.3:

(8)

which is empirically determined to ensure high perceptual improvement with minimizing degradation of quantitative performance.

We evaluate the super-resolution performance on the Set5 [3], Set14 [39], and BSD100 [21] datasets. Each dataset contains 4, 14, and 100 images, respectively. We employ four performance metrics that are widely used in the literature, including PSNR, SSIM [36], NIQE [25], and a no-reference super-resolution (SR) score proposed by Ma et al. [20]. PSNR and SSIM are for measuring the quantitative quality, and NIQE and the SR score are for measuring the perceptual quality. For NIQE, lower values mean better quality. For PSNR, SSIM and the SR score, higher values mean better quality. All quality metrics are calculated on the Y channel of the YCbCr channels converted from the RGB channels with cropping 4 pixels of each border, as in many existing studies [17, 14, 19].

5.1 Comparison with existing models

Models # parameters Multi-scale structure Using reconstruction loss Using adversarial loss Using feature-based loss Using perceptual loss
SRResNet-MSE 1.5M No Yes No No No
SRResNet-VGG22 1.5M No No No Yes No
EDSR 43.1M No Yes No No No
MDSR 8.0M Yes Yes No No No
EUSR 6.3M Yes Yes No No No
SRGAN-MSE 1.5M No Yes Yes No No
SRGAN-VGG22 1.5M No Yes Yes Yes No
SRGAN-VGG54 1.5M No Yes Yes Yes No
CX 1.5M No Yes Yes Yes No
4PP-EUSR (Ours) 6.3M Yes Yes Yes Yes Yes
  •  For pre-training

Table 1: Properties of the baseline and our models with respect to the number of parameters, multi-scale structure, and loss functions.

We first compare the result images obtained from the 4 path of our model with those by the following existing super-resolution models.

  • Bicubic interpolation. It is a traditional upscaling method, which interpolates pixel values based on values of their adjacent pixels.

  • SRResNet [17]. This is for single-scale super-resolution, which consists of several residual blocks. Its two variants are considered: The SRResNet-MSE model is trained with the mean-squared loss and the SRResNet-VGG22 model is trained with the Euclidean distance-based loss for the output of the second conv3-128 layer of VGG19. Their results are retrieved from the authors’ supplementary material222https://twitter.box.com/s/lcue6vlrd01ljkdtdkhmfvk7vtjhetog.

  • EDSR [19]. This model also consists of residual blocks similarly to SRResNet, but does not employ batch normalization to improve the performance. In addition, the upscaled results are obtained by a so-called “geometric self-ensemble” strategy, which obtains eight geometrically transformed versions of the input image via flipping and rotation and blends the model outputs for them. The compared results are obtained from a model trained on the DIV2K dataset, which is provided by the authors333https://cv.snu.ac.kr/research/EDSR/model_pytorch.tar.

  • MDSR [19]. It is an extended version of EDSR, which supports multiple factors of upscaling. We obtain the upscaled images from the 4 path of the MDSR model trained on the DIV2K dataset [35]. The trained model is provided by the authors444https://cv.snu.ac.kr/research/EDSR/model_pytorch.tar.

  • EUSR [14]. This is the base model of 4PP-EUSR, which supports multi-scale super-resolution and consists of optimized residual modules as explained in Section 3.1. We consider the pre-trained EUSR model described in Section 4.1 as a baseline.

  • SRGAN [17]. The SRGAN model is an extended version of the SRResNet model, where a discriminator network is added to improve the perceptual quality of the upscaled outputs. We consider three SRGAN models, which use different loss functions to train the discriminator: SRGAN-MSE (the mean-squared loss), SRGAN-VGG22 (the Euclidean distance-based loss for the output of the second conv3-128 layer of VGG19), and SRGAN-VGG54 (the Euclidean distance-based loss for the output of the fourth conv3-512 layer of VGG19). The compared results are retrieved from the authors’ supplementary material555https://twitter.box.com/s/lcue6vlrd01ljkdtdkhmfvk7vtjhetog.

  • CX [22]. This model is based on SRGAN but employs an additional loss function, the contextual loss [23], which measures the cosine distance between the VGG19 features for the ground-truth and upscaled images. The compared results are retrived from the authors’ website666http://cgm.technion.ac.il/people/Roey/index.html.

Table 1

compares properties of the baselines and ours, including the number of model parameters, the existence of a multi-scale structure, whether to use the reconstruction loss, whether to employ the discriminator, whether to compare features obtained from well-known image classifiers (e.g., VGG19), and whether to use perceptual scores. First, the EDSR model consists of the largest number of parameters than the other models, while the SRResNet, SRGAN, and CX models have the smallest number of parameters. Our model contains a slightly smaller number of parameters than the MDSR model. In terms of the multi-scale structure, MDSR, EUSR, and our model utilize multiple scaling factors, while the other models are based on single-scale super-resolution. Although all the models except SRResNet-VGG22 employ the reconstruction loss, the SRGAN-VGG22 and SRGAN-VGG54 models use it only for pre-training. In addition, SRGANs, CX, and our model employ discriminator networks and use them for adversarial losses. SRResNet-MSE, SRGAN-VGG22, SRGAN-VGG54, and CX employ VGG19 as an additional network to use its intermediate outputs as feature-based losses. Our model employs the MobileNetV2-based networks instead of VGG19. Finally, ours estimates the aesthetic and subjective quality scores of the ground-truth and upscaled images for calculating perceptual losses.

Set5 PSNR (dB) SSIM NIQE SR score
Bicubic 28.4178 0.8098 8.5404 3.7702
CX 29.1017 0.8298 4.5461 7.9566
SRGAN-VGG54 29.4103 0.8338 4.6509 7.9401
SRGAN-VGG22 29.8714 0.8351 4.9186 7.5343
SRResNet-VGG22 30.5012 0.8689 6.9054 6.3357
SRGAN-MSE 30.6657 0.8589 4.9969 7.3082
4PP-EUSR (Ours) 31.0846 0.8652 5.6068 7.1491
SRResNet-MSE 32.0576 0.8919 7.1938 5.4112
EUSR 32.3519 0.8957 7.0703 5.1726
MDSR 32.5325 0.8978 7.1107 5.1094
EDSR 32.6296 0.8987 7.2347 5.2107
Set14 PSNR (dB) SSIM NIQE SR score
CX 26.0109 0.6999 3.4603 7.9423
Bicubic 26.0906 0.7050 7.7637 3.6608
SRGAN-VGG54 26.1138 0.6957 3.8745 8.1112
SRGAN-VGG22 26.5294 0.7121 4.2205 7.9829
SRGAN-MSE 27.0058 0.7187 4.0054 7.8770
SRResNet-VGG22 27.2718 0.7419 7.0235 7.0931
4PP-EUSR (Ours) 27.6222 0.7419 4.1724 7.7659
SRResNet-MSE 28.5900 0.7819 6.0751 5.6482
EUSR 28.7502 0.7860 6.1679 5.4665
MDSR 28.8947 0.7886 6.2667 5.3110
EDSR 28.9533 0.7904 6.3053 5.3788
BSD100 PSNR (dB) SSIM NIQE SR score
CX 24.5813 0.6440 3.3009 8.8007
SRGAN-VGG54 25.1762 0.6408 3.4070 8.7045
SRGAN-VGG22 25.6972 0.6603 3.7500 8.4879
Bicubic 25.9566 0.6693 7.7120 3.7225
SRGAN-MSE 25.9809 0.6429 4.0316 8.4276
SRResNet-VGG22 26.3218 0.6940 7.8053 7.4387
4PP-EUSR (Ours) 26.5707 0.6900 3.6976 8.2089
SRResNet-MSE 27.6013 0.7372 6.2400 5.8067
EUSR 27.6743 0.7403 6.4225 5.8082
MDSR 27.7710 0.7427 6.5379 5.6902
EDSR 27.7960 0.7441 6.4319 5.7791
Table 2: Performance comparison of the baselines and our model evaluated on the Set5 [3], Set14 [39], and BSD100 [21] datasets. The models are sorted by PSNR in an ascending order.
Figure 6: PSNR and NIQE values of the baselines and our model for the BSD100 dataset [21].
Ground-truth Bicubic SRResNet-MSE SRResNet-VGG22
EDSR MDSR EUSR SRGAN-MSE
SRGAN-VGG22 SRGAN-VGG54 CX 4PP-EUSR (Ours)
Figure 7: Images reconstructed by the baselines and our model. The input and ground-truth images are from the Set14 dataset [39].
Ground-truth Bicubic SRResNet-MSE SRResNet-VGG22
EDSR MDSR EUSR SRGAN-MSE
SRGAN-VGG22 SRGAN-VGG54 CX 4PP-EUSR (Ours)
Figure 8: Images reconstructed by the baselines and our model. The input and ground-truth images are from the BSD100 dataset [21].

Table 2 shows the performance comparison of the baselines and ours evaluated on the three datasets. First of all, the bicubic interpolation introduces a large amount of distortion, which results in low PSNR values, and the upscaled images have poor perceptual quality, according to the high NIQE values and low SR scores. The models that do not employ a discriminator network (i.e., SRResNet, EDSR, MDSR, and EUSR) achieve better quantitative quality than the others, showing higher PSNR values, but their perceptual quality is worse except the bicubic interpolation, showing higher NIQE values and lower SR scores. The models considering perceptual quality (i.e., SRGAN and CX) have similar or only slightly higher PSNR values in comparison to the bicubic interpolation, but their perceptual quality is far better than that of the bicubic interpolation, according to the much lower NIQE values and higher SR scores. Our model (i.e., 4PP-EUSR) always records PSNR values higher than those of the other discriminator-based models, which means that ours generates quantitatively better upscaled outputs. At the same time, our model achieves perceptual quality similar to that of SRGAN-MSE in terms of NIQE and SR score. For instance, for the BSD100 dataset, the NIQE values of our model and SRGAN-MSE are 3.6976 and 4.0316, respectively, and the SR scores are 8.2089 and 8.4276, respectively. This appears more clearly in Figure 6, which compares the baselines and our model with respect to PSNR and NIQE values measured for the BSD100 dataset. It confirms that our model achieves proper balances of the quantitative and qualitative quality of the upscaled images.

Figure 7 shows example images upscaled by different methods. Enlarged images of the regions marked by red rectangles are also shown, where high-frequency textures are expected as in the ground-truth image. First, the bicubic interpolation fails to resolve the textures, producing a highly blurred output. The SRResNet-based, EDSR, MDSR, and EUSR models produce richer textures in that region, but still largely blurry. The output of SRResNet-VGG22 shows distinctive textures, which is due to the employment of a different loss function (i.e., differences of VGG19 features). Thanks to the adversarial loss, the other models, including SRGANs, CX, and 4PP-EUSR, generate much better outputs in terms of perceptual quality with sacrificing quantitative quality. Among them, SRGAN-VGG54 and CX recover the most detailed textures, while SRGAN-MSE produces blurry textures. Our model, 4PP-EUSR, restores the textures more clearly than SRGAN-VGG22 and less distinctly than SRGAN-VGG54. Nevertheless, ours achieves better quantitative quality than all the SRGANs in terms of PSNR in Table 2.

Another comparison shown in Figure 8 further supports the importance of considering both the quantitative and perceptual quality. Similarly to Figure 7, the bicubic interpolation shows the worst output than the others, the models employing only the reconstruction loss (i.e., SRResNets, EDSR, MDSR, and EUSR) flatten most of the textured areas, and the rest (i.e., SRGANs, CX, and ours) produce outputs having detailed textures. However, the SRGAN and CX models tend to exaggerate the indistinct textures on the ground and airplane regions, introducing sizzling artifacts. For example, the SRGAN-MSE model adds a considerable amount of undesirable noises over the whole image. On the other hand, thanks to the cooperation of the loss functions, our model successfully recovers much of the textures without any prominent artifacts.

5.2 Comparing upscaling paths

As described in Section 3.2 and shown in Figure 4, our model produces three upscaled images by utilizing all the upscaling paths: by passing through the 4 path, by passing two times through the 2 path, and by passing through the 8 path and then downscaling via bicubic interpolation. Here, we compare the results obtained from the different upscaling paths to examine what aspects our model considers to learn.

Ground-truth 4 path 2 path – 2 path 8 path – downscale
Figure 9: Images reconstructed by different upscaling paths of our model. The input and ground-truth images are from the BSD100 dataset [21].
Set5 PSNR (dB) SSIM NIQE SR score
4 31.0846 0.8652 5.6068 7.1491
2 – 2 31.1722 0.8685 6.9494 7.2152
8 – downscale 30.9807 0.8620 5.8166 7.4754
Set14 PSNR (dB) SSIM NIQE SR score
4 27.6222 0.7419 4.1724 7.7659
2 – 2 27.7927 0.7487 4.8388 7.7259
8 – downscale 27.5065 0.7382 4.7569 7.8919
BSD100 PSNR (dB) SSIM NIQE SR score
4 26.5707 0.6900 3.6976 8.2089
2 – 2 26.6894 0.6962 4.9062 8.1887
8 – downscale 26.4949 0.6913 4.6888 8.4096
Table 3: Performance comparison of the outputs obtained from different three upscaling paths of the 4PP-EUSR model. The results are for the Set5 [3], Set14 [39], and BSD100 [21] datasets.

Table 3 compares the performance of the three upscaling paths of our model. While the PSNR, SSIM, ans SR score values are very similar among the three cases, the 4 path shows the best performance in terms of NIQE. This implies that upscaling using the 2 path or 8 path is more difficult than the 4 path.

Figure 9 shows an example result showing large differences between the three cases. The appearances of the textures in the enlarged regions are different depending on the upscaling paths, although the overall patterns of the textures follow that of the ground-truth image. First, the output obtained by the two-pass upscaling using the 2 path contains grid-like textures. One possible reason is due to the uncertainty in the order of passing: the model does not know whether the current input image is firstly or secondly inputted between the two passes, thus the two-pass upscaling is not fully optimized. Second, the output obtained from the 8 path with downscaling has unexpected white and black pixels, which are similar to the salt-and-pepper noise. It seems that since such noise tends to be removed by downscaling, inclusion of the noise in the output is not necessarily avoided during the training of the 8 path. These results show that each upscaling path of our model learns a different strategy for super-resolution and thus the model is trained to cope with various types of textures via the shared part of the upscaling paths (i.e., the intermediate residual module shown in Figure 3).

5.3 Roles of loss functions

Our model employs multiple types of loss functions as described in Section 4.3. To analyze the role of each loss function, we conduct an experiment where our model is trained with excluding specific loss functions. In detail, we obtain the models trained without , without , without and , and without and .

Ground-truth With all losses Without
Without Without , Without ,
Figure 10: Images reconstructed by our models trained with excluding specific loss functions. The input and ground-truth images are from the Set14 dataset [39].
Set5 PSNR (dB) SSIM NIQE SR score
With all losses 31.0846 0.8652 5.6068 7.1491
Without 29.3258 0.8374 5.2218 8.2387
Without 31.9916 0.8874 6.7405 5.9226
Without , 30.8192 0.8586 5.2113 7.6089
Without , 31.1930 0.8696 5.2472 7.0584
Set14 PSNR (dB) SSIM NIQE SR score
With all losses 27.6222 0.7419 4.1724 7.7659
Without 26.3267 0.7104 4.0134 8.0804
Without 28.5599 0.7777 5.1927 6.2370
Without , 27.5422 0.7339 4.1143 7.8034
Without , 27.5157 0.7354 4.3370 7.6618
BSD100 PSNR (dB) SSIM NIQE SR score
With all losses 26.5707 0.6900 3.6976 8.2089
Without 25.0553 0.6476 3.8430 8.7373
Without 27.4977 0.7310 5.1779 6.4945
Without , 26.3547 0.6755 4.1764 8.4308
Without , 26.3968 0.6840 4.0300 8.3324
Table 4: Performance comparison of the 4PP-EUSR models trained by excluding specific loss functions. The models are evaluated on the Set5 [3], Set14 [39], and BSD100 [21] datasets.

Table 4 shows the PSNR, SSIM, NIQE, and SR score values of the trained models. First, excluding decreases the quantitative quality of the upscaled images, showing smaller PSNR values, and increases the perceptual quality, showing smaller NIQE values and larger SR scores, in comparison to the model trained with all losses. Excluding results in the opposite outcomes: it increases the quantitative quality (i.e., larger PSNR values) and decreases the perceptual quality (i.e., larger NIQE values and smaller SR scores). Excluding the aesthetic losses (i.e., and ) or subjective losses (i.e., and ) also affects to the performance, slightly for Set5 and Set14 but largely in terms of NIQE for BSD100.

Figure 10 shows example output images, where more evident differences of the roles of the loss functions can be observed. First, the image obtained from the model trained without the reconstruction loss (i.e., ) contains the most distinct textures than the others, but the overall color distribution is slightly different from that of the ground-truth image. On the other hand, the result generated by the model trained without the adversarial loss (i.e., ) preserves the overall structure of the ground-truth image, while its details are more blurry than those of the others. The output of the model trained without the subjective loss functions contains more lattice-like textures than that of the model trained without the aesthetic loss functions. This implies that the aesthetic losses contribute to the restoration of highly structured textures, while the subjective losses are helpful to construct dispersed high-frequency textures. Finally, the image obtained by training with all the proposed loss functions is the most reliable and natural.

5.4 Comparing different loss weights

Finally, we train our model with different weights of the loss functions. Specifically, we alter the weight of the reconstruction loss in (8) as

(9)

with . We can expect that a larger value leads the model to be trained towards producing outputs having better quantitative quality.

Ground-truth
Figure 11: Images reconstructed by our models trained with different combinations of the loss weights. The input and ground-truth images are from the BSD100 dataset [21].
Set5 PSNR (dB) SSIM NIQE SR score
31.6114 0.8758 5.9336 6.2262
31.0846 0.8652 5.6068 7.1491
30.2427 0.8488 4.9660 8.1779
Set14 PSNR (dB) SSIM NIQE SR score
28.2063 0.7591 4.5293 7.0402
27.6222 0.7419 4.1724 7.7659
26.8633 0.7199 3.9819 8.0856
BSD100 PSNR (dB) SSIM NIQE SR score
27.1468 0.7094 4.3867 7.2518
26.5707 0.6900 3.6976 8.2089
25.8207 0.6653 3.7586 8.6812
Table 5: Performance comparison of our models trained with different combinations of the loss weights. The models are evaluated on the Set5 [3], Set14 [39], and BSD100 [21] datasets.

Table 5 presents the performance of our model trained with different values. As expected, decreasing the level of contribution of the reconstruction loss with the smaller results in lower PSNR and SSIM values. On the other hand, the NIQE values and SR scores are decreased and increased, respectively, which indicates improved qualitative quality. These observations emerge as the visual differences of the upscaled images shown in Figure 11. When we examine the enlarged regions where high-frequency textures are expected, decreased values affects the clearness of the output images, due to relatively larger contributions of the adversarial and perceptual losses. These confirm that there is a tradeoff between quantitative and perceptual quality as mentioned in [4], and our model has a capability to deal with the priorities of these quality measures by adjusting the weights of the loss functions.

6 Conclusion

In this paper, we proposed a perceptually improved super-resolution method, which employs multi-pass image restoration via a multi-scale super-resolution model and trains the model with a discriminator network and two qualitative score predictors. The results showed that our model successfully recovers the original textures in a perceptual manner while preventing quantitative quality degradation.

Acknowledgements

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the “ICT Consilience Creative Program” (IITP-2018-2017-0-01015) supervised by the IITP (Institute for Information & communications Technology Promotion). In addition, this work was also supported by the IITP grant funded by the Korea government (MSIT) (R7124-16-0004, Development of Intelligent Interaction Technology Based on Context Awareness and Human Intention Understanding).

References

  • [1]

    PIRM-SR: Challenge on perceptual super-resolution. To appear in the Proceedings of the European Conference on Computer Vision Workshops (2018). (2018)

  • [2]

    Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: A system for large-scale machine learning. In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. pp. 265–283 (2016)

  • [3] Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In: Proceedings of the British Machine Vision Conference. pp. 1–10 (2012)
  • [4]

    Blau, Y., Michaeli, T.: The perception-distortion tradeoff. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6228–6237 (2017)

  • [5] Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Proceedings of the European Conference on Computer Vision. pp. 184–199 (2014)
  • [6] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 2672–2680 (2014)
  • [7] Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., Lew, M.S.: Deep learning for visual understanding: A review. Neurocomputing 187, 27–48 (2016)
  • [8] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1026–1034 (2015)
  • [9] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
  • [10] Hou, L., Yu, C.P., Samaras, D.: Squared Earth mover’s distance-based loss for training deep neural networks. arXiv:1611.05916 pp. 1–9 (2016)
  • [11] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the International Conference on Machine Learning. pp. 448–456 (2015)
  • [12] Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Proceedings of the European Conference on Computer Vision. pp. 694–711 (2016)
  • [13] Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1646–1654 (2016)
  • [14] Kim, J.H., Lee, J.S.: Deep residual network with enhanced upscaling module for super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 913–921 (2018)
  • [15] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations. pp. 1–13 (2015)
  • [16] Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 1097–1105 (2012)
  • [17] Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A.P., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4681–4690 (2017)
  • [18] Li, X., He, H., Yin, Z., Chen, F., Cheng, J.: Single image super-resolution via subspace projection and neighbor embedding. Neurocomputing 139, 310–320 (2014)
  • [19] Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: Proccedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 136–144 (2017)
  • [20] Ma, C., Yang, C.Y., Yang, X., Yang, M.H.: Learning a no-reference quality metric for single-image super-resolution. Computer Vision and Image Understanding 158, 1–16 (2017)
  • [21] Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 416–423 (2001)
  • [22] Mechrez, R., Talmi, I., Shama, F., Zelnik-Manor, L.: Maintaining natural image statistics with the contextual loss. arXiv:1803.04626 pp. 1–16 (2018)
  • [23] Mechrez, R., Talmi, I., Zelnik-Manor, L.: The contextual loss for image transformation with non-aligned data. arXiv:1803.02077 pp. 1–16 (2018)
  • [24] Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing 21(12), 4695–4708 (2012)
  • [25] Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters 20(3), 209–212 (2013)
  • [26] Murray, N., Marchesotti, L., Perronnin, F.: AVA: A large-scale database for aesthetic visual analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2408–2415 (2012)
  • [27] Ponomarenko, N., Jin, L., Ieremeiev, O., Lukin, V., Egiazarian, K., Astola, J., Vozel, B., Chehdi, K., Carli, M., Battisti, F., et al.: Image database TID2013: Peculiarities, results and perspectives. Signal Processing: Image Communication 30, 57–77 (2015)
  • [28] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)
  • [29]

    Salvador, J., Perez-Pellitero, E.: Naive Bayes super-resolution forest. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 325–333 (2015)

  • [30] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4510–4520 (2018)
  • [31] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 pp. 1–14 (2014)
  • [32] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2818–2826 (2016)
  • [33] Talebi, H., Milanfar, P.: NIMA: Neural image assessment. IEEE Transactions on Image Processing 27(8), 3998–4011 (2018)
  • [34] Timofte, R., Agustsson, E., Van Gool, L., Yang, M.H., Zhang, L., Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M., et al.: NTIRE 2017 challenge on single image super-resolution: Methods and results. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 1110–1121 (2017)
  • [35] Timofte, R., Gu, S., Wu, J., Van Gool, L., Zhang, L., Yang, M.H., et al.: NTIRE 2018 challenge on single image super-resolution: Methods and results. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 965–976 (2018)
  • [36] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13(4), 600–612 (2004)
  • [37] Yang, S., Liu, Z., Wang, M., Sun, F., Jiao, L.: Multitask dictionary learning and sparse representation based single-image super-resolution reconstruction. Neurocomputing 74(17), 3193–3203 (2011)
  • [38] Yang, W., Zhang, X., Tian, Y., Wang, W., Xue, J.H.: Deep learning for single image super-resolution: A brief review. arXiv:1808.03344 pp. 1–15 (2018)
  • [39] Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations. In: Proceedings of the International Conference on Curves and Surfaces. pp. 711–730 (2010)