Image restoration (IR) is a classic low-level vision problem, which aims at reconstructing high-quality images from distorted low-quality inputs. Typical IR tasks include image super-resolution (SR), denoising, compression artifacts reduction, etc. The whirlwind of progress in deep learning has produced a steady stream of promising IR algorithms that could generate less-distorted or perceptual-friendly images. Nevertheless, one of the key bottlenecks that restrict the future development of IR methods is the “evaluation mechanism”. Although it is nearly effortless for human eyes to distinguish perceptually better images, it is challenging for an algorithm to fairly measure visual quality. In this work, we will focus on the analysis of existing evaluation methods and propose a new image quality assessment (IQA) dataset, which not only includes the most recent IR methods but also has the largest scale/diversity. The motivation will be first stated as follows.
IR methods are generally evaluated by measuring the similarity between the reconstructed images and ground-truth images via IQA metrics, such as PSNR (Group et al., 2000) and SSIM (Wang et al., 2004). Recently, some non-reference IQA methods, such as Ma (Ma et al., 2017) and Perceptual Index (PI) (Blau and Michaeli, 2018), are introduced to evaluate the recent perceptual-driven methods. To some extent, these IQA methods are the chief reason for the considerable progress of the IR field. However, while new algorithms have been continuously improving IR performance, we notice an increasing inconsistency between quantitative results and perceptual quality. For example, literature (Blau and Michaeli, 2018) reveals that the superiority of PSNR values does not always in accord with better visual quality. Although Blau et al. (2018) suggest that PI is more relevant to human judgement, algorithms with high PI scores (e.g., ESRGAN (Wang et al., 2018b) and RankSRGAN (Zhang et al., 2019)) could still produce images with obvious unrealistic artifacts. These conflicts lead us to rethink the evaluation methods for IR tasks.
An important reason for this situation is the invention of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and GAN-based IR methods (Wang et al., 2018b; Gu et al., 2020b), which brings completely new characteristics to the output images. In general, these methods often fabricate seemingly realistic yet fake details and textures. This presents a great challenge for existing IQA methods, which cannot distinguish the GAN-generated textures from noises and real details. Then we naturally arise two questions: (1) Can existing IQA methods objectively evaluate current IR methods, especially GAN-based methods? (2) With the focus on beating benchmarks on the flawed IQA methods, are we getting better IR algorithms? A few works have made early attempts to answer these questions by proposing new benchmarks for IR and IQA methods. Yang et al. (2014) conduct a comprehensive evaluation of traditional SR algorithms. Blau and Michaeli (2018) analyze the perception-distortion trade-off phenomenon and suggest the use of multiply IQA methods. However, these prior studies usually apply unreliable human ratings of image quality and are generally insufficient in IR/IQA methods. Especially, the results of GAN-based methods are missing in the above works.
To touch the heart of this problem, we need to have a better understanding of the new challenges brought by GAN. The first issue is to build a new IQA dataset with the outputs of GAN-based algorithms. An IQA dataset includes a lot of distorted images with visual quality levels annotated by humans. It can be used to measure the consistency of the prediction of the IQA method and human judgement. In this work, we contribute a novel IQA dataset, namely Perceptual Image Processing ALgorithms dataset (PIPAL). The proposed PIPAL dataset distinguishes from previous datasets in four aspects: (1) In addition to traditional distortion types (e.g., Gaussian noise/blur), PIPAL contains the outputs of several kinds of IR algorithms, including traditional algorithms, deep-learning-based algorithms, and GAN-based algorithms. In particular, this is the first time for the results of GAN-based algorithms to appear in an IQA dataset. (2) We employ the Elo rating system (Elo, 1978) to assign subjective scores, which totally involves more than 1.13 million human judgements. Competing with existing rating systems (e.g., five gradations (Sheikh et al., 2006a) and Swiss system (Ponomarenko et al., 2009)
), the Elo rating system provides much more reliable probability-based rating results. Furthermore, it is easily extensible, which allows users to update the dataset by directly adding new distortion types. (3) The proposed dataset contains 29K images in total, including 250 high-quality reference images, and each of which has 116 distortions. To date, PIPAL is the largest IQA dataset with complete subjective scoring. (4) We also contribute a new open-source web-based rating software called Image Quality Opinion Scoring (IQOS) system that allows users to assign subjective scores for their own images easily.
|Dataset||# Ref.||Image||Distortion||# Distort.||# Distort.||# Human||judgement|
|LIVE||29||image||traditional||5||0.8k||25k||MOS (Five gradations)|
|CSIQ||30||image||traditional||6||0.8k||5k||MOS (Direct ranking)|
|TID2008||25||image||traditional||17||1.7k||256k||MOS (Swiss system)|
|TID2013||25||image||traditional||25||3.0k||524k||MOS (Swiss system)|
|BAPPS||187.7k||patch (6464)||trad. alg. outputs||425||375.4k||484.3k||2 Alternative Choice|
|PieAPP||200||patch (256256)||trad. alg. outputs||75||20.3k||2492k||Prob. of Preference|
|PIPAL||250||patch||trad. alg. outputs||40||29k||1.13m||MOS|
|(Ours)||(288288)||including GAN||(Elo rating system)|
With the PIPAL dataset, we are able to answer the aforementioned two questions. First, we build a benchmark using the proposed PIPAL dataset for the existing IQA methods to explore whether can they objectively evaluate current IR methods. Experiments indicate that PIPAL poses challenges for these IQA methods. Evaluating IR algorithms only using existing metrics is not appropriate. Our research also shows that compared with the widely-used metrics (e.g., PSNR and PI), PieAPP (Prashnani et al., 2018), LPIPS (Zhang et al., 2018c) and WaDIQaM (Bosse et al., 2017) are relatively more suitable for evaluating IR algorithms, especially GAN-based algorithms. We also study the characteristics and difficulties of GAN-based distortion by comparing them with some well-studied traditional distortions. Based on the results, we argue that existing IQA methods’ low tolerance toward spatial misalignment may be a key reason for their performance drop. To answer the second question, we review the development of SR algorithms in recent years. We find that none of the existing IQA methods is always effective in evaluating SR algorithms. With the invention of new IR technologies, the corresponding evaluation methods also need to be adjusted to continuously promote the development of the IR field.
At last, we shed light on how to improve the IQA performance on GAN-based distortion. We argue that the existing IQA methods have an unsatisfactory performance on the GAN-based distortion partially because of their low tolerance to spatial misalignment. Inspired by this finding, we improve the performance of an IQA network on GAN-based distortion by explicitly considering this misalignment. Firstly, we introduce pooling layers to replace the original pooling layers in IQA networks, which can avoid aliasing during downsampling and make the extracted image features shift-invariant. Secondly, we propose a new feature comparison operation called Space Warping Difference (SWD) layer, which compares the features that not only on the corresponding position but also on a small range around it. This operation explicitly makes the comparison robust to small displacements. By employing the pooling layers and the SWD layers, we propose the space warping difference IQA network (SWDN). Extensive experimental results demonstrate that the proposed components are effective and SWDN achieves state-of-the-art performance on GAN-based distortion.
We note that a shorter conference version of this paper appeared in Gu et al. (2020a). In addition to the conference version, this manuscript includes the following additional contents: (1) We include more details of the PIPAL dataset in Section 3, including examples of the reference images in Figure 1, the distribution of the subjective scores in Figure 4, and a new image quality opinion scoring system with details about the rating process in Section 3.4. (2) We include more results and discussion in Section 4.1 and 4.2. Especially, we present more details about the “counter-example” experiment in Section 4.2. (3) We introduce a new IQA network called SWDN in Section 5.1. In Section 5.2, we conduct both comparison and ablation experiments to demonstrate the effectiveness of the proposed layers and the SWDN network.
2 Related Work
Image Restoration (IR).
As a fundamental computer vision problem, IR aims at recovering a high-quality image from its degraded observations. Some common degradation processes include nosing, blurring, downsampling, etc. For different types of image degradation, there are corresponding IR tasks, respectively. For example, image SR aims at recovering the high-resolution image from its low-resolution observation and image denoising aims at removing unpleasant noise. In past decades, plenty of IR algorithms have been proposed to continuously improve the performance. The early algorithms use hand-craft features (Dabov et al., 2007; Yang et al., 2010) or exploit image priors (Timofte et al., 2013, 2014)
in optimization problems to reconstruct images. Since the pioneer work of using Convolution Neural Networks (CNNs) to learn the IR mapping(Jain and Seung, 2009; Dong et al., 2015b), the deep-learning-based algorithms have dominated IR research due to their remarkable performance and usability. Recently, with the invention of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), GAN-based IR methods (Sajjadi et al., 2017; Zhang et al., 2019; Gu et al., 2020b) are not limited to getting a higher PSNR performance but trying to have better perceptual effect. However, these image restoration algorithms are not perfect. The results of those algorithms also include various image defects, and they are different from the traditional distortions that are often discussed in previous IQA researches. With the development of IR algorithms and the emergence of new technologies (e.g., GAN-based algorithms), evaluating the results of these algorithms becomes more and more challenging. In this paper, we mainly focus on the restoration of low-resolution images, noisy images, and images degraded by both resolution reduction and noise.
Image Quality Assessment (IQA).
The IQA methods were developed to measure the perceptual quality of images that may be degraded during acquisition, compression, reproduction, and post-processing operations. According to different usage scenarios, IQA methods can be divided into full-reference methods (FR-IQA) and no-reference methods (NR-IQA). FR-IQA methods measure the similarity between two images from the perspective of information or perceptual feature similarity, and have been widely used in the evaluation of image/video coding, restoration and communication quality. Beyond the most widely-used PSNR, FR-IQA methods follow a long line of works that can trace back to SSIM (Wang et al., 2004)
, which first introduces structural information in measuring image similarity. After that, various FR-IQA methods have been proposed to bridge the gap between results of IQA methods and human judgements(Zhang et al., 2011; Sheikh et al., 2005; Zhang et al., 2014). Similar to other computer vision problems, advanced data-driven methods have also motivated the investigation of applications of IQA (Zhang et al., 2018c; Prashnani et al., 2018). In addition to the above FR-IQA methods, NR-IQA methods are proposed to assess image quality without a reference image. Some popular NR-IQA methods include NIQE (Mittal et al., 2012b), Ma et al. (2017), BRISQUE (Mittal et al., 2012a), and PI (Blau and Michaeli, 2018). In some recent works, NR-IQA and FR-IQA methods are combined to measure IR algorithms (Blau and Michaeli, 2018). Despite of the the progress of IQA methods, only few IQA methods (e.g., PSNR, SSIM and PI) are used to evaluate IR methods.
In order to evaluate and develop IQA methods, many datasets have been proposed, such as LIVE (Sheikh et al., 2006a), CSIQ (Larson and Chandler, 2010), TID2008 and TID2013 (Ponomarenko et al., 2009, 2015). These datasets provide both distorted images and the corresponding subjective scores, and they have served as baselines for evaluation of IQA methods. These datasets are mainly distinguished from each other in three aspects: (1) the collecting of the reference images, (2) the number of distortions included and their types and (3) the collection strategy of subjective score. A quick comparison of these datasets can be found in Table 1. In addition, there are also some perceptual similarity datasets such as PieAPP (Prashnani et al., 2018), and BAPPS (Zhang et al., 2018c), which only contain pairwise judgements of distorted images.
|Traditional||Gaussian blur, motion blur, image compression, Gaussian noise, spatial warping, bilateral filter, comfort noise.|
|Super-Resolution||interpolation method, traditional methods, SR with kernel mismatch, PSNR-oriented methods, GAN-based methods.|
|Denoising||mean filtering, traditional methods, deep learning-based methods.|
|Mixture Restoration||SR of noisy images, SR after denoising, SR after compression noise removal|
3 Perceptual Image Processing ALgorithms (PIPAL) Dataset
We first describe the peculiarities of the proposed dataset from the aforementioned aspects of (1) the collecting of the reference images, (2) the number of distortions and their types, and (3) the collecting of subjective score, respectively. Then, we present a new image quality opinion scoring (IQOS) system for collecting user ratings.
3.1 Collection of reference images.
We select 250 image patches from two high-quality image datasets (DIV2K (Agustsson and Timofte, 2017) and Flickr2K (Timofte et al., 2017)) as the reference images. When selecting them, we mainly focus on the representative texture areas that are relatively hard to restore. Figure 1 shows an overview of the selected reference images. As one can see, the selected reference images are representative of a wide variety of real-world textures, including but not limited to buildings, trees, grasses, animal fur, human faces, text, and artificial textures. In our dataset, the images are of size , which could meet the requirements of most IQA methods.
3.2 Distortion Types.
In our dataset, we have 40 distortion types, which can be divided into four sub-types. An over-view of these distortion types is shown in Table 10. The first sub-type includes some traditional distortions (e.g., blur, noise, and compression), which are usually performed by basic low-level image editing operations. In the previous datasets, these distortions can be very severe, such as PieAPP and TID2013. However, in our dataset, we constrain the situation of severe distortions as we want these distortions to be comparable to the IR results, which are not likely to be very low-quality. The second sub-type includes the SR results from existing algorithms. Although some recent datasets (Zhang et al., 2018c; Prashnani et al., 2018) have covered some of the SR results, they contain results that are inferior in algorithms number and types to our dataset. We divide the selected SR algorithms into three categories – traditional algorithms, PSNR-oriented algorithms, and GAN-based algorithms. The results of traditional algorithms can be understood, to some extent, as loss of detail. The PSNR-oriented algorithms are usually based on deep-learning technology. Comparing with the traditional algorithms, their outputs tend to have sharper edges and higher PSNR performance. The outputs of GAN-based algorithms are more complicated and challenging for IQA methods. They do not quite match the quality of detail loss, as they usually contain texture-like noises, or the quality of noise, the texture-like noise is similar to the ground truth in appearance but is not accurate. An example of GAN-based distortions is shown in Figure 2. Measuring the similarity of incorrect yet similar features is of great importance to the development of perceptual SR. The third sub-type includes the outputs of several denoising algorithms. Similar to image SR, the used denoising algorithms contain both model-based algorithms and deep-learning-based algorithms. In addition to Gaussian noise, we also include JPEG compression noise removal results. At last, we include the restoration results of the mixed degradation. As revealed in (Zhang et al., 2018b; Qian et al., 2019), performing denoising and SR sequentially will bring new artifacts or different blur effects that barely occur in other IR tasks.
In summary, we have 40 different distortion types and 116 different distortion levels, resulting in 29k distorted images in total. Note that although the number of distortion types is less than some of the existing datasets, we contain a lot of new distortion types and, especially, a large number of IR algorithms’ results and GAN results. This allows the proposed dataset to provide a more objective benchmark for not only IQA methods but also IR methods.
3.3 Elo Rating for Mean Opinion Score.
Given distorted images, the Mean Opinion Score (MOS) is provided for each distorted image. In literature, there are several methodologies used to assess the visual quality of an image (Ponomarenko et al., 2015; Zhang et al., 2018c). Early datasets (Sheikh et al., 2006b) use five gradations rating method. Images are assigned into five categories such as “Bad”, “Poor”, “Fair”, “Good”, and “Excellent”. This method may lead to a huge bias when the raters do not have enough experience. In recent years, datasets usually collect MOS through a large number of pairwise selections using the Swiss rating system (Ponomarenko et al., 2015). However, the way this pairwise MOS is calculated makes it dependent on a specific dataset, which means that the MOS scores of two distorted images can change significantly when they are included in two different datasets. In order to eliminate this set-dependence effect, Prashnani et al. (2018) propose to build a dataset only based on the probability of pairwise preference, which provides a more accurate propensity probability. However, it not only requires a large number of human judgements, but also can not provide the MOS scores for distortion types, which are important for building benchmark for IR algorithms. In the proposed dataset, we employ the Elo rating system (Elo, 1978) to bring pairwise preference probability and rating system together. The use of the Elo system provides reliable human ratings with fewer human judgements.
The Elo rating system is a statistic-based rating method and was first proposed for assessing chess player levels. We assume the rater preference between two images and follows a Logistic distribution parameterized by their Elo Scores (Elo, 2008). Given their Elo scores and , the expected probability that one user would prefer to is given by:
where is the parameter of the distribution. In our dataset we use , which is a widely used setting in chess games. Then, the probability that one user would prefer to is in a symmetrical form:
Obviously, we have . Once a rater makes a choice, we then update the Elo score for both and use the following rules:
where is the max value of the Elo score change in one step. In our dataset, is set to 16. indicates whether is chosen: if wins and if fails. With thousands of human judgements, the Elo scores for each distorted images will converge. The average of the Elo scores in the last few steps will be assigned as the MOS subjective score. The averaging operation aims at reducing the randomness of the Elo score changes.
An example might help understand the Elo system. Suppose that and , then we have and . In this situation, if is chosen, the updated Elo score for will be and the new score for is ; if is chosen, the new score will be and . Note that because the expected probabilities for different images being chosen are different, the value changes of the Elo scores are different. This indicates that when the quality is too different, the winner will not get a lot from winning the bad image. According to Eq.(1), a score difference of 200 indicates 76% chance to win, and 400 indicates the chance more than 90%. At first, we assign an Elo score of 1400 for each distorted image. After numerous human judgements (in our dataset, we have 1.13 million human judgements), the Elo score for each image are collected. Figure 3 shows some examples of distortions and their corresponding MOS scores, and Figure 4 shows the distribution of MOS scores in PIPAL dataset.
Another superiority of employing the Elo system is that our dataset could be dynamic and can be extend in the future. The Elo system has been widely used to evaluate the relative level of players in electronic games (Wang et al., 2015), where the players are constantly changing and the Elo system can provide ratings for new players in a few gameplays. Recall that one of the chief reason that these IQA methods are facing challenges is the invention of GAN and GAN-based IR methods. What if other novel image generation technologies are proposed in the future? Do people need to build a new dataset to include those new algorithms? With the extendable characteristic of the Elo system, one can easily add new distortions into this dataset and follow the same rating process. The Elo system will automatically adjust the Elo score for all the distortions without re-rating for the old ones.
|Ma et al.||0.4526||0.4963||0.3676||0.6176||0.7124||0.0545|
3.4 Image Quality Opinion Scoring (IQOS) System.
Many prior works have proposed platform and software to collect subjective evaluations such as Amazon Mechanical Turk (MTurk) and Versus (Vuong et al., 2018). However, the Elo rating process is dynamic and can not be simply done with MTurk. In this work, we develop a new web-based image quality opinion scoring system (IQOS), which integrates both Swiss rating system and Elo rating system. The back-end of this system is powered by the Flask framework and the web-based user interface is powered by Vue.js. A screenshot of the interface is shown in Figure 5. For each reference image, we show two of its distorted images. The raters are required to choose the distorted image that differs less from the reference image by directly clicking on it. We ask the raters to make decision in a short time. For two images that can not be easily distinguished, we ask raters to choose any one of them randomly. When the number of judgements is big enough, the probability of two images to be selected will be close. The Elo system will assign images with very close Elo scores, this is consistent with the fact that they do have similar perceptual qualities. We also provide an overview for each reference image to show its role in the original whole image. This software is easy to use and we will make it open-source to contribute the community.
In our work, more than 500 raters have participated in the experiments both in the controlled laboratory environment and via Internet. The raters are composed of both professional data annotation teams and volunteers. The observation conditions are controlled within a reasonable range. Note that we do not strictly follow the recommendation settings of ITU (ITU-T, 2012). As these IQA methods are designed for measuring visual quality under unknown conditions in practice, visualizing and analyzing image quality under slightly changing conditions can provide a reasonably good verification of the IQA methods. The non-identical conditions of the experiments take into account how to evaluate visual quality in real practice for different users.
In this section, we conduct a comprehensive study using the proposed PIPAL dataset. We first build a benchmark for IQA methods. Through this benchmark, we can answer the question that “can existing IQA methods objectively evaluate recent IR algorithms?” We then build a benchmark for some recent SR algorithms to explore the relationship between the development of IQA methods and IR research. We can get the answer of “are we getting better IR algorithms by beating benchmarks on these IQA methods?” At last, we study the characteristics of GAN-based distortion by comparing them with other existing distortion types.
4.1 Evaluation on IQA Methods
We select a set of commonly-used IQA methods to build a benchmark. For the FR-IQA methods, we include: PSNR, NQM (Damera-Venkata et al., 2000), UQI (Wang and Bovik, 2002), SSIM (Wang et al., 2004), MS-SSIM (Wang et al., 2003), IFC (Sheikh et al., 2005), VIF (Sheikh and Bovik, 2006), VSNR-FR (Chandler and Hemami, 2007), RFSIM (Zhang et al., 2010), GSM (Liu et al., 2011), SR-SIM (Zhang and Li, 2012), FSIM and FSIM (Zhang et al., 2011), SFF (Chang et al., 2013), VSI (Zhang et al., 2014), SCQI (Bae and Kim, 2016), LPIPS-Alex and -VGG (Zhang et al., 2018c)111In this work, we use the 0.1 version of LPIPS., PieAPP (Prashnani et al., 2018), DISTS (Ding et al., 2020), and WaDIQaM (Bosse et al., 2017). We also include some popular NR-IQA methods: NIQE (Mittal et al., 2012b), Ma (Ma et al., 2017), and PI (Blau and Michaeli, 2018). Among them, PI is derived from a combination of NIQE and Ma. Note that these NR-IQA methods are designed to measure the intrinsic quality of images, and is different from measuring the perceptual similarity, thus the direct comparison of these methods is unfair. All these methods are implemented using the released code.
As in many previous works, we evaluate IQA methods mainly using Spearman rank order correlation coefficients (SRCC) (Sheikh et al., 2006b) and Kendall rank order correlation coefficients (KRCC) (Kendall and Stuart, 1977). These two indexes evaluate the monotonicity of methods: whether the scores of high-quality images are higher (or lower) than low-quality images. We also include the Pearson linear correlation coefficient (PLCC) results, which are often used to evaluate the accuracy of methods. Before calculating PLCC index, we perform the third-order polynomial nonlinear regression. The complete numerical results and full discussion will be shown in Appendix A.
We first evaluate the IQA methods using all types of distortions in PIPAL dataset. A convenient presentation for both SRCC and KRCC rank coefficients is shown in Figure 6. The first conclusion is that even the best IQA method – PieAPP provides SRCC with only about 0.71 for all distortion types, which is much lower than its performance on TID2013 dataset (about 0.90). This indicates that the proposed PIPAL dataset is challenging for existing IQA methods and there is a large room for future improvement. Moreover, a high overall correlation performance does not necessarily indicate the high performance on each sub-type of distortions. As the focus of this paper, we want to analyze the performance of IQA using IR results, especially the outputs of GAN-based algorithms. Specifically, we take SR sub-type as an example and show the performance of IQA methods in evaluating SR algorithm distortions. In Table 3, we show the SRCC results with respect to different distortion sub-types, including traditional distortions, denoising outputs, all SR outputs, traditional SR outputs, PSNR-oriented SR outputs, and GAN-based outputs. Analysis of Table 3 leads to the following conclusions. First, although performing well in evaluating traditional and PSNR-oriented SR algorithms, all of these IQA methods suffer from severe performance drop when evaluating GAN-based distortion. This confirms the conclusion of Blau and Michaeli (2018) that higher PSNR may be related to lower perceptual performance for GAN-based algorithms. Second, despite the severe performance drop, several IQA methods (e.g., LPIPS, PieAPP, and DISTS) still outperform the others. Coincidentally, these methods are all recent works and based on deep networks. We also show the scatter distributions of the subjective scores vs. the predicted scores in Figure 8 on the PIPAL dataset by some representative IQA methods. The curves shown in Figure 8 were obtained by the third-order polynomial non-linear fitting. For the nonlinear regression, we follow the suggestion of Sheikh et al. (Sheikh et al., 2006b). One can see that for the distortions without GAN effects, the objective scores have a higher correlation with the subjective scores than GAN-based distortion.
We then present the analysis of IQA methods as IR performance measures. In Figure 7, we show the scatter plots of subjective scores vs. some commonly-used image quality metrics and their correlations for 23 SR algorithms. Among them, PSNR, SSIM and FSIM are the most common measures, IFC is suggested by Yang et al. (2014) for its good performance on early algorithms, NIQE and PI are suggested in recent works (Blau and Michaeli, 2018) for their good performance on GAN-based algorithms. LPIPS (Zhang et al., 2018c) and PieAPP (Prashnani et al., 2018) are selected according to our benchmark. As can be seen that PSNR, SSIM and IFC are anti-correlated with the subjective scores, thus are inappropriate for evaluating GAN-based algorithms. NIQE and PI show moderate performance in evaluating IR algorithms, while LPIPS and PieAPP are the most correlated. Note that different from the work of Blau and Michaeli (2018), where they evaluate the perceptual quality only based on whether the image looks real, we collect subjective scores based on the perceptual similarity with the ground truth. Therefore, in evaluating the performance of the IR algorithms from the perspective of reconstruction, the suggestions given by our work are more appropriate. These results have a similar trend as the results presented in Figure 8.
4.2 Evaluation on IR Methods
One of the most important applications of IQA technology is to evaluate IR algorithms. IQA methods have been the chief reason for the progress in the IR field as a means of comparing the performance. However, evaluating IR methods only with specific IQA methods also narrows the focus of IR research and converts it to competitions only on the quantitative numbers (e.g., PSNR competitions (Timofte et al., 2018; Cai et al., 2019) and PI competition (Blau et al., 2018)). As stated above, existing IQA methods may be inadequate in evaluating IR algorithms. We wonder that with the focus on beating benchmarks on the flawed IQA methods, are we getting better IR algorithms? To answer this question, we take the SR task as a representative and select 12 SR algorithms to build a benchmark. These algorithms are all representative algorithms and selected from 2013 to the present. We evaluate: YY (Yang and Yang, 2013), TSG (Timofte et al., 2013), A+ (Timofte et al., 2014), SRCNN (Dong et al., 2015b), FSRCNN (Dong et al., 2016), VDSR (Kim et al., 2016), EDSR (Lim et al., 2017), SRGAN (Ledig et al., 2017), RCAN (Zhang et al., 2018d), BOE (Navarrete Michelini et al., 2018), ESRGAN (Wang et al., 2018b), and RankSRGAN (Zhang et al., 2019). The results are shown in Table 9. We present more algorithms in Appendix A. One can observe that before 2017 (when GAN was applied to SR) the PSNR performance improves continuously. Especially, the deep-learning-based algorithms improve PSNR by about 1.4dB. These efforts do improve the subjective performance – the average MOS value increases by about 90 in 4 years. After SRGAN was proposed, the PSNR decreased by about 2.6dB compared to the state-of-the-art PSNR performance at that time (EDSR), but the MOS value increased by about 50 suddenly. In contrast, RCAN was proposed to defeat EDSR in terms of PSNR. Its PSNR performance is only a little higher than EDSR but its MOS score is even lower than EDSR. When noticing that the mainstream metrics (PSNR and SSIM) had conflicted with the subjective performance, PI was proposed to evaluate perceptual SR algorithms (Blau and Michaeli, 2018). After that, ESRGAN and RankSRGAN have been continuously improving PI performance. Among them, the latest RankSRGAN has achieved the current state-of-the-art in terms of PI performance. However, it is worth noting that, ESRGAN has the highest subjective score, but it has no advantage in the performance of PI and NIQE comparing with RankSRGAN. Efforts on improving the PI value show limited effects and have failed to continuously improve MOS performance after ESRGAN. These observations inspire us in two aspects: (1) None of the existing IQA methods is always effective in evaluation. With the development of IR technology, new IQA methods need to be proposed accordingly. (2) Excessively optimizing performance on a specific IQA metric may cause a decrease in perceptual quality.
We conduct experiments to explore the outcomes of excessively optimizing IR algorithm performance on an IQA metric by generating “counter-examples”. Conceptually speaking, even one counter-example is sufficient to disprove an IQA method as an IR metric, because algorithms may take advantage of this vulnerability to achieve numerical superiority. We obtain these examples by gradient-based optimization to maximize the values of certain IQA methods. According to Blau and Michaeli (2018)
, distortion and perceptual quality are at odds with each other. In order to simulate the situation where there is a perception-distortion trade-off, we constrain the PSNR values to be less than or equal to that of the initial distorted image during optimization. Given a reference imageand an initial image , we solve the following objective function to obtain the possible “counter-example” for an differentiable IQA methods :
where and this constraint ensures the PSNR value of the result will not increase. We employ the projected gradient method (Boyd et al., 2004) to solve this optimization problem. Its basic idea is to take a step in the descending direction, and then project it to the feasible region. Because the constraint is convex, the projection to the convex set is unique. The specific iteration formula is as follows:
where is the feasible region that satisfy Eq. (6) and is the learning rate. In our experiment, We use the output of ESRGAN as the initial image. Note that for some IQA methods, the higher value indicates the better perceptual quality, thus we perform gradient ascent instead of gradient descent to achieve better quality.
The results are shown in Figure 9. We can see that some images show superior numerical performance when evaluated using certain IQA methods, but may not be dominant in other metrics. Their best-cases also show different visual effects. Even for some IQA methods (LPIPS and DISTS) with good performance on GAN-based distortion, their best-cases still contain serious artifacts. This shows the risk of evaluating IR algorithms using these IQA methods. This also indicates that evaluating and developing new IQA methods plays an important role in future research.
4.3 Discussion of GAN-based distortion
Recall that LPIPS, DISTS and PieAPP perform relatively better in evaluating GAN-based distortion. The effectiveness of these methods may be attributed to the following reasons. Compared with other IQA methods, deep-learning-based IQA methods can extract image features more effectively. For traditional distortion types, such as blur, compression and noise, the distorted images usually disobey the prior distribution of natural images. Early methods can assess these images by measuring low-level statistic features such as image contrast, gradient, and structural information. These strategies are also effective for the outputs of traditional and PSNR-oriented algorithms. However, most of these strategies fail in the case of GAN-based distortion, as such distortions may have similar image statistic features with the reference image – the way that GAN-based distortion differs from the reference image is less apparent. In this case, deep networks can capture these unapparent features and distinguish such distortions to some extent.
In order to explore the characteristics and difficulties of GAN-based distortion, we compare them with some well-studied distortions. Note that, for a good IQA method, the subjective scores in the scatter plot should increase coincide with objective values, and the samples are well clustered along the fitted curve. In the case of two distortion types, if the IQA method behaves similarly for both of them, their samples on the scatter plot will also be well clustered and overlapped. For example, the additive Gaussian noise and image lossy compression are well studied for most IQA methods. When calculating the objective values using FSIM, samples of both distortions are clustered, as shown in Figure 10 (a). This indicates that FSIM can adequately characterize the visual quality of an image damaged due to these two types of distortion. Then we study GAN-based distortion by comparing it with some existing distortion types using FSIM. Figure 10 (b) shows the results of GAN-based distortion and additive Gaussian noise, and Figure 10 (c) shows the results of GAN-based distortion and Gaussian blur. It can be seen that the samples of Gaussian noise and Gaussian blur barely intersect with GAN-based samples. FSIM largely underestimates the visual quality of GAN-based distortion. In Figure 10 (d), we show the result of GAN-based distortion and spatial warping distortion. As can be seen, these two distortion types behave unexpectedly similar. FSIM cannot handle them and shows the same random and diffused state. The quantitative results also verify this phenomenon. For spatial warping distortion type, the SRCC of FSIM is 0.3070, and it is close to the performance of GAN-based distortion, which is 0.4047. Thus we argue that the spatial warping distortion and GAN-based distortion pose similar challenges to FSIM.
As revealed in experimental psychology (Kolers, 1962; Kahneman, 1968), the interaction or mutual interference between visual information may cause the Visual Masking effects. According to this theory, some key reasons that IQA methods tend to underestimate both GAN-based distortion and spatial warping distortion are as follows. Firstly, for the edges with strong intensity change, the human visual system (HVS) is sensitive to the contour and shape, but not sensitive to the error and misalignment of the edges. Secondly, the ability of HVS to distinguish texture decreases in the region with dense textures. When the extracted features of the textures are similar, the HVS will ignore part of the subtle differences and misalignment of textures. However, both the traditional and deep-learning-based IQA methods require a good alignment for the inputs. This partially causes the drop of performance of these IQA methods on GAN-based distortion. This finding provides an insight that if we explicitly consider the spatial misalignment of the inputs, we may improve the performance on GAN-based distortion. We will discuss this possibility in Sec 5.
5 Improving IQA Networks
As stated in Sec 4.3
, we argue that deep-learning-based methods outperform others due to their excellent feature extraction capacity. However, when it comes to the distortions with spatial misalignment such as GAN-based distortion, these methods obtain unsatisfactory performance, partially because of their low tolerance to misalignment. This finding provides us an insight that if we explicitly consider the spatial misalignment, we may improve the performance of IQA methods on GAN-based distortion. In this section, we explore this possibility by introducing anti-aliasing pooling layer and a spatially robust comparison operation into the IQA network.
5.1 Anti-aliasing Pooling and Space Warping Difference
FR-IQA networks can be roughly divided into three sub-models: feature extraction, feature comparison and subjective score regression. Among them, the feature extraction sub-model extracts image features by cascaded convolution operations. Feature comparison sub-model compares the features of two images. The most direct and commonly used method is to calculate the differences of image feature vectors(Zhang et al., 2018c). Finally, the subjective score regression sub-model calculates the final similarity scores based on the feature differences. If we want the IQA networks to be robust to small misalignment, the feature extraction and feature comparison sub-models should at least be invariant to this misalignment/shift. However, these two parts are usually not shift-invariant in existing deep IQA networks. We then discuss the feature extraction and feature comparison sub-models, respectively.
For the feature extraction sub-model, the standard convolution operations should have been shift-invariant. However, some commonly used downsampling layers, such as max/average pooling and strided-convolution, ignore the sampling theorem(Azulay and Weiss, 2019) and are not shift-invariant anymore. These downsampling layers are widely used in deep networks such as VGG (Simonyan and Zisserman, 2014) and AlexNet (Krizhevsky et al., 2012), which are popular backbone architectures for feature extraction (e.g., in LPIPS and DISTS). As suggested by Zhang (2019), one can fix this by introducing anti-aliasing pooling. In our work, following Hénaff and Simoncelli (2016), to avoid aliasing when downsampling by a factor of two, the Nyquist theorem requires blurring with a filter whose cutoff frequency is below . Then, we introduce the pooling:
where represents the Hadamard product, represent convolution operation and
is the blurring kernel and is implemented by a Hanning window that approximately enforces the Nyquist criterion. In this work, we replace all the max pooling layers withpooling layers in our IQA network to avoid possible aliasing in feature extraction.
We then discuss the feature comparison sub-model. In most existing IQA networks, the comparison is conducted with element-wise subtraction or a Euclidean distance between two extracted features. These operations require good alignment and are all sensitive to spatial shift. Here, we explicitly consider the robustness of spatial misalignment and propose a Space Warping Difference (SWD) layer to compare the features that not only on the corresponding position but also on a small range around the corresponding position. The SWD layer is illustrated in Figure 11. For two images and and their extracted features and , the SWD layer is formulated as follow:
where indicates the searching range, indicates the feature vector at location , and indicates the location that the distance of two feature vectors achieves the minimum value:
Because a big will make the computational complexity rise sharply, in our work, we choose for a better performance speed trade-off.
|Method||PIPAL (full set)||TID2013||Trad. & PSNR. SR||GAN distort.||PIPAL (test set)|
|Ma et al.||0.4216||0.2951||0.3496||0.2417||0.6774||0.4863||0.0545||0.0363||0.2843||0.2040|
, features of the classification networks are useful for building perceptual metrics. We employ an AlexNet pre-trained on ImageNet(Russakovsky et al., 2015) as the feature extraction backbone and replace all the max pooling layers with the proposed pooling layers. We obtain features , after each convolution layer for both and using the feature extraction sub-model. Note that the feature extraction backbone architecture is replaceable. Alternative choices include VGG, SqueezeNet (Simonyan and Zisserman, 2014), and so on. For each pair of extracted features, we use the SWD layer to calculate the difference . We then perform perceptual score regression by a small sub-net-work with two convolution layers. This sub-network takes the feature difference as input and gives predicted similarity score for each feature pair. Finally, we sum up all the s to obtain the final perceptual similarity .
We train our network using BAPPS (Zhang et al., 2018c) dataset. We employ ranking cross entropy loss to train the SWDN network. The training procedure is illustrated in Figure 13. For each step, we randomly choose one reference image and its two distorted images and . We calculate the distances between the reference image and the distorted images . Given these two distances, we train a small network on top to map to a probability , where represents the predicted preference between and and is the ground truth preference probability provided in BAPPS dataset. The architecture of
consist of two 32-unit fully connected layers with ReLU activations, followed by a 1-unit fully connected layer and a sigmoid activation. The final loss function is shown as follow:
We note that the proposed PIPAL dataset can also be used to train our network. According to the Elo scores of and , the ground truth preference probability is obtained using Eq. (1). For optimization, we use Adam (Kingma and Ba, 2014) with , and learning rate
. We implement our models with the Pytorch framework(Paszke et al., 2019) and the whole training process takes about 12 hours.
|No.||Pooling||SWD Layer||PIPAL (full set)||PIPAL (GAN)|
In this section, we experimentally verify the effectiveness of the proposed SWDN network. We provide a comparison with some commonly-used IQA methods, including PSNR, SSIM, FSIM, NIQE, Ma, PieAPP, DISTS, LPIPS-VGG, and LPIPS-Alex. We test these IQA methods using the proposed PIPAL dataset. We also provide the results tested on the TID2013 dataset to show their performance on the traditional distortion types. The results are shown in Table 5. Among the tested IQA methods, PieAPP and DISTS are trained with other datasets. Although PieAPP achieves the best SRCC performance on both the PIPAL dataset and the TID2013, their training data are not publicly available. Direct comparison with these methods is unfair. LPIPS-VGG and LPIPS-Alex are trained using BAPPS, which are the same as ours; thus they are suitable for comparing the proposed algorithms. As can be seen, the proposed SWDN obtains comparable performance on both the PIPAL dataset and TID2013 dataset. When it comes to GAN-based distortion, the proposed SWDN achieves better performance compared with the existing IQA methods. Especially, the SWDN outperforms the LPIPS-Alex and LPIPS-VGG. Note that the LPIPS-Alex uses the same feature extraction backbone architecture with SWDN, which indicates that the proposed strategies are effective.
The proposed PIPAL dataset can also be used to train the SWDN network. We split the PIPAL dataset into training and testing sets. For the reference images, we split 100 reference images to the testing set and 150 images to the training set. For the distorted images, we randomly split half of the distorted images of the SR distortion sub-type into the testing set, including the outputs of traditional, PSNR-oriented, and GAN-based SR algorithms. For comparison, we train the SWDN and LPIPS-Alex using the PIPAL training dataset. The results are shown in the last two rows of Table 5 and are marked with “*”. As one can observe, compared with the versions trained using the BAPPS dataset, both the LPIPS-Alex and SWDN trained on the PIPAL dataset achieves better performance on TID2013 and the PIPAL test set, which indicates that building a large-scale dataset with GAN-based distortion can directly help to get better IQA methods. When using the same training dataset, the proposed SWDN also outperforms LPIPS-Alex.
To investigate the behavior of SWDN as a proposal method, we conduct ablation studies to show the effect of different components. For the baseline architecture, we replace the pooling layers with the original max-pooling layers and remove SWD layers. The results are shown in Table 6 and there are 8 experiments. First, as shown by the experiments Table 6 (1,2), the pooling layers can improve the performance on GAN-based distortion at a small cost. Second, we show the results with only SWD layers and the hyper-parameter varies from 1 to 5. As can be observed, for all s, the proposed SWD layer can improve the performance of GAN-based distortion. When is set to be 3, the network with SWD layers achieves the best performance. Thus, we choose for a better performance-speed trade-off. At last, we combine the pooling layers and SWD layers to form the final SWDN network, and the results are shown in Table 6 (8). The final SWDN network achieves the best performance of the GAN-based distortion type.
In this paper, we contribute a novel large-scale IQA dataset, namely PIPAL dataset. PIPAL contains 250 reference images, 40 distortion types, 29k distortion images, and more than one million human ratings. Especially, we include 24 GAN-based algorithms’ outputs as a new GAN-based distortion type. We employ the Elo rating system to assign the Mean Opinion Scores (MOS). Based on the PIPAL dataset, we establish benchmarks for both IQA methods and SR algorithms. Our results indicate that existing IQA methods face challenges in evaluating perceptual IR algorithms, especially GAN-based algorithms. With the development of IR technology, new IQA methods need to be proposed accordingly. We also propose a new IQA network called Space Warping Difference Network, which consists of pooling layers and novel Space Warping Difference layers. Experiments demonstrate the effectiveness of the proposed network.
Acknowledgements.This work is partially supported by the National Natural Science Foundation of China (61906184), Science and Technology Service Network Initiative of Chinese Academy of Sciences (KFJ-STS-QYZX-092), Shenzhen Basic Research Program (JSGG20180507182100698, CXB20110422-0032A), the Joint Lab of CAS-HK, Shenzhen Institute of Artificial Intelligence and Robotics for Society, and SenseTime Research.
- Agustsson and Timofte (2017) Agustsson E, Timofte R (2017) Ntire 2017 challenge on single image super-resolution: Dataset and study. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 126–135
Azulay and Weiss (2019)
Azulay A, Weiss Y (2019) Why do deep convolutional networks generalize so poorly to small image transformations? Journal of Machine Learning Research 20(184):1–25
- Bae and Kim (2016) Bae SH, Kim M (2016) A novel image quality assessment with globally and locally consilient visual quality perception. IEEE Transactions on Image Processing 25(5):2392–2406
- Bisberg and Cardona-Rivera (2019) Bisberg AJ, Cardona-Rivera RE (2019) Scope: Selective cross-validation over parameters for elo. In: Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol 15, pp 116–122
- Blau and Michaeli (2018) Blau Y, Michaeli T (2018) The perception-distortion tradeoff. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6228–6237
- Blau et al. (2018) Blau Y, Mechrez R, Timofte R, Michaeli T, Zelnik-Manor L (2018) The 2018 pirm challenge on perceptual image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 0–0
Bosse et al. (2017)
Bosse S, Maniry D, Müller KR, Wiegand T, Samek W (2017) Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing 27(1):206–219
- Boyd et al. (2004) Boyd S, Boyd SP, Vandenberghe L (2004) Convex optimization. Cambridge university press
- Cai et al. (2019) Cai J, Gu S, Timofte R, Zhang L (2019) Ntire 2019 challenge on real image super-resolution: Methods and results. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 0–0
Chandler and Hemami (2007)
Chandler DM, Hemami SS (2007) Vsnr: A wavelet-based visual signal-to-noise ratio for natural images. IEEE transactions on image processing 16(9):2284–2298
- Chang et al. (2013) Chang HW, Yang H, Gan Y, Wang MH (2013) Sparse feature fidelity for perceptual image quality assessment. IEEE Transactions on Image Processing 22(10):4007–4018
- Choi et al. (2018) Choi JH, Kim JH, Cheon M, Lee JS (2018) Deep learning-based image super-resolution considering quantitative and perceptual quality. arXiv preprint arXiv:180904789
- Dabov et al. (2007) Dabov K, Foi A, Katkovnik V, Egiazarian K (2007) Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing 16(8):2080–2095
- Damera-Venkata et al. (2000) Damera-Venkata N, Kite TD, Geisler WS, Evans BL, Bovik AC (2000) Image quality assessment based on a degradation model. IEEE transactions on image processing 9(4):636–650
- Ding et al. (2020) Ding K, Ma K, Wang S, Simoncelli EP (2020) Image quality assessment: Unifying structure and texture similarity. CoRR abs/2004.07728, URL https://arxiv.org/abs/2004.07728
- Dong et al. (2015a) Dong C, Deng Y, Change Loy C, Tang X (2015a) Compression artifacts reduction by a deep convolutional network. In: Proceedings of the IEEE International Conference on Computer Vision, pp 576–584
- Dong et al. (2015b) Dong C, Loy CC, He K, Tang X (2015b) Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38(2):295–307
- Dong et al. (2016) Dong C, Loy CC, Tang X (2016) Accelerating the super-resolution convolutional neural network. In: European conference on computer vision, Springer, pp 391–407
- Elo (1978) Elo AE (1978) The rating of chessplayers, past and present. Arco Pub.
- Elo (2008) Elo AE (2008) Logistic probability as a rating basis. The Rating of Chessplayers, Past&Present Bronx NY 10453
- Goodfellow et al. (2014) Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
- Group et al. (2000) Group VQE, et al. (2000) Final report from the video quality experts group on the validation of objective models of video quality assessment. In: VQEG meeting, Ottawa, Canada, March, 2000
- Gu et al. (2019) Gu J, Lu H, Zuo W, Dong C (2019) Blind super-resolution with iterative kernel correction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1604–1613
- Gu et al. (2020a) Gu J, Cai H, Chen H, Ye X, Ren JS, Dong C (2020a) Pipal: a large-scale image quality assessment dataset for perceptual image restoration. In: Proceedings of the European Conference on Computer Vision (ECCV)
- Gu et al. (2020b) Gu J, Shen Y, Zhou B (2020b) Image processing using multi-code gan prior. In: Proceedings of the IEEE International Conference on Computer Vision
- Hénaff and Simoncelli (2016) Hénaff OJ, Simoncelli E (2016) Geodesics of learned representations. In: International Conference on Learning Representations
- ITU-T (2012) ITU-T P (2012) 1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models. ITU-T Recommendation p 1401
- Jain and Seung (2009) Jain V, Seung S (2009) Natural image denoising with convolutional networks. In: Advances in neural information processing systems, pp 769–776
- Kahneman (1968) Kahneman D (1968) Method, findings, and theory in studies of visual masking. Psychological Bulletin 70(6p1):404
- Kendall and Stuart (1977) Kendall M, Stuart A (1977) The advanced theory of statistics; charles griffin & co. Ltd(London) 83:62013
- Kim et al. (2016) Kim J, Kwon Lee J, Mu Lee K (2016) Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1646–1654
- Kim and Lee (2018) Kim JH, Lee JS (2018) Deep residual network with enhanced upscaling module for super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
- Kingma and Ba (2014) Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980
- Kolers (1962) Kolers PA (1962) Intensity and contour effects in visual masking. Vision Research 2(9-10):277–IN4
Krizhevsky et al. (2012)
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
- Larson and Chandler (2010) Larson EC, Chandler DM (2010) Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of Electronic Imaging 19(1):011006
- Ledig et al. (2017) Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4681–4690
- Lim et al. (2017) Lim B, Son S, Kim H, Nah S, Mu Lee K (2017) Enhanced deep residual networks for single image super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 136–144
- Liu et al. (2011) Liu A, Lin W, Narwaria M (2011) Image quality assessment based on gradient similarity. IEEE Transactions on Image Processing 21(4):1500–1512
- Ma et al. (2017) Ma C, Yang CY, Yang X, Yang MH (2017) Learning a no-reference quality metric for single-image super-resolution. Computer Vision and Image Understanding 158:1–16
- Mittal et al. (2012a) Mittal A, Moorthy AK, Bovik AC (2012a) No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing 21(12):4695–4708
- Mittal et al. (2012b) Mittal A, Soundararajan R, Bovik AC (2012b) Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters 20(3):209–212
- Navarrete Michelini et al. (2018) Navarrete Michelini P, Zhu D, Liu H (2018) Multi-scale recursive and perception-distortion controllable image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV)
- Paszke et al. (2019) Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp 8026–8037
- Ponomarenko et al. (2009) Ponomarenko N, Lukin V, Zelensky A, Egiazarian K, Carli M, Battisti F (2009) Tid2008-a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics 10(4):30–45
- Ponomarenko et al. (2015) Ponomarenko N, Jin L, Ieremeiev O, Lukin V, Egiazarian K, Astola J, Vozel B, Chehdi K, Carli M, Battisti F, et al. (2015) Image database tid2013: Peculiarities, results and perspectives. Signal Processing: Image Communication 30:57–77
- Prashnani et al. (2018) Prashnani E, Cai H, Mostofi Y, Sen P (2018) Pieapp: Perceptual image-error assessment through pairwise preference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1808–1817
- Qian et al. (2019) Qian G, Gu J, Ren JS, Dong C, Zhao F, Lin J (2019) Trinity of pixel enhancement: a joint solution for demosaicking, denoising and super-resolution. arXiv preprint arXiv:190502538
- Russakovsky et al. (2015) Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115(3):211–252
- Sajjadi et al. (2017) Sajjadi MS, Scholkopf B, Hirsch M (2017) Enhancenet: Single image super-resolution through automated texture synthesis. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4491–4500
- Sheikh and Bovik (2006) Sheikh HR, Bovik AC (2006) Image information and visual quality. IEEE Transactions on image processing 15(2):430–444
- Sheikh et al. (2005) Sheikh HR, Bovik AC, De Veciana G (2005) An information fidelity criterion for image quality assessment using natural scene statistics. IEEE Transactions on image processing 14(12):2117–2128
- Sheikh et al. (2006a) Sheikh HR, Sabir MF, Bovik AC (2006a) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing 15(11):3440–3451
- Sheikh et al. (2006b) Sheikh HR, Sabir MF, Bovik AC (2006b) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing 15(11):3440–3451
- Simonyan and Zisserman (2014) Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
- Timofte et al. (2013) Timofte R, De Smet V, Van Gool L (2013) Anchored neighborhood regression for fast example-based super-resolution. In: Proceedings of the IEEE international conference on computer vision, pp 1920–1927
- Timofte et al. (2014) Timofte R, De Smet V, Van Gool L (2014) A+: Adjusted anchored neighborhood regression for fast super-resolution. In: Asian conference on computer vision, Springer, pp 111–126
- Timofte et al. (2017) Timofte R, Agustsson E, Van Gool L, Yang MH, Zhang L (2017) Ntire 2017 challenge on single image super-resolution: Methods and results. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 114–125
- Timofte et al. (2018) Timofte R, Gu S, Wu J, Van Gool L (2018) Ntire 2018 challenge on single image super-resolution: Methods and results. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 852–863
- Vasu et al. (2018) Vasu S, TM N, Rajagopalan A (2018) Analyzing perception-distortion tradeoff using enhanced perceptual super-resolution network. In: European Conference on Computer Vision (ECCV) Workshops
- Vu et al. (2018) Vu T, Luu TM, Yoo CD (2018) Perception-enhanced image super-resolution via relativistic generative adversarial networks. In: The European Conference on Computer Vision (ECCV) Workshops
- Vuong et al. (2018) Vuong J, Kaur S, Heinrich J, Ho BK, Hammang CJ, Baldi BF, O’Donoghue SI (2018) Versus—a tool for evaluating visualizations and image quality using a 2afc methodology. Visual Informatics 2(4):225–234
- Wang et al. (2015) Wang H, Yang HT, Sun CT (2015) Thinking style and team competition game performance and enjoyment. IEEE Transactions on Computational Intelligence and AI in Games 7(3):243–254
- Wang et al. (2018a) Wang X, Yu K, Dong C, Change Loy C (2018a) Recovering realistic texture in image super-resolution by deep spatial feature transform. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 606–615
- Wang et al. (2018b) Wang X, Yu K, Wu S, Gu J, Liu Y, Dong C, Qiao Y, Loy CC (2018b) Esrgan: Enhanced super-resolution generative adversarial networks. In: European Conference on Computer Vision, Springer, pp 63–79
- Wang and Bovik (2002) Wang Z, Bovik AC (2002) A universal image quality index. IEEE signal processing letters 9(3):81–84
- Wang et al. (2003) Wang Z, Simoncelli EP, Bovik AC (2003) Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol 2, pp 1398–1402
- Wang et al. (2004) Wang Z, Bovik AC, Sheikh HR, Simoncelli EP, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4):600–612
- Xu et al. (2018) Xu J, Zhang L, Zhang D (2018) A trilateral weighted sparse coding scheme for real-world image denoising. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 20–36
- Yang and Yang (2013) Yang CY, Yang MH (2013) Fast direct super-resolution by simple functions. In: Proceedings of the IEEE international conference on computer vision, pp 561–568
- Yang et al. (2014) Yang CY, Ma C, Yang MH (2014) Single-image super-resolution: A benchmark. In: European Conference on Computer Vision, Springer, pp 372–386
- Yang et al. (2010) Yang J, Wright J, Huang TS, Ma Y (2010) Image super-resolution via sparse representation. IEEE transactions on image processing 19(11):2861–2873
- Zhang et al. (2017) Zhang K, Zuo W, Chen Y, Meng D, Zhang L (2017) Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26(7):3142–3155
- Zhang et al. (2018a) Zhang K, Zuo W, Zhang L (2018a) Ffdnet: Toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing 27(9):4608–4622
- Zhang et al. (2018b) Zhang K, Zuo W, Zhang L (2018b) Learning a single convolutional super-resolution network for multiple degradations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3262–3271
- Zhang and Li (2012) Zhang L, Li H (2012) Sr-sim: A fast and high performance iqa index based on spectral residual. In: 2012 19th IEEE international conference on image processing, IEEE, pp 1473–1476
- Zhang et al. (2010) Zhang L, Zhang L, Mou X (2010) Rfsim: A feature based image quality assessment metric using riesz transforms. In: 2010 IEEE International Conference on Image Processing, IEEE, pp 321–324
- Zhang et al. (2011) Zhang L, Zhang L, Mou X, Zhang D (2011) Fsim: A feature similarity index for image quality assessment. IEEE transactions on Image Processing 20(8):2378–2386
- Zhang et al. (2014) Zhang L, Shen Y, Li H (2014) Vsi: A visual saliency-induced index for perceptual image quality assessment. IEEE Transactions on Image Processing 23(10):4270–4281
- Zhang (2019) Zhang R (2019) Making convolutional networks shift-invariant again. In: International Conference on Machine Learning, pp 7324–7334
Zhang et al. (2018c)
Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018c) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 586–595
- Zhang et al. (2019) Zhang W, Liu Y, Dong C, Qiao Y (2019) Ranksrgan: Generative adversarial networks with ranker for image super-resolution
- Zhang et al. (2018d) Zhang Y, Li K, Li K, Wang L, Zhong B, Fu Y (2018d) Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 286–301
Appendix A More Results
In this section, we provide more benchmark results. We evaluate IQA methods using Spearman rank order correlation coefficients (SRCC) (Sheikh et al., 2006b) and Kendall rank order correlation coefficients (KRCC) (Kendall and Stuart, 1977). These two indexes evaluate the monotonicity of the methods: whether the scores of high-quality images are higher (or lower) than low-quality images. We also provide the Pearson linear correlation coefficient (PLCC) results. The PLCC index evaluates the accuracy of the methods. Before calculating the PLCC index, we perform a nonlinear regression to fit the subjective scores and the objective scores using third-order polynomials fitting. The KRCC results are shown in Table 7 and the PLCC results are shown in Table 8. In the paper, we prefer to analyze SRCC and KRCC because the PLCC index may overestimate the performance when the IQA method cannot effectively indicate image similarity. As shown in Table 8, some IQA methods obtain high PLCC performance such as SR-SIM, IFC and MAD, which is inconsistent with the conclusion reached by observing SRCC performance. We argue that this is because these methods fail to predict image quality when the Elo score is high, and then the samples with low Elo scores dominate the PLCC performance. We show the scatter plots of SR-SIM, IFC and MAD in Figure 14. As one can see, all the fitted curves of those IQA methods tend to horizontal in the area with better image quality, which means they fail to predict image qualities. After non-linear fitting, the abscissa values of these samples are concentrated in a small interval, and thus cannot have enough influence on the calculation of the PLCC index. We verify this phenomenon through an experiment. In Figure 14, we show the situation of removing these samples and their PLCC performance has not changed significantly. This proves that it is inappropriate to use PLCC for evaluation. At last, we show more results for the SR benchmark in Table 9, including more algorithms and IQA methods.
Appendix B Details of the Distortion types
Recall that we have 40 distortion types and 116 levels of distortion for each reference image in the PIPAL dataset. In addition to the traditional distortion types, we also include the outputs of a variety of real IR algorithms. In Table 10, we present the details of these distortion types, including the selected algorithms and the parameters to create distorted images (such as the SR factors and noise levels) and the implementation details of the traditional distortions in the proposed dataset.
For the spatial warping distortion, we apply local spatial shift to each pixel on several randomly selected regions in the image. Given a line with starting point and termination point , the per-pixel shift map around this line can be computed using the following formula:
where is the original location of the point , and is the radius of influence. The source pixels are then resampled to form the target pixel. For an input image, we randomly select warping points and warping distance (distance between point and ) to perform locally spatial warping. We set four distortion levels for spatial warping with parameters , and .
Appendix C Reliability and Expandability of Elo System
In the PIPAL dataset, we employ the Elo rating system to obtain mean opinion scores (MOS) for distorted images. The Elo rating system is a statistic-based rating method and is able to bring pairwise preference probability and rating system together. In this section, we show the effectiveness and expandability of the Elo rating system with quantitative experiments. We also performed ablation studies on some of the hyper-parameters of the Elo system.
To validate the Elo rating system, we build a simulation set consisting of 150 populations and assign the ground truth scores for them. These scores follow the constraints that: (1) The ranking is transferable. if and , then , (2) Given a ranking, then the expectations of any comparison should be the same as the ranking order. If , then the probability of is higher than the probability of . We simulate the process of scoring these populations using the Elo system. We then use the Spearman rank correlation coefficient between the Elo rating results and the ground truth to measure the effectiveness of the Elo system. We first evaluate the Elo system with different parameters of . As can be seen in Figure 15, all the experiments converge quickly. When becomes smaller, the convergence speed becomes slower but more stable. Considering the trade-off between convergence speed and the rating accuracy, we use to build our dataset. In Figure 16, we show the experiments with different sample strategies. Recall that in our dataset, we intentionally select the distorted images with similar Elo scores for users to make judgements. The experiments show that selecting images with similar scores converge quicker and better than random selecting. We also conduct experiments to show the effect of parameter . As can be seen in Figure 17, the choice of does not affect the convergence performance and accuracy of the Elo system. However, changing enables us to control win probability estimation based on Elo scores. Figure 18 illustrates how tuning affects probability estimation by changing the curve’s kurtosis. In general, a lower indicates that we can be more confident in our predictions since they’re less susceptible to random events. The same conclusion can be found in some prior works Bisberg and Cardona-Rivera (2019). At last, we experimentally show the expandability of the Elo system. We first build a set with populations of 150 and perform Elo rating. When the Elo rating process converges, we add the other 40 populations and keep updating the Elo scores for all of the samples. The curve is shown in Figure 19, as one can see the Elo system quickly adjusts according to the new samples without losing old Elo scores. This experiment demonstrates the used Elo system is expandable.
|Distortion Sub-type||Distortion types||Implementation Detail|
|Traditional Distortions||1. Median filter denoising||The noise levels of the salt pepper noise are 0.08 and 0.16.|
|2. Linear motion blur||fixed width with two different directions.|
|3. JPEG and JPEG 2000||The quality levels for JPEG: 10 and 20. The quality levels for JPEG 2000 are 20, 40 and 60.|
|4. Color quantization||We first use the MATLAB multithresh function to segment the image and then quantizing with the intensity levels of 3 and 7.|
|5. Gaussian noise||The Gaussian noise levels are 10, 15 and 25.|
|6. Gaussian blur||The Gaussian blur with and .|
|7. Bilateral filtering||The range parameter and the spatial parameter are set to and .|
|8. Spatial warping||See details in Appendix Sec. B.|
|9. Comfort noise||We use the implementation from TID2013 and we have 4 levels.|
|Traditional SR||10. Interpolation||Bicubic upsampling and .|
|11. A+ (Timofte et al., 2014)||SR factors of , and .|
|12. YY (Yang and Yang, 2013)||SR factors of , and .|
|13. TSG (Timofte et al., 2013)||SR factors of , and .|
|14. YWHM (Yang et al., 2010)||SR factor of .|
|PSNR-orainted SR||15. SRCNN (Dong et al., 2015b)||SR factors of , and .|
|16. FSRCNN (Dong et al., 2016)||SR factors of , and .|
|17. VDSR (Kim et al., 2016)||SR factors of , and .|
|18. EDSR (Lim et al., 2017)||SR factors of , and .|
|19. RCAN (Zhang et al., 2018d)||SR factors of , , and .|
|SR with kernel mismatch||20. SFTMD (Gu et al., 2019)||For SR, the LR images are blurred with Gaussian blur with , SR using the Gaussian kernel with . For SR, the LR images are blurred with Gaussian blur with , SR using the Gaussian kernel with .|
|GAN-based SR||21. EnhanceNet (Sajjadi et al., 2017)||SR factors of .|
|22. SRGAN (Ledig et al., 2017)||SR factors of , and .|
|23. SFTGAN (Wang et al., 2018a)||SR factors of .|
|24. ESRGAN (Wang et al., 2018b)||SR factors of , and . For SR, we use the network interpolation with for four different GAN effects.|
|25. BOE (Navarrete Michelini et al., 2018)||SR factors of . We choose three models , and released by the authors for different GAN effects.|
|26. EPSR (Vasu et al., 2018)||SR factors of . We choose three models , and released by the authors for different GAN effects.|
|27. PESR (Vu et al., 2018)||SR factors of . We use the network interpolation with for two different GAN effects.|
|28. EUSR (Kim and Lee, 2018)||SR factors of .|
|29. MCML (Choi et al., 2018)||SR factors of .|
|30. RankSRGAN (Zhang et al., 2019)||SR factors of . We choose three different models (RankSRGAN-PI, RankSRGAN-MA and RankSRGAN-NIQE) for different GAN effects.|
|Denoising||31. DnCNN (Zhang et al., 2017)||Gaussian noise removal, the noise levels: 25 and 50.|
|32. FFDNet (Zhang et al., 2018a)||Gaussian noise removal, the noise levels: 25, 50 and 80.|
|33. TWSC (Xu et al., 2018)||Gaussian noise removal, the noise levels: 25, 50 and 80.|
|34. BM3D (Dabov et al., 2007)||Gaussian noise removal, the noise levels: 25, 50 and 80.|
|35. ARCNN (Dong et al., 2015a)||JPEG compression removal, the quality levels: 10 and 30.|
|SR and Denoising Joint Problem||36. BM3D + EDSR||We first perform Gaussian noise removal with noise levels of 25 and 50 with BM3D, and then perform EDSR with SR factors of 2 and 3.|
|37. DnCNN + EDSR||We first perform Gaussian noise removal with noise levels of 25 and 50 with DnCNN, and then perform EDSR with SR factors of 2 and 3.|
|38. ARCNN + EDSR||We first perform JPEG compression removal with quality levels of 10, 20 and 40 with ARCNN, and then perform EDSR with SR factor of 2.|
|39. noise + EDSR||We first add Gaussian noise with noise levels: 1.5 and 3, and then perform SR with EDSR with SR factor of 3.|
|40. noise + ESRGAN||We first add Gaussian noise with noise level 0.1, and then perform SR with ESRGAN with SR factor of 4.|