PIPAL: a Large-Scale Image Quality Assessment Dataset for Perceptual Image Restoration

Image quality assessment (IQA) is the key factor for the fast development of image restoration (IR) algorithms. The most recent IR methods based on Generative Adversarial Networks (GANs) have achieved significant improvement in visual performance, but also presented great challenges for quantitative evaluation. Notably, we observe an increasing inconsistency between perceptual quality and the evaluation results. Then we raise two questions: (1) Can existing IQA methods objectively evaluate recent IR algorithms? (2) When focus on beating current benchmarks, are we getting better IR algorithms? To answer these questions and promote the development of IQA methods, we contribute a large-scale IQA dataset, called Perceptual Image Processing Algorithms (PIPAL) dataset. Especially, this dataset includes the results of GAN-based methods, which are missing in previous datasets. We collect more than 1.13 million human judgments to assign subjective scores for PIPAL images using the more reliable "Elo system". Based on PIPAL, we present new benchmarks for both IQA and super-resolution methods. Our results indicate that existing IQA methods cannot fairly evaluate GAN-based IR algorithms. While using appropriate evaluation methods is important, IQA methods should also be updated along with the development of IR algorithms. At last, we improve the performance of IQA networks on GAN-based distortions by introducing anti-aliasing pooling. Experiments show the effectiveness of the proposed method.

READ FULL TEXT VIEW PDF

page 6

page 12

11/30/2020

Image Quality Assessment for Perceptual Image Restoration: A New Dataset, Benchmark and Metric

Image quality assessment (IQA) is the key factor for the fast developmen...
08/18/2021

Learning Conditional Knowledge Distillation for Degraded-Reference Image Quality Assessment

An important scenario for image quality assessment (IQA) is to evaluate ...
05/04/2020

A Comparative Study of Image Quality Assessment Models through Perceptual Optimization

The performance of objective image quality assessment (IQA) models has b...
01/14/2022

TCR-GAN: Predicting tropical cyclone passive microwave rainfall using infrared imagery via generative adversarial networks

Tropical cyclones (TC) generally carry large amounts of water vapor and ...
01/28/2022

Generalized Visual Quality Assessment of GAN-Generated Face Images

Recent years have witnessed the dramatically increased interest in face ...
06/06/2018

PieAPP: Perceptual Image-Error Assessment through Pairwise Preference

The ability to estimate the perceptual error between images is an import...
05/04/2020

Comparison of Image Quality Models for Optimization of Image Processing Systems

The performance of objective image quality assessment (IQA) models has b...

1 Introduction

Image restoration (IR) is a classic low-level vision problem that aims to reconstruct high-quality images from distorted low-quality inputs. Typical IR tasks include image super-resolution (SR), denoising, enhancement, etc. The whirlwind of deep-learning progress has produced a steady stream of promising IR algorithms that could generate less-distorted or perceptual-friendly images. Nevertheless, one of the key bottlenecks that restrict IR methods’ future development is the “evaluation mechanism”. Although it is nearly effortless for human eyes to distinguish perceptually better images, it is challenging for an algorithm to measure visual quality fairly. In this work, we will focus on the analysis of existing evaluation methods, and introduce a new image quality assessment (IQA) dataset, which not only includes the most recent IR methods but also has the largest scale/diversity. The motivation will be first stated as follows.

IR methods are generally evaluated by measuring the similarity between the reconstructed images and ground-truth images via IQA metrics, such as PSNR [psnr] and SSIM [ssim]. Recently, some non-reference IQA methods, such as Ma [ma2017learning] and Perceptual Index (PI) [blau2018perception], are introduced to evaluate the recent perceptual-oriented algorithms. To some extent, these IQA methods are the chief reason for the considerable progress of the IR field. However, while new algorithms have been continuously improving IR performance, we notice an increasing inconsistency between quantitative results and perceptual quality. For example, literature [blau2018perception] reveals that the superiority of PSNR values does not always accord with better visual quality. Although Blua et al. suggest that PI is more relevant to human judgment, algorithms with high PI scores (e.g., ESRGAN [wang2018esrgan] and RankSRGAN [zhang2019ranksrgan]) could still produce images with obvious unrealistic artifacts. These conflicts lead us to rethink the evaluation methods for IR tasks.

An important reason for this situation is the invention of Generative Adversarial Networks (GANs) [goodfellow2014generative] and GAN-based IR methods [wang2018esrgan, gu2019image], bringing completely new characteristics to the output images. In general, these methods often fabricate seemingly realistic yet fake details and textures. This presents a great challenge for existing IQA methods, which cannot distinguish the GAN-generated textures from noises and real details. We naturally raise two questions: (1) Can existing IQA methods objectively evaluate current IR methods, especially GAN-based methods? (2) With the focus on beating benchmarks on the flawed IQA methods, are we getting better IR algorithms? A few works have made early attempt to answer these questions by proposing new benchmarks for IR and IQA methods. Yang et al. [yang2014single] conduct a comprehensive evaluation of traditional SR algorithms. Blau et al. [blau2018perception] analyze the perception-distortion trade-off phenomenon and suggest the use of multiple IQA methods. However, these prior studies usually apply unreliable human ratings of image quality, and are generally insufficient in the number of IR/IQA methods. Especially, the results of GAN-based methods are missing in the above works.

To touch the heart of this problem, we need to have a better understanding of the new challenges brought by GAN. The first issue is to build a new IQA dataset with GAN-based algorithms. An IQA dataset includes a lot of distorted images with visual quality levels annotated by humans. It can be used to measure the consistency of the prediction of IQA method and human judgment. In this work, we contribute a novel IQA dataset, namely Perceptual Image Processing ALgorithms dataset (PIPAL). The proposed dataset distinguishes from previous datasets in three aspects: (1) In addition to traditional distortion types (e.g., Gaussian noise/blur), PIPAL contains the outputs of several kinds of IR algorithms, including traditional algorithms, deep-learning-based algorithms and GAN-based algorithms. In particular, this is the first time for the results of GAN-based algorithms to appear in an IQA dataset. (2) We employ the Elo rating system [elo1978rating] to assign subjective scores, involving more than 1.13 million human judgments. Comparing with existing rating systems (e.g., five gradations [live] and Swiss system [tid2008]

), the Elo rating system provides much more reliable probability-based rating results. Furthermore, it has good extensibility, allowing users to update the dataset by directly adding new distortion types. (3) The proposed dataset contains 29k images in total, including 250 high-quality reference images, and each of which has 116 distortions. To date, PIPAL is the largest IQA dataset with complete subjective scoring.

With the PIPAL dataset, we are able to answer the questions above. (1) We build a benchmark using the proposed PIPAL dataset for existing IQA methods. Experiments indicate that PIPAL poses challenges for these IQA methods. Evaluating IR algorithms only using existing metrics is not appropriate. Our research also shows that compared with the widely-used metrics (e.g., PSNR and PI), PieAPP [prashnani2018pieapp] and LPIPS [zhang2018unreasonable] are more suitable for evaluating IR algorithms, especially GAN-based algorithms. (2) We then review the development of SR algorithms in recent years. The results show that the recent SR algorithms achieve great progress in the average subjective image quality scores. However, we find that none of the existing IQA methods is always effective in evaluating SR algorithms. With the invention of new IR technologies, the corresponding evaluation methods also need to be adjusted to continuously promote the development of the IR field. (3) We also study the characteristics of GAN-based distortion by comparing them with some well-studied traditional distortions. Based on the results, we argue that existing IQA methods’ low tolerance toward spatial misalignment may be one of the key reasons for their performance drop. By introducing anti-aliasing pooling to the existing IQA networks, we are able to improve their performance on GAN-based distortions.

2 Related Work

Image Restoration.

As a fundamental computer vision problem, IR aims at recovering a high-quality image from its degraded observations. In past decades, plenty of IR algorithms have been proposed to continuously improve the performance. The early algorithms use hand-craft features [bm3d, ywhm2010] or exploit image priors [tsg2013, a+2014]

in optimization problems to reconstruct images. Since the pioneer work of using Convolution Neural Networks (CNNs) to learn the IR mappings

[jain2009natural, srcnn2014], the deep-learning-based algorithms have dominated IR research due to their remarkable performance and usability [feng2019suppressing, gu2019blind]. Recently, with the invention of GAN [goodfellow2014generative], GAN-based IR methods [enhancenet2017, zhang2019ranksrgan] are not limited to getting a higher PSNR performance but trying to have better perceptual effect. However, these IR algorithms are not perfect. The results of those algorithms also include various image defects, and they are different from the traditional distortions that are often discussed in previous IQA researches. With the development of IR algorithms and the emergence of new technologies, evaluating the results of these algorithms becomes more and more challenging. In this paper, we mainly focus on the restoration of low-resolution images, noisy images, and images degraded by both resolution reduction and noise.

Dataset # Ref. Image Distortion # Distort. # Distort. # Human Judgment
images types types types images judgments type
LIVE [live] 29 image traditional 5 0.8k 25k MOS (Five gradations)
CSIQ [csiq] 30 image traditional 6 0.8k 5k MOS (Direct ranking)
TID2008 [tid2008] 25 image traditional 17 1.7k 256k MOS (Swiss system)
TID2013 [tid2013] 25 image traditional 24 3.0k 524k MOS (Swiss system)
BAPPS* [zhang2018unreasonable] 187.7k patch (256256) trad. alg. outputs 425 375.4k 484.3k Prob. of Preference
PieAPP* [prashnani2018pieapp] 200 patch (256256) trad. alg. outputs 75 20.3k 2.3m Prob. of Preference
PIPAL 250 patch trad. alg. outputs 40 29k 1.13m MOS
(Ours) (288288) including GAN (Elo rating system)
Table 1: Comparison with the previous datasets. We include the outputs of GAN-based algorithms as a novel distortion type. Note that BAPPS [zhang2018unreasonable] and PieAPP are perceptual similarity dataset (as opposed to IQA datasets), and are marked with “*”
Image Quality Assessment.

The IQA methods were developed to measure the perceptual quality of images after degradation or post-processing operation. According to different usage scenarios, IQA methods can be divided into full-reference methods (FR-IQA) and no-reference methods (NR-IQA). FR-IQA methods measure the similarity between two images from the perspective of information or perceptual feature similarity, and have been widely used in the evaluation of image/video coding, restoration and communication quality. Beyond the most widely-used PSNR, FR-IQA methods follow a long line of works that can trace back to SSIM [ssim]

, which first introduces structural information in measuring image similarity. After that, various FR-IQA methods have been proposed to bridge the gap between the results of IQA methods and human judgments. Similar to other computer vision problems, advanced data-driven methods have also motivated the investigation of applications of IQA

[zhang2018unreasonable, prashnani2018pieapp]. In addition to the above FR-IQA methods, NR-IQA methods are proposed to assess image quality without a reference image. Some popular NR-IQA methods include NIQE [niqe], Ma et al. [ma2017learning], BRISQUE [brisque], and PI [blau2018perception]. In some recent works, NR-IQA and FR-IQA methods are combined to measure IR algorithms [blau2018perception]. Despite of the progress of IQA methods, only a few IQA methods (e.g., PSNR, SSIM and PI) are frequently used to evaluate IR methods.

Image Quality Assessment Datasets.

In order to evaluate and develop IQA methods, many datasets have been proposed, such as LIVE [live], CSIQ [csiq], TID2008 and TID2013 [tid2008, tid2013]. There are also some perceptual similarity datasets such as PieAPP [prashnani2018pieapp], and BAPPS [zhang2018unreasonable]. These datasets provide both distorted images and the corresponding subjective scores, and they have served as baselines for evaluation of IQA methods. IQA datasets are mainly distinguished from each other in three aspects: (1) the collecting of the reference images, (2) the number of distortions included and their types and (3) the collecting strategy of subjective score. A quick comparison of these datasets can be found in Table 1.

3 Perceptual Image Processing ALgorithms Dataset

We then describe the peculiarities of the proposed dataset from the aforementioned aspects of (1) the collecting of the reference images, (2) the number of distortions and their types, and (3) the collecting of subjective score, respectively.

Collection of reference images.

In the proposed dataset, we select 250 image patches from two high-quality image datasets – DIV2K [div2k] and Flickr2K [timofte2017ntire]. We mainly focus on the area that is relatively hard to restore, such as high-frequency textures. Thus, we crop patches of the representative texture areas from the selected images. The selected reference images are representative of a wide variety of real-world textures, including but not limited to buildings, trees and grasses, animal fur, human faces, text, and artificial textures. The size of these images is , which could meet the requirements of most IQA methods.

Sub-type Distortion Types
Traditional Gaussian blur, motion blur, image compression, Gaussian noise, spatial warping, bilateral filter, comfort noise.
Super-Resolution interpolation method, traditional methods, SR with kernel mismatch, PSNR-oriented methods, GAN-based methods.
Denoising mean filtering, traditional methods, deep-learning-based methods.
Mixture Restoration SR of noisy images, SR after denoising, SR after compression noise removal
Table 2: Our distortions types. In addition to the existing distortions, we include 19 different GAN-based algorithms distortions
Image Distortions.

In our dataset, we have 40 distortion types and these distortions can be divided into four sub-types. An overview of these distortion types is shown in Table 2. The first sub-type includes some traditional distortions (e.g., blur, noise, and compression), which are usually performed by basic low-level image editing operations. In some datasets, these distortions can be very severe; however, in our dataset, we constrain the situation of severe distortions as we want these distortions to be comparable to the IR results, which are not likely to be very low-quality. The second sub-type includes the SR results from a lot of real algorithms. Although some recent datasets [zhang2018unreasonable, prashnani2018pieapp] have covered some of the SR results, they contain results that are inferior in algorithms number and types to our dataset. We divided the used SR algorithms into three categories – traditional algorithms, PSNR-oriented algorithms, and GAN-based algorithms. The results of traditional algorithms can be understood to some extent as loss of detail. The PSNR-oriented algorithms are usually based on deep-learning technology. Comparing with the traditional algorithms, their outputs tend to have sharper edges and higher PSNR performances. The outputs of GAN-based algorithms are more complicated and challenging for IQA methods. They do not quite match the quality of detail loss, as they usually contain texture-like noises, or the quality of noise, as their texture-like noise is similar to the ground truth to some extent, just not accurate. An example of GAN-based distortions is shown in Figure 1. Measure the similarity of incorrect yet similar features are of great importance to the development of perceptual SR. The third sub-type includes the outputs of several denoising algorithms. Similar to image SR, the used denoising algorithms contain both model-based algorithms and deep learning-based algorithms. In addition to Gaussian noise, we also include JPEG compression noise removal results. At last, we include the restoration results of the mixture image degradation. As revealed in [zhang2018learning, qian2019trinity], performing denoising and SR sequentially or jointly will bring new artifacts or different blur effects that barely occur in other IR tasks.

In summary, we have 40 different distortion types and 116 different distortion levels, totally 29k distortion images. Note that although the number of distortion types is less than some of the existing datasets, we contain a lot of new distortion types and, especially, a large number of real algorithms results and GAN results. This allows our proposed dataset to provide a more objective benchmark for not only IQA methods but also IR methods.

Figure 1: Visualizing different distortions. Unlike the distortions in the upper row, which do not follow the natural image distribution. The GAN-based outputs are actually similar to natural images. However, their details are wrong
Elo Rating for Mean Opinion Score.

Having distorted images, Mean Opinion Score (MOS) is to be provided for each distortion image. There are several methodologies used to assess the visual quality of an image [live, tid2013, zhang2018unreasonable, prashnani2018pieapp]. Early datasets [live] use “five-gradations rating” method where images are assigned into five categories directly. Using this method will result in a huge bias when the user has a little experience. In recent years, datasets usually collect MOS through a large number of pairwise selections using the Swiss rating system [tid2008, tid2013]. However, as revealed in [prashnani2018pieapp], the way this pairwise MOS is calculated makes it dependent on specific set, which means the MOS scores of two distorted images can change considerably when they are included in two different datasets. In order to eliminate this set-dependence effect, Prashnani et al. [prashnani2018pieapp] propose to build dataset only based on the probability of pairwise preference. This method can provide a more accurate propensity probability. However, it not only requires a large number of human judgments, but also can not provide the MOS for distortion types, which is important for building benchmarks. In the proposed dataset, we employ Elo rating system [elo1978rating] to bring pairwise preference probability and rating system together. The use of Elo system not only provides reliable human ratings but also reduces the number of required human judgments.

The Elo rating system is a statistic-based rating method and is first proposed for assessing chess player levels. We assume that the user preference between two images and follows a Logistic distribution parameterized by their Elo Scores [elo20088]. Given their Elo scores and , the expected probability of preference is given by:

(1)

where indicates the probability that one user would prefer than , and is the parameter of the distribution. In our dataset we use . Once the user makes a choice, we then update the Elo score for both and use the following rule

(2)

where is the change step in one judgment and is set to 16. indicates whether is chosen: if wins and if fails. With thousands of human judgments, the Elo scores for each distorted images will converge. The average of the Elo scores in the last few steps will be assigned as the MOS subjective score. The averaging operation aims at reducing the randomness of Elo changes.

An example might help to understand the Elo system. Assume that and , then we have and . In this situation, if is chosen, the updated Elo score for will be and the new score for is ; if is chosen, the new score will be and . Note that as the expected probability for different images being chosen are different, the value change of the Elo scores will also be different. This also indicates that when the quality is too different, the winner will not get a lot from winning the bad image. According to Eq.(1), a 200 of score difference indicates 76% chance to win, and 400 indicates the chance more than 90%. At first, we assign an Elo score of 1400 for each distortion image. After numerous human judgment (in our dataset, we have 1.13 million human judgments), the Elo score for each image are collected.

Another superiority of employing Elo system is that our dataset could be dynamic, and can be extended in the future. The Elo system has been widely used to evaluate the relative level of players in electronic games, where the players are constantly changing and the Elo system can provide ratings for new players in a few gameplays. Recall that one of the chief reasons that “these IQA methods are facing challenges” is the invention of GAN and GAN-based IR methods. What if other novel image generation technologies are proposed in the future? Do people need to build a new dataset to include those new algorithms? With the extendable characteristic of Elo system, one can easily add new distortion types into this dataset and follow the same rating process. Elo system will automatically adjust the Elo score for all the distortions without re-rating for the old distortions.

4 Results

In this section, we conduct a comprehensive study using the proposed PIPAL dataset. We first build a benchmark for IQA methods. Through this benchmark, we can answer the question that “can existing IQA methods objectively evaluate recent IR algorithms?” We then build a benchmark for some recent SR algorithms to explore the relationship between the development of IQA methods and IR research. We can get the answer of “are we getting better IR algorithms by beating benchmarks on these IQA methods?” At last, we study the characteristics of GAN-based distortion by comparing them with some existing distortion types. We also improve the performance of IQA networks on GAN-based distortions by introducing anti-aliased pooling layers.

Method Traditional Denoising SR Full Traditional PSNR. GAN-based
Distortion SR SR SR
PSNR 0.3589 0.4542 0.4099 0.4782 0.5462 0.2839
NQM 0.2561 0.5650 0.4742 0.5374 0.6462 0.3410
UQI 0.3455 0.6246 0.5257 0.6087 0.7060 0.3385
SSIM 0.3910 0.6684 0.5209 0.5856 0.6897 0.3388
MS-SSIM 0.3967 0.6942 0.5596 0.6527 0.7528 0.3823
IFC 0.3708 0.7440 0.5651 0.7062 0.8244 0.3217
VIF 0.4516 0.7282 0.5917 0.6927 0.7864 0.3857
VSNR-FR 0.4030 0.5938 0.5086 0.6146 0.7076 0.3128
RFSIM 0.3450 0.4520 0.4232 0.4593 0.5525 0.2951
GSM 0.5645 0.6076 0.5361 0.6074 0.6904 0.3523
SR-SIM 0.6036 0.6727 0.6094 0.6561 0.7476 0.4631
FSIM 0.5760 0.6882 0.5896 0.6515 0.7381 0.4090
FSIM 0.5724 0.6866 0.5872 0.6509 0.7374 0.4058
VSI 0.4993 0.5745 0.5475 0.6086 0.6938 0.3706
MAD 0.3769 0.7005 0.5424 0.6720 0.7575 0.3494
LPIPS-Alex 0.5935 0.6688 0.5614 0.5487 0.6782 0.4882
LPIPS-VGG 0.4087 0.7197 0.6119 0.6077 0.7329 0.4816
PieAPP 0.6893 0.7435 0.7172 0.7352 0.8097 0.5530
WaDIQaM 0.6127 0.7157 0.6621 0.6944 0.7628 0.5343
DISTS 0.6213 0.7190 0.6544 0.6685 0.7733 0.5527
NIQE 0.1107 -0.0059 0.0320 0.0599 0.1521 0.0155
Ma et al. 0.4526 0.4963 0.3676 0.6176 0.7124 0.0545
PI 0.3631 0.3107 0.1953 0.4833 0.5710 0.0187
Table 3: The SRCC results with respect to different distortion sub-types. means the higher the better, while means the lower the better. Higher coefficients matche perceptual scores better. The values with top 3 performance are marked in blod
Figure 2: Quantitative comparison of IQA methods. The right fidgure is the zoom-in view. Higher coefficient matches perceptual score better
Figure 3: Analysis of IQA methods in evaluating IR methods. The first row shows the scatter plots of MOS score vs. IQA methods for all SR algorithms. The second row gives scatter plots for GAN-based SR algorithms

4.1 Evaluations on IQA Methods

We select a set of commonly-used IQA methods to build the benchmark. For the FR-IQA methods, we include: PSNR [psnr], NQM [nqm], UQI [uqi], SSIM [ssim], MS-SSIM [ms-ssim], IFC [ifc], VIF [vif], VSNR-FR [vsnr], RFSIM [rfsim], GSM [gsm], SR-SIM [sr-sim], FSIM and FSIM [fsim], SFF [sff], VSI [vsi], SCQI [scqi], LPIPS-Alex and -VGG [zhang2018unreasonable], PieAPP [prashnani2018pieapp], WaDIQaM [wadiqam] and DISTS [dists]. We also include some popular NR-IQA methods: NIQE [niqe], Ma [ma2017learning], and PI [blau2018perception]. All these methods are calculated using the official implementation released by the authors. As in many previous works [live], we evaluate IQA methods mainly using Spearman rank order correlation coefficients (SRCC) and Kendall rank order correlation coefficients (KRCC) [kendall1977advanced]. These two indexes evaluate the monotonicity of methods: whether the scores of high-quality images are higher (or lower) than low-quality images. We first evaluate the IQA methods using all types of distortions in PIPAL dataset. A clear exhibition for both SRCC and KRCC rank coefficients is shown in Figure 2. The first conclusion is that even the best IQA method (i.e., PieAPP) provides only 0.71 SRCC score, which is much lower than their performance in TID2013 dataset (about 0.90). This indicates that the proposed PIPAL dataset is challenging for existing IQA methods and there is a large room for future improvement. Moreover, a high overall correlation performance does not necessarily indicate the high performance on each sub-type of distortions. As the focus of this paper, we want to analyze the performance of IQA using IR results, especially the outputs of GAN-based algorithms. Specifically, we take SR sub-type as an example and show the performance of IQA methods in evaluating SR algorithms. In Table 3, we show the SRCC results with respect to different distortion sub-types, including traditional distortions, denoising outputs, all SR outputs, and the outputs of traditional SR, PSNR-oriented SR and GAN-based SR algorithms. Analysis of Table 3 leads to the following conclusions. First, although performing well in evaluating traditional and PSNR-oriented SR algorithms, almost all IQA methods suffer from severe performance drop when evaluating GAN-based algorithms. This confirms the conclusion of Blua et al. [blau2018perception] that less distortion (e.g., higher PSNR values) may be related to lower perceptual performance for GAN-based IR algorithms. Second, despite of the severe performance drop, several IQA methods still outperform the others on GAN-based algorithms. Coincidentally, they are all recent works and based on deep networks.

We next present the analysis of IQA methods as IR evaluation metrics. In Figure 

3, we show the scatter plots of subjective scores vs. the average values of some commonly-used image quality metrics for 23 SR algorithms. Among them, PSNR and SSIM are the most common measures, IFC is suggested by Yang et al. [yang2014single], NIQE and PI are suggested in recent works [blau2018perception, zhang2019ranksrgan] for their good performance on GAN-based SR algorithms. LPIPS [zhang2018unreasonable] and PieAPP [prashnani2018pieapp] are selected according to our benchmark. As can be seen that, although widely used, PSNR, SSIM and IFC are anti-correlated with the subjective scores, thus are inappropriate for evaluating GAN-based algorithms. It is worth noting that IFC shows good performance on denoising, traditional SR and PSNR-oriented SR according to Table 3, but drops severely on GAN-based distortions. NIQE and PI show moderate performance on evaluating IR algorithms, and LPIPS and PieAPP are the most correlated. Note that different from the work of Blau et al [blau2018perception] where they collect perceptual quality only based on whether the image looks real, we collect subjective scores based on the perceptual similarity with the ground truth. Therefore, in evaluating the performance of the IR algorithms from the perspective of reconstructing ground truth, the suggestions given by our work are more appropriate.

Method Year PSNR  SSIM Ma  NIQE     PI LPIPS   MOS
YY [yy2013] 2013 23.35 0.6897 4.5486 6.4174 5.9344 0.3574 1367.71
TSG [tsg2013] 2013 23.55 0.6775 4.1298 6.4163 6.1433 0.3570 1387.24
A+ [a+2014] 2014 23.82 0.6919 4.3852 6.3645 5.9897 0.3491 1354.52
SRCNN [srcnn2014] 2014 23.93 0.6966 4.6094 6.5657 5.9781 0.3316 1363.68
FSRCNN [fsrcnn2016] 2016 24.07 0.7013 4.6686 6.9985 6.1649 0.3281 1367.49
VDSR [vdsr2016] 2016 24.13 0.6984 4.7799 7.4436 6.3319 0.3484 1364.90
EDSR [edsr2017] 2017 25.17 0.7541 5.7634 6.4560 5.3463 0.3016 1447.44
SRGAN [srgan2017] 2017 22.57 0.6494 8.4215 3.9527 2.7656 0.2687 1494.14
RCAN [rcan2018] 2018 25.21 0.7569 5.9260 6.4121 5.2430 0.2992 1455.31
BOE [boe2018] 2018 22.68 0.6582 8.5209 3.7945 2.6368 0.2933 1481.51
ESRGAN [wang2018esrgan] 2018 22.51 0.6566 8.3424 4.7821 3.2198 0.2517 1534.25
RankSRGAN [zhang2019ranksrgan] 2019 22.11 0.6392 8.6882 3.8155 2.5636 0.2755 1518.29
Table 4: The SR results. The years of publication are also provided. The bolded values are the top 2 values and the superscripts indicate the ranking

4.2 Evaluations on IR Methods

One of the most important applications of IQA technology is to evaluate IR algorithms. IQA methods have been the chief reason for the progress in the IR field as a means of comparing the performance. However, evaluating IR methods only with specific IQA methods also narrows the focus of IR research and converts it to competitions only on the quantitative numbers (e.g., PSNR competitions [timofte2017ntire, cai2019ntire] and PI competition [blau20182018]). As stated above, existing IQA methods may be inadequate in evaluating IR algorithms. We wonder that with the focus on beating benchmarks on the flawed IQA methods, are we getting better IR algorithms? To answer this question, we take SR task as a representative and select 12 SR algorithms to build a benchmark. We present more algorithms in the Supplementary Material. These are all representative algorithms and selected from the pre-deep-learning era (since 2013) to the present. The results are shown in Table 4. One can observe that before 2017 (when GAN was applied to SR) the PSNR performance improves continuously. Especially, the deep-learning-based algorithms improve PSNR by about 1.4dB. These efforts do improve the subjective performance – the average MOS values increases by about 90 in 4 years. After SRGAN was proposed, the PSNR decreased by about 2.6dB compared to the state-of-the-art PSNR performance at that time (EDSR), but the MOS value increased by about 50 suddenly. In contrast, RCAN was proposed to defeat EDSR in terms of PSNR. Its PSNR performance is a little higher than EDSR but its MOS score is even lower than EDSR. When noting that the mainstream metrics (PSNR and SSIM) had conflicted with the subjective performance, PI was proposed to evaluate perceptual SR algorithms [blau2018perception]. After that, ESRGAN and RankSRGAN have been continuously improving PI performance. Among them, the latest RankSRGAN has achieved the current state-of-the-art in terms of PI and NIQE performance. However, ESRGAN has the highest subjective performance, but has no advantage in terms of PI and NIQE comparing with RankSRGAN. Efforts on improving the PI value show limited effects and have failed to continuously improve MOS scores after ESRGAN. These observations inspire us in two aspects. First, none of existing IQA methods is always effective in evaluation. With the development of IR technology, new IQA methods need to be proposed accordingly. Second, excessively optimizing performance on a specific IQA may cause a decrease in perceptual quality.

We conduct experiments to explore this possibility by performing gradient ascend/descend of certain IQA methods. According to Blau et al. [blau2018perception]

, distortion and perceptual quality are at odds with each other. In order to simulate the situation where there is a perception-distortion trade-off, we constrain the PSNR value to be equal to that of the initial distorted image during optimization. We use the output of ESRGAN as the initial image, and the results are shown in Figure 

4. We can see that some images show superior numerical performance when evaluated using certain IQA methods, but may not be dominant in other metrics. Their best-cases also show different visual effects. Even for some IQA methods (LPIPS and DISTS) with good performance on GAN-based distortion, their best-cases still contain serious artifacts. This indicates that evaluating and developing new IQA methods plays an important role in future research.

Figure 4: Best-case images with respect to different IQA methods, with identical PSNR. These are computed by gradient ascent/descent optimization on certain IQA methods

4.3 Discussion of GAN-based Distortion

Recall that LPIPS, PieAPP and DISTS perform relatively better in evaluating GAN-based distortion. The effectiveness of these methods may be attributed to the following reasons. Compared with other IQA methods, deep-learning-based IQA methods can extract image features more effectively. For traditional distortion types, such as blur, compression and noise, the distorted images usually disobey the prior distribution of natural images. Early IQA methods can assess these images by measuring low-level statistic image features such as image gradient and structural information. These strategies are also effective for the outputs of traditional and PSNR-oriented algorithms. However, most of these strategies fail in the case of GAN-based distortion, as the way that GAN-based distortions differ from the reference images is less apparent. They may have similar image statistic features with the reference image. In this case, deep networks are able to capture these unapparent features and distinguish such distortions to some extent.

(a) (b) (c) (d)
Figure 5: Examples of scatter plots for pairs of distortion types. For distortion types that are easy to measure, samples are well clustered along the fitted curve. For others that are difficult for IQA method, the samples will not be well clustered. The samples of distortion types which have similar behavior will overlap with each other

In order to explore the characteristics of GAN-based distortion, we compare them with some well-studied distortions. As stated in [tid2013], for a good IQA method, the subjective scores in the scatter plot should increase coincide with objective values, and the samples are well clustered along the fitted curve. In the case of two distortion types, if the IQA method behaves similarly for both of them, their samples on the scatter plot will also be well clustered and overlaped. For example, the additive Gaussian noise and lossy compression are well studied distortion types for most IQA methods. When calculating the objective values using FSIM, samples of both distortions are clustered near the curve, as shown in Figure 5 (a). This indicates that FSIM can adequately characterize the visual quality of an image that is damaged due to these two types of distortion. Then we study GAN-based distortion by comparing it with some existing distortion types using FSIM: Figure 5 (b) shows the result of GAN-based distortion and compression noise, and Figure 5 (c) shows the result of GAN-based distortion and Gaussian blur. It can be seen that the samples of compression noise and Gaussian blur barely intersect with GAN-based samples. FSIM largely underestimates the visual quality of GAN-based distortion. In Figure 5 (d), we show the result of GAN-based distortion and spatial warping distortion. As can be seen, these two distortion types behave unexpectedly similar. FSIM cannot handle them and presents the same random and diffused state. The quantitative results also verify this phenomenon. For spatial warping distortion type, the SRCC of FSIM is 0.31, and it is close to the performance of GAN-based distortion, which is 0.41. Thus we argue that the spatial warping distortion and GAN-based distortion pose similar challenges to FSIM.

As revealed in experimental psychology [kolers1962intensity, kahneman1968method], the mutual interference between visual information may cause the Visual Masking effects. According to this theory, some key reasons that IQA methods tend to underestimate both GAN-based distortion and spatial warping distortion are as follows. Firstly, for the edges with strong intensity change, the human visual system (HVS) is sensitive to the contour and shape, but not sensitive to the error and misalignment of the edges. Secondly, the ability of HVS to distinguish texture decreases in the region with dense textures. When the extracted features of the texture are similar, the HVS will ignore part of the subtle differences and misalignment of textures. However, most of the traditional and deep-learning-based IQA methods require good alignment for the inputs. This partially causes the drop of performance of these IQA methods on GAN-based distortion.

This finding provides us an insight that if we explicitly consider the spatial misalignment, we may improve the performance of IQA methods on GAN-based distortion. We explore this possibility by introducing anti-aliasing pooling layer to IQA networks. IQA networks extract features by cascaded convolution operations. If we want the IQA networks to be robust to small misalignment, the extracted features should at least be invariant to this misalignment/shift. CNN should have been shift-invariant, as the standard convolution operations are shift-invariant. However, IQA networks are usually not shift-invariant, as some commonly used downsampling layers, such as max/average pooling and strided-convolution ignore the sampling theorem

[zhang2019making]. These operations are employed in VGG [simonyan2014very] and Alex [krizhevsky2012imagenet]

networks, which are popular backbone architectures for feature extraction (e.g., in LPIPS and DISTS). As suggested by Zhang

[zhang2019making], one can fix this by introducing anti-aliasing pooling. We conduct this experiment based on LPIPS-Alex and introduce l2 pooling [dists] and BlurPool [zhang2019making]

layers to replace its max pooling layers. l2 pooling fixes this problem by low-pass filtering before downsampling, and BlurPool further improves the performance by low-pass filtering between dense max operation and subsampling. The results are shown in Table 

5. We observe increased correlation in both PIPAL full set and GAN-based distortion subset. The results demonstrate the effectiveness of improving anti-aliasing pooling and indicate that the lack of robustness to small misalignment is one of the reasons for the decline in the performance of GAN-based distortion.

Test Set LPIPS baseline LPIPS + l2 pooling LPIPS + BlurPool
SRCC KRCC SRCC KRCC SRCC KRCC
PIPAL 0.5604 0.3910 0.5816 0.4080 0.5918 0.4160
PIPAL GAN distort. 0.4862 0.3339 0.4942 0.3394 0.5135 0.3549
Table 5: The SRCC and KRCC performance of LPIPS with different pooling layers. The anti-aliasing pooling layers (l2 pooling [dists] and BlurPool [zhang2019making]) improve the performance both on the PIPAL full set and GAN-based distortion subset

5 Conclusion

In this paper, we construct a novel IQA dataset, namely PIPAL and establish benchmarks for both IQA methods and IR algorithms. Our results indicate that existing IQA methods face challenges in evaluating perceptual IR algorithms, especially GAN-based algorithms. We also shed light on improving IQA networks by introducing anti-aliasing pooling layers. Experiments demonstrate the effectiveness of the proposed strategy.

5.0.1 Acknowledgement.

This work is partially supported by SenseTime Group Limited, the National Natural Science Foundation of China (61906184), Science and Technology Service Network Initiative of Chinese Academy of Sciences (KFJ-STS-QYZX-092), Shenzhen Basic Research Program (JSGG20180507182100698, CXB201104220032A), the Joint Lab of CAS-HK, Shenzhen Institute of Artificial Intelligence and Robotics for Society. The corresponding author is Chao Dong.

References