A Perceptually Weighted Rank Correlation Indicator for Objective Image Quality Assessment

05/15/2017 ∙ by Qingbo Wu, et al. ∙ The Chinese University of Hong Kong 0

In the field of objective image quality assessment (IQA), the Spearman's ρ and Kendall's τ are two most popular rank correlation indicators, which straightforwardly assign uniform weight to all quality levels and assume each pair of images are sortable. They are successful for measuring the average accuracy of an IQA metric in ranking multiple processed images. However, two important perceptual properties are ignored by them as well. Firstly, the sorting accuracy (SA) of high quality images are usually more important than the poor quality ones in many real world applications, where only the top-ranked images would be pushed to the users. Secondly, due to the subjective uncertainty in making judgement, two perceptually similar images are usually hardly sortable, whose ranks do not contribute to the evaluation of an IQA metric. To more accurately compare different IQA algorithms, we explore a perceptually weighted rank correlation indicator in this paper, which rewards the capability of correctly ranking high quality images, and suppresses the attention towards insensitive rank mistakes. More specifically, we focus on activating `valid' pairwise comparison towards image quality, whose difference exceeds a given sensory threshold (ST). Meanwhile, each image pair is assigned an unique weight, which is determined by both the quality level and rank deviation. By modifying the perception threshold, we can illustrate the sorting accuracy with a more sophisticated SA-ST curve, rather than a single rank correlation coefficient. The proposed indicator offers a new insight for interpreting visual perception behaviors. Furthermore, the applicability of our indicator is validated in recommending robust IQA metrics for both the degraded and enhanced image data.



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In most image processing applications, human eyes are the ultimate receiver of the visual signal, where the subjective test could offer ideal evaluations for various process results. However, confronting the massive image data generated in our daily life, it has become impossible to manually annotate the image quality via substantial human subjects, which is usually time consuming, troublesome and expensive [1]. As an alternative, it is more economical and practical to count on well-designed objective image quality assessment (IQA) metrics.

Given multiple processed or contaminated images, a main target of objective IQA is to correctly sort their quality levels via a computational model, which could approach the rank-order produced by the subjective judgement. A reliable objective IQA model not only serves as an image quality monitor, but also plays important roles in optimizing various quality driven applications, such as, image/video coding [2], image fusion [3], contrast enhancement [4], and so on. With the boom of perceptually friendly image/video processing systems [5, 6, 7], recent decades have witnessed the growing interests in the development of IQA algorithms. A large number of subject rated databases (such as, LIVE II [8], TID2013 [9], CSIQ [10], ChallengeDB [11]) and objective metrics (including the full-reference [12], reduced-reference [13], and no-reference models [14, 15]) are successively proposed to interpret the human perception of image quality from different perspectives. Thanks to the efforts of these researches, many exciting findings and technologies are verified efficient in modeling the perceptual image quality, such as, the perceptual visibility threshold [16, 17], the visual attention psychology [18, 19], the structural similarity measurement [20, 21, 22], the natural scene statistics [23, 24], the semantic obviousness [25], and so on. Such promising progress offers us a wide variety of models to address the IQA problem. But meanwhile, it also becomes a sweet nuisance to fairly evaluate and compare them, which is crucial to guide the choice of various quality driven applications in the real world.

At present, the most common evaluation indicators for IQA are two classic rank correlation statistics, i.e., the Spearman’s and Kendall’s [26]. It is not surprising because the mean opinion score (MOS) collected from human subjects is typically represented by an ordinal variate in terms of rating scale [27, 28]. Computing the rank correlation between IQA prediction and MOS is a straightforward way to measure its capability of sorting out perceptually preferred image. However, there is little research discussing the suitability of directly applying Spearman’s and Kendall’s statistics to the IQA task, even they are successful in many other fields, such as, medicine [29], biology [30], neuroscience [31], and so on.

Fig. 1: The subjective comparison between two rank results for the denoised images. The first row gives the ground-truth rank in terms of image quality, where the rank-orders are labeled in the top right corner of each image. The second and third rows show the rank results whose sorting errors occur in the top-rank and bottom-rank, respectively. Both the second and third rows share the same SRCC and KRCC, which are 0.500 and 0.333, respectively. The recommended denoising results in terms of different ranks are highlighted by the light blue window.

In reviewing these classic indicators [26], there are two rarely mentioned but noteworthy features, i.e., the uniform weighting and hard ranking principles, which assume the sorting mistakes in different quality levels sharing the same significance and each pair of images are sortable. Under these hypotheses, two important perceptual properties are neglected for the IQA task, which easily results a mismatch between the rank correlation evaluation and human preference. Firstly, in many real world applications, the sorting accuracy of high quality images usually plays a more important role than the poor quality ones, where only the top-ranked images are pushed to the users. To illustrate this feature, in Fig. 1, we compare two rank results of several denoised images. In this example, there is only one pair of images are mistakenly sorted for both the second and third rows, which are considered to share the same sorting accuracy in terms of the Spearman’s Rank Correlation Coefficient (SRCC) and Kendall’s Rank Correlation Coefficient (KRCC). However, as shown in Fig. 1, the recommended denoising result from the third row clearly outperforms the one in the second row. When the sorting error occurs in the top-rank, a perceptually poor image is more likely to be pushed to the user as shown in the second row. By contrast, the bottom-rank error for the poor quality images usually shows negligible impact on the recommending results, where the third row still outputs the perceptually best denoising result. Secondly, due to the inherent uncertainty in making judgement [32, 33], the ranking results between two images could not provide reliable measurement for the human perception of image quality, when they share the similar quality levels. For clarity, in Fig. 2, we compare two compressed images whose DMOSs are 13.72 and 13.70, respectively. Let and denote two IQA metrics towards DMOS, and we have the evaluation results of , . When we follow the hard ranking principle in SRCC and KRCC, it is easy to conclude that is better than another one . But, in terms of human perception, it is not certain for us to judge whose prediction is better between and . Particularly, we implement a subjective investigation across 30 human subjects, who are compulsorily required to give a binary response, i.e., Fig. 2 (a) is better or worse than Fig. 2 (b) in terms of image quality. Unsurprisingly, the divisions are serious across all subjects, where 16 votes prefer Fig. 2 (a) and the others disagree with them. The only agreement between all subjects is that it is too hard to make a certain judgement in distinguishing Figs. 2 (a) and (b), which means the hard ranks between perceptually similar images do not contribute to fairly compare different IQA metrics.

(a) Compressed house.bmp (DMOS = 13.72)
(b) Compressed buildings.bmp (DMOS = 13.70)
Fig. 2: The subjective comparison between two perceptually similar images from LIVE II databases. (a) shows the JPEG version of house.bmp, which is compressed at the bit rate of 1.40 bpp. (b) shows the JPEG version of buildings.bmp, whose bit rate is 1.78 bpp.

Although the two problems mentioned above are widespread in the field of IQA, there are very little researches discussing an alternative rank correlation indicator other than SRCC and KRCC. In [34, 35], Wang et al.

proposed an ingenious maximum differentiation (MAD) solution for comparing different IQA metrics, where each pair of IQA metrics fight with each other by detecting the most noticeable outliers from its competitors. In the following, two new indicators including the aggressiveness and resistance matrixes are designed for the group MAD (gMAD)


, which compute the average subjective scores collected from pair comparison. It is worth nothing that the gMAD is initially designed to accelerate the comparison of different IQA algorithms across a large-scale test set. The perceptual importance variance of different quality levels and the subjective uncertainty are not fully considered by the aggressiveness and resistance indicators as well. In addition, the gMAD methodology requires additional manual work for rating the difference of pairwise images in evaluating a new IQA metric, which fairly limits the usability of the aggressiveness and resistance indicators.

To address the aforementioned challenges for IQA metric evaluation, we propose an easy-to-use perceptually weighted rank correlation (PWRC) indicator in this paper. In comparison with the popular SRCC and KRCC, the proposed indicator is characterized by three new features which are summarized in the following:

  1. Nonuniform weighting: To capture the perceptual importance variance in computing the rank correlation, we assign different weights to the rank mistakes occurring in different quality levels. For a pair of images, the weight is jointly determined by the maximum rank of their MOS/DMOS and the difference between their MOS/DMOS. That is, we focus on penalizing great rank mistakes in the high quality levels.

  2. Selective activation: In view of the subjective uncertainty for comparing perceptually similar images, we only count the perceptible sorting mistakes for the pairwise images whose MOS/DMOS difference exceeds a given sensory threshold (ST). In this way, the proposed indicator could suppress the impact of insensitive sorting errors in measuring the rank correlation between IQA predictions and human-rated MOS/DMOS.

  3. Graphical representation: Unlike the traditional numerical representation that shows rank performance with single value, we utilize a graphical plot to illustrate the sorting accuracy (SA) variation with a changeable ST, which is named by SA-ST curve. The graphical representation could compare different IQA metrics more intuitively. Meanwhile, the SA-ST

    curve is capable to measure the superiority of an IQA metric under different confidence intervals, which is beyond both SRCC and KRCC.

Extensive experiments on five publicly available databases show that the proposed PWRC indicator could better distinguish different IQA metrics even they share the same number of mistakenly ranked image pairs. Furthermore, we also validate the applicability of the proposed indicator in recommending robust IQA metrics for both the degraded and enhanced images.

The remainder of this paper is organized as follows. Section II reviews the limitations of the classic SRCC and KRCC statistics. Then, the proposed PWRC indicator is introduced in Section III. We discuss the experimental results in Section IV. Finally, this paper is concluded in Section V.

Ii Rank Correlation Measurement Based on equal weighting

Given objects considered referring to two variables and , Kendall et al. [26] designed a generalized correlation coefficient which is given by


where denotes the -correlation variable between and , and denotes the -correlation variable between and . Both the and are required to be anti-symmetric and satisfy , .

By properly setting and , we can easily deduce the Spearman’s and Kendall’s from Eq. (1). Let and represent the rank deviation of a pair of observation values


where denotes the rank of in x, and denotes the rank of in y. Then, we can obtain as


Assuming there are no tied ranks in x and y, the rank variables and would share equally distributed values ranging in . Under this assumption, it is easy to deform Eq. (3) to a more familiar definition for Spearman’s according to [26], i.e.,


where , , and denotes the ground-truth ranks represented by a sequential value .

It is clear that Spearman’s assigns the same weight to each rank mistake term . It could reflect the subjective uncertainty in some extent, where an insensitive rank mistake results a smaller . But, the perceptual importance variance is completely ignored by its uniform weight .

Similarly, when we set and to the binary rank discriminator, i.e.,


we can deduce the Kendall’s as


where , and is a binary signum function defined by


It is seen that the Kendall’s works like a outlier detector, which tries to capture the disagreement of the rank-orders between and . Because Eq. (6) only counts the signs of pairwise ranks, it is unable to measure the perceptibility of this outlier, which is determined by the severity of sorting error. Meanwhile, the uniform weight is also applied to each pair of ranked samples in Eq. (6), which does not consider the perceptual importance variance across different rank levels.

Iii Perceptually Weighted Rank Correlation Indicator

To better interpret the visual perception behaviors in measuring rank correlation, we propose to assign different weights to each pair of ranked samples by considering both the sorting error degree and the level of the mistaken ranks. Meanwhile, the perceptibility of the mistaken ranks are also identified via a mutable sensory threshold.

Suppose and are the continuous-valued MOSs/DMOSs and IQA predictions, respectively. We first transform them to the discrete rank-orders and

. Then, the proposed indicator separates the task of rank correlation measurement into three steps including comparison activation, outlier detection, and importance measurement.

Before comparing the agreement between each pair of (, ) and (, ), we need to judge wether this comparison is perceptually sensitive or not. When the MOSs/DMOSs difference between and exceeds the sensory threshold, we will activate this comparison between (, ) and (, ). As discussed in [36, 37, 38, 39, 40], the visual sensitivity covers a wide dynamic range across different ages, genders, consciousness, and so on. So, we sample multiple sensory thresholds

to evaluate the rank correlation, where the activation function is denoted by

and . In the following, the outlier detection focus on identifying discordant rank-orders between (, ) and (, ), and we denoted it by . Given all of the activated mistaken rank-orders, we need to assign different weights to them by measuring their perceptual importance. Both the rank-order deviation of (p, q) and the location of the mistaken rank are considered in the importance measurement, which is denoted by . Finally, by combining these three components together, we can yield an overall sorting accuracy indicator, i.e.,


where is a combination function for fusing three rank correlation related components.

In defining the rank correlation indicator , we would like it to possess both the usability and visual perception features, which should satisfy four attributes.

  1. Symmetry: ;

  2. Boundness: ;

  3. Ambiguity: Given a pair of inputs x and y, their rank correlation is non-unique and varies with a changeable sensory threshold ;

  4. Unique maximum: , if and only if x = y and all elements of x satisfy when .

At first, we define the activation function as


where the constant is used to control the steepness of the activation curve. When grows larger, the activation function would get close to the unit step function. In comparison with the hard activation by unit step function, the proposed soft activation in Eq. (9) could better model the progressive change of visual sensitivity [41, 42].

Following the Gaussian distribution assumption about human opinion score

[43], we consider the MOSs/DMOSs variables satisfy and deduce an appropriate

via interval estimation, where


are the mean value and standard deviation of human opinion scores for the

th image. Based on the three-sigma rule [44], we know that there are about 95% human opinion scores lying within two standard deviations away from a MOS/DMOS in evaluating each image. In Fig. 3, we show the distribution of human opinion scores, which is modeled by a Gaussian distribution. The sensory threshold adjusts the mean value of this Gaussian distribution to . When another MOS/DMOS goes beyond by the interval of , we believe that these two images are perceived differently at the confidence level of 0.95, which results the equation


where is approximately equal to . It is noted that different IQA databases usually collect the MOS/DMOS and standard deviation at different scales. To obtain a general setting for , we normalize all subjective scores to the range of [0, 100]. Let denote the scaling factor and denote the bias term for x. Then, the normalization for x is defined by


Meanwhile, all standard deviations are normalized by


In this paper, the mean value of across three popular IQA databases, i.e., LIVE II [8], TID2013 [9] and ChallengeDB [11], is utilized to compute , where and .

Fig. 3: The distribution of human opinion scores.

The outlier detection function borrows the idea of sign comparison from Kendall’s


where each pair of mistaken rank-orders is labeled by -1. Otherwise, the outlier detection function would denote them by 1. Clearly, more discordant pairs between (, ) and (, ) would lead more negative outputs from Eq. (13).

In comparing each pair of rank-orders (, ) and (, ), we assign different weights to them by measuring their perceptual importance. Two factors, i.e., rank deviation and rank level, are considered for computing the weight. Particularly, the ground-truth label has been sequentially ranked as . We represent the normalized rank deviation term between and by . The normalized rank level term is . Both and locate in the range [0, 1]. Then, the importance measurement function is defined as


where when , and . Particularly, a larger weight would be assigned to the predicted rank-order which deviates more from a high level ground-truth rank . For clarity, the weight variation across different rank deviations and rank levels is shown in Fig. 4.

Fig. 4: The illustration of importance measurement function .

Finally, we combine all three components in Eqs. (9), (13) and (14) to compute the proposed PWRC indicator, i.e.,


It is clear that the Eq. (15) satisfy all four attributes for designing easy-to-use and perception aware rank correlation indicator. Particularly, the Kendall’s corresponds to a special case of Eq. (15) when we set the comparison activation and importance measurement functions to constants, i.e., and .

To highlight the difference between PWRC and existing rank correlation indicators, we investigate the evaluation accuracy of different indicators on a set of synthetic data. Assuming that there are five MOS values, i.e., , , and the rank of each MOS is given by . For comparison with the single-valued indicators SRCC and KRCC, the activation function in PWRC is first set to a constant, i.e., . In this investigation, we set the ground-truth rank of x to a descending order permutation

, which is widely used in kinds of image retrieval and enhancement applications. Then, ten predicted ranks indexed by

are evaluated based on different indicators, where the detail results are shown in Table I. For analysis, the total number of mistaken rank-orders, i.e., , is also calculated for each predicted rank, where is a window function and defined by

TABLE I: Comparison of different rank correlation indicators

It is seen that all rank correlation indicators gradually reduce when increases from 0 to 10. If is fixed, the SRCC and KRCC would keep the same even the predicted ranks are different. By contrast, the proposed PWRC could further tell the difference between different predicted ranks, which possess the same . To verify the significance of this subtle discriminability, we analyze the average MOS difference between the top N ranked samples and the rest ones, which is computed by


where a larger means the predicted rank is better for recommending high quality images. Since typically varies across different applications, we further calculate the mean value of by


As shown in Table I, monotonically decreases from the predicted rank S1 to S10. Only the proposed PWRC captures this change, which is crucial for recommending reliable IQA metrics to various quality driven applications.

After the comparison between single-valued indicators, we further illustrate the ambiguity of the proposed PWRC by resorting to Eq. (9). Particularly, twenty evenly spaced sensory thresholds ranging in [0, 100] are investigated in this section, which produce a SA-ST curve for each predicted rank. As shown in Fig. 5, the PWRC indicator dynamically changes when increases from 0 to 100. Particularly, there are two important features of PWRC revealed from this observation.

Firstly, the PWRC is non-monotonic with respect to an increasing threshold , which may result in different evaluation results in comparing two predicted ranks. For example, in Fig. 5, the predicted rank S4 is superior to S5 when is smaller than 10. As increases from 15 to 30, S4 becomes inferior to S5. When is larger than 35, their performances become close to each other. The similar observations could also be found between S7 and S8. The main reasons behind this observation lie in the contrary effects of the inactivation towards different responses in and the nonuniform weighting for different quality levels. More specifically, given two pairs of correctly and mistakenly ranked images, their inactivations would bring PWRC reduction and elevation, respectively. In addition, when the inactivation is applied to high quality image pairs, the PWRC reduction or elevation would be more significant. Otherwise, the PWRC change would be relatively small. This feature enables PWRC to evaluate an IQA metric more comprehensively, where a changeable threshold could capture the visual sensitivities across different genders, ages and so on [36, 37].

Secondly, although the PWRC shows different variation tendencies for various predicted ranks, they tend to approach each other with an increasing . This feature enhances the fault-tolerant capability of PWRC with respect to unsortable images, whose MOSs/DMOSs are close to each other. More specifically, if an IQA metric’s PWRC outperforms the others on a larger , the confidence of its superiority would be higher. By accumulating PWRC across a given threshold range [, ], we derive a confidence-aware rank correlation measurement from the area under the curve (AUC), which is defined by


where and . is the set of all normalized standard deviations associated to the MOSs/DMOSs in each IQA database. For clarity, an illustration of is shown in Fig. 6.

Fig. 5: The PWRC variation under different sensory thresholds .

Iv Experiments

Iv-a Protocols

In this section, we investigate the performance of different rank correlation indicators on five publicly available IQA databases, which include the LIVE II [8], TID2013 [9], ChallengeDB [11], IVCDehazing [45] and ESPL-LIVE [46]

. Particularly, each database explores the human perception of image quality towards different visual contents. The LIVE II and TID2013 databases collect the human opinion scores on thousands of images contaminated by artificially simulated distortions, such as, JPEG2000, JPEG, additive Gaussian white noise, Gaussian blur, fast fading and so on. In ChallengeDB, there are 1162 authentically distorted images collected from a wide variety of mobile camera devices including the smart phones and tablets. The IVCDehazing database focuses on studying the IQA problem for the enhanced image under foggy weather. For the ESPL-LIVE database, a large-scale subjective scores are collect on 1811 high dynamic range (HDR) images which are created by different tone mapping and multi-exposure fusion techniques.

Fig. 6: The computed under a given sensory threshold range.

For comparison, we use different rank correlation indicators to evaluate thirteen popular IQA algorithms, which include six full-reference (FR) metrics, i.e., PSNR [47], IWPSNR [48], SSIM [20], IWSSIM [12], MSSSIM [49], FSIM [50], and seven no-reference (NR) metrics, i.e., BIQI [51], BLIINDS II [52], BRISQUE [53], DIIVINE [54], NFERM [55], M3 [56] and TCLT [57].

Since the NR-IQA algorithms require additional training process, we follow the random split criteria in [51, 52, 54, 53, 55, 56, 57] and separate each IQA database into non-overlapped training and testing sets over 1000 times. More specifically, in each trial, we randomly choose part of reference images and their contaminated or enhanced versions to construct the training set. The testing set is composed of the rest images, whose visual contents are different from the training set. It is noted that all images collected in ChallengeDB database do not contain any reference images or overlapped visual contents. So, we straightforwardly divide it to two parts for training and testing, respectively. In addition, all FR-IQA algorithms are evaluated on the same test set like the other NR-IQA metrics, which ensures a fair comparison under consistent experimental setup.

In this section, three split ratios are investigated where the training set takes up 80, 50, and 20 percents of images in each database. The median values of SRCC, KRCC and across 1000 random split trials are used to rank different IQA algorithms. Since the quality-driven applications prefer an accurate push capability for high quality images, we employ as the benchmark for ranking different IQA metrics. Given a rank correlation indicator, the disagreements between its rank results and are used for measuring the push accuracy, where more disagreements mean worse accuracy.

Iv-B Implementation Details

As discussed in Section III, we compute the proposed PWRC on normalized MOS/DMOS values and standard deviations. Given an IQA database, the scaling factor and bias term are computed on the set of all MOS/DMOS values. Both the ground-truth subjective scores and the predicted image qualities are normalized by Eqs. (11) and (12). Then, we compute the confidence-aware indicator from PWRC curve to rank different IQA metrics, where a sensory threshold range [, ] is required as shown in Eq. (19). It is noted that different IQA databases would produce different , and [, ]. For clarity, we summarize the parameters of different databases in Table II, where a larger MOS/DMOS standard deviation would require wider sensory threshold range for computing .

(a) LIVE II database with 80% training data
(b) TID2013 database with 80% training data
(c) ChallengeDB database with 80% training data
(d) LIVE II database with 50% training data
(e) TID2013 database with 50% training data
(f) ChallengeDB database with 50% training data
(g) LIVE II database with 20% training data
(h) TID2013 database with 20% training data
(i) ChallengeDB database with 20% training data
Fig. 7: The SA-ST curves of different IQA algorithms tested on three databases. (a)-(c) show the PWRC performances on LIVE II, TID2013, and ChallengeDB, respectively, where the training sets take up 80% images in each database. (d)-(f) give the rank correlation results whose training sets occupy 50% images. (g)-(i) report the ranking accuracy when training sets occupy 20% images.
TABLE II: The parameters for PWRC on different databases

Iv-C Evaluation on Degraded Images

TABLE III: Comparison of different rank correlation indicators on LIVE II database
TABLE IV: Comparison of different rank correlation indicators on TID2013 database
TABLE V: Comparison of different rank correlation indicators on ChallengeDB database

In this section, we first employ the proposed PWRC to evaluate the rank performance of different IQA metrics towards degraded images, which are investigated on LIVE II, TID2013 and ChallengeDB databases, respectively. To illustrate an intuitive comparison results, the SA-ST curves of different IQA algorithms are shown on Fig. 7. It is noted that the ChallengeDB database does not contain uncontaminated reference images. So only the results of NR-IQA algorithms are reported in Fig. 7. It is seen that existing IQA metrics work well on artificially simulated distortions, where most PWRC results are larger than 0.5 on LIVE II and TID2013 databases. However, in coping with the authentically distorted images, the performances of all IQA metrics are very poor, whose PWRC results are all smaller than 0.4 on the ChallengeDB database. In addition, it is found that the performances of the training based NR-IQA algorithms would gradually decrease when the ratio of training set reduces from 80% to 20%. By contrast, the FR-IQA metrics achieve more robust results, where the similar results could also be found in [35]. As shown in Fig. 7 (a), all NR-IQA algorithms, apart from BIQI, achieve comparable PWRC performance in comparison with the FR-IQA metrics when the training set takes up 80% images. While, in Fig. 7 (g), the FR-IQA algorithms outperform most NR-IQA algorithms when the training set ratio drops to 20%.

To quantitatively compare the rank results towards different IQA metrics, we further report their SRCC, KRCC, and values on LIVE II, TID2013 and ChallengeDB databases, which are shown in Tables III, IV and V. Particularly, we highlight all pairwise metrics by boldface and italic when they present inconsistent rank results between and the other indicators. Each mistaken pair is labeled by the same color, such as, red, blue and green. As shown in Table III, when 80% images are used for training on the LIVE II database, the SRCC considers BLIINDS II and BRISQUE present the same rank accuracy, whose reported values are both 0.925. But, the value of BRISQUE is clearly better than BLIINDS II. Both the KRCC and the proposed report the consistent rank result with respect to . When IW-PSNR is compared with NFERM, the SRCC prefers the FR-IQA metric IW-PSNR, which is also opposite to the and the other indicators. When the training set ratio is set to 50%, both the SRCC and KRCC would recommend SSIM in comparison with M3. However, we can find that M3 achieves higher , and the proposed indicator correctly reflects this superiority. The similar result is also reported in comparing M3 with TCLT when we use 20% images for training. Meanwhile, more disagreements between and SRCC/KRCC could be found in Tables IV and V.

Across all of experiments on LIVE II, TID2013 and ChallengeDB databases, we can find that the computed from our proposed PWRC achieves highly consistent rank results with respect to , where there is no disagreement between them. It verifies that the proposed PWRC indicator is beneficial for pushing perceptually preferred images from the degraded data, where the top ranked images possess higher average MOS.

Iv-D Evaluation on Enhanced Images

(a) IVCDehazing database with 80% training data
(b) IVCDehazing database with 50% training data
(c) IVCDehazing database with 20% training data
(d) ESPL-LIVE database with 80% training data
(e) ESPL-LIVE database with 50% training data
(f) ESPL-LIVE database with 20% training data
Fig. 8: The SA-ST curves of different IQA algorithms tested on the enhanced images. (a)-(c) show the performances on the IVCDehazing database, where the training sets take up 80%, 50% and 20% images, respectively. (d)-(f) give the results, whose training set ratios are 80%, 50% and 20%, respectively.

Reliable IQA metric plays a crucial role in kinds of quality driven applications. In this section, we utilize the proposed PWRC indicator to recommend appropriate IQA metrics for two image enhancement applications, i.e., image dehazing and HDR image reconstruction, which are investigated on IVCDehazing and ESPL-LIVE databases, respectively. More specifically, the IQA metric serves as a selector towards different enhancement algorithms. The PWRC aims to find the most robust selector from some candidate IQA metrics. It is worth nothing that there is no perfect reference for an enhanced image. So, we select seven state-of-the-art NR-IQA algorithms as the candidate metrics, which include BIQI [51], BLIINDS-II [52], BRISQUE [53], DIIVINE [54], NFERM [55], M3 [56] and TCLT [57].

For fair comparison, different enhancement algorithms should be applied to the same visual content. In view of this fact, we develop a slightly deformed average MOS difference to represent the benchmark of each IQA metric in pushing perceptually preferred images. Given candidate algorithms and raw images, we use to denote the performance of the th IQA metric for ranking enhanced versions of the th raw image, where could be computed via Eq. (18) by setting to . Then, the overall performance of the th IQA metric could be defined by


where the similar measurement could be found in the rational test of [35].

Similar to , we use the image-wise averaged rank correlation indicators to sort different enhancement algorithms


where the area under the curve values computed from is denoted by . For comparison, the deformed indicators and are also tested in this section, which can be computed by replacing with and in Eq. (21), respectively.

Similar to the test on degraded images, we first show the SA-ST curves of different IQA metrics in Fig. 8. It is seen that the existing NR-IQA metrics work poorly in recommending high quality dehazing or HDR image, whose values are all very small. In addition, in comparison with Fig. 7, we can clearly find more intersections between different SA-ST curves as shown in Fig. 8. For example, in Fig. 8 (a), the BRISQUE outperforms M3 when is smaller than 40. However, because the descent speed of BRISQUE is much higher than M3, its values become lower than M3 when is larger than 40. The similar observations could also be found between NFERM and BLIINDS II as shown in Fig. 8 (b). In addition, the intersection between BRISQUE and BLIINDS II also occurs in Figs. 8 (e) and (f), and so on. We can find that the intersection between different SA-ST curves occurs more frequently for the IQA metrics whose locate in low values as shown in Figs. 8 (a)-(c). This is because that there are two factors could result in a low PWRC value, i.e., 1) a large number of correct ranks locate in low quality levels, or 2) a small amount of correct ranks locate in high quality levels. In case 1), the descent speed of SA-ST curve would be relatively slow, where the changes in low quality levels only slightly impair due to their low weights. By contrast, in case 2), the descent speed of SA-ST curve would be faster due to the larger weights assigned to high quality levels in . This subtle discriminability enables to evaluate each IQA metric from a more comprehensive perspective. In the following, by means of , we can derive a single-valued measurement from under the given confidence intervals.

In Tables VI and VII, we report the detailed accuracy of different IQA metrics on the IVCDehazing and ESPL-LIVE databases, respectively. Similar to the observations in [45, 46], existing NR-IQA metrics work poorly in ranking the qualities of enhanced images. As shown in Table VI, many IQA metrics produce negative values. The results in Table VII are slightly better, whose values are still very small. All three rank correlation indicators could reflect the terrible accuracy of these algorithm selectors, where , and are all very low. But, there are also disagreements between the detailed ranks of IQA metrics predicted by them, where the discordant IQA pairs are highlighted by boldface and italic in Tables VI and VII.

TABLE VI: Comparison of different rank correlation indicators on IVCDehazing database
TABLE VII: Comparison of different rank correlation indicators on ESPL-LIVE HDR database

As shown in Table VI, when 80% dehazing images are used for training, both and recommend BRISQUE instead of M3 or TCLT, which is consistent with . While, the prefers TCLT. In Table VII, when we use 50% HDR images for training, both and consider BRISQUE is better than BLIINDS II. Only the proposed indicator is consistent with . The similar condition is also found for comparing BRISQUE with BLIINDS II, when the training set ratio is 20%. These experiments confirm that the proposed indicator works better for recommending robust IQA metric towards enhanced images, whose top ranked samples would present higher average perceptual qualities.

V Conclusion

In this paper, we propose a perceptually weighted rank correlation (PWRC) indicator to fairly compare different IQA algorithms. Inspired by two important visual perception properties, i.e., perceptual importance variation and subjective uncertainty, we develop the nonuniform weighting and adaptive activation schemes to evaluate the rank accuracy of each IQA algorithm. More specifically, a larger weight would be assigned to the image pair with higher quality level and greater rank deviation. Meanwhile, the comparison between two images would be activated only if their rank deviation exceeds the given sensory threshold, which suppresses the interference from perceptually unsortable image pairs. Extensive experiments on five publicly available IQA databases show that the proposed indicator is more consistent with human perception and works better for recommending perceptually preferred images.


  • [1] T. Hossfeld, C. Keimel, M. Hirth, B. Gardlo, J. Habigt, K. Diepold, and P. Tran-Gia, “Best practices for QoE crowdtesting: QoE assessment with crowdsourcing,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 541–558, 2014.
  • [2] S. Wang, A. Rehman, Z. Wang, S. Ma, and W. Gao, “Perceptual video coding based on ssim-inspired divisive normalization,” IEEE Transactions on Image Processing, vol. 22, no. 4, pp. 1418–1429, April 2013.
  • [3] K. Ma, K. Zeng, and Z. Wang, “Perceptual quality assessment for multi-exposure image fusion,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3345–3356, 2015.
  • [4] K. Gu, G. Zhai, X. Yang, W. Zhang, and C. W. Chen, “Automatic contrast enhancement technology with saliency preservation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 9, pp. 1480–1494, 2015.
  • [5] Z. Wang, “Applications of objective image quality assessment methods [applications corner],” IEEE Signal Processing Magazine, vol. 28, no. 6, pp. 137–142, Nov 2011.
  • [6] A. Zhang, J. Chen, L. Zhou, and S. Yu, “Graph theory-based QoE-driven cooperation stimulation for content dissemination in device-to-device communication,” IEEE Transactions on Emerging Topics in Computing, vol. 4, no. 4, pp. 556–567, Oct 2016.
  • [7] W. Zhu, P. Cui, Z. Wang, and G. Hua, “Multimedia big data computing,” IEEE MultiMedia, vol. 22, no. 3, pp. 96–c3, July 2015.
  • [8] H. R. Sheikh, Z. Wang, L. Cormack, and A. C. Bovik, LIVE Image Quality Assessment Database Release 2, [Online]. Available: http://live.ece.utexas.edu/research/quality.
  • [9] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.-C. J. Kuo, “Image database TID2013: Peculiarities, results and perspectives,” Signal Processing: Image Communication, vol. 30, pp. 57 – 77, 2015.
  • [10] E. C. Larson and D. M. Chandler, Categorical image quality (CSIQ) database, [Online]. Available: http://vision.okstate.edu/csiq.
  • [11] D. Ghadiyaram and A. Bovik, LIVE In the Wild Image Quality Challenge Database, 2015, [Online]. Available: http://live.ece.utexas.edu/research/ChallengeDB/index.html.
  • [12] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Transactions on image processing, vol. 15, no. 11, pp. 3440–3451, 2006.
  • [13] Z. Wang and E. P. Simoncelli, “Reduced-reference image quality assessment using a wavelet-domain natural image statistic model,” in Electronic Imaging, 2005, pp. 149–159.
  • [14] Z. Wang and A. C. Bovik, “Reduced-and no-reference image quality assessment,” IEEE Signal Processing Magazine, vol. 28, no. 6, pp. 29–40, 2011.
  • [15] Q. Wu, H. Li, Z. Wang, F. Meng, B. Luo, W. Li, and K. N. Ngan, “Blind image quality assessment based on rank-order regularized regression,” IEEE Transactions on Multimedia, 2017, In Press.
  • [16] E. B. Newman, “The validity of the just noticeable difference as a unit of psychological magnitude,” Transactions of the Kansas Academy of Science (1903-), vol. 36, pp. 172–175, 1933.
  • [17] M. K. Stern and J. H. Johnson, Just Noticeable Difference.   John Wiley & Sons, Inc., 2010.
  • [18] H. Liu and I. Heynderickx, “Visual attention in objective image quality assessment: Based on eye-tracking data,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 7, pp. 971–982, 2011.
  • [19] L. Zhang, Y. Shen, and H. Li, “Vsi: A visual saliency-induced index for perceptual image quality assessment,” IEEE Transactions on Image Processing, vol. 23, no. 10, pp. 4270–4281, 2014.
  • [20] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, April 2004.
  • [21] Z. Wang and Q. Li, “Information content weighting for perceptual image quality assessment,” IEEE Transactions on Image Processing, vol. 20, no. 5, pp. 1185–1198, 2011.
  • [22] M. P. Sampat, Z. Wang, S. Gupta, A. C. Bovik, and M. K. Markey, “Complex wavelet structural similarity: A new image similarity index,” IEEE transactions on image processing, vol. 18, no. 11, pp. 2385–2401, 2009.
  • [23] H. R. Sheikh, A. C. Bovik, and G. De Veciana, “An information fidelity criterion for image quality assessment using natural scene statistics,” IEEE Transactions on image processing, vol. 14, no. 12, pp. 2117–2128, 2005.
  • [24] Y. Fang, K. Ma, Z. Wang, W. Lin, Z. Fang, and G. Zhai, “No-reference quality assessment of contrast-distorted images based on natural scene statistics,” IEEE Signal Processing Letters, vol. 22, no. 7, pp. 838–842, 2015.
  • [25] P. Zhang, W. Zhou, L. Wu, and H. Li, “SOM: Semantic obviousness metric for image quality assessment,” in

    IEEE Conference on Computer Vision and Pattern Recognition

    , June 2015, pp. 2394–2402.
  • [26] M. Kendall and J. Gibbons, Rank correlation methods, 5th ed., ser. A Charles Griffin Book.   London: E. Arnold, 1990.
  • [27] I.-T. R. P.10, “Vocabulary for performance and quality of service,” https://www.itu.int/rec/T-REC-P.10/en, July 2006.
  • [28] Q. Huynh-Thu, M. N. Garcia, F. Speranza, P. Corriveau, and A. Raake, “Study of rating scales for subjective quality assessment of high-definition video,” IEEE Transactions on Broadcasting, vol. 57, no. 1, pp. 1–14, March 2011.
  • [29] W. Daniel and C. Cross, Biostatistics: A Foundation for Analysis in the Health Sciences

    , ser. Wiley Series in Probability and Statistics.   Wiley, 2013.

  • [30] J. H. Pechmann, D. E. Scott et al., “Declining amphibian populations: the problem of separating human impacts from natural fluctuations,” Science, vol. 253, no. 5022, p. 892, 1991.
  • [31] J. B. Brewer, Z. Zhao, J. E. Desmond, G. H. Glover, and J. D. Gabrieli, “Making memories: brain activity that predicts how well visual experience will be remembered,” Science, vol. 281, no. 5380, pp. 1185–1187, 1998.
  • [32] M. Hsu, M. Bhatt, R. Adolphs, D. Tranel, and C. F. Camerer, “Neural systems responding to degrees of uncertainty in human decision-making,” Science, vol. 310, no. 5754, pp. 1680–1683, 2005.
  • [33] C. Camerer and M. Weber, “Recent developments in modeling preferences: Uncertainty and ambiguity,” Journal of Risk and Uncertainty, vol. 5, no. 4, pp. 325–370, 1992.
  • [34] Z. Wang and E. P. Simoncelli, “Maximum differentiation (mad) competition: A methodology for comparing computational models of perceptual quantities,” Journal of Vision, vol. 8, no. 12, pp. 8–8, 2008.
  • [35] K. Ma, Q. Wu, Z. Wang, Z. Duanmu, H. Yong, H. Li, and L. Zhang, “Group mad competition-a new methodology to compare objective image quality models,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1664–1673.
  • [36] J. A. Swets, “Is there a sensory threshold?” Science, vol. 134, no. 3473, pp. 168–177, 1961.
  • [37] C. Owsley, R. Sekuler, and D. Siemsen, “Contrast sensitivity throughout adulthood,” Vision research, vol. 23, no. 7, pp. 689–699, 1983.
  • [38] L. Chaieb, A. Antal, and W. Paulus, “Gender-specific modulation of short-term neuroplasticity in the visual cortex induced by transcranial direct current stimulation,” Visual neuroscience, vol. 25, no. 01, pp. 77–81, 2008.
  • [39] B. H. Crawford, “The change of visual sensitivity with time,” Proceedings of the Royal Society of London B: Biological Sciences, vol. 123, no. 830, pp. 69–89, 1937.
  • [40] D. Alais, J. Cass, R. P. O’Shea, and R. Blake, “Visual sensitivity underlying changes in visual consciousness,” Current Biology, vol. 20, no. 15, pp. 1362 – 1367, 2010.
  • [41] A. I. Jack, G. H. Patel, S. V. Astafiev, A. Z. Snyder, E. Akbudak, G. L. Shulman, and M. Corbetta, “Changing human visual field organization from early visual to extra-occipital cortex,” PLoS One, vol. 2, no. 5, p. e452, 2007.
  • [42] A. P. Venkataraman, S. Winter, P. Unsbo, and L. Lundstr  “Blur adaptation: Contrast sensitivity changes and stimulus extent,” Vision Research, vol. 110, Part A, pp. 100 – 106, 2015.
  • [43] P. Engeldrum, Psychometric Scaling: A Toolkit for Imaging Systems Development.   Imcotek Press, 2000.
  • [44] F. Pukelsheim, “The three sigma rule,” The American Statistician, vol. 48, no. 2, pp. 88–91, 1994.
  • [45] K. Ma, W. Liu, and Z. Wang, “Perceptual evaluation of single image dehazing algorithms,” in IEEE International Conference on Image Processing, Sept 2015, pp. 3600–3604.
  • [46] D. Kundu, D. Ghadiyaram, A. C. Bovik, and B. L. Evans, ESPL-LIVE HDR Image Quality Database, [Online]. Available: http://signal.ece.utexas.edu/~debarati/HDRDatabase.zip.
  • [47] Z. Wang and A. C. Bovik, “Mean squared error: Love it or leave it? a new look at signal fidelity measures,” IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 98–117, Jan 2009.
  • [48] Z. Wang and Q. Li, “Information content weighting for perceptual image quality assessment,” IEEE Transactions on Image Processing, vol. 20, no. 5, pp. 1185–1198, May 2011.
  • [49] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, vol. 2, Nov 2003, pp. 1398–1402 Vol.2.
  • [50] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “Fsim: A feature similarity index for image quality assessment,” IEEE Transactions on Image Processing, vol. 20, no. 8, pp. 2378–2386, Aug 2011.
  • [51] A. Moorthy and A. Bovik, “A two-step framework for constructing blind image quality indices,” IEEE Signal Processing Letter, vol. 17, no. 5, pp. 513–516, 2010.
  • [52] M. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the DCT domain,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, Aug. 2012.
  • [53] A. Mittal, A. Moorthy, and A. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions Image Processing, vol. 21, no. 12, pp. 4695–4708, Dec 2012.
  • [54] A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE Transactions on Image Processing, vol. 20, no. 12, pp. 3350–3364, Dec. 2011.
  • [55] K. Gu, G. Zhai, X. Yang, and W. Zhang, “Using free energy principle for blind image quality assessment,” IEEE Transactions on Multimedia, vol. 17, no. 1, pp. 50–63, Jan 2015.
  • [56] W. Xue, X. Mou, L. Zhang, A. Bovik, and X. Feng, “Blind image quality assessment using joint statistics of gradient magnitude and laplacian features,” IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4850–4862, Nov 2014.
  • [57] Q. Wu, H. Li, F. Meng, K. N. Ngan, B. Luo, C. Huang, and B. Zeng, “Blind image quality assessment based on multichannel feature fusion and label transfer,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 3, pp. 425–440, March 2016.