Uncertainty-Aware Blind Image Quality Assessment in the Laboratory and Wild

05/28/2020 ∙ by Weixia Zhang, et al. ∙ City University of Hong Kong Shanghai Jiao Tong University 0

Performance of blind image quality assessment (BIQA) models has been significantly boosted by end-to-end optimization of feature engineering and quality regression. Nevertheless, due to the distributional shifts between images simulated in the laboratory and captured in the wild, models trained on databases with synthetic distortions remain particularly weak at handling realistic distortions (and vice versa). To confront the cross-distortion-scenario challenge, we develop a unified BIQA model and an effective approach of training it for both synthetic and realistic distortions. We first sample pairs of images from the same IQA databases and compute a probability that one image of each pair is of higher quality as the supervisory signal. We then employ the fidelity loss to optimize a deep neural network for BIQA over a large number of such image pairs. We also explicitly enforce a hinge constraint to regularize uncertainty estimation during optimization. Extensive experiments on six IQA databases show the promise of the learned method in blindly assessing image quality in the laboratory and wild. In addition, we demonstrate the universality of the proposed training strategy by using it to improve existing BIQA models.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the inelastic demand for processing massive Internet images, it is of paramount importance to develop computational image quality models to monitor, maintain, and enhance the perceived quality of the output images of various image processing systems [wang2006modern]. High degrees of consistency between model predictions and human opinions of image quality have been achieved in the full-reference regime, where distorted images are compared to their reference images of pristine quality [wang2003multiscale]

. When such information is not available no-reference or blind image quality assessment (BIQA) that relies solely on distorted images becomes more practical yet more challenging. Recently, deep learning-based BIQA models have experienced an impressive series of successes due to joint optimization of feature representation and quality prediction. However, these models remain notoriously weak at cross-distortion-scenario generalization 

[zhang2020blind]. That is, models trained on images simulated in the laboratory cannot deal with images captured in the wild. Similarly, models optimized for realistic distortions (e.g., sensor noise and poor exposure) do not work well for synthetic distortions (e.g., Gaussian blur and JPEG compression).

Fig. 1: Images with approximately the same linearly re-scaled MOS exhibit drastically different perceptual quality. If the human scores are in the form of DMOSs, we first negate the values followed by linear re-scaling. These are sampled from (a) LIVE [sheikh2006statistical], (b) CSIQ [larson2010most], (c) KADID-10K [lin2019kadid], (d) BID [ciancio2011no], (e) LIVE Challenge [ghadiyaram2016massive], and (f) KonIQ-10K [hosu2020koniq]. It is not hard to observe that image (f) has clearly superior quality than the other five. All images are cropped for better visibility.

A seemingly straightforward method of adapting to the distributional shifts between synthetic and realistic distortions is to directly combine multiple IQA databases for training. However, existing databases have different perceptual scales due to differences in subjective testing methodologies. For example, the CSIQ database [larson2010most] used a multiple stimuli absolute category rating in a well-controlled laboratory environment, with difference mean opinion scores (DMOSs) in the range of , whereas the LIVE Challenge Database used a single stimulus continuous quality rating in an unconstrained crowdsourcing platform, with MOSs in the range of . This means that a separate subjective experiment on images sampled from each database is required for perceptual scale realignment [sheikh2006statistical, larson2010most]. To make this point more explicit, we linearly re-scaled the subjective scores of each of six databases [sheikh2006statistical, larson2010most, lin2019kadid, ciancio2011no, ghadiyaram2016massive, hosu2020koniq] to , with a larger value indicating higher quality. Fig. 1 shows sample images that have approximately the same re-scaled MOS. As expected, they appear to have drastically different perceptual quality. A more promising design methodology for unified

BIQA is to build a prior probability model for natural undistorted images as the reference distribution, to which a test distribution that summarizes the distorted image can be compared. The award-winning BIQA model - NIQE 

[mittal2013making] is a specific instantiation of this idea, but is only capable of handling a small number of synthetic distortions.

In addition to training BIQA models with (D)MOSs, there is another type of supervisory signal - the variance of human opinions, which we believe is beneficial for BIQA, but has not been explored, to our best knowledge. Generally, humans tend to give more consistent ratings (

i.e., smaller variances) to images at the two ends of the quality range, while assessing images in the mid-quality range with less certainty (see Fig. 3). Therefore, it is reasonable to assume image quality models to behave similarly. Moreover, previous methods [wu2018blind] have enjoyed the benefits of modeling the uncertainty of quality prediction for subsequent applications.

In this paper, we take steps toward developing unified uncertainty-aware BIQA models for both synthetic and realistic distortions. Our contributions include:

  • A training strategy that allows differentiable BIQA models to be learned on multiple IQA database (of different distortion scenarios) simultaneously. In particular, we first sample and combine pairs of images within each database. For each pair, we leverage the human-annotated (D)MOSs and variances to compute a probability value that one image is of better perceptual quality as the supervisory signal. The resulting training set bypasses additional subjective testing for perceptual scale realignment. We then use a pairwise learning-to-rank algorithm with the fidelity loss [tsai2007frank] to drive the learning of computational models for BIQA.

  • A regularizer that enforces a hinge constraint on the learned uncertainty using the variance of human opinions as guidance. This enables BIQA models to mimic the uncertain aspects of humans when performing the quality assessment task.

  • A Unified No-reference Image Quality and Uncertainty Evaluator (UNIQUE) based on a deep neural network (DNN) that significantly outperforms state-of-the-art BIQA models on six IQA databases (see Table I) covering both synthetic and realistic distortions. We also verify its generalizability in a challenging cross-data setting and via the group maximum differentiation (gMAD) competition methodology [ma2020group].

Database # of Images Scenario Annotation Range Subjective Testing Methodology
LIVE [sheikh2006statistical] 779 Synthetic DMOS, Variance Single stimulus continuous quality rating
CSIQ [larson2010most] 866 Synthetic DMOS, Variance Multi stimulus absolute category rating
KADID-10K [lin2019kadid] 10,125 Synthetic MOS, Variance Double stimulus absolute category rating with crowdsourcing
BID [ciancio2011no] 586 Realistic MOS, Variance Single stimulus continuous quality rating
LIVE Challenge [ghadiyaram2016massive] 1,162 Realistic MOS, Variance Single stimulus continuous quality rating with crowdsourcing
KonIQ-10K [hosu2020koniq] 10,073 Realistic MOS, Variance Single stimulus absolute category rating with crowdsourcing
TABLE I: Comparison of subject-rated IQA databases. MOS stands for mean opinion score. DMOS is inversely proportional to MOS
Model RankIQA [liu2017rankiqa] DB-CNN [zhang2020blind] dipIQ [ma2017dipiq] Ma19 [ma2019blind] Gao15 [gao2015learning] UNIQUE
Source DS DS FR FR (D)MOS (D)MOS+Variance
Scenario Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic+Realistic
Annotation Binary Categorical Binary Binary Binary Continuous
Loss Function Hinge variant Cross entropy Cross entropy Cross entropy variant Hinge Fidelity+Hinge
Ranking Stage Pre-training Pre-training Prediction Prediction Prediction Prediction
TABLE II: Summary of ranking-based BIQA models. DS: distortion specification characterized by distortion parameters. FR: full-reference IQA model predictions

Ii Related Work

In this section, we give a review of existing BIQA models over the last two decades.

Ii-a BIQA as Regression

Early attempts at BIQA were tailored to specific synthetic distortions, such as JPEG compression [wang2002no] and JPEG2000 compression [marziliano2004perceptual]. Later models aimed for general-purpose BIQA [moorthy2011blind, ye2012unsupervised, mittal2012no, xu2016blind, ghadiyaram2017perceptual], with the underlying assumption that statistics extracted from natural images are highly regular [simoncelli2001natural]

and distortions will break such statistical regularities. Based on natural scene statistics (NSS), a quality prediction function can be produced using standard supervised learning tools. Of particular interest is NIQE 

[mittal2013making], which is arguably the first unified BIQA model with the goal of capturing arbitrary distortions. However, the NSS model used in NIQE is not sensitive to image “unnaturalness” introduced by realistic distortions. Zhang et al. [zhang2015feature] extended NIQE [mittal2013making] by exploiting a more powerful set of NSS for local quality prediction. However, the generalization to realistic distortions is still quite limited.

Joint optimization of feature engineering and quality regression enabled by deep learning has significantly improved the performance of BIQA in recent years. The apparent conflict between the small number of subjective ratings and the large number of learnable model parameters may be alleviated in three ways. The first method is transfer learning 

[zeng2018blind], which directly fine-tunes pre-trained DNNs for object recognition. The second method is patch-based training, which assigns a local quality score to an image patch transferred from the corresponding global quality score [kang2014convolutional, bosse2016deep]. The third method is quality-aware pre-training, which automatically generates a large amount of labeled data by exploiting specifications of distortion processes or quality estimates of full-reference models [Ma2018End, liu2017rankiqa, zhang2020blind]. Despite impressive correlation numbers on individual databases of either synthetic or realistic distortions, DNN-based BIQA models are vulnerable to cross-distortion-scenario generalization, and can also be easily falsified in the gMAD competition [wang2020active].

Ii-B BIQA as Ranking

There are also methods that cast BIQA as a learning-to-rank problem, where relative ranking information can be obtained from distortion specifications [liu2017rankiqa, Ma2018End, zhang2020blind], full-reference IQA models [ma2017dipiq, ma2019blind], and human data [gao2015learning]. Liu et al. [liu2017rankiqa] and Zhang et al. [zhang2020blind] inferred discrete ranking information from images of the same content and distortion but at different levels for BIQA model pre-training. Different from [liu2017rankiqa, zhang2020blind], the proposed UNIQUE explores continuous ranking information from (D)MOSs and variances in the stage of final quality prediction. Ma et al. [ma2017dipiq, ma2019blind] extracted binary ranking information from full-reference IQA methods to guide the optimization of BIQA models. Since full-reference methods can only be applied to synthetic distortions, where the reference images are available, it is not trivial to extend the methods in [ma2017dipiq, ma2019blind] to realistic distortions. The closest work to ours is due to Gao et al. [gao2015learning], who computed binary rankings from MOSs. However, they neither performed end-to-end optimization of BIQA nor explored the idea of combining multiple IQA databases via pairwise rankings. As a result, their method only achieves reasonable performance on a limited number of synthetic distortions. UNIQUE takes a step further to be uncertainty-aware, learning from human behavior when evaluating image quality. We summarize ranking-based BIQA methods in Table II.

Fig. 2: Illustration of the proposed training strategy, which involves two steps: IQA database combination and pairwise learning-to-rank model estimation. The training image pairs are randomly sampled within individual IQA databases and then combined. The optimization is driven by the fidelity and hinge losses.

Ii-C Uncertainty-Aware BIQA

Learning uncertainty is helpful to understand and analyze model predictions. In Bayesian machine learning, uncertainty may come from two parts: one inherent in the data (

i.e., data/aleatoric uncertainty) and the other in the learned parameters (i.e., model/epistemic uncertainty) [kendall2017what]. In the context of BIQA, Huang et al. [huang2019convolutional] modeled the uncertainty of patch quality to alleviate the label noise problem in patch-based training. Wu et al. [wu2018blind] employed a sparse Gaussian process for quality regression, where the data uncertainty can be jointly estimated without supervision. In contrast, UNIQUE assumes the Thurstone’s model [thurstone1927law]

, and learns the data uncertainty with direct supervision, aiming for an effective BIQA model with a probability interpretation.

Iii Training UNIQUE

In this section, we first present the proposed training strategy, consisting of IQA database combination and pairwise learning-to-rank model estimation (see Fig. 2). We then describe the details of the UNIQUE model for unified uncertainty-aware BIQA.

Iii-a IQA Database Combination

Our goal is to combine IQA databases for training while avoiding extra subjective experiments for perceptual scale realignment. To achieve this, we first randomly sample pairs of images from the -th database. For each image pair , we infer the relative ranking information from the corresponding MOSs and variances. Specifically, under the Thurstone’s model [thurstone1927law], we assume that the true perceptual quality of image

follows a Gaussian distribution with mean

and variance collected via subjective testing. Assuming the variability of quality across images is uncorrelated, the quality difference, , is also Gaussian with mean and variance . The probability that has higher perceptual quality than

can be calculated from the Gaussian cumulative distribution function

, which admits a closed-form solution:


Combining pairs of images from databases, we are able to build a training set . Our database combination approach allows future IQA databases to be added with essentially no cost.

Iii-B Model Estimation

Given as the training set, we aim to learn two differentiable functions and

, parameterized by a vector

, which accept an image of arbitrary input size, and compute the quality score and uncertainty. Similar in Section III-A, we assume the true perceptual quality obeys a Gaussian distribution with mean and variance now estimated by and , respectively. The probability of preferring over perceptually in an image pair is


It remains to specify a similarity measure between the probability distributions

and as the objective for model estimation. In machine learning, cross-entropy may be the de facto measure for this purpose, but has several drawbacks [tsai2007frank]. First, the minimum of the cross-entropy loss is not exactly zero, except for the ground truth and . This may hinder the learning of image pairs with close to . Second, the cross-entropy loss is unbounded from above, which may over-penalize some hard training examples, therefore biasing the learned models. To address these problems, we choose the fidelity loss [tsai2007frank], which is originated from quantum physics to measure the difference between two states of a quantum [birrell1984quantum]:


Joint estimation of image quality and uncertainty will introduce scaling ambiguity. More precisely, if we make the scaling and , then the probability given by Eq. (2) is unchanged. Our preliminary results [zhang2020learning] showed that the learned by optimizing Eq. (III-B) solely neither resembles any aspects of human behavior in BIQA, nor reveals new statistical properties of natural images. To resolve the scaling ambiguity and provide with direct supervision, we enforce an explicit regularizer of by taking advantage of the ground truth . Note that the ground truth stds across IQA databases are not comparable, which prevents the use of their absolute values. Similarly, for each pair , we infer a binary label for uncertainty learning, where if and otherwise. We define the regularizer using the hinge loss:


where the margin sets a specific scale for BIQA models to work with. The augmented training set becomes . During training, we sample a mini-batch from

in each iteration, and use stochastic gradient descent to update the parameter vector

by minimizing the following empirical loss:


where denotes the cardinality of and trades off the two terms.

(a) LIVE
(b) CSIQ
(c) KADID-10K
(d) BID
(e) LIVE Challenge
(f) KonIQ-10K
Fig. 3: Scatter plots of means against stds of human quality opinions of images from six IQA databases, including (a) LIVE [sheikh2006statistical], (b) CSIQ [larson2010most], (c) KADID-10K [lin2019kadid], (d) BID [ciancio2011no], (e) LIVE Challenge [ghadiyaram2016massive], and (f) KonIQ-10K [hosu2020koniq]. There is a clear trend that humans are more consistent (i.e., confident) in making predictions of low-quality and high-quality images than mid-quality images, giving rise to arch-like shapes.

Iii-C Specification of UNIQUE

We use ResNet-34 [he2016deep] as the backbone of UNIQUE due to its good balance between model complexity and representation capability. The pairwise learning-to-rank framework composed of two streams is shown in Fig. 2

. Each stream is implemented by a DNN, consisting of a stage of convolution, batch normalization 


, ReLU nonlinearity, and max-pooling, followed by four residual blocks (see Table 

III). To generate a fixed-length image representation regardless of input resolution and summarize higher-order spatial statistics, we replace the first-order average pooling in the original ResNet with a second-order bilinear pooling, which has been empirically proven useful in object recognition [lin2015bilinear] and BIQA [zhang2020blind]. We flatten the spatial dimensions of the feature representation after the last convolution to obtain , where and denote the spatial and channel dimensions, respectively. The bilinear pooling can be defined as


We further flatten , and append a fully connected layer with two outputs to represent and , respectively. The network parameters of the two streams are shared during the entire optimization process.

Layer Name Network Parameter
Convolution 7

7, 64, stride 2

Max Pooling 33, stride 2
Residual Block 1 3
Residual Block 2 1
Residual Block 3 1
Residual Block 4 1
Bilinear Pooling 0
Fully Connected Layer 262,1442
TABLE III: The network architecture of UNIQUE based on ResNet-34 [he2016deep]. The nonlinearity and normalization layers are omitted for brevity
Database LIVE [sheikh2006statistical] CSIQ [larson2010most] KADID-10K [lin2019kadid] BID [ciancio2011no] CLIVE [ghadiyaram2016massive] KonIQ-10K [hosu2020koniq]
MS-SSIM [wang2003multiscale] 0.951 0.941 0.910 0.897 0.821 0.819
NLPD [Laparra:17] 0.942 0.937 0.937 0.930 0.822 0.821
DISTS [ding2020iqa] 0.955 0.955 0.944 0.946 0.892 0.892
NIQE [mittal2013making] 0.906 0.908 0.632 0.726 0.374 0.428 0.468 0.461 0.464 0.515 0.521 0.529
ILNIQE [zhang2015feature] 0.907 0.912 0.832 0.873 0.531 0.573 0.516 0.533 0.469 0.536 0.507 0.534
dipIQ [ma2017dipiq] 0.940 0.933 0.511 0.778 0.304 0.402 0.009 0.346 0.187 0.290 0.228 0.437
Ma19 [ma2019blind] 0.935 0.934 0.917 0.926 0.466 0.500 0.316 0.348 0.348 0.400 0.365 0.416
MEON (LIVE) [Ma2018End] 0.726 0.787 0.234 0.410 0.100 0.217 0.378 0.477 0.145 0.242
deepIQA (LIVE) [bosse2016deep] 0.645 0.730 0.270 0.309 -0.043 0.127 0.076 0.162 -0.064 0.088
deepIQA (TID2013) 0.834 0.853 0.679 0.771 0.559 0.573 0.236 0.282 0.048 0.150 0.182 0.272
RankIQA (LIVE) [liu2017rankiqa] 0.711 0.790 0.436 0.488 0.324 0.350 0.451 0.503 0.617 0.631
RankIQA (TID2013) 0.599 0.633 0.667 0.778 0.477 0.504 0.217 0.306 0.289 0.313 0.460 0.471
PQR (BID) [zeng2018blind] 0.663 0.673 0.522 0.612 0.321 0.403 0.691 0.740 0.614 0.650
PQR (KADID-10K) 0.916 0.927 0.803 0.870 0.358 0.405 0.439 0.501 0.485 0.484
PQR (KonIQ-10K) 0.741 0.760 0.710 0.737 0.583 0.604 0.729 0.739 0.766 0.826
PaQ-2-PiQ (LIVE Patch) [ying2020from] 0.472 0.559 0.555 0.658 0.379 0.429 0.682 0.713 0.719 0.778 0.722 0.735
DB-CNN (CSIQ) [zhang2020blind] 0.855 0.854 0.501 0.569 0.329 0.382 0.451 0.472 0.499 0.515
DB-CNN (LIVE Challenge) 0.723 0.754 0.691 0.685 0.488 0.529 0.809 0.832 0.770 0.825
UNIQUE (All databases) 0.969 0.968 0.902 0.927 0.878 0.876 0.858 0.873 0.854 0.890 0.896 0.901
TABLE IV: Median SRCC and PLCC results across ten sessions. The databases used for training models that require human data are included in the bracket

Iv Experiments

In this section, we first present the experimental setups, including the IQA database selection, the evaluation protocols, and the training details of UNIQUE. We then compare UNIQUE with several state-of-the-art BIQA models on existing IQA databases and using gMAD competition [ma2020group]. Moreover, we verify the superiority of the proposed training strategy by comparing it with alternative training schemes, testing it in a cross-database setting, and using it to improve existing BIQA models.

Iv-a Experimental Setups

We choose six IQA databases (summarized in Table I), among which LIVE [sheikh2006statistical], CSIQ [larson2010most], and KADID-10K [lin2019kadid] contain synthetic distortions, while LIVE Challenge [ghadiyaram2016massive], BID [ciancio2011no], and KonIQ-10K [hosu2020koniq] include realistic distortions. All selected databases provide the stds of subjective quality scores (see Fig. 3). We exclude TID2013 [ponomarenko2013color] in our experiments because the MOS is computed by the number of winning times in a maximum of nine paired comparisons without suitable psychometric scaling [mikhailiuk2018psychometric], and does not satisfy the Gaussian assumption. We refer the interested readers to our preliminary work [zhang2020learning] for the results on TID2013 [ponomarenko2013color].

We randomly sample images from each database to construct the training set and leave the remaining for testing. Regarding synthetic databases LIVE, CSIQ, and KADID-10K, we split the training and test sets according to the reference images in order to ensure content independence. We adopt two performance criteria: Spearman rank-order correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC), which measure prediction monotonicity and precision, respectively. To reduce the bias caused by the randomness in training and test set splitting, we repeat this procedure ten times, and report median SRCC and PLCC results.

We train UNIQUE on image pairs using Adam [Kingma2014adam] by minimizing the objective in Eq. (5). The margin of the hinge loss and the trade-off parameter are set to and

, respectively. Empirically, we find that the performance is insensitive to the two hyperparameters. More specifically, a softplus function is applied to constrain the predicted std

to be positive. The parameters of UNIQUE based on ResNet-34 [he2016deep]

are initialized with the weights pre-trained on ImageNet 

[deng2009imagenet]. The parameters of the last fully connected layer are initialized by He’s method [he2015delving]. We set the initial learning rate to with a decay factor of

for every three epochs, and we train UNIQUE twelve epochs. A warm-up training strategy is adopted: only the last fully connected layer is trained in the first three epochs with a mini-batch of

; for the remaining epochs, we fine-tune the entire method with a mini-batch of . During training, we re-scale and crop the images to

, keeping their aspect ratios. In all experiments, we test on images of original size. We implement UNIQUE using PyTorch, and will make the source codes publicly available.

Iv-B Main Results

Fig. 4: gMAD competition results between PQR [zeng2018blind] and UNIQUE. (a) Fixed UNIQUE at the low-quality level. (b) Fixed UNIQUE at the high-quality level. (c) Fixed PQR at the low-quality level. (d) Fixed PQR at the high-quality level.
Fig. 5: gMAD competition results between DB-CNN [zhang2020blind] and UNIQUE. (a) Fixed UNIQUE at the low-quality level. (b) Fixed UNIQUE at the high-quality level. (c) Fixed DB-CNN at the low-quality level. (d) Fixed DB-CNN at the high-quality level.
(a) LIVE
(b) CSIQ
(c) KADID-10K
(d) BID
(e) LIVE Challenge
(f) KonIQ-10K
Fig. 6: Scatter plots of means against stds of images from six IQA databases predicted by UNIQUE with the hinge loss. and indicate the predicted mean and std of image , respectively.
(a) LIVE
(b) CSIQ
(c) KADID-10K
(d) BID
(e) LIVE Challenge
(f) KonIQ-10K
Fig. 7: Scatter plots of means against stds of images from six IQA databases predicted by UNIQUE without the hinge loss.

Iv-B1 Correlation Results

We compare the performance of UNIQUE against three full-reference IQA measures - MS-SSIM [wang2003multiscale], NLPD [Laparra:17] and DISTS [ding2020iqa], and ten BIQA models, including four knowledge-driven methods that do not require MOSs for training - NIQE [mittal2013making], IL-NIQE [zhang2015feature], dipIQ [ma2017dipiq] and Ma19 [ma2019blind], and six data-driven DNN-based methods - MEON [Ma2018End], deepIQA [bosse2016deep], RankIQA [liu2017rankiqa], PQR [zeng2018blind], PaQ-2-PiQ [ying2020from] and DB-CNN [zhang2020blind]. For the competing models, we either use the publicly available implementations or re-train them on the specific databases using the training codes provided by the corresponding authors [zeng2018blind, zhang2020blind]. The median SRCC and PLCC results across ten sessions are listed in Table IV. NIQE and its improved version ILNIQE do not perform well on realistic distortions and challenging synthetic distortions in KADID-10K [lin2019kadid], despite the original goal of handling arbitrary distortions. dipIQ and Ma19 may only be able to deal with distortions included in the training set. This highlights the difficulties of distortion-aware BIQA methods to handle unseen distortions.

(a) LIVE Challenge, = 2.641
(b) KADID-10K, = 2.626
(c) KonIQ-10K, = 2.520
(d) BID, = 1.548
(e) KonIQ-10K, = 0.772
(f) CSIQ, = 0.417
(g) LIVE, = 0.264
(h) BID, = 0.107
(i) LIVE Challenge, = -2.055
(j) KADID-10K, = -2.388
(k) LIVE, = -2.785
(l) CSIQ, = -2.787
Fig. 8: Visual examples from different databases aligned in the learned perceptual scale. (a)-(d) Images with good predicted quality. (e)-(h) Images with fair predicted quality. (i)-(l) Images with poor predicted quality. In each row, images are arranged from left to right in descending order of predicted quality. Images are cropped for better visibility.

We observe similar phenomena for deep learning-based methods when facing the cross-distortion-scenario challenge. Despite large-scale pre-training, MEON fine-tuned on LIVE [sheikh2006statistical] does not generalize to other databases with different distortion types and scenarios. Being exposed to more synthetic distortion types in TID2013 [ponomarenko2013color], deepIQA and RankIQA achieve better performance on KADID-10K [lin2019kadid] than MEON. Equipped with a second-order pooling, DB-CNN trained on LIVE Challenge [ghadiyaram2016massive] is reasonably good at cross-distortion-scenario generalization. PQR trains two probabilistic models on BID [ciancio2011no] and KonIQ-10K [hosu2020koniq]. Not surprisingly, PQR, trained on KonIQ-10K with greater content and distortion diversity, generalizes much better to the rest databases than the one trained on BID. The most recent DNN-based method - PaQ-2-PiQ aims for local quality prediction, but only delivers reasonable performance on databases captured in the wild. Enabled by the proposed training strategy, UNIQUE is able to train on multiple databases simultaneously, and outperforms all competing models on all six databases by a large margin. It also shows competitive performance when comparing to state-of-the-art full-reference IQA models on synthetic databases.

SRCC LIVE CSIQ KADID-10K BID LIVE Challenge KonIQ-10K Weighted
UNIQUE trained without the hinge loss 0.969 0.887 0.879 0.872 0.856 0.898 0.889
UNIQUE trained with the hinge loss 0.969 0.902 0.878 0.858 0.854 0.896 0.888
Fidelity Loss LIVE CSIQ KADID-10K BID LIVE Challenge KonIQ-10K Weighted
UNIQUE trained without the hinge loss 0.131 0.086 0.048 0.058 0.015 0.026 0.041
UNIQUE trained with the hinge loss 0.073 0.042 0.020 0.027 0.014 0.010 0.018
TABLE V: Median results of UNIQUE trained with and without the hinge loss across ten sessions. The results in the last column are computed by the weighted average across all databases according to the number of images in each database
Database LIVE CSIQ BID LIVE Challenge
NIQE 0.906 0.627 0.459 0.449
ILNIQE 0.898 0.815 0.496 0.439
dipIQ 0.938 0.527 0.019 0.177
Ma19 0.919 0.915 0.295 0.330
PQR 0.902 0.765 0.304 0.408
PQR 0.729 0.707 0.751 0.763
DB-CNN 0.916 0.751 0.602 0.531
DB-CNN 0.820 0.724 0.785 0.723
UNIQUE 0.917 0.830 0.783 0.786
TABLE VI: SRCC results on the four IQA databases under the cross-database setup. The subscripts “s” and “r” stand for models trained on KADID-10K [lin2019kadid] and KonIQ-10K [hosu2020koniq], respectively. UNIQUE is trained on KADID-10K and KonIQ-10K simultaneously

Iv-B2 gMAD Competition Results

gMAD competition [ma2020group] is a complementary methodology for IQA model comparison on large-scale databases without human annotations. Focusing on falsifying perceptual models in the most efficient way, gMAD seeks pairs of images of similar quality predicted by one model, while being substantially different according to another model. To build the playground for gMAD, we combine all synthetically distorted images in the Waterloo Exploration Database [ma2017waterloo] with a corpus of realistically distorted images from SPAQ [fang2020perceptual]. We first let UNIQUE compete against PQR [zeng2018blind] trained on the entire KonIQ-10K [hosu2020koniq] in Fig. 4. The pair of images in (a) exhibits similar poor quality, which is in close agreement with UNIQUE. However, PQR favors the top JPEG compressed image, exposing its weaknesses of capturing synthetic distortions. When the roles of the two models are switched, UNIQUE consistently spots the failures of PQR (see (c) and (d)), suggesting that UNIQUE is better able to assess image quality in the laboratory and wild.

SRCC LIVE CSIQ KADID-10K BID LIVE Challenge KonIQ-10K Weighted
Baseline (LIVE) 0.951 0.721 0.475 0.632 0.502 0.688 0.596
Baseline (CSIQ) 0.921 0.863 0.483 0.510 0.457 0.638 0.577
Baseline (KADID-10K) 0.877 0.749 0.780 0.498 0.515 0.607 0.693
Baseline (BID) 0.589 0.558 0.298 0.843 0.731 0.722 0.533
Baseline (LIVE Challenge) 0.535 0.504 0.312 0.849 0.842 0.773 0.563
Baseline (KonIQ-10K) 0.832 0.640 0.540 0.765 0.726 0.887 0.716
Linear re-scaling (All databases) 0.935 0.821 0.870 0.809 0.799 0.868 0.865
Binary labeling (All databases) 0.963 0.863 0.864 0.854 0.860 0.898 0.881
UNIQUE (All databases) 0.969 0.902 0.878 0.858 0.854 0.896 0.888
TABLE VII: Median SRCC results across ten sessions among different training strategies
SRCC LIVE [sheikh2006statistical] CSIQ [larson2010most] KADID-10K [lin2019kadid] BID [ciancio2011no] LIVE Challenge [ghadiyaram2016massive] KonIQ-10K [hosu2020koniq]
DB-CNN (All databases) 0.956 0.912 0.891 0.827 0.836 0.860
Improvements over DB-CNN (CSIQ) 11.81% 77.84% 151.37% 85.37% 72.34%
Improvements over DB-CNN (LIVE Challenge) 32.22% 31.98% 82.58% 2.22% 11.69%
TABLE VIII: Improved SRCC results of DB-CNN across ten sessions enabled by the proposed training strategy

We then let UNIQUE compete with DB-CNN [zhang2020blind], which has demonstrated competitive gMAD performance against other DNN-based BIQA models [Ma2018End, bosse2016deep] on the Waterloo Exploration Database. In Fig. 5 (a) and (b), we observe that UNIQUE successfully survives from the attacks by DB-CNN, with pairs of images of similar quality according to human perception. DB-CNN [zhang2020blind] fails to penalize the top image in (a), which is severely degraded by a combination of out-of-focus blur and over-exposure. When UNIQUE serves as the attacker, it is able to falsify DB-CNN by finding the counterexamples in (c) and (d). This further validates that UNIQUE well aligns images across distortion scenarios in a learned perceptual scale.

Iv-B3 Uncertainty Estimation Results

UNIQUE can not only compute image quality scores, but also enable uncertainty quantification of such estimates. We test the learned uncertainty function both quantitatively and qualitatively. To do so, we construct a baseline version of UNIQUE, which is supervised by the fidelity loss only. That is, no direct supervision of is provided, and training may suffer from the scaling ambiguity. In addition to SRCC, we also adopt the fidelity loss as a second and more suitable quantitative measure as it takes into account the ground truth uncertainty when evaluation. Table V shows the median results across ten sessions. Adding the hinge loss as an explicit regularizer, UNIQUE presents a slightly inferior performance in terms of the weighted SRCC (), but is significantly better in terms of the fidelity loss (). This suggests that the hinge loss is helpful in regularizing the uncertainty learning of UNIQUE. We also draw the scatter plots of the learned uncertainty as a function of in Fig. 6. With the hinge regularizer, the learned uncertainty on all databases exhibits human-like behavior, in that UNIQUE tends to assess images in the two ends of the quality range with higher confidence (i.e., lower uncertainty). In contrast, without the hinge regularizer, the learned uncertainty is less interpretable, and may seem counterintuitive (see Fig. 7 (a) and (b)).

Iv-B4 Qualitative Results

We conduct a qualitative analysis of UNIQUE by sampling images across different databases, as shown in Fig. 8. Although the proposed training strategy may not generate pairs of images from two different databases, the optimized UNIQUE is capable of aligning images from different databases in a perceptually meaningful way. In particular, synthetically distorted images with severity levels that may not encounter in real-world receive the lowest quality scores, which also conforms to our observations that the ranges of in LIVE [sheikh2006statistical], CSIQ [larson2010most] and KADID-10K [lin2019kadid] are relatively broader.

Iv-C Ablation Studies

Iv-C1 Performance in a Cross-Database Setting

We test UNIQUE in a more challenging cross-database setting. Specifically, we construct another training set using image pairs sampled from the full KADID-10K [lin2019kadid] and KonIQ-10K [hosu2020koniq] databases, and re-train it with the procedure described in IV-A. As comparison, we also re-train two top-performing DNN-based methods - PQR [zeng2018blind] and DB-CNN [zhang2020blind] on the full KADID-10K [lin2019kadid] and KonIQ-10K [hosu2020koniq] databases. The full LIVE [sheikh2006statistical], CSIQ [larson2010most], BID [ciancio2011no], and LIVE Challenge [ghadiyaram2016massive] are employed as the test sets. It is clear from Table VI that UNIQUE achieves significantly better performance than the four knowledge-driven models and the two DNN-based models. The training image pairs from the two databases effectively provide mutual regularization, guiding UNIQUE to a better local optimum. This experiment provides strong evidence that UNIQUE empowered by the proposed training strategy generalizes to both synthetic and realistic distortions.

Iv-C2 Performance of Different Training Strategies

The key idea of UNIQUE to meet the cross-distortion-scenario challenge is to train it on multiple databases. Here we compare different training strategies against the proposed one. We first treat BIQA as a standard regression problem, and train six variants of UNIQUE on six databases separately using the mean squared error (MSE) as the loss. Next, we turn to exploit the idea of training BIQA models on multiple databases simultaneously. As discussed in Sec. I, we linearly re-scale the MOSs to the same range of , and re-train UNIQUE. We also construct pairs of images within individual databases for training, and compute binary labels using MOSs: for an image pair , the ground truth if and otherwise. The results are listed in Table VII. We find that regression-based models perform favorably on the training databases, present reasonable generalization to databases of similar distortion scenarios, but show incompetence in cross-distortion-scenario generalization. Training on multiple databases leads to a significant performance boost. Linear re-scaling offers the least improvement due to the absence of perceptual scale realignment [perez2020pairwise]. Without the label noise problem (see Fig. 1), binary labeling achieves better performance, but is still inferior to the proposed training strategy in terms of the weighted SRCC.

Iv-C3 Improving Existing BIQA Models

The proposed training strategy is model-agnostic, meaning that it can be used to fine-tune existing differentiable BIQA models for improved performance. Here we implement this idea by applying the proposed training strategy to DB-CNN [zhang2020blind]. The SRCC results are shown in Table VIII. We find that DB-CNN fine-tuned by the proposed training strategy achieves significantly better performance than the original versions trained on CSIQ and LIVE challenge, where the improvement can be as high as . In addition, we endow DB-CNN with the capability to estimate the uncertainty, as evidenced by a fidelity loss of .

V Conclusion

We have introduced a unified uncertainty-aware BIQA model - UNIQUE, and a method of training it on multiple IQA databases simultaneously. We also proposed an uncertainty regularizer, which enables direction supervision from the ground truth human uncertainty. UNIQUE is the first of its kind with superior cross-distortion-scenario generalization. We believe this performance improvements arise because of 1) the continuous ranking annotation that provides a more accurate supervisory signal, 2) the fidelity loss that assigns appropriate penalties to image pairs with different probabilities, and 3) the hinge regularizer that offers better statistical modeling of the uncertainty. In the future, we hope that the proposed learning strategy will become a standard solution for existing and next-generation BIQA models to meet the cross-distortion-scenario challenge. We also would like to explore the limits of the proposed learning strategy, towards universal visual quality assessment of digital images and videos in various multimedia applications.