With the inelastic demand for processing massive Internet images, it is of paramount importance to develop computational image quality models to monitor, maintain, and enhance the perceived quality of the output images of various image processing systems [wang2006modern]. High degrees of consistency between model predictions and human opinions of image quality have been achieved in the full-reference regime, where distorted images are compared to their reference images of pristine quality [wang2003multiscale]
. When such information is not available no-reference or blind image quality assessment (BIQA) that relies solely on distorted images becomes more practical yet more challenging. Recently, deep learning-based BIQA models have experienced an impressive series of successes due to joint optimization of feature representation and quality prediction. However, these models remain notoriously weak at cross-distortion-scenario generalization[zhang2020blind]. That is, models trained on images simulated in the laboratory cannot deal with images captured in the wild. Similarly, models optimized for realistic distortions (e.g., sensor noise and poor exposure) do not work well for synthetic distortions (e.g., Gaussian blur and JPEG compression).
A seemingly straightforward method of adapting to the distributional shifts between synthetic and realistic distortions is to directly combine multiple IQA databases for training. However, existing databases have different perceptual scales due to differences in subjective testing methodologies. For example, the CSIQ database [larson2010most] used a multiple stimuli absolute category rating in a well-controlled laboratory environment, with difference mean opinion scores (DMOSs) in the range of , whereas the LIVE Challenge Database used a single stimulus continuous quality rating in an unconstrained crowdsourcing platform, with MOSs in the range of . This means that a separate subjective experiment on images sampled from each database is required for perceptual scale realignment [sheikh2006statistical, larson2010most]. To make this point more explicit, we linearly re-scaled the subjective scores of each of six databases [sheikh2006statistical, larson2010most, lin2019kadid, ciancio2011no, ghadiyaram2016massive, hosu2020koniq] to , with a larger value indicating higher quality. Fig. 1 shows sample images that have approximately the same re-scaled MOS. As expected, they appear to have drastically different perceptual quality. A more promising design methodology for unified
BIQA is to build a prior probability model for natural undistorted images as the reference distribution, to which a test distribution that summarizes the distorted image can be compared. The award-winning BIQA model - NIQE[mittal2013making] is a specific instantiation of this idea, but is only capable of handling a small number of synthetic distortions.
In addition to training BIQA models with (D)MOSs, there is another type of supervisory signal - the variance of human opinions, which we believe is beneficial for BIQA, but has not been explored, to our best knowledge. Generally, humans tend to give more consistent ratings (i.e., smaller variances) to images at the two ends of the quality range, while assessing images in the mid-quality range with less certainty (see Fig. 3). Therefore, it is reasonable to assume image quality models to behave similarly. Moreover, previous methods [wu2018blind] have enjoyed the benefits of modeling the uncertainty of quality prediction for subsequent applications.
In this paper, we take steps toward developing unified uncertainty-aware BIQA models for both synthetic and realistic distortions. Our contributions include:
A training strategy that allows differentiable BIQA models to be learned on multiple IQA database (of different distortion scenarios) simultaneously. In particular, we first sample and combine pairs of images within each database. For each pair, we leverage the human-annotated (D)MOSs and variances to compute a probability value that one image is of better perceptual quality as the supervisory signal. The resulting training set bypasses additional subjective testing for perceptual scale realignment. We then use a pairwise learning-to-rank algorithm with the fidelity loss [tsai2007frank] to drive the learning of computational models for BIQA.
A regularizer that enforces a hinge constraint on the learned uncertainty using the variance of human opinions as guidance. This enables BIQA models to mimic the uncertain aspects of humans when performing the quality assessment task.
A Unified No-reference Image Quality and Uncertainty Evaluator (UNIQUE) based on a deep neural network (DNN) that significantly outperforms state-of-the-art BIQA models on six IQA databases (see Table I) covering both synthetic and realistic distortions. We also verify its generalizability in a challenging cross-data setting and via the group maximum differentiation (gMAD) competition methodology [ma2020group].
|Database||# of Images||Scenario||Annotation||Range||Subjective Testing Methodology|
|LIVE [sheikh2006statistical]||779||Synthetic||DMOS, Variance||Single stimulus continuous quality rating|
|CSIQ [larson2010most]||866||Synthetic||DMOS, Variance||Multi stimulus absolute category rating|
|KADID-10K [lin2019kadid]||10,125||Synthetic||MOS, Variance||Double stimulus absolute category rating with crowdsourcing|
|BID [ciancio2011no]||586||Realistic||MOS, Variance||Single stimulus continuous quality rating|
|LIVE Challenge [ghadiyaram2016massive]||1,162||Realistic||MOS, Variance||Single stimulus continuous quality rating with crowdsourcing|
|KonIQ-10K [hosu2020koniq]||10,073||Realistic||MOS, Variance||Single stimulus absolute category rating with crowdsourcing|
|Model||RankIQA [liu2017rankiqa]||DB-CNN [zhang2020blind]||dipIQ [ma2017dipiq]||Ma19 [ma2019blind]||Gao15 [gao2015learning]||UNIQUE|
|Loss Function||Hinge variant||Cross entropy||Cross entropy||Cross entropy variant||Hinge||Fidelity+Hinge|
Ii Related Work
In this section, we give a review of existing BIQA models over the last two decades.
Ii-a BIQA as Regression
Early attempts at BIQA were tailored to specific synthetic distortions, such as JPEG compression [wang2002no] and JPEG2000 compression [marziliano2004perceptual]. Later models aimed for general-purpose BIQA [moorthy2011blind, ye2012unsupervised, mittal2012no, xu2016blind, ghadiyaram2017perceptual], with the underlying assumption that statistics extracted from natural images are highly regular [simoncelli2001natural]
and distortions will break such statistical regularities. Based on natural scene statistics (NSS), a quality prediction function can be produced using standard supervised learning tools. Of particular interest is NIQE[mittal2013making], which is arguably the first unified BIQA model with the goal of capturing arbitrary distortions. However, the NSS model used in NIQE is not sensitive to image “unnaturalness” introduced by realistic distortions. Zhang et al. [zhang2015feature] extended NIQE [mittal2013making] by exploiting a more powerful set of NSS for local quality prediction. However, the generalization to realistic distortions is still quite limited.
Joint optimization of feature engineering and quality regression enabled by deep learning has significantly improved the performance of BIQA in recent years. The apparent conflict between the small number of subjective ratings and the large number of learnable model parameters may be alleviated in three ways. The first method is transfer learning[zeng2018blind], which directly fine-tunes pre-trained DNNs for object recognition. The second method is patch-based training, which assigns a local quality score to an image patch transferred from the corresponding global quality score [kang2014convolutional, bosse2016deep]. The third method is quality-aware pre-training, which automatically generates a large amount of labeled data by exploiting specifications of distortion processes or quality estimates of full-reference models [Ma2018End, liu2017rankiqa, zhang2020blind]. Despite impressive correlation numbers on individual databases of either synthetic or realistic distortions, DNN-based BIQA models are vulnerable to cross-distortion-scenario generalization, and can also be easily falsified in the gMAD competition [wang2020active].
Ii-B BIQA as Ranking
There are also methods that cast BIQA as a learning-to-rank problem, where relative ranking information can be obtained from distortion specifications [liu2017rankiqa, Ma2018End, zhang2020blind], full-reference IQA models [ma2017dipiq, ma2019blind], and human data [gao2015learning]. Liu et al. [liu2017rankiqa] and Zhang et al. [zhang2020blind] inferred discrete ranking information from images of the same content and distortion but at different levels for BIQA model pre-training. Different from [liu2017rankiqa, zhang2020blind], the proposed UNIQUE explores continuous ranking information from (D)MOSs and variances in the stage of final quality prediction. Ma et al. [ma2017dipiq, ma2019blind] extracted binary ranking information from full-reference IQA methods to guide the optimization of BIQA models. Since full-reference methods can only be applied to synthetic distortions, where the reference images are available, it is not trivial to extend the methods in [ma2017dipiq, ma2019blind] to realistic distortions. The closest work to ours is due to Gao et al. [gao2015learning], who computed binary rankings from MOSs. However, they neither performed end-to-end optimization of BIQA nor explored the idea of combining multiple IQA databases via pairwise rankings. As a result, their method only achieves reasonable performance on a limited number of synthetic distortions. UNIQUE takes a step further to be uncertainty-aware, learning from human behavior when evaluating image quality. We summarize ranking-based BIQA methods in Table II.
Ii-C Uncertainty-Aware BIQA
Learning uncertainty is helpful to understand and analyze model predictions. In Bayesian machine learning, uncertainty may come from two parts: one inherent in the data (i.e., data/aleatoric uncertainty) and the other in the learned parameters (i.e., model/epistemic uncertainty) [kendall2017what]. In the context of BIQA, Huang et al. [huang2019convolutional] modeled the uncertainty of patch quality to alleviate the label noise problem in patch-based training. Wu et al. [wu2018blind] employed a sparse Gaussian process for quality regression, where the data uncertainty can be jointly estimated without supervision. In contrast, UNIQUE assumes the Thurstone’s model [thurstone1927law]
, and learns the data uncertainty with direct supervision, aiming for an effective BIQA model with a probability interpretation.
Iii Training UNIQUE
In this section, we first present the proposed training strategy, consisting of IQA database combination and pairwise learning-to-rank model estimation (see Fig. 2). We then describe the details of the UNIQUE model for unified uncertainty-aware BIQA.
Iii-a IQA Database Combination
Our goal is to combine IQA databases for training while avoiding extra subjective experiments for perceptual scale realignment. To achieve this, we first randomly sample pairs of images from the -th database. For each image pair , we infer the relative ranking information from the corresponding MOSs and variances. Specifically, under the Thurstone’s model [thurstone1927law], we assume that the true perceptual quality of image
follows a Gaussian distribution with meanand variance collected via subjective testing. Assuming the variability of quality across images is uncorrelated, the quality difference, , is also Gaussian with mean and variance . The probability that has higher perceptual quality than
can be calculated from the Gaussian cumulative distribution function, which admits a closed-form solution:
Combining pairs of images from databases, we are able to build a training set . Our database combination approach allows future IQA databases to be added with essentially no cost.
Iii-B Model Estimation
Given as the training set, we aim to learn two differentiable functions and
, parameterized by a vector, which accept an image of arbitrary input size, and compute the quality score and uncertainty. Similar in Section III-A, we assume the true perceptual quality obeys a Gaussian distribution with mean and variance now estimated by and , respectively. The probability of preferring over perceptually in an image pair is
It remains to specify a similarity measure between the probability distributionsand as the objective for model estimation. In machine learning, cross-entropy may be the de facto measure for this purpose, but has several drawbacks [tsai2007frank]. First, the minimum of the cross-entropy loss is not exactly zero, except for the ground truth and . This may hinder the learning of image pairs with close to . Second, the cross-entropy loss is unbounded from above, which may over-penalize some hard training examples, therefore biasing the learned models. To address these problems, we choose the fidelity loss [tsai2007frank], which is originated from quantum physics to measure the difference between two states of a quantum [birrell1984quantum]:
Joint estimation of image quality and uncertainty will introduce scaling ambiguity. More precisely, if we make the scaling and , then the probability given by Eq. (2) is unchanged. Our preliminary results [zhang2020learning] showed that the learned by optimizing Eq. (III-B) solely neither resembles any aspects of human behavior in BIQA, nor reveals new statistical properties of natural images. To resolve the scaling ambiguity and provide with direct supervision, we enforce an explicit regularizer of by taking advantage of the ground truth . Note that the ground truth stds across IQA databases are not comparable, which prevents the use of their absolute values. Similarly, for each pair , we infer a binary label for uncertainty learning, where if and otherwise. We define the regularizer using the hinge loss:
where the margin sets a specific scale for BIQA models to work with. The augmented training set becomes . During training, we sample a mini-batch from
in each iteration, and use stochastic gradient descent to update the parameter vectorby minimizing the following empirical loss:
where denotes the cardinality of and trades off the two terms.
Iii-C Specification of UNIQUE
We use ResNet-34 [he2016deep] as the backbone of UNIQUE due to its good balance between model complexity and representation capability. The pairwise learning-to-rank framework composed of two streams is shown in Fig. 2
. Each stream is implemented by a DNN, consisting of a stage of convolution, batch normalization[ioffe2015batch]III). To generate a fixed-length image representation regardless of input resolution and summarize higher-order spatial statistics, we replace the first-order average pooling in the original ResNet with a second-order bilinear pooling, which has been empirically proven useful in object recognition [lin2015bilinear] and BIQA [zhang2020blind]. We flatten the spatial dimensions of the feature representation after the last convolution to obtain , where and denote the spatial and channel dimensions, respectively. The bilinear pooling can be defined as
We further flatten , and append a fully connected layer with two outputs to represent and , respectively. The network parameters of the two streams are shared during the entire optimization process.
|Layer Name||Network Parameter|
7, 64, stride 2
|Max Pooling||33, stride 2|
|Residual Block 1||3|
|Residual Block 2||1|
|Residual Block 3||1|
|Residual Block 4||1|
|Fully Connected Layer||262,1442|
|Database||LIVE [sheikh2006statistical]||CSIQ [larson2010most]||KADID-10K [lin2019kadid]||BID [ciancio2011no]||CLIVE [ghadiyaram2016massive]||KonIQ-10K [hosu2020koniq]|
|MEON (LIVE) [Ma2018End]||–||–||0.726||0.787||0.234||0.410||0.100||0.217||0.378||0.477||0.145||0.242|
|deepIQA (LIVE) [bosse2016deep]||–||–||0.645||0.730||0.270||0.309||-0.043||0.127||0.076||0.162||-0.064||0.088|
|RankIQA (LIVE) [liu2017rankiqa]||–||–||0.711||0.790||0.436||0.488||0.324||0.350||0.451||0.503||0.617||0.631|
|PQR (BID) [zeng2018blind]||0.663||0.673||0.522||0.612||0.321||0.403||–||–||0.691||0.740||0.614||0.650|
|PaQ-2-PiQ (LIVE Patch) [ying2020from]||0.472||0.559||0.555||0.658||0.379||0.429||0.682||0.713||0.719||0.778||0.722||0.735|
|DB-CNN (CSIQ) [zhang2020blind]||0.855||0.854||–||–||0.501||0.569||0.329||0.382||0.451||0.472||0.499||0.515|
|DB-CNN (LIVE Challenge)||0.723||0.754||0.691||0.685||0.488||0.529||0.809||0.832||–||–||0.770||0.825|
|UNIQUE (All databases)||0.969||0.968||0.902||0.927||0.878||0.876||0.858||0.873||0.854||0.890||0.896||0.901|
In this section, we first present the experimental setups, including the IQA database selection, the evaluation protocols, and the training details of UNIQUE. We then compare UNIQUE with several state-of-the-art BIQA models on existing IQA databases and using gMAD competition [ma2020group]. Moreover, we verify the superiority of the proposed training strategy by comparing it with alternative training schemes, testing it in a cross-database setting, and using it to improve existing BIQA models.
Iv-a Experimental Setups
We choose six IQA databases (summarized in Table I), among which LIVE [sheikh2006statistical], CSIQ [larson2010most], and KADID-10K [lin2019kadid] contain synthetic distortions, while LIVE Challenge [ghadiyaram2016massive], BID [ciancio2011no], and KonIQ-10K [hosu2020koniq] include realistic distortions. All selected databases provide the stds of subjective quality scores (see Fig. 3). We exclude TID2013 [ponomarenko2013color] in our experiments because the MOS is computed by the number of winning times in a maximum of nine paired comparisons without suitable psychometric scaling [mikhailiuk2018psychometric], and does not satisfy the Gaussian assumption. We refer the interested readers to our preliminary work [zhang2020learning] for the results on TID2013 [ponomarenko2013color].
We randomly sample images from each database to construct the training set and leave the remaining for testing. Regarding synthetic databases LIVE, CSIQ, and KADID-10K, we split the training and test sets according to the reference images in order to ensure content independence. We adopt two performance criteria: Spearman rank-order correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC), which measure prediction monotonicity and precision, respectively. To reduce the bias caused by the randomness in training and test set splitting, we repeat this procedure ten times, and report median SRCC and PLCC results.
We train UNIQUE on image pairs using Adam [Kingma2014adam] by minimizing the objective in Eq. (5). The margin of the hinge loss and the trade-off parameter are set to and
, respectively. Empirically, we find that the performance is insensitive to the two hyperparameters. More specifically, a softplus function is applied to constrain the predicted stdto be positive. The parameters of UNIQUE based on ResNet-34 [he2016deep]
are initialized with the weights pre-trained on ImageNet[deng2009imagenet]. The parameters of the last fully connected layer are initialized by He’s method [he2015delving]. We set the initial learning rate to with a decay factor of
for every three epochs, and we train UNIQUE twelve epochs. A warm-up training strategy is adopted: only the last fully connected layer is trained in the first three epochs with a mini-batch of; for the remaining epochs, we fine-tune the entire method with a mini-batch of . During training, we re-scale and crop the images to
, keeping their aspect ratios. In all experiments, we test on images of original size. We implement UNIQUE using PyTorch, and will make the source codes publicly available.
Iv-B Main Results
Iv-B1 Correlation Results
We compare the performance of UNIQUE against three full-reference IQA measures - MS-SSIM [wang2003multiscale], NLPD [Laparra:17] and DISTS [ding2020iqa], and ten BIQA models, including four knowledge-driven methods that do not require MOSs for training - NIQE [mittal2013making], IL-NIQE [zhang2015feature], dipIQ [ma2017dipiq] and Ma19 [ma2019blind], and six data-driven DNN-based methods - MEON [Ma2018End], deepIQA [bosse2016deep], RankIQA [liu2017rankiqa], PQR [zeng2018blind], PaQ-2-PiQ [ying2020from] and DB-CNN [zhang2020blind]. For the competing models, we either use the publicly available implementations or re-train them on the specific databases using the training codes provided by the corresponding authors [zeng2018blind, zhang2020blind]. The median SRCC and PLCC results across ten sessions are listed in Table IV. NIQE and its improved version ILNIQE do not perform well on realistic distortions and challenging synthetic distortions in KADID-10K [lin2019kadid], despite the original goal of handling arbitrary distortions. dipIQ and Ma19 may only be able to deal with distortions included in the training set. This highlights the difficulties of distortion-aware BIQA methods to handle unseen distortions.
We observe similar phenomena for deep learning-based methods when facing the cross-distortion-scenario challenge. Despite large-scale pre-training, MEON fine-tuned on LIVE [sheikh2006statistical] does not generalize to other databases with different distortion types and scenarios. Being exposed to more synthetic distortion types in TID2013 [ponomarenko2013color], deepIQA and RankIQA achieve better performance on KADID-10K [lin2019kadid] than MEON. Equipped with a second-order pooling, DB-CNN trained on LIVE Challenge [ghadiyaram2016massive] is reasonably good at cross-distortion-scenario generalization. PQR trains two probabilistic models on BID [ciancio2011no] and KonIQ-10K [hosu2020koniq]. Not surprisingly, PQR, trained on KonIQ-10K with greater content and distortion diversity, generalizes much better to the rest databases than the one trained on BID. The most recent DNN-based method - PaQ-2-PiQ aims for local quality prediction, but only delivers reasonable performance on databases captured in the wild. Enabled by the proposed training strategy, UNIQUE is able to train on multiple databases simultaneously, and outperforms all competing models on all six databases by a large margin. It also shows competitive performance when comparing to state-of-the-art full-reference IQA models on synthetic databases.
|UNIQUE trained without the hinge loss||0.969||0.887||0.879||0.872||0.856||0.898||0.889|
|UNIQUE trained with the hinge loss||0.969||0.902||0.878||0.858||0.854||0.896||0.888|
|Fidelity Loss||LIVE||CSIQ||KADID-10K||BID||LIVE Challenge||KonIQ-10K||Weighted|
|UNIQUE trained without the hinge loss||0.131||0.086||0.048||0.058||0.015||0.026||0.041|
|UNIQUE trained with the hinge loss||0.073||0.042||0.020||0.027||0.014||0.010||0.018|
Iv-B2 gMAD Competition Results
gMAD competition [ma2020group] is a complementary methodology for IQA model comparison on large-scale databases without human annotations. Focusing on falsifying perceptual models in the most efficient way, gMAD seeks pairs of images of similar quality predicted by one model, while being substantially different according to another model. To build the playground for gMAD, we combine all synthetically distorted images in the Waterloo Exploration Database [ma2017waterloo] with a corpus of realistically distorted images from SPAQ [fang2020perceptual]. We first let UNIQUE compete against PQR [zeng2018blind] trained on the entire KonIQ-10K [hosu2020koniq] in Fig. 4. The pair of images in (a) exhibits similar poor quality, which is in close agreement with UNIQUE. However, PQR favors the top JPEG compressed image, exposing its weaknesses of capturing synthetic distortions. When the roles of the two models are switched, UNIQUE consistently spots the failures of PQR (see (c) and (d)), suggesting that UNIQUE is better able to assess image quality in the laboratory and wild.
|Baseline (LIVE Challenge)||0.535||0.504||0.312||0.849||0.842||0.773||0.563|
|Linear re-scaling (All databases)||0.935||0.821||0.870||0.809||0.799||0.868||0.865|
|Binary labeling (All databases)||0.963||0.863||0.864||0.854||0.860||0.898||0.881|
|UNIQUE (All databases)||0.969||0.902||0.878||0.858||0.854||0.896||0.888|
|SRCC||LIVE [sheikh2006statistical]||CSIQ [larson2010most]||KADID-10K [lin2019kadid]||BID [ciancio2011no]||LIVE Challenge [ghadiyaram2016massive]||KonIQ-10K [hosu2020koniq]|
|DB-CNN (All databases)||0.956||0.912||0.891||0.827||0.836||0.860|
|Improvements over DB-CNN (CSIQ)||11.81%||–||77.84%||151.37%||85.37%||72.34%|
|Improvements over DB-CNN (LIVE Challenge)||32.22%||31.98%||82.58%||2.22%||–||11.69%|
We then let UNIQUE compete with DB-CNN [zhang2020blind], which has demonstrated competitive gMAD performance against other DNN-based BIQA models [Ma2018End, bosse2016deep] on the Waterloo Exploration Database. In Fig. 5 (a) and (b), we observe that UNIQUE successfully survives from the attacks by DB-CNN, with pairs of images of similar quality according to human perception. DB-CNN [zhang2020blind] fails to penalize the top image in (a), which is severely degraded by a combination of out-of-focus blur and over-exposure. When UNIQUE serves as the attacker, it is able to falsify DB-CNN by finding the counterexamples in (c) and (d). This further validates that UNIQUE well aligns images across distortion scenarios in a learned perceptual scale.
Iv-B3 Uncertainty Estimation Results
UNIQUE can not only compute image quality scores, but also enable uncertainty quantification of such estimates. We test the learned uncertainty function both quantitatively and qualitatively. To do so, we construct a baseline version of UNIQUE, which is supervised by the fidelity loss only. That is, no direct supervision of is provided, and training may suffer from the scaling ambiguity. In addition to SRCC, we also adopt the fidelity loss as a second and more suitable quantitative measure as it takes into account the ground truth uncertainty when evaluation. Table V shows the median results across ten sessions. Adding the hinge loss as an explicit regularizer, UNIQUE presents a slightly inferior performance in terms of the weighted SRCC (), but is significantly better in terms of the fidelity loss (). This suggests that the hinge loss is helpful in regularizing the uncertainty learning of UNIQUE. We also draw the scatter plots of the learned uncertainty as a function of in Fig. 6. With the hinge regularizer, the learned uncertainty on all databases exhibits human-like behavior, in that UNIQUE tends to assess images in the two ends of the quality range with higher confidence (i.e., lower uncertainty). In contrast, without the hinge regularizer, the learned uncertainty is less interpretable, and may seem counterintuitive (see Fig. 7 (a) and (b)).
Iv-B4 Qualitative Results
We conduct a qualitative analysis of UNIQUE by sampling images across different databases, as shown in Fig. 8. Although the proposed training strategy may not generate pairs of images from two different databases, the optimized UNIQUE is capable of aligning images from different databases in a perceptually meaningful way. In particular, synthetically distorted images with severity levels that may not encounter in real-world receive the lowest quality scores, which also conforms to our observations that the ranges of in LIVE [sheikh2006statistical], CSIQ [larson2010most] and KADID-10K [lin2019kadid] are relatively broader.
Iv-C Ablation Studies
Iv-C1 Performance in a Cross-Database Setting
We test UNIQUE in a more challenging cross-database setting. Specifically, we construct another training set using image pairs sampled from the full KADID-10K [lin2019kadid] and KonIQ-10K [hosu2020koniq] databases, and re-train it with the procedure described in IV-A. As comparison, we also re-train two top-performing DNN-based methods - PQR [zeng2018blind] and DB-CNN [zhang2020blind] on the full KADID-10K [lin2019kadid] and KonIQ-10K [hosu2020koniq] databases. The full LIVE [sheikh2006statistical], CSIQ [larson2010most], BID [ciancio2011no], and LIVE Challenge [ghadiyaram2016massive] are employed as the test sets. It is clear from Table VI that UNIQUE achieves significantly better performance than the four knowledge-driven models and the two DNN-based models. The training image pairs from the two databases effectively provide mutual regularization, guiding UNIQUE to a better local optimum. This experiment provides strong evidence that UNIQUE empowered by the proposed training strategy generalizes to both synthetic and realistic distortions.
Iv-C2 Performance of Different Training Strategies
The key idea of UNIQUE to meet the cross-distortion-scenario challenge is to train it on multiple databases. Here we compare different training strategies against the proposed one. We first treat BIQA as a standard regression problem, and train six variants of UNIQUE on six databases separately using the mean squared error (MSE) as the loss. Next, we turn to exploit the idea of training BIQA models on multiple databases simultaneously. As discussed in Sec. I, we linearly re-scale the MOSs to the same range of , and re-train UNIQUE. We also construct pairs of images within individual databases for training, and compute binary labels using MOSs: for an image pair , the ground truth if and otherwise. The results are listed in Table VII. We find that regression-based models perform favorably on the training databases, present reasonable generalization to databases of similar distortion scenarios, but show incompetence in cross-distortion-scenario generalization. Training on multiple databases leads to a significant performance boost. Linear re-scaling offers the least improvement due to the absence of perceptual scale realignment [perez2020pairwise]. Without the label noise problem (see Fig. 1), binary labeling achieves better performance, but is still inferior to the proposed training strategy in terms of the weighted SRCC.
Iv-C3 Improving Existing BIQA Models
The proposed training strategy is model-agnostic, meaning that it can be used to fine-tune existing differentiable BIQA models for improved performance. Here we implement this idea by applying the proposed training strategy to DB-CNN [zhang2020blind]. The SRCC results are shown in Table VIII. We find that DB-CNN fine-tuned by the proposed training strategy achieves significantly better performance than the original versions trained on CSIQ and LIVE challenge, where the improvement can be as high as . In addition, we endow DB-CNN with the capability to estimate the uncertainty, as evidenced by a fidelity loss of .
We have introduced a unified uncertainty-aware BIQA model - UNIQUE, and a method of training it on multiple IQA databases simultaneously. We also proposed an uncertainty regularizer, which enables direction supervision from the ground truth human uncertainty. UNIQUE is the first of its kind with superior cross-distortion-scenario generalization. We believe this performance improvements arise because of 1) the continuous ranking annotation that provides a more accurate supervisory signal, 2) the fidelity loss that assigns appropriate penalties to image pairs with different probabilities, and 3) the hinge regularizer that offers better statistical modeling of the uncertainty. In the future, we hope that the proposed learning strategy will become a standard solution for existing and next-generation BIQA models to meet the cross-distortion-scenario challenge. We also would like to explore the limits of the proposed learning strategy, towards universal visual quality assessment of digital images and videos in various multimedia applications.