Objectively assessing image quality is of fundamental importance due in part to the massive expansion of online image volume. Objective image quality assessment (IQA) has become an active research topic over the last decade, with a large variety of IQA models proposed [1, 2]. They can be categorized into full-reference models (FR, where the reference image is fully available when evaluating a distorted image) , reduced-reference models (RR, where only partial information about the reference image is available) , and blind/no-reference models (NR, where the reference image is not accessible) . In many real-world applications, reference images are unavailable, making blind IQA (BIQA) models highly desirable in practice.
Many BIQA models are developed by supervised learning[6, 7, 8, 9, 10, 11, 12, 13, 14] and share a common two-stage structure: 1) perception- and/or distortion-relevant features (denoted by ) are extracted from the test image; and 2) a quality prediction function
is learned by statistical machine learning algorithms. The performance and robustness of these approaches rely heavily on the quality and quantity of the ground truth data for training. The most common type of ground truth data is in the form of the mean opinion score (MOS), which is the average of quality ratings given by multiple subjects. Therefore, these models are often referred to as opinion-aware BIQA (OA-BIQA) models and may incur the following drawbacks. First, collecting MOS via subjective testing is slow, cumbersome, and expensive. As a result, even the largest publicly available IQA database, TID2013, provides only images with MOSs. This limited number of training images is deemed extremely sparsely distributed in the entire image space, whose dimension equals the number of pixels and is typically in the order of millions. As such, the generalizability of BIQA models learned from small training samples is questionable on real-world images. Second, among thousands of sample images, only a few dozen source reference images can be included, considering the combinations of reference images, distortion types and levels. For example, the TID2013 database  includes source images only. It is extremely unlikely that this limited number of reference images sufficiently represent the variations that exist in real-world images. Third, since these BIQA models are trained with individual images to make independent quality predictions, the cost function is blind to the relative perceptual order between images. As a result, the learned models are weak at ordering images with respect to their perceptual quality.
In this paper, we show that a vast amount of reliable training data in the form of so-called quality-discriminable image pairs (DIP) can be generated by exploiting large-scale databases with diverse image content. Each DIP is associated with a perceptual uncertainty measure to indicate the confidence level of its quality discriminability. We show that such DIPs can be generated at very low cost without resorting to subjective testing. We then employ RankNet 
, a neural network-based pairwise learning-to-rank (L2R) algorithm[17, 18]
, to learn an opinion-unaware BIQA (OU-BIQA, meaning that no subjective opinions are used for training) model by incorporating the uncertainty measure into the loss function. Extensive experiments on four benchmark IQA databases demonstrate that the DIP inferred quality (dipIQ) indices significantly outperform previous OU-BIQA models. We also conduct another set of experiments in which we train the dipIQ indices using different feature representations as inputs and compare them with OA-BIQA models using the same representations. The generalizability and robustness of dipIQ are improved across all four IQA databases and verified by the group MAximum Differentiation (gMAD) competition method, which examines image pairs optimally selected from the Waterloo Exploration Database . Furthermore, we extend the proposed pairwise L2R approach for OU-BIQA to a listwise L2R one by evoking ListNet  (a listwise L2R extension of RankNet ) and transforming DIPs to quality-discriminable image lists (DIL) for training. The resulting DIL inferred quality (dilIQ) index leads to an additional performance gain.
The remainder of the paper is organized as follows. BIQA models and typical L2R algorithms are reviewed and categorized in Section II. The proposed dipIQ approach is introduced in Section III. Experimental results using dipIQ on four benchmark IQA databases compared with state-of-the-art BIQA models are presented in Section IV, followed by an extension to the dilIQ model in Section V. We conclude the paper in Section VI.
Ii Related Work
We first review existing BIQA models according to their two-stage structure: feature extraction and quality prediction model learning. We then review typical L2R algorithms. Details of RankNet are provided in Section III.
Ii-a Existing BIQA Models
From the feature extraction point of view, three types of knowledge can be exploited to craft useful features for BIQA. The first is knowledge about our visual world that summarizes the statistical regularities of undistorted images. The second is knowledge about degradation, which can then be explicitly taken into account to build features for particular artifacts, such as blocking [22, 23, 24], blurring [25, 26, 27] and ringing [28, 29, 30]. The third is knowledge of the human visual system (HVS) , namely perceptual models derived from visual physiological and psychophysical studies [32, 33, 34, 35]. Natural scene statistics (NSS), which seek to capture the natural statistical behavior of images, embody the three-fold modeling in a rather elegant way . NSS can be extracted directly in the spatial domain or in transform domains such as DFT, DCT, and wavelets [36, 37].
, and the intensity variance in smooth regions close to edges can indicate ringing artifacts. Step edge detectors that operate at block boundaries measure the severity of discontinuities caused by JPEG compression . The sample entropy of intensity histograms is used to identify image anisotropy [40, 41]. The responses of image gradients and the Laplacian of Gaussian operators are jointly modeled to describe the destruction of statistical naturalness of images 
. The singular value decomposition of local image gradient matrices may provide a quantitative measure of image content
. Mean-subtracted and contrast-normalized pixel value statistics have also been modeled using a generalized Gaussian distribution (GGD)[8, 43, 44, 45]
, inspired by the adaptive gain control mechanism seen in neurons.
Statistical modeling in the wavelet domain resembles the early visual system , and natural images exhibit statistical regularities in the wavelet space. Specifically, it is widely acknowledged that the marginal distribution of wavelet coefficients of a natural image (regardless of content) has a sharp peak near zero and heavier than Gaussian tails. Therefore, statistics of raw [46, 4, 6, 47] and normalized [48, 49] wavelet coefficients, and wavelet coefficient correlations in the neighborhood [29, 10, 50, 51, 52] can be individually or jointly modeled as image naturalness measurements. The phase information of wavelet coefficients, for example expressed as the local phase coherence, is exploited to describe the perception of blur  and sharpness .
In the DFT domain, blur kernels can be efficiently estimated[50, 54, 51] to quantify the degree of image blurring. The regular peaks at feature frequencies can be used to identity blocking artifacts [23, 55]. Moreover, it is generally hypothesized that most perceptual information in an image is stored in the Fourier phase rather than the Fourier amplitude [56, 57]. Phase congruency  is such a feature that identifies perceptually significant image features at spatial locations where Fourier components are maximally in-phase .
. The kurtosis of AC coefficients can be used to quantify the structure statistics. In addition, AC coefficients can also be jointly modeled using a GGD.
There is a growing interest in learning features for BIQA. Ye et al.
learned quality filters on image patches using K-means clustering and adopted filter responses as features. They then took one step further by supervised filter learning . Xue et al.  proposed a quality-aware clustering scheme on the high frequencies of raw patches, guided by an FR-IQA measure . Kang et al.
investigated a convolutional neural network to jointly learn features and nonlinear mappings for BIQA.
From the model learning perspective, SVR [63, 64] is the most commonly used tool to learn for BIQA [6, 10, 52, 9, 45, 12]. The capabilities of neural networks to pre-train a model without labels and to easily scale up have also been exploited for this purpose [40, 62, 51, 47]. Another typical quality regression is the example-based method, which predicts the test image quality score using the weighted average of training image quality scores, where the weight encodes the perceptual similarity between the test and training images [52, 60, 14]. Saad et al. jointly modeled
and MOS using a multivariate Gaussian distribution and performed prediction by maximizing the conditional probability[59, 7]. Similar probabilistic modeling strategies have been investigated [43, 65]. Pairwise L2R algorithms have also been used to learn BIQA models [66, 67]. However, in these methods, DIP generation relies solely on MOS availability, which limits the number of DIPs produced. Moreover, their performance is inferior to that of existing BIQA methods. Other advanced learning algorithms include topic modeling , Gaussian process , and multi-kernel learning [69, 67].
Ii-B Existing L2R Algorithms
Existing L2R algorithms can be broadly classified into three categories based on the training data format and loss function: pointwise, pairwise, and listwise approaches. An excellent survey of L2R algorithms can be found in. Here we only provide a brief overview.
Pointwise approaches assume that each instance’s importance degree is known. The loss function usually examines the prediction accuracy of each individual instance. In an early attempt on L2R, Fuhr 
adopted a linear regression with a polynomial feature expansion to learn the score function. Cossock and Zhang  utilized a similar formulation with some theoretical justifications for the use of the least squares loss function. Nallapati 
formulated L2R as a classification problem and investigated the use of maximum entropy and support vector machines (SVMs) to classify each instance into two classes—relevant or irrelevant. Ordinal regression-based pointwise L2R algorithms have also been proposed such as PRanking and SVM-based large margin principles .
Pairwise approaches assume that the relative order between two instances is known or can be inferred from other ground truth formats. The goal is to minimize the number of misclassified instance pairs. In the extreme case, if all instance pairs are correctly classified, they will be correctly ranked . In RankSVM , Joachims creatively generated training pairs from clickthrough data and reformulated SVM to learn the score function from instance pairs. Proposed in 2005, RankNet  was probably the first L2R algorithm used by commercial search engines, which had a typical neural network with a weight-sharing scheme forming its skeleton. Tsai et al.  replaced RankNet’s loss function  with a fidelity loss originating from quantum physics. In this paper, RankNet is adopted as the default pairwise L2R algorithm to learn OU-BIQA models for reasons that will be described later. RankBoost  is another well-known pairwise L2R algorithm based on AdaBoost  with an exponential loss.
Listwise approaches provide the opportunity to directly optimize ranking performance criteria . Representative algorithms include SoftRank , , and RankGP . Another subset of listwise approaches choose to optimize listwise ranking losses. For example, as a direct extension of RankNet, ListNet 
duplicates RankNet’s structure to accommodate an instance list as input and optimizes a ranking loss based on the permutation probability distribution. In this paper, we also employ ListNet to learn OU-BIQA models as an extension of the proposed pairwise L2R approach.
Iii Proposed Pairwise L2R Approach for OU-BIQA
In this section, we elaborate the proposed pairwise L2R approach to learn OU-BIQA models. First, we propose an automatic DIP generation engine. Each DIP is associated with an uncertainty measure to quantify the confidence level of its quality discriminability. Second, we detail RankNet  and extend its capability to learn from the generated DIPs with uncertainty.
Iii-a DIP Generation
Our automatic DIP generation engine is described as follows. We first choose three best-trusted FR-IQA models, namely MS-SSIM , VIF , and GSMD . A logistic nonlinear function suggested in  is adopted to map predictions of the three models to the MOS scale of the LIVE database . After that, the score range of the three models roughly spans , where higher values indicate better perceptual quality. We associate each candidate image pair with a nonnegative , which is equal to the smallest score difference of the three FR models. Intuitively, the perceptual uncertainty level of quality discriminability should decrease monotonically with the increase of . By varying , we can generate DIPs with different uncertainty levels. To quantify the level of uncertainty, we employ a raised-cosine function given by
where lies in , with a higher value indicating a greater degree of uncertainty and is a constant, above which the uncertainty goes to zero. In the current implementation, we set
, whose legitimacy can be validated from two sources. First, the average standard deviation of MOSs on LIVE is around, which is approximately half of , therefore guaranteeing the perceived discriminability of two images. Second, based on the subjective experiments conducted by Gao et al.  on LIVE, the consistency between subjects on the relative quality of one pair increases with the absolute difference and, when it is larger than , the consistency approaches . Fig. 1 shows the shape of the uncertainty function as a function of and some representative DIPs, where the left images have better quality in terms of the three chosen FR-IQA models with . All the shown DIPs are generated from the training image set that will be described later. It is clear that setting close to zero produces the highest level of uncertainty of quality discriminability. Careful inspection of Fig. 1(a) and Fig. 1(b) reveals that the uncertainty manifests itself in two ways. First, the right image in Fig. 1(a) has better perceived quality to many human observers compared with the left one, which disagrees with the three FR-IQA models. Second, both images in Fig. 1(b) have distortions that are barely perceived by the human eye. In other words, they have very similar perceptual quality. The perceptual uncertainty generally decreases if increases and when , the DIP is clearly discriminable, further justifying the selection of .
Iii-B RankNet 
Given a number of DIPs, a pairwise L2R algorithm would make use of their perceptual order to learn quality models while taking the inherent perceptual uncertainty into account. Here, we revisit RankNet , a pairwise L2R algorithm that was the first of its kind used by commercial search engines . We extend it to learn from DIPs associated with uncertainty. Fig. 2
shows RankNet’s architecture, which is based on classical neural networks and has two parallel streams to accommodate a pair of inputs. The two-stream weights are shared, which is achieved by using the same initializations and the same gradients during backpropagation. The quality prediction function , namely the dipIQ index, is implemented by one of the streams, and the loss function is defined on a pair of images with the help of . Specifically, let and be the output of the first and second streams, whose difference is converted to a probability using
based on which we define the cross entropy loss as
where is the ground truth label associated with the training pair, consisting of the -th and -th images. In the case of DIPs described in the Section III-A, is always or , indicating that the quality of the -th image is worse or better than the -th one. Within the mini-batch stochastic gradient minimization framework, we define the batch-level loss function using the perceptual uncertainty of each DIP as a weighting factor
where is the batch containing the DIP indices currently being trained. As Eq. (4) makes clear, DIPs with higher uncertainty contribute less to the overall loss. With some derivations, we obtain the gradient of with respect to the model parameters collectively denoted by as follows
In the case of a linear dipIQ containing no hidden layers and no nonlinear activations, Eq. (3) is reduced to
which is easily recognized as logistic regression. The convexity of Eq. (6) ensures the global optimality of the solution. We investigate both linear and nonlinear dipIQ cases with the cross entropy as loss. In fact, any probability distribution measures can be adopted as alternatives. For example, Tsai et al.  proposed a fidelity loss measure from quantum physics. We find in our experiments that the fidelity loss impairs performance, so we use the cross entropy loss throughout the paper.
We select RankNet  as our first choice of pairwise L2R algorithm for two reasons. First, it is capable of handling a large number (millions) of training samples using stochastic or mini-batch gradient descent algorithms. By contrast, the training of other pairwise L2R methods such as RankSVM , even with a linear kernel, is painfully slow. Second, since RankNet  embodies classical neural network architectures, we embrace the latest advances in training deep neural networks [87, 88] and can easily upscale the network by adding more hidden layers to learn powerful nonlinear quality prediction functions.
In this section, we first provide thorough implementation details of RankNet  to learn OU-BIQA models. We then describe the experimental protocol based on which a fair comparison is conducted between dipIQ and state-of-the-art BIQA models. After that, we discuss how to extend the proposed pairwise L2R approach for OU-BIQA to a listwise one that could possibly boost the performance.
Iv-a Implementation Details
Iv-A1 Training Set Construction
We collect high quality and high resolution natural images to represent scenes we see in the real-world. They can be roughly clustered into seven groups: human, animal, plant, landscape, cityscape, still-life, and transportation. Sample source images are shown in Fig. 3. We preprocess each source image by down-sampling it using a bicubic kernel so that the maximum height or width is . Following the procedures described in , we add four distortion types, namely JPEG and JPEG2000 (JP2K) compression, white Gaussian noise contamination (WN), and Gaussian blur (BLUR), each with five distortion levels. As a result, our training set consists of test images, with source and distorted images. We randomly hold out source images and their corresponding distorted images and use them as the validation set. For the rest images, we adopt the proposed DIP generation engine to produce more than million DIPs, which constitute our training set.
Iv-A2 Base Feature
We adopt CORNIA features  to represent test images because they appear to be highly competitive in a recent gMAD competition on the Waterloo Exploration Database . In addition, a top performing OU-BIQA model, BLISS , also chooses CORNIA features as input and trains on synthetic scores. As such, we offer a fair testing bed to compare dipIQ learned by a pairwise L2R approach (RankNet ) against BLISS  learned by a regression method (SVR).
Iv-A3 RankNet Instantiation
We investigate both linear and nonlinear dipIQ models, denoted by dipIQ and dipIQ, respectively. The input dimension to RankNet is , equaling the feature dimension in CORNIA . The loss layer is implemented by the cross entropy function in Eq. (3). For dipIQ, the input layer is directly connected to the output layer without adding hidden layers or going through nonlinear transforms. The use of the cross entropy loss ensures the convexity of the optimization problem. For dipIQ, we add hidden layers, which have a - -90] as nonlinearity activations. We choose the node number of the third hidden layer to be so that we can visualize the three-dimensional embedding of test images. Other choices are somewhat ad-hoc, and a more careful exploration of alternative architectures could potentially lead to significant performance improvements.
The RankNet training procedure generally follows Simonyan and Zisserman . Specifically, the training is carried out by optimizing the cross entropy function using mini-batch gradient descent with momentum. The weights of the two streams in RankNet are shared. The batch size is set to , and momentum to . The training is regularized by weight decay (the penalty multiplier set to ). The learning rate is fixed to . Since we have a plenty of DIPs (more than million) for training, each DIP is exposed to the learning algorithm once and only once. The learning stops when the entire set of DIPs have been swept. The weights that achieve the lowest validation set loss are used for testing.
Iv-B Experimental Protocol
Four IQA databases are used to compare dipIQ with state-of-the-art BIQA measures. They are LIVE , CSIQ , TID2013  and Waterloo Exploration Database . The first three are small subject-rated IQA databases that are widely adopted to benchmark objective IQA models. Each test image is associated with an MOS to represent its perceptual quality. In our experiments, we only consider distortion types that are shared by all four databases, namely JP2K, JPEG, WN, and BLUR. As a result, LIVE , CSIQ , and TID2013  contain , , and test images, respectively. The Exploration database contains reference and distorted images. Although the MOS of each test image is not available in the Exploration database, innovative evaluation criteria are employed to compare BIQA measures as will be specified next.
Iv-B2 Evaluation Criteria
We use five evaluation criteria to compare the performance of BIQA measures. The first two are included in previous tests carried out by the video quality experts group (VQEG) . Others are introduced in  to take into account image databases without MOS. Details are given as follows.
Spearman’s rank-order correlation coefficient (SRCC) is defined as
where is the number of images in a database and is the difference between the -th image’s ranks in the MOS and model prediction.
Pearson linear correlation coefficient (PLCC) is computed by
where and stand for the MOS and model prediction of the -th image, respectively.
Pristine/distorted image discriminability test (D-test) considers pristine and distorted images as two distinct classes, and aims to measure how well an IQA model is able to separate the two classes. More specifically, indices of pristine and distorted images are grouped into sets and , respectively. A threshold is adopted to classify images such that and . The average correct classification rate is defined as
The value of should be optimized to yield the maximum correct classification rate, which results in a discriminability index
lies in with a larger value indicating a better separability between pristine and distorted images.
Listwise ranking consistency test (L-test) evaluates the robustness of IQA models when rating images with the same content and the same distortion type but different distortion levels. The assumption is that the quality of an image degrades monotonically with the increase of the distortion level for any distortion type. Given a database with source images, distortion types and distortion levels, the average SRCC is used to quantify the ranking consistency between distortion levels and model predictions
where and represent the distortion levels and the corresponding distortion/quality scores given by a model to the set of images that are from the same (-th) source image and have the same (-th) distortion type.
Pairwise preference consistency test (P-test) compares the performance of IQA models on a number of DIPs, whose generation is similar to what is described Section III-A but with a stricter rule . A good IQA model should give concordant preferences with respect to DIPs. Assuming that an image database contains DIPs and that the number of concordant pairs of an IQA model (meaning that the model predicts the correct preference) is , the pairwise preference consistency ratio is defined as
lies in with a higher value indicating better performance. We also denote the number of incorrect preference predictions as .
SRCC and PLCC are applied to LIVE , CSIQ , and TID2013 , while the D-test, L-test, and P-test are applied to the Waterloo Exploration Database. Note that the use of PLCC requires a nonlinear function to map raw model predictions to the MOS scale. Following Mittal et al.  and Ye et al. , in our experiments we randomly choose reference images along with their corresponding distorted versions to estimate , and use the rest images for testing. This procedure is repeated times and the median SRCC and PLCC values are reported.
Iv-C Experimental Results
Iv-C1 Comparison with FR and OU-BIQA Models
We compare dipIQ with two well-known FR-IQA models: PSNR (whose largest value is clipped at dB in order to perform a reasonable parameter estimation) and SSIM  (whose implementation used in the paper involves a down-sampling process ) and previous OU-BIQA models, including QAC , NIQE , ILNIQE , and BLISS . The implementations of QAC , NIQE , and ILNIQE  are obtained from the original authors. To the best of our knowledge, the complete implementation of BLISS  is not publicly available. Therefore, to make a fair comparison we train BLISS  on the same reference images and their distorted versions, which have been used to train dipIQ. The labels are synthesized using the method in . The training toolbox and parameter settings are inherited from the original paper .
Tables I, II, and III list comparison results between dipIQ and existing OU-BIQA models in terms of median SRCC and PLCC values on LIVE , CSIQ , and TID2013 , respectively. Both dipIQ and dipIQ outperform all previous OU-BIQA models on LIVE  and CSIQ , and are comparable to ILNIQE  on TID2013 . Although both dipIQ and BLISS  learn a linear prediction function using CORNIA features as inputs , we observe consistent performance gains of dipIQ across all three databases over BLISS . This may be because dipIQ learns from more reliable data (DIPs) with uncertainty weighting, whereas the training labels (synthetic scores) for BLISS are noisier, as exemplified in Fig. 4. It is not hard to observe that Fig. 4(a) has clearly worse perceptual quality than Fig. 4(b), which in turn has approximately the same quality compared with Fig. 4(c). Both two cases are in disagreement with the synthetic score .
To ascertain that the improvement of dipIQ is statistically significant, we carry out a two sample T-test (with aconfidence) between PLCC values obtained by different models on LIVE . After comparing every possible pairs of OU-BIQA models, the results are summarized in Table V, where a symbol “1” means the row model performs significantly better than the column model, a symbol “0” means the opposite, and a symbol “-” indicates that the row and column models are statistically indistinguishable. It can be observed that dipIQ is statistically better than dipIQ, which is better than all previous OU-BIQA models.
Table IV shows the results on the Waterloo Exploration Database. dipIQ and dipIQ outperform all previous OU-BIQA models in the D-test and P-test, and are competitive in the L-test, whose performance is slightly inferior to NIQE  and ILNIQE . By learning from examples with a variety of image content, dipIQ is able to crush the number of incorrect preference predictions in the P-test down to around out of more than billion candidate DIPs.
In order to gain intuitions on why the generalizability of dipIQ is excellent even without MOS for training, we visualize the three-dimensional embedding of the LIVE database  in Fig 5, using the learned three-dimensional features from the third hidden layer of dipIQ. We can see that the learned representation is able to cluster test images according to the distortion type, and meanwhile align them with respect to their perceptual quality in a meaningful way, where high quality images are clamped together regardless of image content.
Iv-C2 Comparison with OA-BIQA Models
In the second set of experiments, we train dipIQ using different feature representations as inputs and compare with OA-BIQA models using the same representations and MOS for training. BRISQUE  and DIIVINE  are selected as representative features extracted from the spatial and wavelet domain, respectively. We also compare dipIQ with CORNIA , whose features are adopted as the default input to dipIQ. We re-train BRISQUE , DIIVINE , and CORNIA  on the LIVE database, whose learning tools and parameter settings follow their respective papers. We adjust the dimension of the input layer of dipIQ to accommodate features of different dimensions and train them on the reference images and their distorted versions, as described in IV-A. All models are tested on CSIQ , TID2013  and the Exportation database . From Tables VI, VII, and VIII, we observe that dipIQ consistently performs better than the corresponding OA-BIQA model on CSIQ  and the Exploration database, and is comparable on TID2013 . The reason we do not obtain noticeable performance gains on TID2013  may be that TID2013  has references images originated from LIVE , based on which the OA-BIQA models have been trained. This creates dependencies between training and testing sets. We may also draw conclusions about the effectiveness of the feature representations based on their performance under the same pairwise L2R framework: generally speaking, CORNIA  features BRISQUE  features DIIVINE  features.
We further compare dipIQ and BRISQUE  using the gMAD competition methodology on the Waterloo Exploration Database. Specifically, we first find a pair of images that have the maximum and minimum dipIQ values from a subset of images in the Exploration database, where BRISQUE  rates them to have the same quality. We then repeat this procedure, but with the roles of dipIQ and BRISQUE  exchanged. The two image pairs are shown in Fig. 6, from which we conclude that images in the first row exhibits approximately the same perceptual quality (in agreement with dipIQ) and those in the second row has drastically different perceptual quality (in disagreement with BRISQUE ). This verifies that the robustness of dipIQ is significantly improved over BRISQUE  using the same feature representations and MOS for training. Similar gMAD competition results are obtained across all quality levels, and for dipIQ versus DIIVINE  and dipIQ versus CORNIA .
In summary, the proposed pairwise L2R approach is proved to learn OU-BIQA models with improved generalizability and robustness compared with OA-BIQA models using the same feature representations and MOS for training.
V Listwise L2R Approach for OU-BIQA
In this section, we extend the proposed pairwise L2R approach for OU-BIQA to a listwise L2R one. Specifically, we first construct three-element DILs by concatenating DIPs. For example, given two DIPs and with the same level of uncertainty, we create a list with the ground truth label , indicating that the quality of the -th image is better than the -th image, whose quality is better than the -th image. The uncertainty level is transferred as well. We then employ ListNet , a listwise L2R extension of RankNet  to learn OU-BIQA models. The major differences between ListNet and RankNet are twofold. First, ListNet can have multiple streams with the same weights to accommodate a list of inputs, where each stream is implemented by a classical neural network architecture similar to RankNet, as shown in Fig. 2. In this paper, we instantiate a three-stream ListNet to fit three-element DILs. Second, the loss function of ListNet is defined using the concept of permutation probability. More specifically, we define a permutation on a list of instances as a bijection from to itself, where denotes the instance at position in the permutation. The set of all possible permutations of instances is termed as . We define the probability of permutation given the list of predicted scores as
which satisfies and as proved in . The loss function can then be defined as the cross entropy function between the ground truth and permutation probabilities
When , the loss function of ListNet  in Eq. (14) becomes equivalent to that of RankNet  in Eq. (3). In the case of three-element DILs, we have , if and otherwise. Therefore, the loss function in Eq. (14) can be simplified as
base on which we define the batch-level loss as
where is the uncertainty level of the list, transferred from the corresponding DIPs. The gradient of Eq. (16) w.r.t. the parameters can be easily derived. Note that ListNet  does not add new parameters.
We generate million DILs from the available DIPs as the training data for ListNet . The training procedure is exactly the same as training RankNet . The training stops when the entire set of image lists have been swept once. The weights that achieve the lowest validation set loss are used for testing.
We list the comparison results between dilIQ trained by ListNet  and the baseline dipIQ on LIVE , CSIQ , TID2013 , and the Exploration database in Tables IX, X, XI, and XII, respectively. Remarkable performance improvements have been achieved on CSIQ and TID2013. This may be because the ranking position information is made explicit to the learning process. dilIQ is comparable to dipIQ on LIVE and the Exploration database.
Vi Conclusion and Future Work
In this paper, we have proposed an OU-BIQA model, namely dipIQ, using RankNet . The input to the dipIQ training model are an enormous number of DIPs, not obtained by expensive subjective testing but automatically generated with the help of most trusted FR-IQA models at low cost. Extensive experimental results demonstrate the effectiveness of the proposed dipIQ indices with higher accuracy and improved robustness in content variations. We also learn an OU-BIQA model, namely dilIQ, using a listwise L2R approach, which achieves an additional performance gain.
The current work opens the door to a new class of OU-BIQA models and can be extended in many ways. First, novel image pair and list generation engines may be developed to account for situations that reference images are not available (or do not ever exist). Second, advanced L2R algorithms are worth exploring to improve the quality prediction performance. Third, in practice, a pair of images may be regarded as having indiscriminable quality. Such knowledge could be obtained either from subjective testing (e.g., paired comparison between images) or from the image source (e.g., two pristine images acquired from the same source), and is informative in constraining the behavior of an objective quality model. The current learning framework needs to be improved in order to learn from such quality-indiscriminable image pairs. Fourth, given the powerful DIP generation engine developed in the current work and the remarkable success of recent deep convolutional neural networks, it may become feasible to develop end-to-end BIQA models that bypass the feature extraction process and achieve even stronger robustness and generalizability.
The authors would like to thank Zhengfang Duanmu for suggestions on the efficient implementation of RankNet, and the anonymous reviewers for constructive comments. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada, and the Australian Research Council Projects FT-130101457, DP-140102164, and LP-150100671. K. Ma was partially supported by the CSC.
-  H. R. Wu and K. R. Rao, Digital Video Image Quality and Perceptual Coding. CRC press, 2005.
-  Z. Wang and A. C. Bovik, Modern Image Quality Assessment. Morgan & Claypool Publishers, 2006.
-  S. J. Daly, “Visible differences predictor: An algorithm for the assessment of image fidelity,” in SPIE/IS&T Symposium on Electronic Imaging: Science and Technology, 1992, pp. 2–15.
-  Z. Wang, G. Wu, H. R. Sheikh, E. P. Simoncelli, E.-H. Yang, and A. C. Bovik, “Quality-aware images,” IEEE Transactions on Image Processing, vol. 15, no. 6, pp. 1680–1689, Jun. 2006.
-  Z. Wang and A. C. Bovik, “Reduced- and no-reference image quality assessment: The natural scene statistic model approach,” IEEE Signal Processing Magazine, vol. 28, no. 6, pp. 29–40, Nov. 2011.
-  A. K. Moorthy and A. C. Bovik, “A two-step framework for constructing blind image quality indices,” IEEE Signal Processing Letters, vol. 17, no. 5, pp. 513–516, May 2010.
-  M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the DCT domain,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, Aug. 2012.
-  A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, Dec. 2012.
-  P. Ye, J. Kumar, L. Kang, and D. Doermann, “Unsupervised feature learning framework for no-reference image quality assessment,” in
-  A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE Transactions on Image Processing, vol. 20, no. 12, pp. 3350–3364, Dec. 2011.
-  Q. Wu, Z. Wang, and H. Li, “A highly efficient method for blind image quality assessment,” in IEEE International Conference on Image Processing, 2015, pp. 339–343.
-  W. Xue, X. Mou, L. Zhang, A. C. Bovik, and X. Feng, “Blind image quality assessment using joint statistics of gradient magnitude and Laplacian features,” IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4850–4862, Nov. 2014.
-  K. Gu, G. Zhai, X. Yang, and W. Zhang, “Using free energy principle for blind image quality assessment,” IEEE Transactions on Multimedia, vol. 17, no. 1, pp. 50–63, Jan. 2015.
-  Q. Wu, H. Li, F. Meng, K. N. Ngan, B. Luo, C. Huang, and B. Zeng, “Blind image quality assessment based on multi-channel features fusion and label transfer,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 3, pp. 425–440, Mar. 2016.
-  N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.-C. J. Kuo, “Image database TID2013: Peculiarities, results and perspectives,” Signal Processing: Image Communication, vol. 30, pp. 57–77, Jan. 2015.
-  C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, “Learning to rank using gradient descent,” in International Conference on Machine Learning, 2005, pp. 89–96.
-  T.-Y. Liu, “Learning to rank for information retrieval,” Foundations and Trends in Information Retrieval, vol. 3, no. 3, pp. 225–331, 2009.
-  L. Hang, “A short introduction to learning to rank,” IEICE Transactions on Information and Systems, vol. 94, no. 10, pp. 1854–1862, Oct. 2011.
-  K. Ma, Q. Wu, Z. Wang, Z. Duanmu, H. Yong, H. Li, and L. Zhang, “Group MAD competition a new methodology to compare objective image quality models,” in IEEE Conference on Computer Vsion and Pattern Recognition, 2016, pp. 1664–1673.
-  K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo Exploration Database: New challenges for image quality assessment models,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 1004–1016, Feb. 2017.
-  Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank: From pairwise approach to listwise approach,” in International Conference on Machine Learning, 2007, pp. 129–136.
-  H. R. Wu and M. Yuen, “A generalized block-edge impairment metric for video coding,” IEEE Signal Processing Letters, vol. 4, no. 11, pp. 317–320, Nov. 1997.
-  Z. Wang, A. C. Bovik, and B. L. Evan, “Blind measurement of blocking artifacts in images,” in IEEE International Conference on Image Processing, 2000, pp. 981–984.
-  S. Liu and A. C. Bovik, “Efficient DCT-domain blind measurement and reduction of blocking artifacts,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 12, pp. 1139–1149, Dec. 2002.
-  H. Tong, M. Li, H. Zhang, and C. Zhang, “Blur detection for digital images using wavelet transform,” in IEEE International Conference on Multimedia and Expo, 2004, pp. 17–20.
-  Z. Wang and E. P. Simoncelli, “Local phase coherence and the perception of blur,” in Advances in Neural Information Processing Systems, 2003.
-  X. Zhu and P. Milanfar, “A no-reference sharpness metric sensitive to blur and noise,” in International Workshop on Quality of Multimedia Experience, 2009, pp. 64–69.
-  S. Oğuz, Y. Hu, and T. Q. Nguyen, “Image coding ringing artifact reduction using morphological post-filtering,” in IEEE Workshop on Multimedia Signal Processing, 1998, pp. 628–633.
-  H. R. Sheikh, A. C. Bovik, and L. Cormack, “No-reference quality assessment using natural scene statistics: JPEG2000,” IEEE Transactions on Image Processing, vol. 14, no. 11, pp. 1918–1927, Nov. 2005.
-  H. Liu, N. Klomp, and I. Heynderickx, “A no-reference metric for perceived ringing artifacts in images,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 4, pp. 529–539, Apr. 2010.
-  B. A. Wandell, Foundations of Vision. Sinauer Associates, 1995.
-  D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” The Journal of physiology, vol. 160, no. 1, pp. 106–154, Jan. 1962.
-  D. J. Heeger, “Normalization of cell responses in cat striate cortex,” Visual Neuroscience, vol. 9, no. 02, pp. 181–197, Aug. 1992.
-  D. J. Field, “What is the goal of sensory coding?” Neural Computation, vol. 6, no. 4, pp. 559–601, Jul. 1994.
-  W. S. Geisler and R. L. Diehl, “Bayesian natural selection and the evolution of perceptual systems,” Philosophical Transactions of the Royal Society of London B: Biological Sciences, vol. 357, no. 1420, pp. 419–448, Apr. 2002.
-  E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger, “Shiftable multiscale transforms,” IEEE Transactions on Information Theory, vol. 38, no. 2, pp. 587–607, Mar. 1992.
-  S. G. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 7, pp. 674–693, Jul. 1989.
-  X. Li, “Blind image quality assessment,” in IEEE International Conference on Image Processing, 2002, pp. 449–452.
-  P. Marziliano, F. Dufaux, S. Winkler, and T. Ebrahimi, “Perceptual blur and ringing metrics: Application to JPEG2000,” Signal Processing: Image Communication, vol. 19, no. 2, pp. 163–172, Feb. 2004.
-  C. Li, A. C. Bovik, and X. Wu, “Blind image quality assessment using a general regression neural network,” IEEE Transactions on Neural Networks, vol. 22, no. 5, pp. 793–799, May 2011.
-  Y. Fang, K. Ma, Z. Wang, W. Lin, and G. Zhai, “No-reference quality assessment of contrast-distorted images based on natural scene statistics,” IEEE Signal Processing Letters, vol. 22, no. 7, pp. 838–842, Jul. 2015.
-  X. Zhu and P. Milanfar, “Automatic parameter selection for denoising algorithms using a no-reference measure of image content,” IEEE Transactions on Image Processing, vol. 19, no. 12, pp. 3116–3132, Dec. 2010.
-  A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 209–212, Mar. 2013.
-  A. Mittal, G. S. Muralidhar, J. Ghosh, and A. C. Bovik, “Blind image quality assessment without human training using latent quality factors,” IEEE Signal Processing Letters, vol. 19, no. 2, pp. 75–78, Feb. 2012.
-  P. Ye, J. Kumar, L. Kang, and D. Doermann, “Real-time no-reference image quality assessment based on filter learning,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 987–994.
-  Z. Wang and E. P. Simoncelli, “Reduced-reference image quality assessment using a wavelet-domain natural image statistic model,” in Human Vision and Electronic Imaging, 2005, pp. 149–159.
W. Hou, X. Gao, D. Tao, and X. Li, “Blind image quality assessment via deep learning,”IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 6, pp. 1275–1286, Jun. 2015.
-  Q. Li and Z. Wang, “Reduced-reference image quality assessment using divisive normalization-based image representation,” IEEE Journal of Selected Topics in Signal Processing, vol. 3, no. 2, pp. 202–211, Apr. 2009.
-  A. Rehman and Z. Wang, “Reduced-reference image quality assessment by structural similarity estimation,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3378–3389, Aug. 2012.
-  H. Tang, N. Joshi, and A. Kapoor, “Learning a blind measure of perceptual image quality,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 305–312.
-  ——, “Blind image quality assessment using semi-supervised rectifier networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2877–2884.
-  P. Ye and D. Doermann, “No-reference image quality assessment using visual codebooks,” IEEE Transactions on Image Processing, vol. 21, no. 7, pp. 3129–3138, Jul. 2012.
-  R. Hassen, Z. Wang, and M. M. Salama, “Image sharpness assessment based on local phase coherence,” IEEE Transactions on Image Processing, vol. 22, no. 7, pp. 2798–2810, Jul. 2013.
-  L. Xu and J. Jia, “Two-phase kernel estimation for robust motion deblurring,” in European Conference on Computer Vision, 2010, pp. 157–170.
-  Z. Wang, H. R. Sheikh, and A. C. Bovik, “No-reference perceptual quality assessment of JPEG compressed images,” in IEEE International Conference on Image Processing, vol. 1, 2002, pp. 477–480.
-  T. S. Huang, J. W. Burnett, and A. G. Deczky, “The importance of phase in image processing filters,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 23, no. 6, pp. 529–542, Dec. 1975.
-  A. V. Oppenheim and J. S. Lim, “The importance of phase in signals,” Proceedings of the IEEE, vol. 69, no. 5, pp. 529–541, May 1981.
-  P. Kovesi, “Image features from phase congruency,” Journal of Computer Vision Research, vol. 1, no. 3, pp. 1–26, Jun. 1999.
-  M. A. Saad, A. C. Bovik, and C. Charrier, “A DCT statistics-based blind image quality index,” IEEE Signal Processing Letters, vol. 17, no. 6, pp. 583–586, Jun. 2010.
-  W. Xue, L. Zhang, and X. Mou, “Learning without human scores for blind image quality assessment,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 995–1002.
-  L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity index for image quality assessment,” IEEE Transactions on Image Processing, vol. 20, no. 8, pp. 2378–2386, Aug. 2011.
-  L. Kang, P. Ye, Y. Li, and D. Doermann, “Convolutional neural networks for no-reference image quality assessment,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1733–1740.
-  C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, Sep. 1995.
-  B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett, “New support vector algorithms,” Neural Computation, vol. 12, no. 5, pp. 1207–1245, May 2000.
-  L. Zhang, L. Zhang, and A. Bovik, “A feature-enriched completely blind image quality evaluator,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2579–2591, Aug. 2015.
-  L. Xu, W. Lin, J. Li, X. Wang, Y. Yan, and Y. Fang, “Rank learning on training set selection and image quality assessment,” in IEEE International Conference on Multimedia and Expo, 2014, pp. 1–6.
-  F. Gao, D. Tao, X. Gao, and X. Li, “Learning to rank for blind image quality assessment,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 10, pp. 2275–2290, Oct. 2015.
T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,”Machine Learning, vol. 42, no. 1, pp. 177–196, Jan. 2001.
-  X. Gao, F. Gao, D. Tao, and X. Li, “Universal blind image quality assessment metrics via natural scene statistics and multiple kernel learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 12, pp. 2013–2026, Dec. 2013.
-  N. Fuhr, “Optimum polynomial retrieval functions based on the probability ranking principle,” ACM Transactions on Information Systems, vol. 7, no. 3, pp. 183–204, Jul. 1989.
-  D. Cossock and T. Zhang, “Subset ranking using regression,” in Conference on Learning Theory, 2006, pp. 605–619.
-  R. Nallapati, “Discriminative models for information retrieval,” in International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004, pp. 64–71.
-  K. Crammer and Y. Singer, “Pranking with ranking,” in Advances in Neural Information Processing Systems, 2002, pp. 641–647.
-  A. Shashua and A. Levin, “Ranking with large margin principle: Two approaches,” in Advances in Neural Information Processing Systems, 2002, pp. 937–944.
-  T. Joachims, “Optimizing search engines using clickthrough data,” in Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 133–142.
-  M. F. Tsai, T. Y. Liu, T. Qin, H. H. Chen, and W. Y. Ma, “FRank: A ranking method with fidelity loss,” in International ACM SIGIR Conference on Research and Development in Information Retrieval, 2007, pp. 383–390.
-  Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An efficient boosting algorithm for combining preferences,” Journal of Machine Learning Research, vol. 4, no. 6, pp. 170–178, Nov. 2003.
Y. Freund and R. E. Schapire, “A decision-theoretic generalization of online
learning and an application to boosting,” in
European Conference on Computational Learning Theory, 1995, pp. 23–37.
-  M. Taylor, J. Guiver, S. Robertson, and T. Minka, “SoftRank: optimizing non-smooth rank metrics,” in ACM International Conference on Web Search and Data Mining, 2008, pp. 77–86.
-  Y. Yue, T. Finley, F. Radlinski, and T. Joachims, “A support vector method for optimizing average precision,” in International ACM SIGIR Conference on Research and Development in Information Retrieval, 2007, pp. 271–278.
J.-Y. Yeh, J.-Y. Lin, H.-R. Ke, and W.-P. Yang, “Learning to rank for information retrieval using genetic programming,” inSIGIR Workshop on Learning to Rank for Information Retrieval, 2007, pp. 1–8.
-  Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in IEEE Asilomar Conference on Signals, Systems and Computers, 2003, pp. 1398–1402.
-  H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, Feb. 2006.
-  W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity deviation: A highly efficient perceptual image quality index,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 684–695, Feb. 2014.
-  H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Transactions on Image Processing, vol. 15, no. 11, pp. 3440–3451, Nov. 2006.
-  H. R. Sheikh, Z. Wang, A. C. Bovik, and L. K. Cormack, Image and Video Quality Assessment Research at LIVE [Online]. Available: http://live.ece.utexas.edu/research/quality/.
-  G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, Jul. 2006.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” inAdvances in Neural Information Processing Systems, 2012, pp. 1097–1105.
-  P. Ye, J. Kumar, and D. Doermann, “Beyond human opinion scores: blind image quality assessment based on synthetic scores,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4241–4248.
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” inIEEE International Conference on Machine Learning, 2010, pp. 807–814.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representation, 2015.
-  E. C. Larson and D. M. Chandler, “Most apparent distortion: Full-reference image quality assessment and the role of strategy,” SPIE Journal of Electronic Imaging, vol. 19, no. 1, pp. 1–21, Jan. 2010.
-  VQEG, Final Report from the Video Quality Experts Group on the Validation of Objective Models of Video Quality Assessment 2000 [Online]. Available: http://www.vqeg.org.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004.
-  ——, The SSIM Index for Image Quality Assessment [Online]. Available: https://ece.uwaterloo.ca/~z70wang/research/ssim/.