Blind image quality assessment (BIQA) aims to predict the perceptual quality of a visual image without reference to its original counterpart. The majority of BIQA models [1, 2, 3, 4] have been developed in well-controlled laboratory environments, whose feature representations are adapted to common synthetic distortions (e.g., Gaussian blur and JPEG compression). Only recently has BIQA of images captured in the wild with realistic distortions become an active research topic . Poor lighting conditions, sensor limitations, lens imperfections, and amateur manipulations are the main sources of distortions in this scenario, which are generally more complex and difficult to simulate. As a result, models trained on databases of synthetic distortions (e.g., LIVE  and TID2013 ) are not capable of handling databases of realistic distortions (e.g., LIVE Challenge  and KonIQ-10k ). Similarly, models optimized for realistic distortions do not work well for synthetic distortions .
Very limited effort has been put to develop unified BIQA models for both synthetic and realistic distortions. Mittal et al. 
based their NIQE method on a prior probability model of natural undistorted images, aiming for strong generalizability to unseen distortions. Unfortunately, NIQE is only able to handle a small set of synthetic distortions. Zhanget al.  extended NIQE  by extracting more powerful natural scene statistics for local quality prediction. Combining multiple IQA databases for training seems a simple and plausible solution. However, existing databases have different perceptual scales due to differences in subjective testing methodologies (see Table I). A separate subjective experiment on images sampled from each database is then required for scale realignment [6, 10]. To verify this, we linearly re-scale the mean opinion scores (MOSs) of each of the six databases to , and show sample images that have approximately the same re-scaled MOS in Fig. 1, which have dramatically different perceptual quality as expected. Using these noisy re-scaled MOSs for training results in sub-optimal performance (see Table II).
|LIVE ||synthetic||DMOS||[0, 100]|
|CSIQ ||synthetic||DMOS||[0, 1]|
|TID2013 ||synthetic||MOS||[0, 9]|
|BID ||Realistic||MOS||[0, 5]|
|LIVE Challenge ||Realistic||MOS||[0, 100]|
|KonIQ-10k ||Realistic||MOS||[1, 5]|
In addition to absolute MOSs, recent methods also exploit relative ranking information to learn BIQA models from three major sources: distortion specifications [14, 3], full-reference IQA models [15, 16, 17], and human data . Liu et al. and Ma et al. extracted ranking information from images with the same content and distortion type but different levels to pre-train deep neural networks (DNNs) for BIQA. Their methods can only be applied to synthetic distortions, whose degradation processes are exactly specified. Ye et al.  and Ma et al. [16, 17] supervised the learning of BIQA models with ranking information from full-reference IQA models. Their methods cannot be extended to realistic distortions either, because the reference images are not available or may not even exist for full-reference models to compute quality values. The closest work to ours is due to Gao et al. , who inferred pairwise rank information from MOSs. However, the model is not end-to-end optimized, which only delivers reasonable performance on a small set of synthetic distortions. Moreover, they didn’t explore the idea of combining multiple IQA databases via relative rankings.
In this letter, we aim to develop a unified BIQA model for both synthetic and realistic distortions with a single set of model parameters. To achieve this, we describe a simple technique of training BIQA models on multiple databases without performing additional subjective experiments to bridge their perceptual gaps. Specifically, we are able to generate and combine image pairs within individual databases, whose ground-truth binary labels are inferred from the corresponding MOSs, indicating which of the two in each pair has better quality. We then learn a DNN for BIQA using a pairwise learning-to-rank algorithm . Through extensive experiments on six databases covering both synthetic and realistic distortions, we find that our BIQA model significantly outperforms existing ones by a large margin, especially in the cross-distortion-scenario setting. We further verify its generalizability by group maximum differentiation (gMAD) competition .
In this section, we detail the construction of the training set from multiple IQA databases, followed by pairwise learning procedure and network specification.
Ii-a Training Set Construction
Given subject-rated IQA databases, we randomly sample from the -th database image pairs . For each pair , we infer its relative ranking from the corresponding absolute MOSs, and compute a binary label , where indicates that is of higher perceptual quality and indicates the opposite. By doing so, we are able to combine an arbitrary number of IQA databases without performing any realignment experiment, and generate a training set .
|Database||LIVE ||CSIQ ||TID2013 ||BID ||LIVE Challenge ||KonIQ-10k |
|MEON (LIVE) ||–||0.726||0.378||0.100||0.378||0.145|
|deepIQA (TID2013) ||0.833||0.687||–||0.120||0.133||0.169|
|PQR (BID) ||0.634||0.559||0.374||–||0.680||0.636|
|PQR (KonIQ-10k) ||0.729||0.709||0.530||0.755||0.770||–|
|DB-CNN (TID2013) ||0.883||0.817||–||0.409||0.414||0.518|
|DB-CNN (LIVE Challenge) ||0.709||0.691||0.403||0.762||–||0.754|
|Ours (LIVE Challenge)||0.537||0.517||0.392||0.852||0.856||0.796|
|Ours (LIVE+LIVE Challenge)||0.955||0.785||0.593||0.838||0.840||0.826|
Ii-B Learning-to-Rank for BIQA
Given the training set , our goal is to learn a differentiable function
parameterized by a vector, which takes image as input and computes an overall quality value . Specifically, we make use of the Bradley-Terry model  and assume that the true perceptual quality follows a Gumbel distribution, whose location is determined by . The quality difference
is then a logistic random variable, and
can then be computed from the logistic cumulative distribution function, which has a closed form
where . Combined with the true label , we formulate the binary cross entropy loss as
In practice, we sample a mini-batch from
in each iteration and use a variant of the stochastic gradient descent method to adjust the parameter vectorby minimizing the empirical loss
where represents the cardinality of
. During backpropagation, the gradient ofwith respect to is derived as
Ii-C Network Specification
Due to the successes of DNNs in various computer vision and image processing applications, we adopt a ResNet as the backbone to construct our quality prediction function . The two-stream learning framework is shown in Fig. 225]. To better summarize the spatial statistics and generate a fixed-length representation regardless of image size, we replace the first-order average pooling in the original ResNet with a second-order bilinear pooling , which has been proven to be effective at visual recognition  and BIQA . Denoting the spatially flattened feature representation after the last group of layers by , where and represent the spatial and channel dimensions, respectively, we define the bilinear pooling as
We flatten and append a fully connected layer to compute a scalar that represents the perceptual quality of the input image. The weights of the two streams are shared during the entire optimization process.
In this section, we first describe training and testing procedures. We then compare the performance of the optimized method to a set of state-of-the-art BIQA models. Last, we report the gMAD competition results of our method against top performing models.
Iii-a Model Training
We generate the training set and conduct comparison experiments on six IQA databases, including LIVE , CSIQ , TID2013 , LIVE Challenge , BID , and KonIQ-10k . The first three are synthetic, while the last three are realistic. More information of these databases can be found in Table. I. We randomly sample images in each database for training and leave the rest for evaluation. Regarding LIVE, CSIQ, and TID2013, we split training and test sets according to the reference images to guarantee the content independence between the two sets. From the six databases, we are able to generate more than image pairs. During testing, we use Spearman rank-order correlation coefficient (SRCC) to quantify the performance on individual databases.
for eight epochs. The initial learning rate is set towith a decay factor of for every two epochs. A warm-up training strategy is adopted: only the last fully connected layer with random initialization is learned in the first two epochs with a mini-batch of ; for the remaining epochs, we fine-tune the entire network with a mini-batch of . In all experiments, we test on images of original size. To reduce the bias caused by the randomness in the training and test set splitting, we repeat this process for ten times and report the median SRCC results.
Iii-B Model Performance
Iii-B1 Main Results
We compare our method with three knowledge-driven BIQA models that do not require MOSs for training - NIQE , ILNIQE  and dipIQ , and four data-driven DNN-based models - MEON , deepIQA , PQR  and DB-CNN . The implementations are obtained from the respective authors. The results are listed in Table II, where we have several interesting observations. First, our method significantly outperforms the three knowledge-driven models. Although NIQE  and its feature-enriched version ILNIQE  are designed for arbitrary distortion types, they do not perform well on realistic distortions and challenging synthetic distortions in TID2013 . dipIQ  is only able to handle distortion types that have been seen during training. As a result, the performance on all databases except LIVE is particularly weak, which suggests that it may be difficult for distortion-aware BIQA methods to handle unseen distortions.
We then compare our model with four recent DNN-based methods. Since previous data-driven models can only be trained on one IQA database, we highlight in the table the databases used to train the models. Despite pre-trained on a large number of synthetically distorted images, MEON fine-tuned on LIVE  does not generalize to other databases with different distortion types and scenarios. Although trained with more synthetic distortion types in TID2013 , deepIQA  performs slightly worse on CSIQ  and LIVE Challenge  than MEON due to label noise in patch-based training. By bilinearly pooling two feature representations that are sensitive to synthetic and realistic distortions, respectively, DB-CNN  trained on TID2013 achieves reasonable performance on databases built in the wild. Based on a probabilistic formulation, PQR  targets realistic distortions and trains two models on BID  and KonIQ-10k  separately. Benefiting from a larger number of training images with more diverse content and distortion variations, the model trained on KonIQ-10k generalizes much better to the rest databases than the one trained on BID. Our method performs significantly better than all competing models on all six databases, and is close to two full-reference models (MS-SSIM  and NLPD ). We believe this performance improvement arises because the proposed learning technique allows us to train our method on multiple databases simultaneously, adapting the feature representation to multiple distortion scenarios. In addition, the pre-trained weights on object recognition is helpful to prevent our method over-fitting to any single pattern within individual databases.
Iii-B2 gMAD Competition Results
We further qualitatively examine the generality of our method using gMAD competition  on the large-scale Waterloo Exploration Database . We first let our method compete with the best knowledge-driven BIQA model ILNIQE  in Fig. 3. Our method successfully falsifies ILNIQE. Specifically, ILNIQE treats the clear image (c) and the extremely JPEG-compressed image (d) with the same visual quality, which is in strong disagreement with human perception. Meanwhile, our method survives from the attack by ILNIQE, where images (a) and (b) appear to have similar perceptual quality. We then compare our method with the top-performing DNN-based model DB-CNN  in Fig. 4. We find that our method roughly matches DB-CNN on the Waterloo Exploration Database. DB-CNN finds a counterexample, indicating that our method fails to handle the extremely blurred image (b). Similarly, our method spots the weakness of DB-CNN when dealing with Gaussian noise (image (d)). Nevertheless, if we incorporate realistically distorted images into the competition, our method may find stronger failure cases of DB-CNN.
Iii-B3 Ablation Study
We first investigate how our method behaves when we gradually add more databases into training. From Table II, we observe that when trained on a single synthetic/realistic database, our method does not generalize to realistic/synthetic databases, which confirms previous empirical studies. When trained on the LIVE and LIVE Challenge databases, we observe improved performance on the three synthetic databases and the large-scale KonIQ-10k database, which verifies the effectiveness of the proposed training technique. The training image pairs from the two databases effectively provide mutual regularization, guiding the network to a better local optimum. When trained on all six databases, our method delivers the best performance across databases. We also train a baseline model using linearly re-scaled MOSs of the six databases. Compared to the proposed method, the performance of the baseline model drops significantly due to the inaccuracy of linear re-scaling (see Fig. 1).
We have introduced a BIQA model and a method of training it on multiple IQA databases simultaneously. Our BIQA model is the first of its kind to handle both synthetic and realistic distortions with a single set of model parameters. The proposed learning technique is model agnostic, meaning that it can be combined with other data-driven BIQA models, especially advanced ones (e.g., DB-CNN ) for improved performance. In addition, it is straightforward to incorporate more image pairs into training, when new IQA databases are available. We hope that the proposed learning technique would become a standard tool for existing and next-generation BIQA models to meet the cross-distortion-scenario challenge.
-  A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Trans. Image Process., vol. 21, no. 12, pp. 4695–4708, Dec. 2012.
P. Ye, J. Kumar, L. Kang, and D. Doermann, “Unsupervised feature learning
framework for no-reference image quality assessment,” in
IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 1098–1105.
-  K. Ma, W. Liu, K. Zhang, Z. Duanmu, Z. Wang, and W. Zuo, “End-to-end blind image quality assessment using deep neural networks,” IEEE Trans. Image Process., vol. 27, no. 3, pp. 1202–1213, Mar. 2018.
-  S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek, “Deep neural networks for no-reference and full-reference image quality assessment,” IEEE Trans. Image Process., vol. 27, no. 1, pp. 206–219, Jan. 2018.
-  D. Ghadiyaram and A. C. Bovik, “Massive online crowdsourced study of subjective and objective picture quality,” IEEE Trans. Image Process., vol. 25, no. 1, pp. 372–387, Jan. 2016.
-  H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Trans. Image Process., vol. 15, no. 11, pp. 3440–3451, Nov. 2006.
-  N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.-C. J. Kuo, “Image database TID2013: Peculiarities, results and perspectives,” Signal Process. Image Commun., vol. 30, pp. 57–77, Jan. 2015.
-  H. Lin, V. Hosu, and D. Saupe, “KonIQ-10K: Towards an ecologically valid and large-scale IQA database,” CoRR, vol. abs/1803.08480, 2018.
W. Zhang, K. Ma, J. Yan, D. Deng, and Z. Wang, “Blind image quality assessment using a deep bilinear convolutional neural network,”IEEE Trans. Circuits and Syst. Video Technol., to appear.
-  E. C. Larson and D. M. Chandler, “Most apparent distortion: Full-reference image quality assessment and the role of strategy,” J. Electron. Imaging, vol. 19, no. 1, pp. 1–21, Jan. 2010.
A. Ciancio, A. L. N. T. da Costa, E. A. da Silva, A. Said, R. Samadani, and P. Obrador, “No-reference blur assessment of digital pictures based on multifeature classifiers,”IEEE Trans. Image Process., vol. 20, no. 1, pp. 64–75, Jan. 2011.
-  A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a ‘completely blind’ image quality analyzer,” IEEE Signal Process. Lett., vol. 20, no. 3, pp. 209–212, Mar. 2013.
-  L. Zhang, L. Zhang, and A. C. Bovik, “A feature-enriched completely blind image quality evaluator,” IEEE Trans. Image Process., vol. 24, no. 8, pp. 2579–2591, Aug. 2015.
-  X. Liu, J. van de Weijer, and A. D. Bagdanov, “RankIQA: Learning from rankings for no-reference image quality assessment,” in IEEE Int. Conf. Comput. Vis., 2017, pp. 1040–1049.
-  P. Ye, J. Kumar, and D. Doermann, “Beyond human opinion scores: Blind image quality assessment based on synthetic scores,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 4241–4248.
-  K. Ma, W. Liu, T. Liu, Z. Wang, and D. Tao, “dipIQ: Blind image quality assessment by learning-to-rank discriminable image pairs,” IEEE Trans. Image Process., vol. 26, no. 8, pp. 3951–3964, Aug. 2017.
-  K. Ma, X. Liu, Y. Fang, and E. P. Simoncelli, “Blind image quality assessment by learning from multiple annotators,” in IEEE Int. Conf. Image Proc., to appear.
-  F. Gao, D. Tao, X. Gao, and X. Li, “Learning to rank for blind image quality assessment,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 10, pp. 2275–2290, Oct. 2015.
-  C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, “Learning to rank using gradient descent,” in Int. Conf. Mach. Learn., 2005, pp. 89–96.
-  K. Ma, Z. Duanmu, Z. Wang, Q. Wu, W. Liu, H. Yong, H. Li, and L. Zhang, “Group maximum differentiation competition: Model comparison with few samples,” IEEE Trans. Pattern. Anal. Mach. Intell., to appear.
-  Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in IEEE Asilomar Conf. on Signals, Syst. and Comput., 2003, pp. 1398–1402.
-  V. Laparra, A. Berardino, J. Ballé, and E. P. Simoncelli, “Perceptually optimized image rendering,” J. of Opt. Soc. of Am. A, vol. 34, no. 9, pp. 1511–1525, Sep. 2017.
-  H. Zeng, L. Zhang, and A. C. Bovik, “A probabilistic quality representation approach to deep blind image quality prediction,” CoRR, vol. abs/1708.08190, 2017.
-  R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, Dec. 1952.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
-  T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear CNN models for fine-grained visual recognition,” in IEEE Int. Conf. Comput. Vis., 2015, pp. 1449–1457.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
-  V. Laparra, J. Ballé, A. Berardino, and E. P. Simoncelli, “Perceptual image quality assessment using a normalized Laplacian pyramid,” Electron. Imaging, vol. 2016, no. 16, pp. 1–6, Feb. 2016.
-  K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo Exploration Database: New challenges for image quality assessment models,” IEEE Trans. Image Process., vol. 26, no. 2, pp. 1004–1016, Feb. 2017.