Learning to Blindly Assess Image Quality in the Laboratory and Wild

07/01/2019 ∙ by Weixia Zhang, et al. ∙ City University of Hong Kong Shanghai Jiao Tong University 0

Previous models for blind image quality assessment (BIQA) can only be trained (or fine-tuned) on one subject-rated database due to the difficulty of combining multiple databases with different perceptual scales. As a result, models trained in a well-controlled laboratory environment with synthetic distortions fail to generalize to realistic distortions, whose data distribution is different. Similarly, models optimized for images captured in the wild do not account for images simulated in the laboratory. Here we describe a simple technique of training BIQA models on multiple databases simultaneously without additional subjective testing for scale realignment. Specifically, we first create and combine image pairs within individual databases, whose ground-truth binary labels are computed from the corresponding mean opinion scores, indicating which of the two images is of higher quality. We then train a deep neural network for BIQA by learning-to-rank massive such image pairs. Extensive experiments on six databases demonstrate that our BIQA method based on the proposed learning technique works well for both synthetic and realistic distortions, outperforming existing BIQA models with a single set of model parameters. The generalizability of our method is further verified by group maximum differentiation (gMAD) competition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Blind image quality assessment (BIQA) aims to predict the perceptual quality of a visual image without reference to its original counterpart. The majority of BIQA models [1, 2, 3, 4] have been developed in well-controlled laboratory environments, whose feature representations are adapted to common synthetic distortions (e.g., Gaussian blur and JPEG compression). Only recently has BIQA of images captured in the wild with realistic distortions become an active research topic [5]. Poor lighting conditions, sensor limitations, lens imperfections, and amateur manipulations are the main sources of distortions in this scenario, which are generally more complex and difficult to simulate. As a result, models trained on databases of synthetic distortions (e.g., LIVE [6] and TID2013 [7]) are not capable of handling databases of realistic distortions (e.g., LIVE Challenge [5] and KonIQ-10k [8]). Similarly, models optimized for realistic distortions do not work well for synthetic distortions [9].

(a)
(b)
(c)
(d)
(e)
(f)
Fig. 1: Images with approximately the same linearly re-scaled MOS exhibit dramatically different perceptual quality. If the ground-truth is in the form of difference MOSs (DMOSs), we first negative the values followed by linear re-scaling. They are sampled from: (a) LIVE [6], (b) CSIQ [10], (c) TID2013 [7] (d) BID [11], (e) LIVE Challenge [5], (f) KonIQ-10k [8].

Very limited effort has been put to develop unified BIQA models for both synthetic and realistic distortions. Mittal et al[12]

based their NIQE method on a prior probability model of natural undistorted images, aiming for strong generalizability to unseen distortions. Unfortunately, NIQE is only able to handle a small set of synthetic distortions. Zhang

et al[13] extended NIQE [12] by extracting more powerful natural scene statistics for local quality prediction. Combining multiple IQA databases for training seems a simple and plausible solution. However, existing databases have different perceptual scales due to differences in subjective testing methodologies (see Table I). A separate subjective experiment on images sampled from each database is then required for scale realignment [6, 10]. To verify this, we linearly re-scale the mean opinion scores (MOSs) of each of the six databases to , and show sample images that have approximately the same re-scaled MOS in Fig. 1, which have dramatically different perceptual quality as expected. Using these noisy re-scaled MOSs for training results in sub-optimal performance (see Table II).

Database Scenario Annotation Range
LIVE [6] synthetic DMOS [0, 100]
CSIQ [10] synthetic DMOS [0, 1]
TID2013 [7] synthetic MOS [0, 9]
BID [11] Realistic MOS [0, 5]
LIVE Challenge [5] Realistic MOS [0, 100]
KonIQ-10k [8] Realistic MOS [1, 5]
TABLE I: Comparison of subject-rated IQA databases. MOS stands for mean opinion score. DMOS is inversely proportional to MOS

In addition to absolute MOSs, recent methods also exploit relative ranking information to learn BIQA models from three major sources: distortion specifications [14, 3], full-reference IQA models [15, 16, 17], and human data [18]. Liu et al. and Ma et al. extracted ranking information from images with the same content and distortion type but different levels to pre-train deep neural networks (DNNs) for BIQA. Their methods can only be applied to synthetic distortions, whose degradation processes are exactly specified. Ye et al[15] and Ma et al[16, 17] supervised the learning of BIQA models with ranking information from full-reference IQA models. Their methods cannot be extended to realistic distortions either, because the reference images are not available or may not even exist for full-reference models to compute quality values. The closest work to ours is due to Gao et al[18], who inferred pairwise rank information from MOSs. However, the model is not end-to-end optimized, which only delivers reasonable performance on a small set of synthetic distortions. Moreover, they didn’t explore the idea of combining multiple IQA databases via relative rankings.

In this letter, we aim to develop a unified BIQA model for both synthetic and realistic distortions with a single set of model parameters. To achieve this, we describe a simple technique of training BIQA models on multiple databases without performing additional subjective experiments to bridge their perceptual gaps. Specifically, we are able to generate and combine image pairs within individual databases, whose ground-truth binary labels are inferred from the corresponding MOSs, indicating which of the two in each pair has better quality. We then learn a DNN for BIQA using a pairwise learning-to-rank algorithm [19]. Through extensive experiments on six databases covering both synthetic and realistic distortions, we find that our BIQA model significantly outperforms existing ones by a large margin, especially in the cross-distortion-scenario setting. We further verify its generalizability by group maximum differentiation (gMAD) competition [20].

Ii Method

In this section, we detail the construction of the training set from multiple IQA databases, followed by pairwise learning procedure and network specification.

Ii-a Training Set Construction

Given subject-rated IQA databases, we randomly sample from the -th database image pairs . For each pair , we infer its relative ranking from the corresponding absolute MOSs, and compute a binary label , where indicates that is of higher perceptual quality and indicates the opposite. By doing so, we are able to combine an arbitrary number of IQA databases without performing any realignment experiment, and generate a training set .

Database LIVE [6] CSIQ [10] TID2013 [7] BID [11] LIVE Challenge [5] KonIQ-10k [8]
MS-SSIM [21] 0.951 0.910 0.790
NLPD [22] 0.942 0.937 0.798
NIQE [12] 0.906 0.632 0.343 0.468 0.464 0.521
ILNIQE [13] 0.907 0.832 0.658 0.516 0.469 0.507
dipIQ [16] 0.940 0.511 0.453 0.009 0.187 0.228
MEON (LIVE) [3] 0.726 0.378 0.100 0.378 0.145
deepIQA (TID2013) [4] 0.833 0.687 0.120 0.133 0.169
PQR (BID) [23] 0.634 0.559 0.374 0.680 0.636
PQR (KonIQ-10k) [23] 0.729 0.709 0.530 0.755 0.770
DB-CNN (TID2013) [9] 0.883 0.817 0.409 0.414 0.518
DB-CNN (LIVE Challenge) [9] 0.709 0.691 0.403 0.762 0.754
Linear re-scaling 0.924 0.807 0.746 0.842 0.824 0.880
Ours (LIVE) 0.948 0.732 0.576 0.598 0.533 0.735
Ours (LIVE Challenge) 0.537 0.517 0.392 0.852 0.856 0.796
Ours (LIVE+LIVE Challenge) 0.955 0.785 0.593 0.838 0.840 0.826
Ours 0.957 0.867 0.806 0.851 0.853 0.892
TABLE II: Median SRCC results across ten sessions on six IQA databases covering both synthetic and realistic distortions. The databases for training models that rely on human annotations are included in the bracket.

Ii-B Learning-to-Rank for BIQA

Given the training set , our goal is to learn a differentiable function

parameterized by a vector

, which takes image as input and computes an overall quality value [19]. Specifically, we make use of the Bradley-Terry model [24] and assume that the true perceptual quality follows a Gumbel distribution, whose location is determined by . The quality difference

is then a logistic random variable, and

can then be computed from the logistic cumulative distribution function, which has a closed form

(1)

where . Combined with the true label , we formulate the binary cross entropy loss as

(2)

In practice, we sample a mini-batch from

in each iteration and use a variant of the stochastic gradient descent method to adjust the parameter vector

by minimizing the empirical loss

(3)

where represents the cardinality of

. During backpropagation, the gradient of

with respect to is derived as

(4)

Ii-C Network Specification

Due to the successes of DNNs in various computer vision and image processing applications, we adopt a ResNet 

[25] as the backbone to construct our quality prediction function . The two-stream learning framework is shown in Fig. 2

. Each stream consists of a stage of convolution, batch normalization, ReLU nonlinearity, and maxpooling, followed by four groups of layers based on the bottleneck architecture 

[25]. To better summarize the spatial statistics and generate a fixed-length representation regardless of image size, we replace the first-order average pooling in the original ResNet with a second-order bilinear pooling [9], which has been proven to be effective at visual recognition [26] and BIQA [9]. Denoting the spatially flattened feature representation after the last group of layers by , where and represent the spatial and channel dimensions, respectively, we define the bilinear pooling as

(5)

We flatten and append a fully connected layer to compute a scalar that represents the perceptual quality of the input image. The weights of the two streams are shared during the entire optimization process.

Fig. 2: The two-stream framework for learning the quality prediction function . The training image pairs are randomly sampled within individual IQA databases.

Iii Experiments

In this section, we first describe training and testing procedures. We then compare the performance of the optimized method to a set of state-of-the-art BIQA models. Last, we report the gMAD competition results of our method against top performing models.

Iii-a Model Training

We generate the training set and conduct comparison experiments on six IQA databases, including LIVE [6], CSIQ [10], TID2013 [7], LIVE Challenge [5], BID [11], and KonIQ-10k [8]. The first three are synthetic, while the last three are realistic. More information of these databases can be found in Table. I. We randomly sample images in each database for training and leave the rest for evaluation. Regarding LIVE, CSIQ, and TID2013, we split training and test sets according to the reference images to guarantee the content independence between the two sets. From the six databases, we are able to generate more than image pairs. During testing, we use Spearman rank-order correlation coefficient (SRCC) to quantify the performance on individual databases.

We adopt ResNet-34 [25] as the backbone of , and train it using the Adam optimizer [27]

for eight epochs. The initial learning rate is set to

with a decay factor of for every two epochs. A warm-up training strategy is adopted: only the last fully connected layer with random initialization is learned in the first two epochs with a mini-batch of ; for the remaining epochs, we fine-tune the entire network with a mini-batch of . In all experiments, we test on images of original size. To reduce the bias caused by the randomness in the training and test set splitting, we repeat this process for ten times and report the median SRCC results.

Iii-B Model Performance

Iii-B1 Main Results

We compare our method with three knowledge-driven BIQA models that do not require MOSs for training - NIQE [12], ILNIQE [13] and dipIQ [16], and four data-driven DNN-based models - MEON [3], deepIQA [4], PQR [23] and DB-CNN [9]. The implementations are obtained from the respective authors. The results are listed in Table II, where we have several interesting observations. First, our method significantly outperforms the three knowledge-driven models. Although NIQE [12] and its feature-enriched version ILNIQE [13] are designed for arbitrary distortion types, they do not perform well on realistic distortions and challenging synthetic distortions in TID2013 [7]. dipIQ [16] is only able to handle distortion types that have been seen during training. As a result, the performance on all databases except LIVE is particularly weak, which suggests that it may be difficult for distortion-aware BIQA methods to handle unseen distortions.

We then compare our model with four recent DNN-based methods. Since previous data-driven models can only be trained on one IQA database, we highlight in the table the databases used to train the models. Despite pre-trained on a large number of synthetically distorted images, MEON fine-tuned on LIVE [6] does not generalize to other databases with different distortion types and scenarios. Although trained with more synthetic distortion types in TID2013 [7], deepIQA [4] performs slightly worse on CSIQ [10] and LIVE Challenge [5] than MEON due to label noise in patch-based training. By bilinearly pooling two feature representations that are sensitive to synthetic and realistic distortions, respectively, DB-CNN [9] trained on TID2013 achieves reasonable performance on databases built in the wild. Based on a probabilistic formulation, PQR [23] targets realistic distortions and trains two models on BID [11] and KonIQ-10k [8] separately. Benefiting from a larger number of training images with more diverse content and distortion variations, the model trained on KonIQ-10k generalizes much better to the rest databases than the one trained on BID. Our method performs significantly better than all competing models on all six databases, and is close to two full-reference models (MS-SSIM [21] and NLPD [28]). We believe this performance improvement arises because the proposed learning technique allows us to train our method on multiple databases simultaneously, adapting the feature representation to multiple distortion scenarios. In addition, the pre-trained weights on object recognition is helpful to prevent our method over-fitting to any single pattern within individual databases.

(a)
(b)
(c)
(d)
Fig. 3: gMAD competition between our method and ILNIQE [13]. (a/b) Best/worst quality images according to IL-NIQE, with near-identical quality reported by our method. (c/d) Best/worst quality images according to our method with near-identical quality reported by ILNIQE.
(a)
(b)
(c)
(d)
Fig. 4: gMAD competition between our method and DB-CNN [9]. (a/b) Best/worst quality images according to DB-CNN, with near-identical quality reported by our method. (c/d) Best/worst quality images according to our method with near-identical quality reported by DB-CNN.

Iii-B2 gMAD Competition Results

We further qualitatively examine the generality of our method using gMAD competition [20] on the large-scale Waterloo Exploration Database [29]. We first let our method compete with the best knowledge-driven BIQA model ILNIQE [13] in Fig. 3. Our method successfully falsifies ILNIQE. Specifically, ILNIQE treats the clear image (c) and the extremely JPEG-compressed image (d) with the same visual quality, which is in strong disagreement with human perception. Meanwhile, our method survives from the attack by ILNIQE, where images (a) and (b) appear to have similar perceptual quality. We then compare our method with the top-performing DNN-based model DB-CNN [9] in Fig. 4. We find that our method roughly matches DB-CNN on the Waterloo Exploration Database. DB-CNN finds a counterexample, indicating that our method fails to handle the extremely blurred image (b). Similarly, our method spots the weakness of DB-CNN when dealing with Gaussian noise (image (d)). Nevertheless, if we incorporate realistically distorted images into the competition, our method may find stronger failure cases of DB-CNN.

Iii-B3 Ablation Study

We first investigate how our method behaves when we gradually add more databases into training. From Table II, we observe that when trained on a single synthetic/realistic database, our method does not generalize to realistic/synthetic databases, which confirms previous empirical studies. When trained on the LIVE and LIVE Challenge databases, we observe improved performance on the three synthetic databases and the large-scale KonIQ-10k database, which verifies the effectiveness of the proposed training technique. The training image pairs from the two databases effectively provide mutual regularization, guiding the network to a better local optimum. When trained on all six databases, our method delivers the best performance across databases. We also train a baseline model using linearly re-scaled MOSs of the six databases. Compared to the proposed method, the performance of the baseline model drops significantly due to the inaccuracy of linear re-scaling (see Fig. 1).

Iv Conclusion

We have introduced a BIQA model and a method of training it on multiple IQA databases simultaneously. Our BIQA model is the first of its kind to handle both synthetic and realistic distortions with a single set of model parameters. The proposed learning technique is model agnostic, meaning that it can be combined with other data-driven BIQA models, especially advanced ones (e.g., DB-CNN [9]) for improved performance. In addition, it is straightforward to incorporate more image pairs into training, when new IQA databases are available. We hope that the proposed learning technique would become a standard tool for existing and next-generation BIQA models to meet the cross-distortion-scenario challenge.

References

  • [1] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Trans. Image Process., vol. 21, no. 12, pp. 4695–4708, Dec. 2012.
  • [2] P. Ye, J. Kumar, L. Kang, and D. Doermann, “Unsupervised feature learning framework for no-reference image quality assessment,” in

    IEEE Conf. Comput. Vis. Pattern Recognit.

    , 2012, pp. 1098–1105.
  • [3] K. Ma, W. Liu, K. Zhang, Z. Duanmu, Z. Wang, and W. Zuo, “End-to-end blind image quality assessment using deep neural networks,” IEEE Trans. Image Process., vol. 27, no. 3, pp. 1202–1213, Mar. 2018.
  • [4] S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek, “Deep neural networks for no-reference and full-reference image quality assessment,” IEEE Trans. Image Process., vol. 27, no. 1, pp. 206–219, Jan. 2018.
  • [5] D. Ghadiyaram and A. C. Bovik, “Massive online crowdsourced study of subjective and objective picture quality,” IEEE Trans. Image Process., vol. 25, no. 1, pp. 372–387, Jan. 2016.
  • [6] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Trans. Image Process., vol. 15, no. 11, pp. 3440–3451, Nov. 2006.
  • [7] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.-C. J. Kuo, “Image database TID2013: Peculiarities, results and perspectives,” Signal Process. Image Commun., vol. 30, pp. 57–77, Jan. 2015.
  • [8] H. Lin, V. Hosu, and D. Saupe, “KonIQ-10K: Towards an ecologically valid and large-scale IQA database,” CoRR, vol. abs/1803.08480, 2018.
  • [9]

    W. Zhang, K. Ma, J. Yan, D. Deng, and Z. Wang, “Blind image quality assessment using a deep bilinear convolutional neural network,”

    IEEE Trans. Circuits and Syst. Video Technol., to appear.
  • [10] E. C. Larson and D. M. Chandler, “Most apparent distortion: Full-reference image quality assessment and the role of strategy,” J. Electron. Imaging, vol. 19, no. 1, pp. 1–21, Jan. 2010.
  • [11]

    A. Ciancio, A. L. N. T. da Costa, E. A. da Silva, A. Said, R. Samadani, and P. Obrador, “No-reference blur assessment of digital pictures based on multifeature classifiers,”

    IEEE Trans. Image Process., vol. 20, no. 1, pp. 64–75, Jan. 2011.
  • [12] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a ‘completely blind’ image quality analyzer,” IEEE Signal Process. Lett., vol. 20, no. 3, pp. 209–212, Mar. 2013.
  • [13] L. Zhang, L. Zhang, and A. C. Bovik, “A feature-enriched completely blind image quality evaluator,” IEEE Trans. Image Process., vol. 24, no. 8, pp. 2579–2591, Aug. 2015.
  • [14] X. Liu, J. van de Weijer, and A. D. Bagdanov, “RankIQA: Learning from rankings for no-reference image quality assessment,” in IEEE Int. Conf. Comput. Vis., 2017, pp. 1040–1049.
  • [15] P. Ye, J. Kumar, and D. Doermann, “Beyond human opinion scores: Blind image quality assessment based on synthetic scores,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 4241–4248.
  • [16] K. Ma, W. Liu, T. Liu, Z. Wang, and D. Tao, “dipIQ: Blind image quality assessment by learning-to-rank discriminable image pairs,” IEEE Trans. Image Process., vol. 26, no. 8, pp. 3951–3964, Aug. 2017.
  • [17] K. Ma, X. Liu, Y. Fang, and E. P. Simoncelli, “Blind image quality assessment by learning from multiple annotators,” in IEEE Int. Conf. Image Proc., to appear.
  • [18] F. Gao, D. Tao, X. Gao, and X. Li, “Learning to rank for blind image quality assessment,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 10, pp. 2275–2290, Oct. 2015.
  • [19] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, “Learning to rank using gradient descent,” in Int. Conf. Mach. Learn., 2005, pp. 89–96.
  • [20] K. Ma, Z. Duanmu, Z. Wang, Q. Wu, W. Liu, H. Yong, H. Li, and L. Zhang, “Group maximum differentiation competition: Model comparison with few samples,” IEEE Trans. Pattern. Anal. Mach. Intell., to appear.
  • [21] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in IEEE Asilomar Conf. on Signals, Syst. and Comput., 2003, pp. 1398–1402.
  • [22] V. Laparra, A. Berardino, J. Ballé, and E. P. Simoncelli, “Perceptually optimized image rendering,” J. of Opt. Soc. of Am. A, vol. 34, no. 9, pp. 1511–1525, Sep. 2017.
  • [23] H. Zeng, L. Zhang, and A. C. Bovik, “A probabilistic quality representation approach to deep blind image quality prediction,” CoRR, vol. abs/1708.08190, 2017.
  • [24] R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, Dec. 1952.
  • [25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
  • [26] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear CNN models for fine-grained visual recognition,” in IEEE Int. Conf. Comput. Vis., 2015, pp. 1449–1457.
  • [27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
  • [28] V. Laparra, J. Ballé, A. Berardino, and E. P. Simoncelli, “Perceptual image quality assessment using a normalized Laplacian pyramid,” Electron. Imaging, vol. 2016, no. 16, pp. 1–6, Feb. 2016.
  • [29] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo Exploration Database: New challenges for image quality assessment models,” IEEE Trans. Image Process., vol. 26, no. 2, pp. 1004–1016, Feb. 2017.