Images have dominated online media and social networks because of the advances in capturing, storage, streaming, and display technologies. Everyday, on average, billions of images are shared online. To be able to share all these images within a limited bandwidth, we need to design compact representations that not only satisfy bandwidth requirements but also maintain perceived quality. The ever-increasing number of images makes it impossible to assess the perceived quality of all publicly available images subjectively. Therefore, there is an imminent need to automatically assess the quality of experience (QoE), whose definition depends on the application and in case of imaging applications, the core of QoE is image quality.
In the research community, image quality assessment is modeled by mapping images to subjective scores. Quality estimators are designed to process an image or images to provide a quality score. Fidelity-based approaches measure pixel-wise differences and they can be extended with visual system-based characteristics to obtain a more perceptually correlated quality score [1, 2, 3]. Instead of quantifying pixel-wise differences, measuring structural similarity is also commonly used in the literature [4, 5, 6]. Even the majority of the quality estimators use only grayscale images or intensity channels, color information is also used in the literature [7, 8, 9]. In addition to these hand-crafted quality estimators, methods that are based on modeling natural scene statistics and data-driven learning are also used in the literature [10, 11, 12, 13]. The majority of the quality estimators refer to visual system characteristics but none of them is a comprehensive model of the perception process. Existing image quality estimators differ from each other in various ways. However, all these methods fundamentally map pixels to subjective scores. Moreover, even some of the methods are less perceptually correlated than others, they can still contain additional information that can not be provided by better performing methods. Therefore, multiple methods can be fused to boost the overall performance. Boosting is initially discussed in  and  to investigate whether it is possible to obtain strong learners from weak learners or not. In , the authors describe a method for converting a weak learning algorithm into a strong one that obtains arbitrarily high accuracy.
Based on the boosting discussion, we can also convert image quality assessment algorithms with poor performance into highly perceptually correlated quality estimators. In , the authors analyze the performance of multiple methods and combine two methods linearly to obtain a hybrid quality estimator. Linear weights are selected by exploring a parameter space. The authors in 
propose a regression-based approach that is used to fuse quality estimates of multiple methods non-linearly. In addition to estimating a quality score directly, hand-crafted features are also used to classify distortion types and a regression approach is used in each distortion type separately to learn the mapping function. In, the multi-method fusion is extended with a method selection algorithm to reduce the overall complexity. The authors in  follow a regression-based approach to obtain two types of image quality estimators that are separately trained with features of existing quality estimators and hand-crafted features that measure degradations overlooked by the existing features. The scores of individual quality estimators are fused with a support vector regression stage along with a statistical testing-based selection mechanism. A similar parallel boosting approach based on support vector regression is also used for stereoscopic image quality assessment in .
Multi-method fusion approaches are promising if they are considered as a framework, which can lead to more comprehensive quality estimators as the boosting methods and the fused image quality assessment algorithms improve. The framework should not be limited to some specific quality estimators, distortion types, or learning methods. To investigate the generalizability of fusing quality estimators, we use image quality assessment algorithms that can be grouped into different categories as fidelity-based, perceptually-extended fidelity-based, structural similarity-based, color-based, and learning-based. In addition to performing boosting with support vector machines, we also use neural networks. Fused methods are trained and tested over three different databases that include distortion types based on compression, image noise, communication, blur, color, global, and local. Finally, performances of fused methods on different databases are measured using linearity-, accuracy-, and ranking-based metrics. At first, we investigate the performance of regressed versions of the existing methods and compare the performance with existing methods. Then, we analyze the performance of multi-method fusion that uses all the quality estimators. Finally, performance of boosted methods with respect to the number of fused methods are analyzed. To the best of our knowledge, there are only a few studies that fuse multiple methods, which use a single type of architecture for regression. In this work, in addition to commonly used support vector machines, we propose using neural networks in the multi-method fusion. In Section II, we describe the experimental setup, which includes brief descriptions of used image quality estimators, boosting methods, databases, data partitioning, number of experiments, and performance metrics. We discuss the experimental results in Section III and conclude our discussion in Section IV.
Ii Experimental Setup
Ii-a Image Quality Estimators
Fidelity attributes quantify the changes in a degraded image with respect to a reference image and they are commonly preferred in image and video coding standards for rate-distortion optimization because of low computational complexity and ease of implementation. The intuitive method to measure the fidelity of an image is to directly compare it with its distortion-free image, if available. Mean square error (MSE) is a commonly used pixel-wise fidelity method, which is calculated by obtaining the difference between images, taking the square root of the difference, and calculating the mean value. MSE is scaled by the range of an image and mapped with a logarithmic function to obtain the peak signal-to-noise ratio (PSNR), which is one of the quality estimators used in the boosting operations.
Ii-A2 Perceptually Extended Fidelity-based
Image quality metrics use the characteristics known about the visual system to make the perceptual quality assessment more accurate. The authors in  extend PSNR by removing mean shift, stretching contrast block-wise, and quantizing DCT coefficients with the compression table proposed by JPEG. These extensions are performed to make PSNR compatible with the human visual system and the extended metric is named as PSNR-HVS. Reduction by value of contrast masking is also added to the metric and the modified version is named as PSNR-HVS-M . These metrics are further extended by adding contrast change and mean shifting sensitivity (PSNR-HA, PSNR-HMA) as explained in , both of which are used in the boosting operations.
Ii-A3 Structural Similarity-based
Structural similarity is commonly obtained by quantifying the similarity between mean subtracted and divisive normalized images. The authors in  propose a full reference metric (SSIM) based on the comparison between a reference and a distorted image in terms of luminance, contrast, and structure in the spatial domain. These structure-based methods are also extended to multi-scale (MS-SSIM) , complex domain (CW-SSIM) , and information-weighted (IW-SSIM)  versions. All of these structural similarity methods are used in the boosting operations. Moreover, we also use spectral similarity in the boosting .
The human visual system (HVS) is more sensitive to changes in intensity compared to color . Although color may not be as informative as intensity, it can still contain additional information. An intuitive way to use color information in the image quality assessment is pixel-wise fidelity. FSIMc  and PerSIM  introduce color information by computing pixel-wise fidelity over chroma channels in the La*b* color space. In addition to the color-based similarity, FSIMc computes similarity based on phase congruency and gradient magnitude, and PerSIM computes similarity based on band-pass features that are obtained from the contrast sensitivity formulation of the retinal ganglion cells. FSIMc are PerSIM are used in the boosting operations.
It is not possible to handcraft a comprehensive quality estimator that covers all the aspects of visual system. Therefore, data-driven approaches can be used to design quality estimators. The majority of the data-driven approaches require distortion-specific images or subjective scores in the training, which can bias the performance of boosting methods. Therefore, we use the data-driven quality estimator UNIQUE, which is trained with solely generic images in an unsupervised fashion. Images are pre-processed with a mean subtraction stage, a whitening operation, and color space transformations to obtain more descriptive representations in terms of structure and color. These representations are fed to a linear decoder to obtain sparse representations. An objective score is obtained by comparing the sparse representations in terms of monotonic behavior.
Ii-B Boosting Methods
Rather than using specifically tuned deep networks or complicated architectures, we analyze the effect of boosting through two off-the-shelf methods. We use a generic neural network and a support vector machine. The only parameter that we adjust in the neural network architecture is the number of neurons in a single hidden layer, which is set to the total number of quality estimators used in the experiments. By default, we use mean square error as the cost function and Levenberg-Marquardt as the training function, which does not necessarily guarantee a global minimum. The default configuration in a support vector machine includes a sequential minimal optimization (SMO) as the solver and a linear kernel.
In the performance comparison of the quality estimators, we use the LIVE , the multiply distorted LIVE (MULTI) , and the TID 2013 (TID13) databases . The distortion types in these databases can be grouped into seven categories as given in Table I. JPEG, JPEG2000, and lossy compression of noisy images are included in the compression category. The noise category consists of Gaussian noise, additive noise in color components which is more intensive than additive noise in the luminance component, spatially correlated noise, masked noise, high frequency noise, impulse noise, quantization noise, image denoising, multiplicative Gaussian noise, comfort noise, and lossy compression of noisy images. JPEG and JPEG2000 transmission errors are included in the communication category, and Gaussian blur and sparse sampling and reconstruction are in the blur category. The color category consists of change of color saturation, image color quantization with dither and chromatic aberrations. Intensity shift and contrast change are included in the global category, and non-eccentricity pattern noise and local block-wise distortions are in the local category.
Ii-D Data Partition and Number of Experiments
In the experiments, the performance of the quality estimators are measured with k-fold cross validation, in which is set to . At each iteration, of total images in each database are selected as the test set. In Section III-B, we test the performance of methods boosted with a neural network and a support vector machine. Each method is trained and tested times. The test set in each iteration is also used to measure the performance of existing quality estimators. Since there are different quality estimators, boosting methods, and runs, we report the average performance of existing quality estimators for runs in Section III-A.
Ii-E Performance Metrics
We use accuracy-, linearity-, ranking-, and statistical significance-based metrics in the performance analysis and comparison. Before the metric calculations, a mapping operation is performed between objective and subjective scores as suggested in . The mapping formulation used in the simulations can be expressed as
where is an estimated score, is a regressed output and s are tuning parameters that are set according to the relationship between the quality estimates and the subjective scores.
Root mean square error measures the accuracy of the quality estimators as
where is an estimated score and is a subjective score corresponding to an image indexed with , and is the total number of images.
Pearson correlation coefficient is used to measure the linearity of the predictions which is formulated as
where is an estimated score and is a subjective score corresponding to an image indexed with , is the average operator, and is the total number of images.
Spearman correlation is used to measure the monotonic relationship between quality estimates and subjective scores. Instead of using the exact values, rank of the values are used. The formulation of Spearman correlation coefficient is given as
where is a rank assigned to a score and is a rank assigned to a subjective score , which correspond to an image indexed with , and is the total number of images.
Ii-E4 Statistical Significance
In order to assess the significance of the difference between correlation coefficients, we use the statistical significance tests suggested in ITU-T Rec. P.1401 .
||Root Mean Square Error|
|Pearson Correlation Coefficicent|
|Spearman Correlation Coefficient|
Iii Experimental Results
Iii-a Part 1
We report the performance of existing quality estimators in Table II. In terms of root mean square error and Pearson correlation, the best performing methods are PSNR-HMA in the LIVE database and SR-SIM in the MULTI database. In terms of Spearman correlation, IW-SSIM is the best performing method in the LIVE and the MULTI databases. UNIQUE is the best performing quality estimator in terms of all the metrics in the TID13 database.
||Root Mean Square Error|
||Pearson Correlation Coefficicent|
||Spearman Correlation Coefficient|
||Root Mean Square Error|
||Pearson Correlation Coefficient|
||Spearman Correlation Coefficient|
Neural network-based regression results are given in Table III
. Neural networks trained with fidelity-, perceptually-extended fidelity-, and perceptual similarity-based methods enhance the performances in some categories and degrade in others with minor changes. In terms of root mean square error and Pearson correlation, neural networks lead to significant or minor enhancements for structural, spectral, unsupervised learning-based, and feature-based similarity methods. In terms of Spearman correlation, neural networks mostly lead to some minor changes other than some major changes in the TID13 database. After the neural network-based regression, IW-SSIM becomes the best performing quality estimator in terms of root mean square error and Pearson correlation in the LIVE and the MULTI databases. In the TID13 database, SR-SIM becomes the best performing quality estimator in terms of root mean square error and Pearson correlation after the neural network-based regression. We also perform support vector machine-based regression and the results are given in TableIV. Support vector regression does not lead to significant changes when it is only trained with one method and the best performing methods are the same with the existing methods. In Table V, we report the best performance values of existing and regressed methods. Moreover, we also report the performances of neural network- and support vector machine-based boosting. Existing methods regressed with neural networks perform better than existing methods in all the categories other than Spearman in the LIVE database, and the performances of existing methods regressed with support vector machines are similar to existing methods. Support vector machine-based boosting performs better than existing and regressed existing methods in the MULTI and the TID13 databases whereas in the LIVE database, it is better in some categories and worse in others. Neural network-based boosting leads to the best performances in all the categories.
||Root Mean Square Error|
||Pearson Correlation Coefficicent|
|Spearman Correlation Coefficient|
Iii-B Part 2
In this section, we discuss the relative performance change as a consequence of adding new methods into the boosting algorithms. We start with the worst performing methods in each category and add the next best into the boosting in the next step. Based on the results in Table II, we rank the methods for each database in a descending order in the root mean square error category, and in an ascending order in Pearson and Spearman correlation categories. The results are given in Fig. 1
in which the lengths of the main bars correspond to the mean values and the lengths of the thin bars plotted over the main bars show the standard deviations. We plot a horizontal black line in correlation figures, after which the increase in correlation coefficients becomes statistically significant with respect to the regressed worst performing quality estimator. Red bars correspond to the performance of support vector machine-based boosting and blue bars correspond to the neural network-based boosting.
As the number of fused methods increase, there is a general decrease in terms of root mean square error and an increase in terms of Pearson and Spearman correlations. Neural network-based boosting outperforms support vector machine-based boosting in terms of root mean square error in all the boosting scenarios when two or more methods are fused. Both Pearson and Spearman follow a non-decreasing behavior with respect to the number of fused methods other than a few exceptions. In terms of Pearson correlation, neural network-based boosting outperforms support vector machine-based boosting in all the boosting scenarios. In terms of Spearman correlation, the worst performing quality estimators regressed with support vector machines perform slightly better than neural-network-based ones in the LIVE and the MULTI databases. However, in all the other scenarios, neural network-based boosting outperforms support vector machine-based boosting.
We analyze the effect of boosting in image quality assessment using multi-method fusion. Experimental results show that boosting-based methods outperform existing best performing methods in out of comparisons and neural network-based boosting outperforms support vector machine-based boosting when two or more methods are fused. Based on these observations, we can claim that boosting generally enhances the performance of image quality assessment algorithms and the enhancement level depends on the type of the boosting strategy. Moreover, boosting the worst performing quality estimator with two or more additional methods leads to statistically significant improvements in all the scenarios independent of the boosting technique.
-  K. Egiazarian, J. Astola, N. Ponomarenko, V. Lukin, F. Battisti, and M. Carli, “A new full-reference quality metrics based on HVS,” in the proceedings of VPQM,, 2006.
-  N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. Astola, and V. Lukin, “On between-coefficient contrast masking of DCT basis functions,” in the proceedings of VPQM, 2007, pp. 1–4.
-  N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazarian, and M. Carli, “Modified image Visual quality metrics for contrast change and mean shift accounting,” the proceedings of CADSM, 2011.
-  Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multi-scale structural similarity for image quality assessment,” the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 9–13, 2004.
-  Z. Wang and E. P. Simoncelli, “Translation insensitive image similiarity in complex wavelet domain,” vol. II, no. March, pp. 573–576, 2005.
-  Z. Wang and Q. Li, “Information content weighting for perceptual image quality assessment.,” IEEE Transactions on Image Processing, vol. 20, no. 5, pp. 1185–98, May 2011.
-  M. Carnec, P. Le Callet, and Dominique B., “Objective quality assessment of color images based on a generic perceptual reduced reference,” vol. 4, no. April, pp. 239–256, 2008.
-  L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A Feature similarity index for image quality qssessment.,” IEEE Transactions on Image Processing, vol. 20, no. 8, pp. 2378–86, Aug. 2011.
-  D. Temel and G. Alregib, “PerSIM: Multi-Resolution Image Quality Assessment in the Perceptually Uniform Color Domain,” in the proceedings of ICIP, 2015.
-  H. R. Sheikh, A. C. Bovik, and G. de Veciana, “An information fidelity criterion for image quality assessment using natural scene statistics.,” IEEE Transactions on Image Processing , vol. 14, no. 12, pp. 2117–28, Dec. 2005.
P. Le Callet, C. Viard-Gaudin, and D. Barba,
“A convolutional neural network approach for objective video quality assessment.,”IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, vol. 17, no. 5, pp. 1316–27, Sept. 2006.
-  L. Kang, P. Ye, Y. Li, and D. Doermann, “Convolutional neural networks for no-reference image quality assessment,” the proceedings of CVPR, 2014.
-  D. Temel, M. Prabhushankar, and G. AlRegib, “UNIQUE: Unsupervised image quality estiamtion,” IEEE Signal Processing Letters, 2016.
-  M. Kearns, “Thoughts on hypothesis boosting,” Machine Learning Class Project, 1988.
-  M. Kearns and L. Valiant, “Cryptographic limitations on learning boolean formulae and finite automata,” J. ACM, vol. 41, no. 1, pp. 67–95, Jan. 1994.
-  R. E. Schapire, “The strength of weak learnability,” Mach. Learn., vol. 5, no. 2, pp. 197–227, July 1990.
-  A. Leontaris, P. C. Cosman, and A. R. Reibman, “Quality evaluation of motion-compensated edge artifacts in compressed video,” IEEE Transactions on Image Processing, vol. 16, no. 4, April 2007.
-  T. J. Liu, W. Lin, and C. C. J. Kuo, “A multi-metric fusion approach to visual quality assessment,” in Quality of Multimedia Experience (QoMEX), 2011 Third International Workshop on, Sept 2011, pp. 72–77.
-  T. J. Liu, W. Lin, and C. C. J. Kuo, “Image quality assessment using multi-method fusion,” IEEE Transactions on Image Processing, vol. 22, no. 5, pp. 1793–1807, May 2013.
-  T. J. Liu, K. H. Liu, J. Y. Lin, W. Lin, and C. C. J. Kuo, “A ParaBoost method to image quality assessment,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–15, 2015.
-  H. Ko, R. Song, and C.-C. J. Kuo, “A ParaBoost stereoscopic image quality assessment (PBSIQA) system,” arXiv, 2016.
-  L. Zhang and H. Li, “SR-SIM: A fast and high performance IQA index based on spectral residual,” in 2012 19th IEEE International Conference on Image Processing, Sept 2012, pp. 1473–1476.
-  C. J. V. Lambrecht, Vision models and applications to image and video processing , Kluwer Academic Publishers, 2001.
-  H. R. Sheikh, L. Cormack, and A. C. Bovik, “LIVE Image Quality Assessment Database Release 2,” http://live.ece.utexas.edu/research/quality, 2006.
-  D. Jayaraman, A. K. Moorthy A. Mittal, and A. C. Bovik, “Objective Quality Assessment of Multiply Distorted Images,” 2012, Proceedings of Asilomar Conference on Signals, Systems and Computers.
-  N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.-C. J. Kuo, “Image database TID2013: Peculiarities, results and perspectives ,” Signal Processing: Image Communication, vol. 30, pp. 57 – 77, 2015.
-  H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Transactions on Image Processing, 2006.
-  ITU, “Statistical analysis , evaluation and reporting guidelines of quality measurements,” ITU-T Rec P.1401, 2012.