Exploiting High-Level Semantics for No-Reference Image Quality Assessment of Realistic Blur Images

10/18/2018
by   Dingquan Li, et al.
Peking University
4

To guarantee a satisfying Quality of Experience (QoE) for consumers, it is required to measure image quality efficiently and reliably. The neglect of the high-level semantic information may result in predicting a clear blue sky as bad quality, which is inconsistent with human perception. Therefore, in this paper, we tackle this problem by exploiting the high-level semantics and propose a novel no-reference image quality assessment method for realistic blur images. Firstly, the whole image is divided into multiple overlapping patches. Secondly, each patch is represented by the high-level feature extracted from the pre-trained deep convolutional neural network model. Thirdly, three different kinds of statistical structures are adopted to aggregate the information from different patches, which mainly contain some common statistics (i.e., the mean&standard deviation, quantiles and moments). Finally, the aggregated features are fed into a linear regression model to predict the image quality. Experiments show that, compared with low-level features, high-level features indeed play a more critical role in resolving the aforementioned challenging problem for quality estimation. Besides, the proposed method significantly outperforms the state-of-the-art methods on two realistic blur image databases and achieves comparable performance on two synthetic blur image databases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 12

12/05/2016

Local Blur Mapping: Exploiting High-Level Semantics by Deep Neural Networks

The human visual system excels at detecting local blur of visual images,...
09/22/2016

Deep Quality: A Deep No-reference Quality Assessment System

Image quality assessment (IQA) continues to garner great interest in the...
02/04/2020

Aesthetic Quality Assessment for Group photograph

Image aesthetic quality assessment has got much attention in recent year...
10/19/2018

Quality Assessment for Tone-Mapped HDR Images Using Multi-Scale and Multi-Layer Information

Tone mapping operators and multi-exposure fusion methods allow us to enj...
10/15/2020

Convolutional Neural Network for Blur Images Detection as an Alternative for Laplacian Method

With the prevalence of digital cameras, the number of digital images inc...
11/21/2018

MS-UNIQUE: Multi-model and Sharpness-weighted Unsupervised Image Quality Estimation

In this paper, we train independent linear decoder models to estimate th...
04/11/2020

MetaIQA: Deep Meta-learning for No-Reference Image Quality Assessment

Recently, increasing interest has been drawn in exploiting deep convolut...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In the era of big data, images have become the primary carrier of information in human’s daily life. Before ultimately received by a human observer, digital images may suffer from a variety of distortions. Quality of Experience (QoE), whose goal is to provide a satisfying end-user experience, has drawn increasing attention. To reach this goal, a critical precondition is to conduct image quality assessment (IQA). The most reliable way to assess image quality is subjective ratings, but it is often cumbersome, expensive and difficult to carry out in reality. Thus, objective IQA methods that can automatically predict image quality efficiently and effectively are needed. Objective IQA can be categorized into full-reference IQA (FR-IQA) (Wang et al., 2004; You et al., 2011), reduced-reference IQA (RR-IQA) (Zhai et al., 2012; Qi et al., 2015) and no-reference IQA (NR-IQA) (Gu et al., 2015b; Zhang and Kuo, 2014). Due to the unavailability of the reference image in most practical applications, NR-IQA is preferable but also more challenging.

In this paper, we focus on NR-IQA of realistic blur images. Blur is often induced by following reasons: (1) out-of-focus, (2) relative motion between the camera and the objects (object motion & camera shake), (3) non-ideal imaging systems (e.g., lens aberration), (4) atmospheric turbulence, and (5) image post-processing steps (such as compression and denoising) (Ferzli and Karam, 2009; Narvekar and Karam, 2011; Hassen et al., 2013). Except the blur in Bokeh to strengthen the photo’s expressiveness, it is a definite fact that unintentional blur impairs image quality.

(a) MOS=4.0637
(b) MOS=2.3413
Figure 1. The two images are from BID (Ciancio et al., 2011), and larger MOS indicates better subjective image quality. The three traditional methods (MDWE (Marziliano et al., 2002), FISH (Vu and Chandler, 2012), LPC (Hassen et al., 2013)) predict that (a) is worse than (b). Our method predicts that (a) is better than (b), which is consistent with subjective ratings.

Traditional NR-IQA methods of blur images are mainly based on the assumptions that blur leads to the spread of edges (e.g., MDWE (Marziliano et al., 2002)), the reduction of high-frequency energy (e.g., FISH (Vu and Chandler, 2012)) or the loss of local phase coherence (e.g., LPC (Hassen et al., 2013)). However, these methods neglect the high-level semantic information, and can distinguish neither between intrinsic flat regions and blurry regions, nor between structures with and without blurring. As a result, it is shown in Figure 1 that they predict the quality of a clear sky being worse than the quality of a blurry mouse, which is not consistent with human perception.

In this work, we tackle the problem by exploiting the high-level semantic features extracted from the pre-trained deep convolutional neural network (DCNN) models. First of all, since the pre-trained DCNN models (e.g., AlexNet (Krizhevsky et al., 2012)) require a fixed input size, we need to determine how to represent an image. We compare four different image representations, and find that the multi-patch representation significantly better than the others. Secondly, we need to decide which pre-trained DCNN model and which layer to extract image features. We first explore the effectiveness of features extracted from different layers in a same pre-trained DCNN model, and find out high-level features from the top third or second layer more effective in realistic blur image assessment. Then we investigate the impact of different pre-trained models, and find out the one using residual learning (i.e.,ResNet-50 (He et al., 2016)) more suitable for NR-IQA of realistic blur images. Thirdly, as a result of the multi-patch representation, we derive a set of features for an image. So another question arises: how to aggregate a set of extracted features?

One simple way is to use the mean feature vector to represent the feature set. However, it will lose important information (

e.g., the standard deviation in each dimension) of the feature set. So we propose three different statistical structures for feature aggregation, namely, mean&std aggregation, quantile aggregation and moment aggregation. As the dimension of the aggregated feature is still very high, we finally feed the aggregated feature into a linear regression model, known as partial least square regression (PLSR) (Rosipal and Krämer, 2006), to predict the image quality.

Experiments are conducted on two realistic blur image databases (BID (Ciancio et al., 2011) and CLIVE (Ghadiyaram and Bovik, 2016)), as well as two synthetic blur image databases (TID2008 (Ponomarenko et al., 2009) and LIVE (Sheikh et al., 2006)). Our best proposal, named Semantic Feature Aggregation metric using PLSR (SFA-PLSR), is compared with the state-of-the-art methods. Experiments show that our method significantly outperforms the state-of-the-art on BID and CLIVE, and achieves comparable performance on TID2008 and LIVE. The good generalization ability of SFA-PLSR is validated by the cross dataset evaluation. We have also experimentally shown that high-level semantic features indeed play a more critical role than low-level features in resolving the challenging issue for NR-IQA of realistic blur images (see Figure 1). This indicates a new perspective of blur perception in terms of the semantic loss.

The remainder of this paper is organized as follows. Section 2 reviews the related work on NR-IQA of blur images. Section 3 introduces the benchmark databases and performance criteria. Section 4 describes our method in details. Section 5 discusses the experimental results and analysis. And conclusions are made in Section 6.

2. Related Work

2.1. Learning free methods

Learning free methods use the characteristics of blur in terms of the spread of edges, the smoothing effects, the reduction of high frequency components or the loss of phase coherence.

The spread of edges can be used as a cue for blur estimation. Marziliano et al. (Marziliano et al., 2002) used the average edge spread over all detected vertical Sobel edge locations as a quality metric for blur images. It can be further improved by incorporating the concept of just noticeable blur (JNB) (Ferzli and Karam, 2009) to adapt for the perception of human visual system (HVS). Since blur is not likely to be perceived when the edge width is small enough (below the width corresponding to JNB), Narvekar and Karam (Narvekar and Karam, 2011) assigned the quality score as the percentage of edges whose blur cannot be detected.

The smoothing effects of the blur process is useful information for NR-IQA. Gu et al. (Gu et al., 2015a) estimated image quality based on the energy-differences and contrast-differences of the locally estimated coefficients in the autoregressive parameter space. Bahrami and Kot (Bahrami and Kot, 2014)

considered the content based weighting distribution of the maximum local variation, which was modeled by the generalized Gaussian distribution (GGD). The estimated standard deviation was then used as an indicator of image quality. Later they also parameterized the image total variation distribution, and predicted image quality using the standard deviation modified by the shape-parameter to account for image content variation 

(Bahrami and Kot, 2016).

Image blur results in the reduction of high frequency components. Vu and Chandler (Vu and Chandler, 2012) estimated image quality using weighted average of the log-energies of the high-frequency coefficients. In (Vu et al., 2012)

, they generated a quality map based on a geometric mean of spectral and spatial measures. In view of the reduction of high-frequency components, the spectral measure was initially defined as the slope of the local magnitude spectrum, then rectified by a sigmoid function to account for HVS. To further consider the contrast effect, the spatial measure was calculated by the local total variation. Sang 

et al. (Sang et al., 2014)

estimated image quality using the exponent of the truncated singular value curve of an image. Li 

et al. (Li et al., 2016a) considered the moment energy, which can be affected by noticeable blur.

Blur also causes the loss of phase coherence, which gives a different perspective for understanding blur perception (Wang and Simoncelli, 2003). So Hassen et al. (Hassen et al., 2013) estimated image quality based on the strength of the local phase coherence near edges and lines.

2.2. Learning-based methods

Traditional learning free methods can not accurately express the diversity of blur process and the complexity of HVS. So recently machine learning technologies appear in IQA field. Learning-based methods mainly consist of two steps: feature extraction and quality prediction. In terms of feature extraction, these methods fall into two classes: the one using hand-crafted features and the other using learnt features.

Features can be manually designed using the natural scene statistics (NSS) of the image. NSS models of image coefficients in the spatial domain, wavelet domain and DCT domain are utilized in (Mittal et al., 2012; Moorthy and Bovik, 2011; Saad et al., 2012) to extract quality relevant features, respectively. Tang et al. (Tang et al., 2011) derived a set of low-level image quality features from NSS models, texture characteristics and blur/noise estimation. Ciancio et al. (Ciancio et al., 2011) used a neural network to combine eight existing methods and low-level features for blur image quality assessment. Oh et al. (Oh et al., 2014) evaluated image quality of camera-shaken images through mapping the spectral direction and shape features using support vector regression (SVR). Li et al. (Li et al., 2017) took gradient similarity, singular value similarity and DCT domain entropies as quality features in a multi-scale framework. Li et al. (Li et al., 2016b) jointly considered the structural and luminance information in predicting image quality, where the structure information was described by the local binary pattern distribution and the normalized luminance magnitudes distribution portrayed the luminance information.

Machine learning techniques can learn quality relevant features. Li et al. (Li et al., 2016c) and Lu et al. (Lu et al., 2016) extracted learnt features based on dictionary learning. Visual codebook is used to learn quality features in (Ye et al., 2012; Xu et al., 2016). Convolutional neural networks (CNN) have also been used to learn quality relevant features in NR-IQA (Kang et al., 2014; Bosse et al., 2016; Yu et al., 2017; Kim and Lee, 2017; Siahaan et al., 2016; Sun et al., 2016). Kang et al. (Kang et al., 2014) integrated feature learning and patch quality prediction into an end-to-end network, and the image quality was estimated by the average score of all sampling patches. Following (Kang et al., 2014), the network was deeper and weights for patch scores were also integrated into the learning process (Bosse et al., 2016). In (Yu et al., 2017), CNN was used to learn features and the general regression neural network was used as the predictor. In (Kim and Lee, 2017), a sub-network was first trained on patches using the FR-IQA scores, and then a whole network from images to quality was trained.

The most related works to ours are (Siahaan et al., 2016; Sun et al., 2016), which resize the image to meet the required input size of the pre-trained AlexNet so as to extract the image features. Our work differs from them mainly in three ways: (1) Unlike (Siahaan et al., 2016; Sun et al., 2016), we use multiple overlapping image patches instead of the resized image to represent the image, which can avoid introducing deformation as well as cover the image information. Correspondingly, we propose three effective statistical structures to conduct feature aggregation. (2) The features extracted from the pre-trained DCNN model in (Siahaan et al., 2016; Sun et al., 2016) are only used as the auxiliary to boost the performance of methods based on low-level features, while our aggregated semantic features are directly used as quality relevant features. (3) We focus on realistic blur, and since residual images contain important cues about image blur, the residual learning based network (ResNet-50 (He et al., 2016)) is selected as the feature extractor instead of the one in (Siahaan et al., 2016; Sun et al., 2016) without residual learning.

3. Benchmark Databases and Performance Criteria

3.1. Benchmark Databases

In this work, we consider two realistic image databases (BID (Ciancio et al., 2011) and CLIVE (Ghadiyaram and Bovik, 2016)), as well as two synthetic blur image datasets from TID2008 (Ponomarenko et al., 2009) and LIVE (Sheikh et al., 2006).

BID includes totally 586 realistic blur images taken from real world along with a variety of scenes, light conditions, camera apertures and exposure time. Subjective quality scores are provided in the form of mean opinion score (MOS) ranging from 0 to 5.

CLIVE includes 1162 realistic distorted images captured using real-world mobile cameras, most of which suffer from motion blur or out-of-focus blur. Subjective quality scores are provided in the form of MOS ranging from 0 to 100.

TID2008 contains 1700 distorted images, in which we only consider the 100 Gaussian blur images. There are only 25 reference images, and 4 blur kernels for each reference image. Subjective quality scores are provided in the form of MOS ranging from 0 to 9.

LIVE contains 779 distorted images, in which we only consider the 145 Gaussian blur images. There are only 29 reference images, and 5 blur kernels for each reference image. Subjective quality scores are provided in the form of Difference of MOS (DMOS) ranging from 0 to 100.

3.2. Performance Evaluation Criteria

Three evaluation criteria are chosen to evaluate the performance of NR-IQA methods: Spearman’s rank-order correlation coefficient (SROCC), Pearson’s linear correlation coefficient (PLCC) and root mean square error (RMSE). PLCC and RMSE are used for measuring prediction accuracy, while SROCC is used for measuring prediction monotonicity. For these three criteria, larger PLCC/SROCC and smaller RMSE indicate better performance. Before calculating PLCC and RMSE values of the learning free methods, a nonlinear fitting is needed to map the objective scores to the same scales of the subjective scores. In this paper, we adopt the following four-parameter logistic function recommended in (VQEG, 2000).

(1)

where to are free parameters to be determined during the curve fitting process.

Monte-Carlo cross validation is used for learning-based methods. For each database, 80% data are for training and 20% data are for testing. There is no same “original images” between training data and testing data. This procedure is repeated 1000 times and the median or mean values are reported. It should be noted that the learning free methods are tested on the same data as learning-based methods. Besides, we should specifically point out that the training data on BID are used in Section 4 for the comparative study.

4. The Proposed Method

The framework of the proposed method is shown in Figure 2, including four steps: image representation, feature extraction, feature aggregation, and quality prediction. In this section, we will conduct an in-depth comparative study to determine the best choice for each step.

Figure 2. The overall framework of the proposed method, mainly includes four steps: image representation, feature extraction, feature aggregation, and quality prediction.

4.1. Image Representation

The pre-trained DCNN models (e.g., AlexNet) require a fixed input size. To meet this requirement, images can be cropped, or resized to the fixed size. Since the resizing operation can introduce geometric deformation, which may change the image quality, it is not a good way. In the mean time, cropping only the central patch is not enough to cover the information of a large image. Because of these two issues, we consider using multiple overlapping patches to represent the image, which not only covers information of the whole image but also avoids introducing geometric deformation.

We compare the impact of four different image representations, including the cropping, scaling, padding and multi-patch representation. Cropping representation uses the central patch to represent the image. Padding representation preserves the aspect ratio by resizing the larger dimension to the required length and then padding zeros to the smaller dimension. Scaling representation directly resizes the image without keeping the original aspect ratio. Multi-patch representation generates multiple overlapping patches that are uniformly sampled over the whole image with a sampling stride

111There is no significant performance variation among different sampling strides when it is subjected to cover the whole information, so the sampling stride is simply fixed to be half of the patch size..

Figure 3. Comparison of different image representations. No matter which layer is used to extract features, the multi-patch representation achieves the best performance.

To perform the comparison, we need a baseline for feature extraction, feature aggregation and quality prediction. Before the comparative study on the following steps, we choose the classical pre-trained DCNN model AlexNet and extract features from the frequently-used fully connected layers (i.e., , and ). For feature aggregation, we choose the mean feature vector for simplicity. PLSR is used for quality prediction. The comparative study is conducted on the training data of BID, where 20% of the training data are used as validation data and the performance on the validation set is used for comparison (the same below). It can be seen from Figure 3 that (1) cropping representation obtains the worst performance, (2) since resizing operation keeps most of the image information, padding and scaling representation achieve better performance than cropping representation, (3) the use of multi-patch representation significantly outperforms the other three. So we decide to use the multi-patch representation in our framework.

4.2. Feature Extraction

Given an image , we represent it with a set of multiple overlapping patches , and then feed these patches into an off-the-shelf DCNN model to extract features. For each patch , the extracted feature is denoted by

(2)

where indicates which layer (e.g., layer in AlexNet) to extract features and is the trained network parameter.

The role of high-level semantics

: Pre-trained DCNN models for image classification or scene recognition have encoded semantics in high-level features. Here, we conduct a comparative study to investigate the role of high-level semantics in NR-IQA of realistic blur images. We take the AlexNet as the pre-trained model, and extract features of multiple patches from its different layers

222Since the response of the convolutional layer is a set of feature maps, we derive features by global average pooling. ( to and to ). PLSR maps the mean feature vector to the quality score. From the plot in Figure 4, we have the following observations. First, high-level features are better than the low-level features, which indicates that high-level semantic features play an important role in NR-IQA of realistic blur images. However, the feature extracted from the top layer () is slightly worse than the second and third top layers (

). This is because the top layer is directly linked to the classifier and the extracted feature is task-specific, which may contain only the classification information. The third top layer (

), close to the last convolutional layer, achieves the best performance in terms of SROCC. Therefore, in our framework, we consider the second or third top layer close to the last convolutional layer to extract features.

Figure 4. [Best viewed in color.] Mean and standard deviation of SROCC and PLCC. x-axis indicates from which layer (in AlexNet) we extract features. The curve indicates the mean values and the error bars indicate the standard deviations.

Impact of different pre-trained DCNN models: We also compare different pre-trained DCNN models in the proposed framework, including AlexNet (Krizhevsky et al., 2012), GoogleNet (Szegedy et al., 2015) and ResNet-50 (He et al., 2016), where the features are extracted from the , and layer, respectively. The quality prediction step is still based on PLSR. Figure 5 shows the performance values, from which we can observe that ResNet-50 achieves the best performance. It is shown that the residual image contains important information in capturing quality relevant features (Yan et al., 2016). Besides, image blur can be more easily captured in residual images. So the significant gain in ResNet-50 may due to the residual learning, and we choose ResNet-50 as the feature extractor.

Figure 5. Comparison of different pre-trained DCNN models.

4.3. Feature Aggregation

With the extracted features, we need to aggregate them into a single one. One straightforward way is to concatenate all these features into a long feature vector, i.e.,

(3)

where is the concatenation operator.

However, it will result in a very high dimension of the feature space. Besides, the dimension of the concatenated feature vector will depend on the number of patches, which is not the same among the images with different resolutions. To avoid this, we can take the mean value of all features in each dimension, that is,

(4)

where is the i-th element of and is the dimension of .

The mean aggregation structure loses important information (e.g., the standard deviation in each dimension) of the feature set. So we propose three different statistical structures for feature aggregation, namely, mean&std aggregation, quantile aggregation and moment aggregation.

Mean&std aggregation: The standard deviation in each dimension is further considered, and the first aggregated feature is obtained by:

(5)

where is the -th element of

Quantile aggregation

: Quantiles are important order statistics. We consider the widely used quartiles. The min, the median and the max are the zeroth, second, and fourth quartile, respectively. We denote the zeroth to fourth quartile of

as , respectively. So the second aggregated feature based on quartiles can be defined as:

(6)

Moment aggregation: Moments also play an important role in describing the statistics of a distribution. Mean is actually the origin moment of first-order. In order to balance between the need of more information and the dimension reduction of the feature space, we further consider the -th root of the central moment of order 333

Note that the first central moment is zero, and here the second central moment is the variance computed using a divisor of

rather than ., and obtain the third aggregated feature :

(7)

where is the -th element of

The aforementioned three statistical structures for feature aggregation result in a and -dimensional feature vector, respectively. An example of these aggregation structures when is shown in Figure 6.

Figure 6. [Best viewed in color.] An example of the three statistical structures for feature aggregation. The input is features , where the feature dimension is . indicate the five quartiles, and represents the -th root of the central moment of order (). Not all the connections are shown between input and statistical functions for clarity.

4.3.1. Contribution of Different Statistical Aggregation Structures

We compare the mean aggregation (baseline) with the three proposed statistical aggregation structures. The ResNet-50 is used as the feature extractor of multiple patches and PLSR is used as the regression model. Table 1 summarizes the median values of SROCC, PLCC and RMSE. The best result comes from the ensemble of the three statistical structures and has been marked in boldface. We can see that the three proposed statistical structures have significant gain over the baseline, from which we verify the effectiveness of the proposed aggregation structures on capturing the information of the feature set.

Aggregated Feature SROCC PLCC RMSE
mean () 0.7577 0.7673 0.8283
mean&std () 0.8022 0.8174 0.7333
quantile () 0.8109 0.8254 0.7135
moment () 0.8100 0.8254 0.7171
0.8123 0.8269 0.7116
0.8127 0.8270 0.7121
average-quality () 0.8154 0.8305 0.7055
Table 1. Comparison among different aggregation structures. “average-quality” means averaging scores of the three proposed structures.

4.4. Quality Prediction

With the help of statistical structures for feature aggregation, we reduce the dimension of feature space () and make the dimension independent of the number of patches. However, in the pre-trained DCNN, is also a large number ( in ResNet-50’s layer). Since the dimension of the feature space is much larger than the number of our training samples, we consider the linear regression model. Specifically, partial least square regression (PLSR) (Rosipal and Krämer, 2006)

is adopted in our work because of its low-complexity and remarkable capability to handle high-dimensional data. PLSR reduces the input high-dimensional features to several uncorrelated latent components and then performs least squares regression on these components. There is only one parameter

(the number of components) in PLSR, which can be determined by cross validation.

After the above investigations, we obtain our best proposal, dubbed as Semantic Feature Aggregation metric using PLSR (SFA-PLSR). It uses multiple overlapping patches to represent images, and extracts features from the layer of the pre-trained ResNet-50 model, as well as averages the scores of the mean&std aggregation, quantile aggregation and moment aggregation.

5. Experiments

In the following parts, we compare the performance of the proposed SFA-PLSR method with the state-of-the-art NR-IQA methods in both intra-database and inter-database scenarios. As for the software platform to implement our proposed method, we use the Caffe 

(Jia et al., 2014) framework to extract the features from the pre-trained DCNN model. PLSR is performed by the MATLAB function plsregress, and its parameter is globally set to based on the 5-fold cross-validation using the training data of a single run (on BID), where p is selected from the set for simplicity.

5.1. Performance Comparison

In this part, we compare the performance of SFA-PLSR with ten existing (from classical to the most up to date) NR-IQA methods of blur images, which are MDWE (Marziliano et al., 2002), CPBD (Narvekar and Karam, 2011),FISH (Vu and Chandler, 2012), S3 (Vu et al., 2012), LPC (Hassen et al., 2013), MLV (Bahrami and Kot, 2014), ARISM (Gu et al., 2015a), BIBLE (Li et al., 2016a), SPARISH (Li et al., 2016c) and RISE (Li et al., 2017). Five remarkable general-purpose NR-IQA methods, including BRISQUE (Mittal et al., 2012), Kang’s CNN (Kang et al., 2014), FRIQUEE (Ghadiyaram and Bovik, 2017), NRSL (Li et al., 2016b), and S-HOSA (Siahaan et al., 2016), are also taken for comparison.

Table 2 reports the median SROCC, PLCC and RMSE in 1000 runs on the four databases. We also report the weighted-average SROCC over all four databases as the overall performance, where the weights are proportional to the database-sizes (see the last column of Table 2). Among the ten NR-IQA methods of blur images, the first nine methods fail on the two realistic databases (SROCC on BID and CLIVE) due to their neglect of global semantic information, and RISE achieves the best performance on BID. The proposed method SFA-PLSR significantly outperforms others on BID and CLIVE in both prediction accuracy (PLCC, RMSE) and monotonicity (SROCC). As for the general purpose NR-IQA methods, FRIQUEE and S-HOSA achieve better performance on the realistic databases than the others. Kang’s CNN (Kang et al., 2014) does not perform well because it assumes that patch quality equals to image quality, which is not true for these two realistic image datasets. On TID2008 and LIVE, there are less than 30 images with different contents, which is much smaller than BID (586) and CLIVE (1162), so the role of semantic information is weakened and the impact of low-level features is enhanced. Nevertheless, our method SFA-PLSR still achieves comparable performance on the two synthetic databases. In general, our method also achieves the best overall performance.

Category Method BID (Ciancio et al., 2011) CLIVE (Ghadiyaram and Bovik, 2016) TID2008 (Ponomarenko et al., 2009) LIVE (Sheikh et al., 2006) Overall
SROCC PLCC RMSE SROCC PLCC RMSE SROCC SROCC SROCC
NR-IQA of blur images MDWE (Marziliano et al., 2002) 0.3067 0.3538 1.1639 0.4313 0.4988 17.5025 0.8556 0.9188 0.4514
CPBD (Narvekar and Karam, 2011) 0.0202 0.2181 1.2166 0.3027 0.4026 18.4602 0.8723 0.9390 0.2945
FISH (Vu and Chandler, 2012) 0.4736 0.4853 1.0894 0.4865 0.5380 17.0310 0.8737 0.9008 0.5323
S3 (Vu et al., 2012) 0.4109 0.4471 1.1177 0.4034 0.4864 17.6224 0.8650 0.9515 0.4686
LPC (Hassen et al., 2013) 0.3150 0.4053 1.1408 0.1483 0.3490 18.9205 0.8805 0.9469 0.2922
MLV (Bahrami and Kot, 2014) 0.3169 0.3750 1.1561 0.3412 0.4076 18.4350 0.8977 0.9431 0.4058
ARISM (Gu et al., 2015a) 0.0151 0.1929 1.2245 0.2427 0.3554 18.8947 0.8851 0.9585 0.2601
BIBLE (Li et al., 2016a) 0.3609 0.3923 1.1469 0.4260 0.5178 17.3007 0.9114 0.9638 0.4703
SPARISH (Li et al., 2016c) 0.3071 0.3555 1.1659 0.4015 0.4843 17.6702 0.9126 0.9638 0.4403
RISE (Li et al., 2017) 0.5839 0.6017 0.9936 - - - 0.9218 0.9493 0.6833
Proposed SFA-PLSR 0.8269 0.8401 0.6854 0.8130 0.8313 11.3905 0.9098 0.9523 0.8321
General purpose NR-IQA BRISQUE (Mittal et al., 2012) 0.5795 0.5754 1.0624 0.5950 0.6195 16.0273 0.8737 0.8892 0.6258
Kang’s CNN (Kang et al., 2014) 0.4818 0.4977 1.1030 0.4964 0.5218 17.8567 0.9000 0.9429 0.5448
FRIQUEE (Ghadiyaram and Bovik, 2017) 0.7359 0.7477 0.8433 0.6916 0.7069 14.4244 0.9261 0.9515 0.7353
NRSL (Li et al., 2016b) 0.638 0.663 0.931 0.631 0.654 15.317 - 0.959 0.658
S-HOSA (Siahaan et al., 2016) 0.6869 0.6913 0.9112 0.7051 0.7241 14.0237 0.8729 0.9469 0.7258
  • The results of RISE and NRSL are from their original papers. The code of Kang’s CNN and S-HOSA are written by ourselves following the detail of their papers, and the codes of other compared methods are from original authors.

Table 2. Performance comparison on four databases. In each column, the best performance value is marked in boldface and the second best performance value is underlined. The last column indicates the weighted-average of SROCC over all four databases, where the weights are proportional to the database-sizes.

5.2. Cross Dataset Evaluation

In this subsection, we test the generalization capability of learning-based methods through cross dataset evaluation. Since learning-based methods assume testing images and training images have a similar distribution, we conduct cross dataset evaluation on realistic databases (BID and CLIVE) and synthetic databases (TID2008 and LIVE), respectively. It should be noted that CLIVE contains 383 images resized from BID images, we exclude the 383 images from CLIVE in cross dataset experiments.

We compare our method with RISE (the compared NR-IQA method of blur images with the best overall performance), FRIQUEE and S-HOSA (the best two general purpose NR-IQA methods). The SROCC values are provided in Table 3. It can be seen that our method performs better than RISE, FRIQUEE and S-HOSA, which has demonstrated the database independency and robustness of the proposed SFA-PLSR method.

train test RISE FRIQUEE S-HOSA Ours
BID CLIVE - 0.3571 0.4767 0.5729
CLIVE BID - 0.3886 0.3433 0.6838
TID2008 LIVE 0.8638 0.8690 0.8950 0.9166
LIVE TID2008 0.9138 0.8727 0.8612 0.9243
  • The results of cross dataset evaluation on the two realistic blur datasets were not reported in the original paper of RISE.

Table 3. SROCC values in cross dataset evaluation.

5.3. Impact of Training Ratio

In order to have an intuitive understanding of how the training ratio affects the performance of our methods, we also conduct an experiment to test SFA-PLSR with different training ratios (from 10% to 90% with an increment step of 10%). It is clearly shown from Figure 7 that with the increase of training ratio, the performance values boost quickly when the training ratio is smaller than 30%. We can see that even if only 40% of images are used for training, the PLCC values are still close to 0.8. This is helpful in real-world applications, where relatively small amount of images are labeled.

Figure 7. The PLCC of SFA-PLSR with different training ratios.

5.4. -Confidence Band and Failure Case

In this part, we further consider the prediction consistency of the proposed method and FRIQUEE (the method among the compared methods with the best overall performance). The green regions shown in Figure 8(a), (b) are the

-confidence bands on BID. The scatter points outside the band are regarded as outliers. It can be seen that FRIQUEE has more outliers than our method SFA-PLSR. The median values of outlier’s ratio (OR) in 1000 runs are 5.98%, 11.11% for SFA-PLSR, FRIQUEE, respectively, which indicates that our method is more consistent with human perception. The outliers correspond the failure cases, and the worst case of our method is shown in Figure 

8(c). The picture suffered from so complex distortions. To overcome this type of failure cases, more clues should be considered, such as saturation and ghosting.

(a) SFA-PLSR
(b) FRIQUEE
(c) Failure case
Figure 8. (a) SFA-PLSR scores and -confidence band on BID, (b) FRIQUEE scores and -confidence band on BID, and (c) a failure case.

6. Conclusion

In this paper, we propose a novel NR-IQA method for realistic blur images, which is based on statistically aggregating the high-level semantic features extracted from pre-trained deep convolutional neural networks. The top performance and strong generalization capability of our method are validated by comparing with several state-of-the-art methods on two realistic image databases (BID, CLIVE) and two synthetic image databases (TID2008, LIVE). Experiments also show that high-level semantics indeed play a more critical role than low-level features in NR-IQA of realistic blur images. In the future study, we will consider our methods in a coarse to fine multi-scale framework, since object scale also plays a role in human blur perception.

Acknowledgements.
This work was partially supported by National Basic Research Program of China (973 Program) under contract 2015CB351803, the National Natural Science Foundation of China under contracts 61210005, 61390514, 61421062, 61527804, 61572042, 61520106004 and Sino-German Center (GZ 1025).

References

  • (1)
  • Bahrami and Kot (2014) Khosro Bahrami and Alex C. Kot. 2014. A fast approach for no-reference image sharpness assessment based on maximum local variation. IEEE Signal Processing Letters 21, 6 (2014), 751–755.
  • Bahrami and Kot (2016) Khosro Bahrami and Alex C. Kot. 2016. Efficient image sharpness assessment based on content aware total variation. IEEE Transactions on Multimedia 18, 8 (2016), 1568–1578.
  • Bosse et al. (2016) Sebastian Bosse, Dominique Maniry, Thomas Wiegand, and Wojciech Samek. 2016. A deep neural network for image quality assessment. In IEEE International Conference on Image Processing. 3773–3777.
  • Ciancio et al. (2011) Alexandre Ciancio, André Luiz N. Targino da Costa, Eduardo A. B. da Silva, Amir Said, Ramin Samadani, and Pere Obrador. 2011. No-reference blur assessment of digital pictures based on multifeature classifiers. IEEE Transactions on Image Processing 20, 1 (2011), 64–75.
  • Ferzli and Karam (2009) Rony Ferzli and Lina J. Karam. 2009. A no-reference objective image sharpness metric based on the notion of just noticeable blur (JNB). IEEE Transactions on Image Processing 18, 4 (2009), 717–728.
  • Ghadiyaram and Bovik (2016) Deepti Ghadiyaram and Alan C. Bovik. 2016. Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25, 1 (2016), 372–387.
  • Ghadiyaram and Bovik (2017) Deepti Ghadiyaram and Alan C. Bovik. 2017. Perceptual quality prediction on authentically distorted images using a bag of features approach. Journal of Vision 17, 32 (2017).
  • Gu et al. (2015a) Ke Gu, Guangtao Zhai, Weisi Lin, Xiaokang Yang, and Wenjun Zhang. 2015a. No-reference image sharpness assessment in autoregressive parameter space. IEEE Transactions on Image Processing 24, 10 (2015), 3218–3231.
  • Gu et al. (2015b) Ke Gu, Guangtao Zhai, Xiaokang Yang, and Wenjun Zhang. 2015b. Using free energy principle for blind image quality assessment. IEEE Transactions on Multimedia 17, 1 (2015), 50–63.
  • Hassen et al. (2013) Rania Hassen, Zhou Wang, and Magdy M. A. Salama. 2013. Image sharpness assessment based on local phase coherence. IEEE Transactions on Image Processing 22, 7 (2013), 2798–2810.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
  • Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM MM. 675–678.
  • Kang et al. (2014) Le Kang, Peng Ye, Yi Li, and David Doermann. 2014. Convolutional neural networks for no-reference image quality assessment. In CVPR. 1733–1740.
  • Kim and Lee (2017) Jongyoo Kim and Sanghoon Lee. 2017. Fully deep blind image quality predictor. IEEE Journal of Selected Topics in Signal Processing 11, 1 (January 2017), 206–220.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097–1105.
  • Li et al. (2016a) Leida Li, Weisi Lin, Xuesong Wang, Gaobo Yang, Khosro Bahrami, and Alex C Kot. 2016a. No-reference image blur assessment based on discrete orthogonal moments. IEEE Transactions on Cybernetics 46, 1 (2016), 39–50.
  • Li et al. (2016c) Leida Li, Dong Wu, Jinjian Wu, Haoliang Li, Weisi Lin, and Alex C Kot. 2016c. Image sharpness assessment by sparse representation. IEEE Transactions on Multimedia 18, 6 (2016), 1085–1097.
  • Li et al. (2017) Leida Li, Wenhan Xia, Weisi Lin, Yuming Fang, and Shiqi Wang. 2017. No-reference and robust image sharpness evaluation based on multi-scale spatial and spectral features. IEEE Transactions on Multimedia (2017).
  • Li et al. (2016b) Qiaohong Li, Weisi Lin, Jingtao Xu, and Yuming Fang. 2016b. Blind image quality assessment using statistical structural and luminance features. IEEE Transactions on Multimedia 18, 12 (2016), 2457–2469.
  • Lu et al. (2016) Qingbo Lu, Wengang Zhou, and Houqiang Li. 2016. A no-reference image sharpness metric based on structural information using sparse representation. Information Sciences 369 (2016), 334–346.
  • Marziliano et al. (2002) Pina Marziliano, Frederic Dufaux, Stefan Winkler, and Touradj Ebrahimi. 2002. A no-reference perceptual blur metric. In IEEE International Conference on Image Processing, Vol. 3. 57–60.
  • Mittal et al. (2012) Anish Mittal, Anush Krishna Moorthy, and Alan C. Bovik. 2012. No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing 21, 12 (2012), 4695–4708.
  • Moorthy and Bovik (2011) Anush Krishna Moorthy and Alan C. Bovik. 2011. Blind image quality assessment: From natural scene statistics to perceptual quality. IEEE Transactions on Image Processing 20, 12 (2011), 3350–3364.
  • Narvekar and Karam (2011) Niranjan D. Narvekar and Lina J. Karam. 2011.

    A no-reference image blur metric based on the cumulative probability of blur detection (CPBD).

    IEEE Transactions on Image Processing 20, 9 (2011), 2678–2683.
  • Oh et al. (2014) Taegeun Oh, Jincheol Park, Kalpana Seshadrinathan, Sanghoon Lee, and Alan C. Bovik. 2014. No-reference sharpness assessment of camera-shaken images by analysis of spectral structure. IEEE Transactions on Image Processing 23, 12 (2014), 5428–5439.
  • Ponomarenko et al. (2009) Nikolay Ponomarenko, Vladimir Lukin, Alexander Zelensky, Karen Egiazarian, Marco Carli, and Federica Battisti. 2009. TID2008-a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics 10, 4 (2009), 30–45.
  • Qi et al. (2015) Feng Qi, Debin Zhao, and Wen Gao. 2015. Reduced reference stereoscopic image quality assessment based on binocular perceptual information. IEEE Transactions on Multimedia 17, 12 (2015), 2338–2344.
  • Rosipal and Krämer (2006) Roman Rosipal and Nicole Krämer. 2006. Overview and recent advances in partial least squares. In

    Subspace, Latent Structure and Feature Selection

    . Springer, 34–51.
  • Saad et al. (2012) Michele A. Saad, Alan C. Bovik, and Christophe Charrier. 2012. Blind image quality assessment: A natural scene statistics approach in the DCT domain. IEEE Transactions on Image Processing 21, 8 (2012), 3339–3352.
  • Sang et al. (2014) Qingbing Sang, Huixin Qi, Xiaojun Wu, Chaofeng Li, and Alan C. Bovik. 2014. No-reference image blur index based on singular value curve. Journal of Visual Communication and Image Representation 25, 7 (2014), 1625 – 1630.
  • Sheikh et al. (2006) Hamid R. Sheikh, Muhammad F. Sabir, and Alan C. Bovik. 2006. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing 15, 11 (2006), 3440–3451.
  • Siahaan et al. (2016) Ernestasia Siahaan, Alan Hanjalic, and Judith A. Redi. 2016. Augmenting blind image quality assessment using image semantics. In IEEE International Symposium on Multimedia. 307–312.
  • Sun et al. (2016) Cuirong Sun, Houqiang Li, and Weiping Li. 2016. No-reference image quality assessment based on global and local content perception. In Visual Communications and Image Processing. 1–4.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR. 1–9.
  • Tang et al. (2011) Huixuan Tang, Neel Joshi, and Ashish Kapoor. 2011. Learning a blind measure of perceptual image quality. In CVPR. 305–312.
  • VQEG (2000) VQEG. 2000. Final report from the Video Quality Experts Group on the validation of objective models of video quality assessment. (2000).
  • Vu et al. (2012) Cuong T. Vu, Thien D. Phan, and Damon M. Chandler. 2012. S: A spectral and spatial measure of local perceived sharpness in natural images. IEEE Transactions on Image Processing 21, 3 (2012), 934–945.
  • Vu and Chandler (2012) Phong V. Vu and Damon M. Chandler. 2012. A fast wavelet-based algorithm for global and local image sharpness estimation. IEEE Signal Processing Letters 19, 7 (2012), 423–426.
  • Wang et al. (2004) Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612.
  • Wang and Simoncelli (2003) Zhou Wang and Eero P Simoncelli. 2003. Local phase coherence and the perception of blur. In NIPS. 1435–1442.
  • Xu et al. (2016) Jingtao Xu, Peng Ye, Qiaohong Li, Haiqing Du, Yong Liu, and David Doermann. 2016. Blind Image Quality Assessment Based on High Order Statistics Aggregation. IEEE Transactions on Image Processing 25, 9 (2016), 4444–4457.
  • Yan et al. (2016) Jia Yan, Weixia Zhang, and Tianpeng Feng. 2016. Blind image quality assessment based on natural redundancy statistics. In ACCV. 3–18.
  • Ye et al. (2012) Peng Ye, Jayant Kumar, Le Kang, and David Doermann. 2012. Unsupervised feature learning framework for no-reference image quality assessment. In CVPR. 1098–1105.
  • You et al. (2011) Junyong You, Touradj Ebrahimi, and Andrew Perkis. 2011. Modeling motion visual perception for video quality assessment. In ACM MM. 1293–1296.
  • Yu et al. (2017) Shaode Yu, Shibin Wu, Fan Jiang, Leida Li, Yaoqin Xie, and Lei Wang. 2017. A shallow convolutional neural network for blind image sharpness assessment. PloS one (2017).
  • Zhai et al. (2012) Guangtao Zhai, Xiaolin Wu, Xiaokang Yang, Weisi Lin, and Wenjun Zhang. 2012. A psychovisual quality metric in free-energy principle. IEEE Transactions on Image Processing 21, 1 (2012), 41–52.
  • Zhang and Kuo (2014) Jiangyang Zhang and C-C Jay Kuo. 2014. An objective quality of experience (QoE) assessment index for retargeted images. In ACM MM. 257–266.