Deep Optimization model for Screen Content Image Quality Assessment using Neural Networks

03/02/2019 ∙ by Xuhao Jiang, et al. ∙ 8

In this paper, we propose a novel quadratic optimized model based on the deep convolutional neural network (QODCNN) for full-reference and no-reference screen content image (SCI) quality assessment. Unlike traditional CNN methods taking all image patches as training data and using average quality pooling, our model is optimized to obtain a more effective model including three steps. In the first step, an end-to-end deep CNN is trained to preliminarily predict the image visual quality, and batch normalized (BN) layers and l2 regularization are employed to improve the speed and performance of network fitting. For second step, the pretrained model is fine-tuned to achieve better performance under analysis of the raw training data. An adaptive weighting method is proposed in the third step to fuse local quality inspired by the perceptual property of the human visual system (HVS) that the HVS is sensitive to image patches containing texture and edge information. The novelty of our algorithm can be concluded as follows: 1) with the consideration of correlation between local quality and subjective differential mean opinion score (DMOS), the Euclidean distance is utilized to measure effectiveness of image patches, and the pretrained model is fine-tuned with more effective training data; 2) an adaptive pooling approach is employed to fuse patch quality of textual and pictorial regions, whose feature only extracted from distorted images owns strong noise robust and effects on both FR and NR IQA; 3) Considering the characteristics of SCIs, a deep and valid network architecture is designed for both NR and FR visual quality evaluation of SCIs. Experimental results verify that our model outperforms both current no-reference and full-reference image quality assessment methods on the benchmark screen content image quality assessment database (SIQAD).



There are no comments yet.


page 1

page 3

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Nowadays screen content pictures have become quite common in our daily life with the rapid development of multimedia and social network. Numerous consumer applications, such as Facebook, Twitter, remote control and more, involve computer-generalize screen content images (SCIs). Fig. 1(a)-(b) shows two typical images, one is a natural image (NI) and the other is a SCI. There are significant differences between these two images. NIs have rich color and slow color change, while SCIs contain more thin lines, sharp edges and little color variance for massive existence of texts and computer-generated graphics. During acquisition, processing, compression, storage, transmission and reproduction, digital images may introduce various types of distortions, and the visual quality of the images is degraded as a result. Image quality assessment (IQA) aims to objectively evaluate image quality, in order to solve the problem that the human spend much time on judging the image subjective quality. IQA methods can also be used to optimize image processing algorithms. Therefore, IQA plays a very important role in image processing community.

Fig. 1: (a) is an example of screen content image and (b) is an example of natural image

There are numbers of IQA methods for NIs designed in recent years including full reference (FR), reduced reference (RR) and no reference (NR). Considering the characteristics of the HVS, many FR approaches develop and become highly consistent with subjective quality scores, including structural similarity (SSIM) [1], multi-scale SSIM (MS-SSIM) [2], information weighted SSIM (IW-SSIM) [3], feature similarity (FSIM) [4], visual saliency-based index (VSI) [5], gradient magnitude similarity deviation (GMSD) [6], and gradient SSIM (GSIM) [7]. For RR methods, such as [8]-[10]

, only partial information of reference images is used for IQA. However, while evaluating image quality, the reference image or part of its information is unavailable. For NR methods, only the distorted images are employed for IQA. Generally, NR IQA algorithms extract specific features from distorted images and train a regression model with these features and subjective rating by machine learning, such as


Most existing IQA approaches devised for NIs are not effective for SCI quality evaluation without taking into account the difference in image content and characteristics between SCIs and NIs. Recognizing this difficulty, others have instead developed a variety of IQA approaches tailored to SCIs. Some representative works [17]-[22] of FR IQA have been published and achieves good results. Yang et al. propose a FR-IQA model based on SCIs segmentation [17]. It provides an effective segmentation method to distinguish the pictorial and textual regions, and an activity weighting strategy is employed to fuse the visual quality scores of entire, textual and pictorial regions to the overall quality scores. Based on this segmentation method, structural features based on gradient information and luminance features are extracted for similarity computation to obtain the visual quality of SCI [19]. In [20], Gu et al. propose a FR metric which mainly relies on simple convolution operators to detect salient areas. Ni et al. [21] design a FR metric based on local similarities extracted with Gabor filters in the LMN color space. These FR-IQA methods above achieve a superior performance of SCIs quality evaluation. Numerous NR IQA methods for SCIs are proposed [23],[24]. Shao et al. [23]

propose a blind quality predictor for SCI to explore the issue from the perspective of sparse representation. Local sparse representation and global sparse representation are conducted for textual and pictorial regions, respectively. Then, the local and global quality scores are estimated and combined to a total one. Another effective approach of no-reference SCI quality assessment is presented in

[24], which obtains an overall quality score through extracting features from the histograms of texture and luminance and training these features based on SVR.

With the development of the CNN, many models [25]-[32] have started to build neural networks to process the problem for NI quality assessment, and have achieved superior performance. These methods utilize image patches as a data augmentation, and design special patch-level neural networks for NI IQA. Kang et al. [25] propose a method based on CNN to accurately predict NI quality without reference images. Bosse’s method [32] promotes the CNN to learn the local quality and local weights, and then fuse the local quality to global quality with the local weights. This work mainly considers the relative importance of local quality to the global quality estimation. However, these CNN-based methods still do not consider the special characteristics of the screen content images where textual regions attract more attention than pictorial regions. Zhang et al. [33] propose a FR-IQA model for SCI taking fusion of textual and pictorial regions into consideration, where the IQAs of pictorial and textual regions are evaluated separately and fused with a region-size-adaptive quality-fusion strategy. Zuo et al. propose an NR method using classification models for SCI quality assessment in [34]. A novel classification network is designed to train the distorted images for getting a practical model, and weightings of texture regions and pictorial regions are determined according to the gradient entropy adapt to the characteristics of the screen content image. Chen et al. [35] propose naturalization module to transform IQA of NIs into IQA of SCIs. These CNN-based IQA methods for SCIs have their limitations. They divide SCIs into image patches aiming to obtain enough training data, and utilize DMOSs as ground truth. This brings two problems. They are lacking reliable ground truth of image patches and effective strategy of fusing local quality.

In this work, we propose a novel algorithm for both NR and FR IQA of screen content images to solve limitations of the previous methods for screen content images based on patch-level CNN-based models. Unlike traditional patch-level CNN-based methods, our model selects part of all image patches as effective input data whose quality is relatively close to DMOS. For this purpose, a two-steps training strategy is devised. In the first step, the network is trained with all of the image patches to obtain a pretrained model, and predicts the quality scores of the training image patches with the model. Then, the Euclidean distance is employed to evaluate the effectiveness of training image patches, and the pretrained model is fine-tuned with selected image patches to gain a more accurate model. On this basis, an efficient and adaptive weighting method is designed to fuse the visual quality of textual and pictorial regions with considering the effect of the different image patch content. The main contributions of our method are described as follows:

1) Considering the characteristics of SCIs, a deep and valid network architecture is designed for both NR and FR visual quality evaluation of SCIs. Moreover, reference information is extracted by independent layers, which is concatenated with distorted information in the shallow layer for FR-IQA. This confirms that networks learns the feature differences.

2) Considering the connection between the histogram distribution of local quality and DMOSs, the Euclidean distance between local quality and DMOSs is utilized to evaluate the effectiveness of training image patches. A training data selection based on effectiveness is proposed to fine-tune pretrained model for obtaining a higher-performance model.

3) The noise robust index variance of local standard deviation (VLSD) is utilized to distinguish textual and pictorial regions of SCIs, and measure patch weights of two regions. Our proposed adaptive weighting method using VLSD is appropriate to fuse local quality under different types and degrees of distortions.

The rest of this paper is organized as follows. Section II provides a brief review of the related work. In Section III, an effective CNN-based NR-IQA algorithm for SCI is proposed. Section IV shows experimental results and compares performance of the proposed algorithm with the state-of-the-art methods. Finally, conclusions are given in Section V.

Fig. 2: Framework of the proposed algorithm.

Ii Related Work

Ii-a Training Patch-level Deep IQA Model with More Reliable Ground-truth

The traditional CNN-based approaches [25], [32] work with image patches by assigning the subjective DMOS of an image to all patches within it. These approaches suffer from limitation that local quality of image patches within a large image varies even when the distortion is homogeneously applied [1]. Therefore, some works [27], [36] make use of FR-IQA methods for quality annotation. Kim and Lee [36] pretrain the model with the predicted local score of an FR-IQA approach as the ground-truth and fine-tune it with DMOSs. Inspired by this work, Bare [27] devise an accurate deep model which utilizes the FSIM [4] to generate training labels of image patches and adopts a deep residual network [37] showing strong ability to extract features in classification and regression tasks. Compared with approaches using DMOSs, utilizing the local score of an FR-IQA model achieves better performance, that benefits network training while each image patch labeled with a more accurate score. Howevere, this brings new problem that the accuracy of the FR-IQA models affects their performance.

Inspired by these works, we observe that the local quality within context of SCIs (e.g., the quality of a patch within a large image) has great difference for SCIs contain complex content including textual and pictorial regions. However, three high-performance FR-IQA methods for NIs and SCIs are utilized to predict the image patches quality and the performance with an average pooling strategy is poor in Section III-B. The main reason is that local quality of SCIs varies greatly for different characteristics of pictorial and textual regions existing in SCIs. Therefore, the method to train a deep model with the predicted local scores by FR models is not appropriate for SCIs.

Ii-B Salience and Attention for IQA

Image saliency is one of the most popular topics in computer vision for the characteristics of HVS. This leads to an idea to combine salience models with NR-IQA models. Zhang

et al. [5] devise a VSI FR-IQA model which takes the local saliency from reference and distorted images as feature maps and combines it with similarity from local gradients and local chrominance. Saliency detection is difficult in noisy images that HVS is sensitive to noise. Bosse et al. [32] first propose a learning model to combine saliency with NR-IQA model. In this work, it contains two sub-networks to separately learn local quality and local weight. Then the image visual quality is evaluated by weighting the local quality of region with the corresponding local saliency with


However, we observe that the method to learn local weight surely improves performance of IQA for NI but has little effect on IQA for SCIs. For NIs, CNN-based model can precisely predict local quality and show a high performance on IQA. For SCIs, CNN-based model does not achieve an expected performance for SCI quality assessment compared with performance of NI quality assessment. Local weight of learning model has high correlation with local quality, and thus the weight prediction will be poor without accurate quality prediction.

Fig. 3: An illustration of the architecture of our CNN model.

Iii Proposed Method

In this section, the proposed QODCNN for SCIs is described in details. As is shown in Fig. 2, QODCNN consists of three sub-steps accomplished by training, fine-tuning and post-processing. Fiestly, the designed CNN is trained with all the image patches to obtain an initial model of SCI visual quality assessment in the first stage, which learns the features of image distortion information and can effectively predict the quality of SCIs. Secondly, with the pre-trained CNN model, quality scores of all the training image patches is predicted and then a data selection is applied according to the predicted scores. Fine-tuning the network aims to gain a more precise and valid model with selected data in the second stage which is as the first optimization. Third, considering the different importances of textual and pictorial regions for IQA of SCIs, the VLSD is designed to distinguish textual and pictorial regions, measure the local weights of image patches and fuse local quality which is the second optimization. Finally, a learning model is obtained to effectively evaluate the visual quality of SCIs.

Iii-a Network Architecture

Fig. 4: The samples of predicted local quality. The first and third rows show the distorted SCIs; the second and fourth rows show the correspond local quality histograms. (a-d, i-k) are distorted SCIs with the most serious noise of GN, GB, MB, CC, JPEG, JPEG2000, and LSC in SIQAD.

The design of the proposed QODCNN architecture is shown in Fig. 3. The proposed model consists of eight convolutional layers, four max pooling layers, one concatenate layer and two full-connection layers. Each convolutional layer has a

filter with a stride of 1 pixel, and each pooling layer has a

pixel-sized kernel with a stride of 2 pixels. For each convolutional layer, zeros are padded around the border and a BN

[38] layer is added to improve network training performance. The output feature map of the BN layer is calculated by Eq. 2,


where is the input feature map of the batch normalized layer and is the output. and respectively denote the mean and variance of the input map. and

are two parameters updated in training. The rectified linear unit (ReLU)


as activation function is added after the normalized layers. Feature maps extracted by convolutional layers and pooling layers are named, and the precise configurations are listed in Fig. 3.

For FR-IQA, this model extracts the feature maps of reference image patches and distorted image patches, and fuses these maps with a concatenate layer in shallow layer of network. These fused feature maps are regressed by the remaining network layers. For NR-IQA, the branch of extracting feature maps of reference images is abolished, and thus the adjusted model extracts features only from the distorted images. The loss function for both NR and FR models is defined as follows,


where is the image patch number of an input mini-batch , is the network output of an input image patch, and is the ground-truth of the input image patch.

The CNN parameters are learned end-to-end by minimizing the sum of loss function and regularization for the predicted quality, on all training tuples:


where represents the penalty factor and is the weight of CNN model.

In our model, combination of two

convolutional layers and one pooling layers is employed for owning a larger view to extracting features with less data. Considering SCIs contain lots of edge and gradient information, the max pooling layer is applied to capture texture changes degraded by noise. For FR-IQA, reference information is extracted by independent layers and concatenated with distorted information in the shallow layer. This confirms that networks learns the differences between features extracted from distorted and reference SCIs rather than the differences between distorted and reference SCIs. Compared with features in deep layer, features in shallow layer retain a large amount of original information with less information loss. In addition, BN layers and

regularization significantly improve the speed of the model’s regression and the performance of fitting. Most existing models adopt loss for NI quality assessment. However, considering the big statistics differences between image patches of SCIs, loss is applied to reduce the impact of some abnormal image patches.

Iii-B A Two-steps Training Strategy

Most existing patch-level CNN-based models all face a problem that local quality of a large image is labeled with an inaccurate quality score. This problem is more serious when employing a patch-level CNN model to predict visual quality of SCIs. Compared with NIs, SCIs contain more complex content consisting of pictorial and textual regions. Some approaches [27], [36] utilize FR-IQA methods to predict local quality of natural scene image. However, it has limited effect for SCI visual quality evaluation. Here, SSIM [1], FSIM [4] and SQMS [20] are employed to test the performance where the three FR-IQA methods predict local quality of SCIs and fusing local quality with average pooling is applied to obtain image quality on SIQAD database [17]. From Table I, it can be observed that these methods illustrate poor performance due to existing big difference between image patches of SCIs. Therefore, we consider solving this problem from SCI patches itself, not from the correspond labels.

SSIM [1] 0.7630 0.7602 10.8837
FSIM [4] 0.8295 0.8328 9.2777
SQMS [20] 0.8104 0.8156 10.2790
TABLE I: Patch-Level Performance of Three FR-IQA Methods

In our model, CNN trained with more efficient data (TMED) are considered to solve this problem including two training steps. In the first step, an initial model is obtained by training neural network of Section III A with all the image patches for SCI quality assessment. To test effectiveness of the pretrained model, this model predicts all the training image patches and the corresponding distribution maps of these predicted patch scores are provided in Fig. 4. As shown in Fig. 4, the distorted images of seven different distortion types and the corresponding local quality histograms are listed. For gaussian noise (GN), gaussian blur (GB), motion blur (MB), contrast change (CC), JPEG compression (JC), JPEG2000 compression (J2C) and layer segmentation-backed coding (LSC) distortions, the most serious noise is used to analysis as typical examples. These histograms of local quality show that predicted local quality of training image patches is distributed around the DMOSs. This also verifies the view mentioned in the introduction that the local quality of a large image varies. It can be noted the local quality scores of some image patches are far away from the DMOSs which damages the performance of the learning model.

Inspired by histograms of local quality, a training data selection is more reasonable and benefits the deep model learning, since CNN can be learned better with training data labeled with precise ground-truth. Data selection abandons those image patches whose local quality deviates from ground-truth and selects those local quality closes to DMOSs. Therefore, the Euclidean distance is employed to evaluate the effectiveness of training image patches and calculated as follows,


where is the predicted score of an image patch, is the ground-truth of the image patch and denotes the effectiveness index. In order to maintain enough information of each image, image patches are selected from each image with a fixed ratio. The ratio is computed by


where presents the data selection ratio, and are the number of selected image patches and all image patches of a SCI, and denotes an adaptive threshold. The pre-trained model is fine-tuned with selected training data to obtain a more effective and higher-performance model in the second step, .

Iii-C Pooling with A Novel Weighting Method

The pooling-by-average approach is used to calculate global quality with local quality in most patch-level CNN-based models [25], [27]. However, the average pooling local quality estimates does not consider the effect of spatially varying perceptual relevance of local quality. Especially for SCIs, HVS is sensitive to edge information meaning that textual regions owns higher weights than pictorial regions for IQA. It is very difficult to distinguish these two regions only from distorted images. Fang’s method [19] obtains global quality by weighting local quality with gradient entropy. Using gradient entropy to distinguish textual and pictorial regions is useful for FR-IQA model, but it is difficult for NR-IQA since entropy is sensitive to noise.

Fig. 5: (a) is the map with a smooth processing of a SCI distorted by Gaussian noise; (b) and (c) are of different pixel distributions of pictorial region and textual region in the smoothed image.

In our model, weighting local quality with VLSD (WLQVLSD) is first proposed for SCI quality assessment. Fig. 5(a) depicts map of a typical gaussian distorted SCI with a smooth processing named LSD, and Figs. 5(b) and (c) depict two histograms of the textual region and pictorial region. LSD is applied on each image to indicate the structural complexity and reduce the impact of noise. The value of output feature map is calculated as,


where is a 2D circularly-symmetric Gaussian weighting function, is the pixel point in the image, and is the mean value of within a local window centered at , is the output pixel of the corresponding position. In our implementation, . As shown in Fig. 5(a), the noise is weakened after smoothing, and plenty of thin lines are left on the image which is beneficial for the distinguishing between pictorial and textual regions.

Second, two typical regions are marked in Fig. 5(a). It can be seen that the histogram of the textual region (b) is relatively scattered compared to the pictorial region (c) whose histogram is relatively concentrated. Considering the differences of the histogram distribution, the variance of LSD is taken as the feature to describe the contents of image. The value of variance is calculated by,


where is number of pixels in the image patch, is the mean value of local deviation map and is the pixel point in the local deviation map which are calculated in Eq. 7. The reference SCI and the corresponding VLSD maps of three distortion types (GN, CC and JPEG) are shown in Fig. 6. To demonstrate the noise robustness of VLSD, the distorted SCIs with the slightest and most serious noise are as examples. As seen in Fig .6, two typical areas are marked with two different color boxes where yellow boxes represent pictorial regions and blue boxes represent textual regions. It can be observed that textual regions of SCI obtains bigger value compared with pictorial regions of SCI in all VLSD maps. This shows VLSD can effectively distinguish pictorial and textual regions, and measure local weight of SCIs. Moreover, VLSD also owns strong noise robust ability. Compared with gradient entropy, the VLSD can better distinguish textual regions and pictorial regions, and is robust to distortion types and intensity. Thus the VLSD is employed to measure the importance of local regions in a large image.

Fig. 6: The visual samples of VLSD maps. (a) is a reference SCI and (b-g) are VLSD maps of six distorted SCIs. (b, c), (d, e) and (f, g) are VLSD map pairs with the slightest and most serious noise of GN, CC and JPEG in SIQAD.

Finally, scores predicted by the fine-tuned CNN and the corresponding VLSD of the image patches are obtained. A weighting method is applied to fuse quality of textual and pictorial regions which is calculated as,


where and are the score of the patch and its variance value calculated based on Eq. 8, is the number of the patches of the test image, is the final score of the test image.

Iii-D Training

Before training our model, a grayscale processing and a data augmentation are applied by dividing large color images into gray image patches with size

. The Tensorflow is used as the training toolbox, and two databases

[17], [18] are used to train and test our model. Both pre-training and fine-tuning steps adopt the Adam optimization algorithm [40] with a mini-batch of 64, and employ DMOSs as ground-truth of training. The penalty factor of regularization is . In the pre-training stage, the learning rate is changed from to

at the interval of ten epochs. For fine-tuning, we fine-tune the pre-trained model with the same learning rate conditions. After training for two hundred epochs, the final model is obtained to predict visual quality.

Iv Experimental Results

Iv-a Database and Evaluation Methodology

To verify the effectiveness of the proposed QODCNN, two screen content image databases SIQAD [17] and SCID [18] are used to conduct the comparison experiments. The SIQAD database has 20 reference screen content images and 980 distorted screen content images which contain seven types of distortion including GN, GB, MB, CC, JC, J2C and LSC, and each is with seven levels of distortions. The SCID database is used for cross-database experiments. Six distortion types (GN, GB, MB, CC, JC, and J2C) at five different levels is considered that are common in the two databases. This leaves us 1200 test SCIs in SCID.

In most cases, three typical performance evaluation criteria are adopted to evaluate the performance of IQA algorithms, the Pearson Linear Correlation Coefficient (PLCC), Spearman s Rank-order Correlation Coefficient (SRCC) and Root Mean Square Error (RMSE). The values of PLCC and SRCC is closer to 1, and the values of RMSE is smaller, indicating that the algorithm is more accurate. Given the image in the database (with images in total), and are the objective and subjective scores, and are the mean values of and , is the difference between the subjective and objective results. PLCC, SRCC and RMSE are defined as follows,


A five-parameter mapping function [50] is employed to nonlinearly regress the quality scores into a common space as follows,


where () are parameters to be compute with a curve fitting process.

Iv-B Performance Evaluation on SIQAD

For performance evaluation, the proposed QODCNN model is trained on database SIQAD. To distinguish the NR and FR models, the NR model is named as QODCNN-NR and the FR model is named as QODCNN-FR. For SIQAD, 980 distorted SCIs are randomly divided into two subsets according to the image content. Training set contains 784 distorted SCIs associated to 16 reference images (80 data for training) and the rest 196 distorted SCIs associated to 4 reference images are used as testing set (20 data for testing). The training-testing sets partition are randomly repeated 10 times, and the average performance of ten experiments is calculated as the overall performance.

In the second stage of our proposed QODCNN model, the proportion of data selection influences the performance of the fine-tuned model. Here, considering the difference of local quality within a large image, 10 to 80 of image patches of each image in training set of the second stage are selected by adjusting the threshold of Eq. 5 for training, and all of the image patches in testing set are utilized to evaluate the performance. Experiments are repeated for different training-testing sets of the first stage and different proportion of training data. For different proportion of training data, the average performance of ten results is calculated as the overall performance. The best performance is obtained when 70 of training data is used to fine-tune pre-trained model, which can be seen in Table V and Fig. 7 in Section IV-D-1. Thus, proportion of 70 is applied for experiments on SIQAD and cross-database experiments.



PSNR 0.5869 0.5608 11.5859
SSIM [1] 0.7561 0.7566 9.3676
GMSD [6] 0.7259 0.7305 9.4684
SPQA [17] 0.8584 0.8416 7.3421
ESIM [18] 0.8788 0.8632 6.8310
SQMS [20] 0.8872 0.8803 6.6039
GFM [21] 0.8828 0.8735 6.7234
SFUW [19] 0.8910 0.8800 6.4990
MDOGS [22] 0.8839 0.8822 6.6951
CNN-SQE [33] 0.9040 0.8940 6.1150
QODCNN-FR 0.9142 0.9066 5.8015


BLINDS-II [11] 0.7255 0.6813 9.4991
BRISQUE [12] 0.7708 0.7237 8.1342
BLIQUP-SCI [23] 0.7705 0.7990 10.0213
NRLT [24] 0.8442 0.8202 7.5957
CNN-Kang [25] 0.8487 0.8091 7.4472
WaDIQaM-NR [32] 0.8594 0.8522 7.0570
PICNN [35] 0.896 0.897 6.790
QODCNN-NR 0.9008 0.8888 6.2258
TABLE II: Experimental results of proposed and other existing FR and NR methods on SIQAD database
PLCC GN 0.905 0.881 0.899 0.892 0.899 0.900 0.887 0.898 - 0.913 0.918
GB 0.860 0.901 0.910 0.906 0.923 0.912 0.923 0.920 - 0.925 0.934
MB 0.704 0.806 0.844 0.831 0.889 0.867 0.878 0.842 - 0.889 0.907
CC 0.753 0.744 0.783 0.799 0.764 0.803 0.829 0.801 - 0.837 0.866
JC 0.770 0.749 0.775 0.770 0.800 0.786 0.757 0.789 - 0.830 0.848
J2C 0.789 0.775 0.851 0.825 0.789 0.826 0.815 0.861 - 0.818 0.857
LSC 0.781 0.731 0.856 0.796 0.792 0.813 0.759 0.832 - 0.867 0.897
Overall 0.587 0.756 0.726 0.858 0.879 0.887 0.891 0.884 0.904 0.901 0.914
SRCC GN 0.879 0.870 0.886 0.882 0.876 0.886 0.869 0.888 0.893 0.905 0.907
GB 0.858 0.892 0.912 0.902 0.924 0.915 0.917 0.919 0.924 0.916 0.921
MB 0.713 0.804 0.844 0.826 0.894 0.869 0.874 0.835 0.904 0.871 0.895
CC 0.683 0.641 0.544 0.615 0.611 0.695 0.722 0.664 0.665 0.700 0.778
JC 0.757 0.758 0.771 0.767 0.799 0.789 0.750 0.786 0.847 0.815 0.829
J2C 0.775 0.760 0.844 0.815 0.783 0.819 0.812 0.862 0.862 0.795 0.835
LSC 0.793 0.737 0.859 0.800 0.796 0.829 0.754 0.851 0.887 0.882 0.898
Overall 0.561 0.757 0.731 0.842 0.863 0.880 0.880 0.882 0.894 0.889 0.907
RMSE GN 6.338 7.068 6.521 6.739 6.827 6.921 6.876 6.558 - 6.150 5.963
GB 7.738 6.570 6.310 6.430 5.827 6.611 5.592 5.964 - 5.772 5.454
MB 9.229 7.697 6.982 7.222 5.964 7.204 6.236 7.012 - 5.762 5.251
CC 8.282 8.412 7.828 7.618 8.114 7.743 7.048 7.528 - 6.939 6.381
JC 6.000 6.230 5.941 6.000 5.640 5.983 6.143 5.779 - 5.460 5.141
J2C 6.382 6.591 5.459 5.871 6.388 6.050 6.023 5.293 - 6.000 5.286
LSC 5.330 5.825 4.411 5.166 5.215 5.104 5.555 4.738 - 4.338 3.857
Overall 11.590 9.368 9.642 7.342 6.831 7.297 6.499 6.695 6.115 6.226 5.801
TABLE III: Experimental Results of Our Proposed Models and Other State-of-the-art FR Models on Different Distortion Types on SIQAD.
PLCC GN 0.955 0.936 0.954 0.958 0.956 - 0.949 0.960
GB 0.778 0.871 0.797 0.836 0.870 - 0.845 0.866
MB 0.763 0.880 0.834 0.827 0.882 - 0.812 0.849
CC 0.755 0.708 0.811 0.878 0.791 - 0.752 0.817
JC 0.839 0.859 0.935 0.915 0.942 - 0.935 0.942
J2C 0.918 0.859 0.943 0.946 0.946 - 0.890 0.940
Overall 0.716 0.747 0.851 0.697 - 0.827 0.849 0.882
SRCC GN 0.944 0.917 0.934 0.946 0.946 - 0.947 0.938
GB 0.776 0.870 0.799 0.822 0.870 - 0.829 0.856
MB 0.756 0.859 0.815 0.801 0.861 - 0.803 0.828
CC 0.732 0.679 0.715 0.816 0.618 - 0.569 0.687
JC 0.833 0.850 0.934 0.914 0.946 - 0.929 0.935
J2C 0.907 0.846 0.928 0.931 0.936 - 0.865 0.916
Overall 0.673 0.716 0.843 0.668 - 0.822 0.848 0.876
TABLE IV: Cross-Database Evaluation (both SRCC and PLCC)of our proposed models and Other FR Models on SCID.

Iv-B1 Full-Reference Image Quality Assessment

The results of QODCNN-FR models are shown in Table II, compared with other state-of-the-art FR models: PSNR, SSIM [1], GMSD [6], SPQA [17], ESIM [18], SQMS [20],GFM [21], SFUW [19], MDOGS [22] and CNN-SQE [33]. Specially the last seven models are designed for SCIs. It can be seen from Table II that FR methods designed for SCIs achieve higher performance than FR metrics designed for NIs. The reason is that these SCI models (SPQA, ESIM, SQMS, GFM, SFUW, MDOGS and CNN-SQE) consider the correlation between HVS and local area consisting of pictorial and textual regions compared to FR-IQA models of NIs.

Among all FR metrics, QODCNN-FR can obtain the best performance and achieve a great improvement. The SRCC value of QODCNN-FR model is 2.44% higher than the MDOGS model, 1.26% higher than the CNN-SQE model, and the PLCC value of QODCNN-FR model is 2.32% higher than the SFUW model, 1.02% higher than the CNN-SQE model. In addition, our FR models fully utilize strong extracting features ability and generalization ability of CNN while CNN-SQE model only uses CNN to distinguish pictorial and textual regions which is mainly based on traditional ways.

Iv-B2 No-Reference Image Quality Assessment

To demonstrate the excellent performance of our proposed QODCNN-NR models, it is compared with the above excellent FR-IQA models and the following state-of-the-art NR perceptual quality evaluation methods: BLINDS-II [11], BRISQUE [12], BLIQUP-SCI [23], NRLT [24], CNN-Kang [25], WaDIQaM-NR [32] and PICNN[35]. Among these NR-IQA approaches, the BLIQUP-SCI, NRLT and PICNN are designed for IQA of SCIs. In addition, CNN-SQE, CNN-Kang, WaDIQaM-NR and PICNN are CNN-based methods. For NR models, NRLT shows excellent performance among traditional NR methods, utilizing the global scope statistical luminance and texture features.

Our proposed QODCNN-NR model achieves the best performance on visual quality evaluation of SCIs compared with traditional FR and NR methods. Compared with CNN-based methods, our NR model obtains better performance than CNN-Kang model, WaDIQaM-NR model and PICNN model, and shows very close performance with CNN-SQE FR model. The PLCC value of our model is 5.66 higher than the NRLT model of NR-IQA, 1.69 higher than the MDOGS model of FR-IQA, 5.21 higher than the CNN-Kang NR model, 4.14 higher than the WaDIQaM-NR model and 0.48 higher than the PICNN model. Although the performance of PICNN is closed to the performance of our QODCNN-NR for that two models both are designed for SCIs, the proposed QODCNN-NR achieves better generalization ability from the results of cross-database experiments in Table IV. The main reason is that our method fully utilizes the strong extracting features ability of CNN, and considers the divergence of local quality within a large SCI. In addition, our method fuses local quality to obtain visual quality of images by using an adaptive and effective weighting method.

80% 70% 60% 50% 40% 30% 20% 10%
PLCC 0.8849 0.8854 0.8908 0.8882 0.8858 0.8863 0.8877 0.8853 0.8830
SRCC 0.8650 0.8656 0.8706 0.8695 0.8671 0.8678 0.8694 0.8640 0.8647
RMSE 6.6930 6.6697 6.5924 6.5907 6.6426 6.6525 6.5936 6.6901 6.7400
TABLE V: Performance of QODCNN-NR-II with Different Proportion Training Data

Iv-B3 Performance Comparison on Individual Distortion

The performances of our models and other FR models on each individual distortion type are shown in Table III. From Table III, it can be observed that GMSD, ESIM, SFUW and MDOGS show better performance on individual distortion type compared to other traditional methods. The main reason is that these approaches consider the edge and gradient information for the visual quality prediction of SCIs. The CNN-based learning models demonstrate superiority compared to traditional methods, our FR model demonstrates excellent performance for all distortion types, and even the QODCNN-NR model outperforms traditional FR approaches and achieves competitive performance for most distortion types. Especially, the proposed NR and FR models show outstanding performance on the GN, CC, JC and LSC.

Iv-C Cross-Database Evaluation

To verify the generalization of proposed learning models, the CNN models are trained on SIQAD and tested on SCID. Considering that two databases contain different distortion types, experimental results of our models are given on 6 common distortion types consisting of GN, GB, MB, CC, JC and J2C. All distorted SCIs of six distortion types in SIQAD are used as training set. In the procedure of testing, the common practice of Mittal et al. [12] and Ye et al. [43] is employed, and 80% distorted SCIs associated with 32 reference SCIs of SCID are randomly chosen to evaluate the parameters of nonlinear function (Eq. 13). The rest 20% distorted SCIs are utilized for testing. This operation is repeated with 1000 times, and the median performance is reported.

The cross-data performance of our proposed models including NR and FR is compared with the following FR methods and NR CNN-based model: PSNR, SSIM [1], GMSD [6], VSI [5], ESIM [18] and PICNN [35]. From Table IV, it can be observed that our models achieve better performance than PSNR, SSIM, GMSD, VSI and PICNN, and our FR model obtains similar performance with ESIM designed for SCIs. Comparing the two CNN-based NR model including the proposed QODCNN-NR model and PICNN, it can be find that our NR model obtains significant performance improvement which means the proposed model owns stronger generalization ability.

Iv-D Performance Analysis

Fig. 7: Comparison of prediction performance under different proportion training data.
Fig. 8: Comparison of prediction performance in three stages of both FR and NR models on SIQAD.
Fig. 9: Comparison of prediction cross-database performance in three stages of both FR and NR models on SCID.

In order to fully demonstrate the effectiveness of our optimization approaches including TMED and WLQVLSD, the pre-trained models using average weighting method of NR-IQA and FR-IQA in the first training stage are named as QODCNN-NR-I and QODCNN-FR-I, the fine-tuning models using average weighting method of second stage are denoted as QODCNN-NR-II and QODCNN-FR-II, and the fine-tuning models employing VLSD weighting method are named as QODCNN-NR-III and QODCNN-FR-III.

Iv-D1 Influence of Training Data Selection Proportion

to choose the appropriate proportion of training data, the proportions from 80% to 10% are taken into account for NR-IQA. The experimental results on SIQAD are reported in Table V, and the cross-data experimental results are shown in Fig. 7. From Table V and Fig. 7, it can be seen that QODCNN-NR-II achieve better performance than QODCNN-NR-I except when the proportions of 20% are applied. Especially, QODCNN-NR-II obtains the best performance when 70% training data are employed to fine-tuned pretrained model. Considering that the local quality distribution is similar between FR and NR models, 70% of the training data are employed for the second stage of FR-IQA models.

Iv-D2 Effectiveness of Two Optimizations

To show the advantage of our two optimizations, the performance of models in three stages are shown in Fig. 8 and Fig. 9. It can be observed that the performance is improved when TMED and WLQVLSD are employed for both FR and NR models on two databases. Therefore, experimental results show that the two optimizations including TMED and WLQVLSD are effective for visual perceptual prediction of SCIs. In addition, NR model has more performance improvement compared with FR model. We consider that FR CNN-based model predicts more precise quality of image quality for utilizing reference information. TMED and WLQVLSD are two optimizations to improve accuracy of the model. When the accuracy of the model is higher, the amplitude of performance improvement is smaller.

Iv-D3 Reduce Overfitting

One of the important problems in machine learning is overfitting. In our model, two ways are adopted to solve this problem. Firstly, data augmentation is employed to generate large data by cropping image to image patches with small size. Then in our model architecture, BN layers are added to improve learning ability of neural networks, regularization are used to generate sparse model, and both of them are effective approaches to reduce overfitting problem. Two regularization ways are tested to be valid and certainly helps to improve generalization ability. Our experiments on SIQAD and SCID verifies that our models achieve excellent performance on visual quality evaluation of SCIs.

V Conclusion

In this paper, a neural network-based model is presented for full-reference and no-reference image quality assessment of SCIs with two optimizations. The OQDCNN model consists of three steps. In the first step, an effective CNN model is proposed to predict visual quality of SCIs for both FR and NR by employing concatenate layer to control input of reference information. Then, the Euclidean distance between DMOSs and predicted scores is employed to select more effective data, and pretrained model is fine-tuned with these data is utilized to optimize model. For third step, local weights are measured using a noise robust index VLSD and applied to fuse local quality for obtaining image visual quality. Experimental results on SIQAD demonstrate that the proposed FR and NR models achieve the best performance compared to the state-of-the-art approaches. In addition, cross-data experimental results on SCID illustrate the strong generalization ability of our models and the efficiency of two optimizations including TMED and WLQVLSD.


  • [1] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transanctions on Image Processing, vol. 13, no. 4, pp. 600-612, Apr. 2004.
  • [2] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Proc. IEEE Asilomar Conf. Signals, Syst, Comput, pp. 1398-1402, Nov. 2003.
  • [3] Z. Wang and Q. Li, “Information content weighting for perceptual image quality assessment,” IEEE Transanction on Image Processing, vol. 20, no. 5, pp. 1185-1198, May 2011.
  • [4] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity index for image quality assessment,” IEEE Transanctions on Image Processing, vol. 20, no. 8, pp. 2378-2386, Aug. 2011.
  • [5] L. Zhang, Y. Shen, and H. Li, “VSI: A visual saliency-induced index for perceptual image quality assessment,” IEEE Transanctions on Image Processing, vol. 23, no. 10, pp. 4270-4281, Aug. 2014.
  • [6] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity deviation: A highly efficient perceptual image quality index,” IEEE Transanctions on Image Processing, vol. 23, no. 2, pp. 684-695, Feb. 2014.
  • [7] A. Liu, W. Lin and M. Narwaria, “Image Quality Assessment Based on Gradient Similarity,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 1500-1512, April 2012.
  • [8] J. Wu, G. Shi, W. Lin and X. Wang, “Reduced-reference image quality assessment with orientation selectivity based visual pattern,” in IEEE China Summit and International Conference on Signal and Information Processing, pp. 663-666, 2015.
  • [9] S. Wang, K. Gu, X. Zhang, W. Lin, S. Ma and W. Gao, “Reduced-Reference Quality Assessment of Screen Content Images,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 1, pp. 1-14, Jan. 2018.
  • [10] L. Ma, S. Li and K. N. Ngan, “Reduced-Reference Video Quality Assessment of Compressed Video Sequences,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 10, pp. 1441-1456, Oct. 2012.
  • [11] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the DCT domain,” IEEE Transanctions on Image Processing, vol. 21, no. 8, pp. 3339-3352, Aug. 2012.
  • [12] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transanctions on Image Processing, vol. 21, no. 12, pp. 4695-4708, Dec. 2012.
  • [13] Y. Zhang, A. K. Moorthy, D. M. Chandler, and A. C. Bovik, “C-DIIVINE: No-reference image quality assessment based on local magnitude and phase statistics of natural scenes,” Signal Processing: Image Communication, vol. 29, no. 7, pp. 725-747, Aug. 2014.
  • [14] Q. Wu, Z. Wang, and H. Li, “A highly efficient method for blind image quality assessment,” in IEEE International Conference on Image Processing, pp. 339-343, Sep. 2015.
  • [15] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a completely blind image quality analyzer,” IEEE Signal Processing Letters, vol. 22, no. 3, pp. 209-212, Mar. 2013.
  • [16] L. Zhang, L. Zhang, and A. C. Bovik, “A feature-enriched completely blind image quality evaluator,” IEEE Transanctions on Image Processing, vol. 24, no. 8, pp. 2579-2591, Aug. 2015.
  • [17] H. Yang, Y. Fang and W. Lin, “Perceptual Quality Assessment of Screen Content Images,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 4408-4421, Nov. 2015.
  • [18] Z. Ni, L. Ma, H. Zeng, J. Chen, C. Cai and K. K. Ma, “ESIM: Edge Similarity for Screen Content Image Quality Assessment,” IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 4818-4831, Oct. 2017.
  • [19] Y. Fang, J. Yan, J. Liu, S. Wang, Q. Li and Z. Guo, “Objective Quality Assessment of Screen Content Images by Uncertainty Weighting,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2016-2027, April 2017.
  • [20] K. Gu et al., “Saliency-Guided Quality Assessment of Screen Content Images,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 1098-1110, June 2016.
  • [21] Z. Ni, H. Zeng, L. Ma, J. Hou, J. Chen and K. Ma, “A Gabor Feature-Based Quality Assessment Model for the Screen Content Images,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4516-4528, Sept. 2018.
  • [22] Y. Fu, H. Zeng, L. Ma, Z. Ni, J. Zhu and K. Ma, “Screen Content Image Quality Assessment Using Multi-Scale Difference of Gaussian,” IEEE Transactions on Circuits and Systems for Video Technology. doi: 10.1109/TCSVT.2018.2854176, 2018.
  • [23] F. Shao, Y. Gao, F. Li and G. Jiang, “Toward a Blind Quality Predictor for Screen Content Images,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, DOI. 10.1109/TSMC.2017.2676180, 2018.
  • [24] Y. Fang, J. Yan, L. Li, J. Wu and W. Lin, “No Reference Quality Assessment for Screen Content Images with Both Local and Global Feature Representation,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1600-1610, April 2018.
  • [25] L. Kang, P. Ye, Y. Li and D. Doermann, “Convolutional Neural Networks for No-Reference Image Quality Assessment,” in

    IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 1733-1740, 2014.
  • [26] Y. Li, X. Ye, and Y. Li, “Image quality assessment using deep convolutional networks,” AIP Advances, vol. 7, no. 12, p. 125324, 2017.
  • [27] B. Bare, K. Li and B. Yan, “An accurate deep convolutional neural networks model for no-reference image quality assessment,” in IEEE International Conference on Multimedia and Expo, pp. 1356-1361, 2017.
  • [28] H. Wang, L. Zuo and J. Fu, “Distortion recognition for image quality assessment with convolutional neural network,” in IEEE International Conference on Multimedia and Expo, pp. 1-6, 2016.
  • [29] J. Kim, H. Zeng, D. Ghadiyaram, S. Lee, L. Zhang and A. C. Bovik, “Deep Convolutional Neural Models for Picture-Quality Prediction: Challenges and Solutions to Data-Driven Image Quality Assessment,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 130-141, Nov. 2017.
  • [30] J. Kim, A. D. Nguyen and S. Lee, “Deep CNN-Based Blind Image Quality Predictor,” IEEE Transactions on Neural Networks and Learning Systems. doi: 10.1109/TNNLS.2018.2829819, 2018.
  • [31] K. Ma, W. Liu, K. Zhang, Z. Duanmu, Z. Wang and W. Zuo, “End-to-End Blind Image Quality Assessment Using Deep Neural Networks,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1202-1213, March 2018.
  • [32] S. Bosse, D. Maniry, K. R. M ller, T. Wiegand and W. Samek, “Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 206-219, Jan. 2018.
  • [33] Y. Zhang, D. M. Chandler and X. Mou, “Quality Assessment of Screen Content Images via Convolutional-Neural-Network-Based Synthetic/Natural Segmentation,” IEEE Transactions on Image Processing. doi: 10.1109/TIP.2018.2851390, 2018.
  • [34] L. Zuo, H. Wang and J. Fu, “Screen content image quality assessment via convolutional neural network,” in IEEE International Conference on Image Processing, pp. 2082-2086, 2016.
  • [35] J. Chen, L. Shen, L. Zheng and X. Jiang, “Naturalization Module in Neural Networks for Screen Content Image Quality Assessment,” IEEE Signal Processing Letters, vol. 25, no. 11, pp. 1685-1689, Nov. 2018.
  • [36] J. Kim and S. Lee, “Fully deep blind image quality predictor,” IEEE Journal of Selected Topics Signal Processing, vol. 11, no. 1, pp. 206-220, Feb. 2017.
  • [37] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
  • [38] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in International Conference On Machine Learning, 2015.
  • [39]

    Vinod Nair and Geoffrey E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    International Conference On Machine Learning, 2010.
  • [40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, Dec. 2014.
  • [41] Methodology for the Subjective Assessment of the Quality of Television Pictures, document Rec. ITU-R BT.500-11, International Telecommunications Union, 2012.
  • [42] K. Gu, G. Zhai, W. Lin, X. Yang, and W. Zhang, “Learning a blind quality evaluation engine of screen content images, Neurocomputing, vol. 196, pp. 140-149, Jul. 2016.
  • [43] P. Ye, J. Kumar, and D. Doermann, “Beyond human opinion scores: Blind image quality assessment based on synthetic scores,” in IEEE Conference Computer Vision and Pattern Recognition, pp. 4241-4248, 2014.