Recent years have witnessed the surge of applications based on deep learning, which primarily rely on training of neural networks with large-scale labelled data. No reference image quality assessment (NRIQA), which operates without the pristine reference, has also greatly benefited from such a principled pipeline[moorthy2011blind, gu2014using, mittal2012no, kang2014convolutional]. However, the strong assumption that the training and testing data are drawn from closely aligned feature spaces and distributions creates the risk of poor generalization capability, and as a consequence, inaccurate predictions are obtained on the images that hold dramatically different statistics compared to those in the training set. However, in real applications scenarios, obtaining the ground-truth label via subjective testing for any content of interest is quite time-consuming and cumbersome. As such, though there are numerous labelled image quality assessment datasets based on natural images, the emergence of image creation routines and methods brings new challenges on the NRIQA based on the available labelled data.
In this paper, we answer the question that whether the quality of natural scene could be transferable by designing a novel NRIQA method based on unsupervised domain adaptation. There are three reasons behind which the natural images are treated as the source domain in this study. First, there are numerous datasets of natural images with subjective ratings, and the need for transferring from statistically regular natural images to other types of content is becoming more pronounced. Second, there is an increasing consensus that the human visual system (HVS) evolves with the natural scene statistics, such that transferring from natural to unnatural scene could closely resemble human behavioral responses when evaluating the quality of artificially created images. Third, it is widely acknowledged that the artificially created images do not follow the natural scene statistics, such that it is meaningful to investigate the transfer capability since such methodology could be feasibly extended to many other scenarios. As such, in contrast to natural images which form the source domain, we choose screen content images as the target domain. Fig. 1 shows examples of SCIs and NIs in SIQAD [yang2015perceptual] and TID2013 databases [ponomarenko2015image], respectively. It is very interesting to find that although the similar levels of distortions are injected on the SCIs and NIs, the quality ranking between the distorted SCIs, and that between the distorted NIs are opposite. The underlying reason could be attributed to the different statistical properties of images. Therefore, given the subjective ratings of the natural images only, it could be a quite challenging task to transfer the quality of natural scenes to the unnatural screen content.
The transferability of quality prediction differs substantially from other computer vision tasks (e.g., object and action recognition). Quality assessment, the aim of which is matching human measurements of perceptual quality, highly relies on the image content. To tackle this problem, we propose to leverage the advantage of domain adaptation (DA), in an effort to learn a NRIQA model specifically for SCIs (target domain) from the NIs (source domain) as well as corresponding ground-truth ratings of NIs. This scenario falls into the unsupervised domain adaptation which has been widely studied in the literature[ben2007analysis, ben2010theory, fernando2013unsupervised]
. However, the difficulty of directly transferring the quality prediction model from NIs to SCIs arises due to underlying differences in terms of their characteristics. Instead of enforcing the model to equip with the capability in accurately predicting the quality on both domains, we propose to explore the transferability of pair-wise relationship by learning to rank, such that the model that infers the quality rank of a pair of images can be learned. More concretely, discriminable image pairs from source and target domains are selected to learn the ranking model. Grounded on the work of embedding DA in the the process of representation learning, a feature that accounts for the ranking equipping with the property of domain invariant is expected to be learned, such that with the reduction of domain shift, the knowledge learned in the source domain can be transferred to the target domain and signiﬁcantly improve the performance. To this end, we introduce two complementary losses to explicitly regularize the feature space of pair-wise relationship in a progressive manner. For feature extraction, we introduce the maximum mean discrepancy (MMD) loss to reduce the pair-wise feature discrepancy between source and target domains, from which the latent feature space can be shared. Regarding the classifier, we propose to utilize a center-based loss to rectify our classifier on the target domain which can further improve the performance of our model. The superior performance of the proposed scheme provides useful evidence on the transferability from the source domain to the target domain, and such paradigm can also be extended in multiple ways to predict the quality of images/videos in a specific domain without deliberately acquiring the subjective ratings for training.
Ii Related Works
Ii-a NRIQA for NIs
Conventional NRIQA methods rely on the theory that the natural scene statistics (NSS) are governing the perception and behavior of natural images, such that the distortion is reflected by the destruction from naturalness. In [moorthy2011blind], based on NSS in wavelet domain, the un-naturalness of distorted images is characterized. Saad et al.
established the NSS model in discrete cosine transform (DCT) domain and the quality is predicted by the Bayesian inference[saad2012blind]. Different from quality regression, Hou et al. designed a deep learning model to classify the NSS features into five grades, corresponding to five quality levels [hou2014blind]. In general, the deep learning based methods rely on large-scale training samples with subjective ratings as the label information [kang2014convolutional, kim2016fully, bosse2017deep, bianco2018use, gu2019blind, fu2016blind, kim2018multiple]. However, due to the insufficient training data, extra synthetic databases have also been taken advantage of [zhang2018blind, ma2017end], in which the distortion type identification network can be identified as “prior knowledge” and combined with the quality prediction network. Different from learning with a single image, ranking based methods [liu2017rankiqa] have also been proposed to enrich the training data. In general, to straightforwardly acquire the rank information, the image content within one pair is usually required to be identical, lacking the capacity of cross content quality prediction.
Ii-B NRIQA for SCIs
Due to the distinct statistics of SCIs, numerous NRIQA methods have been specifically developed. In [gu2017no], four types of features including picture complexity, screen content statistics, global brightness and sharpness of details are extracted for SCI quality prediction. In [fang2017no], inspired by the perception characteristics, Fang et al. designed a quality assessment method by the combination of local and global texture features and luminance features. Driven by the hypothesis that HVS is highly sensitive to sharp edges, in [zheng2019no], the regions in SCI are divided into sharp edge regions and non-sharp edge regions, such that the hybrid region based features are extracted for no-reference SCI quality assessment. Benefiting from the powerful feature extraction capability of CNNs, Zuo et al. proposed a CNN based framework with two sub-networks, where one of the sub-network is designed for producing the local quality of each image patch and the other is responsible for image level quality fusion [zuo2016screen].
Ii-C Domain Adaptation
Transfer learning has emerged as an effective technique to address the lack of sufficient labels on target domain by learning from labelled source domain. However, the domain gap between source and target domains is inclined to cause the performance degradation. To alleviate this issue, domain adaption (DA) has been widely concerned. Common practices are minimizing the feature discrepancy between source domain and target domain by minimizing the maximum mean discrepancy (MMD) [long2017deep][yan2017mind], correlation alignment (CORAL) [sun2016deep][peng2018synthetic], or reducing the Kullback- Leibler (KL) divergence [zhuang2015supervised]. Another line of research is confusing a domain discriminator by adversarial learning. In [ganin2014unsupervised]
, a gradient reversal layer is proposed for domain invariant feature learning. Generative adversarial networks (GANs) have also been adopted for DA in[liu2016coupled, isola2017image] to transform the appearance of source samples, in an effort to enforce them to be similar with target samples. However, these methods only impose the constraint on the distributions in the feature space while neglecting the distributions of classification results in the label space, possibly leading to an inadequate DA.
Iii The Proposed Scheme
Our goal is to learn an NRIQA model that captures and quantifies the regularities of SCIs based on ground-truth ratings from NIs. Instead of simply transferring the continuous quality ratings, we learn to rank the quality of image pairs, which has been regarded as an alternative yet promising paradigm in conveying the quality, but has not been fully exploited in DA based quality prediction task. First, it is generally acknowledged that the statistics of naturalness for SCIs are highly different from those of NIs. In this regard, one empirical experiment is conducted, as shown in Fig. 3. In particular, we compute the naturalness distributions of NIs and SCIs, and it is apparent that the NIs follow Gaussian distribution which is highly different with the statistics of SCIs, due to the fact that SCIs are computer generated instead of optical camera acquired. More specifically, following[mittal2012making], the naturalness value is calculated as follows,
where and are the spatial indices in an image . The mean and deviation are computed as follows,
where is the Gaussian weighting function with . As such, we transform the DA based quality prediction to a relatively less ambiguous task. This aligns with the cognitive process, as it is usually much more straightforward to compare a pair of images than providing the rating scales (e.g., 5-star). Second, learning from image pairs can highly expand the training samples which further alleviates the over-fitting problem to some extent. Third, it is quite feasible to obtain the global quality predictions by aggregating from pairwise comparison data. These motivations inspire us to propose to explore the transferability of pairwise relationship between the source and target domains.
Given NI pairs as well as their labels in the NI dataset as source domain : and unlabelled SCI pairs as target domain : , our task is to learn a feature extractor and a ranking classifier such that the expected target risk
can be minimized with a certain classification loss function, where denotes the corresponding ground truth of from target domain. To be specific, we denote two images as and in , assuming their quality values are and
, then the probability of the ranking classification
can be estimated as follows:
In [ben2007analysis, ben2010theory], Ben-David et al.
proved that the upper bound of the empirical risk on the target domain is jointly determined by the empirical risk on the source domain and the discrepancy between the source and target domain. In this paper, we model the distance between source and target domain by considering the joint distribution of data pair, which can be formulated as follows:
where the and
where and denote the marginal distribution on the ranking features of source and target domains, respectively. Moreover, and denote the conditional probability distribution on the ranking outputs conditioned on the ranking features of source and target domains, respectively. The distribution discrepancy measurement will be introduced in Section 3.4.
To minimize the empirical risk of the target domain, we propose to reduce the domain distribution discrepancy by jointly constraining the distance between and in the feature space by a MMD loss and the distance between and by a classifier rectification loss. The empirical risk on the source domain can be minimized with the cross-entropy loss for the purpose of pairwise ranking classification based on the labelled NI pairs.
Iii-B Discriminable Image Pairs Selection
One may consider to generate rankings with arbitrary images from source and target domains, in an effort to create the training set for ranking. However, we argue that this may not be optimal due to two reasons. First, the intrinsically ambiguous pairs are even difficult to be distinguished by HVS, particularly when the quality scores of the image pair are extremely close. As such, the labels of such hard sample pairs may not be credible [ma2017dipiq] and cause the difficulty in model convergence when they are involved during training. Second, forcing the network to distinguish the indiscriminable image pairs is likely to lead to overfitting problem to source domain, resulting in a negative transferring to the target domain. To this end, we propose to select the NI pairs with their quality score difference governed by a threshold, rendering discriminable pairs instead of random pairs as follows,
where is a pre-set ratio of the score gap between the maximum quality score and minmum quality score in the dataset. Regarding SCIs, we also propose to select discriminable image pairs by Eqn. (7), in order to align the quality scale in source domain and target domain. As the ground truth information is not available in the target domain, we propose to use the DB-CNN model [zhang2018blind], which is pre-trained on NI dataset (TID2013), to predict the pseudo ratings of SCIs as the guidance. It is worth mentioning that although the predicted quality may not be accurate enough, they only serve as the guidance based on the difference between image pairs, such that the discriminable image pairs can be selected by a predefined threshold. In our method, we also find the selection process is also necessary for our classifier rectification, making our model easier to learn the center features shared by NIs and SCIs.
Iii-C Architecture for Ranking
As shown in Fig. 2, our aim is to transfer the quality relationship from SCI pairs to NI pairs with a pairwise ranking model based on DA. The architecture of our model mainly consists of two parts, ranking feature generator G and ranking feature classification C. For ranking feature generator, we first use a pre-trained CNN performed on the selected discriminable image pairs to extract feature of each single image. The CNN adopted is the tailored SCNN introduced in [zhang2018blind], which is pretrained with different distortion types on NIs. Here, only the second-to-last layers are used as the feature encoder. There are two reasons behind which we choose SCNN as our encoder. First, the SCNN is a much lighter network compared with other pretrained networks such as VGG [simonyan2014very] or ResNet [he2016deep]. Second, the prior knowledge of distortion type that SCNN obtains could make our model more stable during training, as evidenced by our experiments. We denote the distortion features extracted from the two images and in an image pair as and . With the assumption that the ranking information of an image pair can be acquired by measuring their differences on each distortion type, we use the relative distance of their features as the ranking feature ,
Subsequently, for the ranking feature classification, a classifier is designed which consists of a fully connect layer and a softmax layer trained with a binary cross entropy loss,
where is the batch size, and indicates the input pair with its binary label . The represents the predicted probability that the first image has better quality than the second one in the input pair.
Iii-D Domain Alignment
In this subsection, we introduce how the source and target domains are aligned to transfer the quality from the source to the target domain. Though it is generally acknowledged that there are dramatical differences in terms of the statistics presumably perceived by HVS between source and target domains, the shareable feature responses, which are transferable and subjected to be learned with DA, originate from the relatively quality rank across content and distortion types. As discussed above, we propose to conduct domain adaptation by jointly considering the marginal distribution on ranking features, and the ranking output distribution conditioned on the features.
Regarding the marginal distribution based on the ranking features, we propose to reduce the discrepancy between and with a Maximum Mean Discrepancy (MMD) loss, which can be formulated as
where the and indicate the numbers of samples of NIs and SCIs in a batch respectively and is a function that maps the features into the Reproducing kernel Hilbert Space (RKHS) [gretton2012kernel]. We apply the Kernel trick by adopting Gaussian kernel [gretton2012kernel] to compute by setting the kernel bandwidth to be the median distances of all pairwise data points from the batch, such that the discrepancy of marginal distribution is expected to be minimized.
Regarding the conditional probability distribution based on ranking features and the corresponding ranking outputs, we propose a classifier rectification loss to improve the discrimination capability of classifier on target domain. More concretely, we first apply a center loss [wen2016discriminative] to learn two centers for the ranking feature of each class on the source domain as follows:
where if the condition is satisfied, and otherwise. In addition, and are the learned class specific centers. The center loss simultaneously learns the centers of each class and penalizes the distances between the features. In particular, with the centers acquired from the source domain, they can be further applied to cluster the ranking features in the target domain,
where indicates the likelihood of the sample be classified to class 0. There are two advantages when imposing . First, the wrong classification results will be rectified based on the distance between features and the two centers in the target domain, as the will decrease when the ranking features are classified to its closest center with high probability. Second, the center of each class will be updated gradually when the ranking features of NIs and SCIs vary. This could further improve the feature separability, and finally the two class-specific centers shared by NIs and SCIs can be acquired.
Iii-E Quality Prediction and Model Refining
The total loss functions can be summarized as follows,
where , , are the weighting factors. Given the trained ranking model, it is further applied to predict the quality of SCIs. To precisely obtain the quality, for the image , we compare it with the rest of all of images and its quality value is given by,
where is the number of all the SCIs and is the binary comparison result between the image with . In particular, means the quality of is better than and vice versa. After obtaining , the discriminable image pairs re-selection can be governed by the predicted quality instead of the pre-trained DB-CNN model, such that our model can be refined by retraining. This refining operation will be performed iteratively until the number of iterations reaches a maximum value or the Average Difference of Quality (ADQ) predicted by current model and previous model is smaller than a given threshold. Algorithm 1 summarizes our training procedure.
Iv Experimental Results
|Target domain: SIQAD||PLCC||MAE||RMSE||SRCC||KRCC|
|Conventional NRIQA||NIQE||0.2967||11.059||13.548||0.2863||0.1963 [t]|
|Deep Learning Based (Trained on TID2013)||Rank||0.2547||11.257||13.719||0.2303||0.1557 [t]|
|Deep Learning Based (Trained on LIVE)||Rank||0.1745||11.486||13.970||0.1695||0.1149 [t]|
To show the effectiveness of our method, we evaluate our model based on four datasets, including two for NIs (TID2013 [ponomarenko2015image] and LIVE [sheikh2003image]) and two for SCIs (SIQAD [yang2015perceptual] and QACS [wang2016subjective]).
TID2013. The TID2013 dataset consists of 3000 NIs obtained from 25 reference images. The reference images are distorted by 24 distortion types and each distortion type corresponds to 5 levels. In this dataset, MOS in the range [0 , 9] of each image are collected from 985 subjective experimenters and those volunteers (observers) are from five countries (Finland, France, Italy, Ukraine, and USA).
LIVE. The LIVE IQA database includes 982 distorted NIs and 29 reference images. Five different distortion types are included: JPEG and JPEG2000 (JP2K) compression, additive white Gaussian noise (WN), Gaussian blur (BLUR), and Rayleigh fast-fading channel distortion (FF). The DMOS value for each distorted image are provided in the range [0 , 100].
SIQAD. The SIQAD is a SCI dataset which contains 20 source and 980 distorted SCIs. This dataset involves seven distortion types: Gaussian Noise (GN), Gaussian Blur (GB), Motion Blur (MB), Contrast Change (CC), JPEG, JPEG2000 (JPEG2K) and Layer Segmentation based Coding (LSC). Each distortion type corresponds to seven degradation levels are performed.
QACS. Compared with SIQAD, QACS database emphasizes on the distortions of compression by two codecs based on the high efﬁciency video coding (HEVC) standard [sullivan2012overview]: HEVC and its screen content coding (SCC) extension [shi2015study]. For simplification, the HEVC-SCC extension is denoted as SCC here. This dataset contains 24 source and 492 compressed SCIs. Each SCI is compressed with 11 QP values ranging from 30 to 50, and viewed by twenty subjects with single-stimulus.
We adopt four different settings to verify the transferable capability, including: 1) source domain: TID2013, target domain: SIQAD; 2) source domain: LIVE, target domain: SIQAD; 3) source domain: TID2013, target domain: QACS; 4) source domain: LIVE, target domain: QACS.
Iv-B Implementation Details
We implement our model by PyTorch. As discussed in subsection 3.2, we set the quality ratio in Eqn. (7) as 0.2 for discriminable image pairs selection. The selected images are resized to as the inputs of our network. The batch size in the training phase is 16 and we adopt Adam optimizer for optimization. The learning rate is ﬁxed to 0.00005 with a weight decay set by 0.001. The weight parameters , , in Eqn. (13) are set as 1.0, 0.2 and 0.001 for all experiments. For the model refining, we set the maximum iterations as 10 and the threshold of ADQ as 0.002.
Five evaluation metrics are reported for each experimental setting, including:
Spearman rank correlation coefficient (SRCC), which is a nonparametric measure given by,
where is the number of test images and is the rank difference between the objective and subjective predictions of the image.
Pearson linear correlation coefﬁcient (PLCC), which is obtained based on a nonlinear mapping between objective and subjective scores with the logistic regression function,
where and stand for the subjective and mapped objective values of the image, respectively. Moreover, the and represent the mean of and over the testing set.
The Kendall rank correlation coefficient (KRCC), which is defined to measure the ordinal association between two quantities,
where and are the number of concordant and discordant pairs, respectively.
The mean absolute error (MAE), which measures the prediction accuracy based on the absolute distance after converting the objective scores,
The root mean square error (RMSE), which measures the standard deviation of the prediction errors,
Iv-C Quality Prediction Performance
In this subsection, we evaluate the performance of our method with four different settings to further verify the effectiveness. We compare the proposed method with both conventional and deep learning based NRIQA measures, including NIQE [mittal2012making], PIQE [venkatanath2015blind], BRISQ [mittal2012no], Rank [liu2017rankiqa], MENO [ma2017end], Bilinear [zhang2018blind]. In particular, the conventional methods are pre-trained on NIs, and the deep learning based methods are trained with the data in the corresponding source domain.
First, we treat TID2013 dataset as the source domain and SIQAD dataset as the target domain to train our model. The results are shown in Table 1, from which we can find our method can achieve the best performance. We also provide the visualization of the MOS and predicted scores in Fig. 4, from which we can find that the proposed method demonstrates a stronger linear relationship between quality prediction and MOS, comparing with all conventional methods and deep learning based methods. To explore the influence of the source domain, we also conduct the experiments to replace the TID2013 with LIVE dataset. From Table 1, we can find that our method can still acquire the best performance. However, compared with TID2013 as the source domain, the performance has been degraded to some extent as there are more distortion types involved in the TID2013 dataset. As such, more distortion relevant prior and knowledge can be transferred to the target domain. In addition to the performance of deep learning based methods, we can find the conventional method NIQE [mittal2012making] provides the worst performance on SCIs, as this method is based on the statistical characteristics of NIs. This further verifies our assumption that the dramatical differences in terms of the statistics lead to the difficult in quality transfer.
To further explore the generalization ability of our method, we adopt another SCI dataset QACS as our target domain. As discussed in Section 4.1, this dataset considers two distortion types: HEVC and SCC extension based on HEVC, such that the distortions injected to these SCIs are closer to real application scenarios. The experimental results are shown in Tables 2. Compared with the results of Table 1, we can find our method still leads the performance by a large margin. Besides, we can find the performance improvement on the QACS dataset is much larger than on the SIQAD dataset, as only one distortion type exists in the target domain QACS which results in the fact that feature centers can be more easily learned. Although the Bilinear method achieves the second best results, we argue that the method is a heavy-weight based method (139.8 M) as two networks (SCNN and VGG16) are adopted. By contrast, our method only utilizes the SCNN (1.6 M) which is much lighter than the Bilinear method.
|QACS (HEVC)||Conventional Methods||Deep Learning Based (TID2013)||Deep Learning Based (LIVE)|
|QACS (SCC)||Conventional Methods||Deep Learning Based (TID2013)||Deep Learning Based (LIVE)|
Iv-D Ablation Study
In this subsection, to reveal the functionalities of different modules in the proposed method, we perform the ablation study based on the first setting (source domain: TID2013, target domain: SIQAD), and the results are shown in Table 3.
First, we train the SCNN for quality prediction with the source domain data, leading to a model MOSPre. that is directly applied on the target domain for testing. As expected, significant performance drop has been observed, due to the fact that the relationship between the two domains has not been further investigated. Subsequently, we explore the DA based on the MOSPre. model by imposing the MMD loss, leading to the model MOSPre.+MMD. It is interesting to find the performance degradation compared to MOSPre. due to the negative transfer occurs when the MMD loss is introduced. This phenomenon verifies our assumption that the MOS prediction model trained on the NIs is difficult to be straightforwardly transferred to SCIs due to the large discrepancy of deterministic or statistical characteristics relevant to quality. On the contrary, we adopt the ranking mechanism for quality transfer. Although the models (RankPre. and RankPre.+MMD) trained for pairwise ranking cannot achieve the same level of accuracy of MOSPre. on SIQAD, the negative transferring phenomenon has been largely alleviated, implying that the ranking relationship among NIs and SCIs can be shared to a certain extent. Based on MMD and rectification losses, we further enhance the model from the theory of DA by aligning the joint distribution between the source and target domains, which can significantly improve our performance. This further provides evidences on the effectiveness of our proposed method.
We have presented a new NRIQA method based on unsupervised domain adaptation, in an effort to quest the transfer capability of the natural image quality. The proposed method is grounded on the unsupervised domain adaptation, equips the transferability of pair-wise relationship, and performs well on the target domain for specific application scenarios. The proposed method attempts to fill the gap between the statistics of SCIs and NIs through the ranking based relationship modeling, and the loss functions that minimize the feature discrepancy and rectify the classifier to accommodate the target domain lead to noticeable performance improvement in terms of the prediction accuracy.
Recent years have witnessed a surge of images/videos that are not purely generated from optical cameras, such as screen, gaming and mixture content. In particular, with the fast development of artificial intelligence, there are also numerous images and videos generated with the aid of deep generative network. As such, it is expected that the methodology and philosophy of the proposed method could play important roles in predicting the quality of these emerging domains. Moreover, rather than providing a DA based quality measure only, we would also like to emphasize that the generalization capability of the NRIQA models could be further improved relying on investigation of the shareable knowledge and priors between different domains. It is also of interest to extend the current approach to other transferable tasks in quality assessment, such as distortion type and viewing condition. Moreover, it is imperative to study the quality assessment in the scenario that only a few samples are labelled with subjective ratings in the target domain, to meet the grand challenges faced by NRIQA in different real-world applications.