No-Reference Image Quality Assessment by Hallucinating Pristine Features

by   Baoliang Chen, et al.
City University of Hong Kong

In this paper, we propose a no-reference (NR) image quality assessment (IQA) method via feature level pseudo-reference (PR) hallucination. The proposed quality assessment framework is grounded on the prior models of natural image statistical behaviors and rooted in the view that the perceptually meaningful features could be well exploited to characterize the visual quality. Herein, the PR features from the distorted images are learned by a mutual learning scheme with the pristine reference as the supervision, and the discriminative characteristics of PR features are further ensured with the triplet constraints. Given a distorted image for quality inference, the feature level disentanglement is performed with an invertible neural layer for final quality prediction, leading to the PR and the corresponding distortion features for comparison. The effectiveness of our proposed method is demonstrated on four popular IQA databases, and superior performance on cross-database evaluation also reveals the high generalization capability of our method. The implementation of our method is publicly available on



There are no comments yet.


page 1

page 3


No-Reference Color Image Quality Assessment: From Entropy to Perceptual Quality

This paper presents a high-performance general-purpose no-reference (NR)...

Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment

We present a deep neural network-based approach to image quality assessm...

Transformer for Image Quality Assessment

Transformer has become the new standard method in natural language proce...

A Multi-task convolutional neural network for blind stereoscopic image quality assessment using naturalness analysis

This paper addresses the problem of blind stereoscopic image quality ass...

Image Quality Assessment using Contrastive Learning

We consider the problem of obtaining image quality representations in a ...

Controllable List-wise Ranking for Universal No-reference Image Quality Assessment

No-reference image quality assessment (NR-IQA) has received increasing a...

Viewpoint Selection for Photographing Architectures

This paper studies the problem of how to choose good viewpoints for taki...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image quality assessment (IQA), which aims to establish the quantitative connection between the input image and the corresponding perceptual quality, serves as a key component in a wide range of computer vision applications

[guo2017building, liu2017quality, zhang2017learning]. The typical full-reference (FR) IQA models resort to the fidelity measurement in predicting image quality via measuring the deviation from its pristine-quality counterpart (reference). The pioneering studies date back to 1970’s and a series of visual fidelity measures have been investigated [mannos1974effects]

. Recently, there has been a demonstrated success for developing the FR quality measures, including the Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM)

[wang2004image], Multiscale SSIM (MS-SSIM) [wang2003multiscale], Visual Saliency-Induced Index (VSI) [zhang2014vsi], Median Absolute Deviation (MAD) [larson2010most] and Visual Information Fidelity (VIF) [sheikh2006image]. Unfortunately, in the vast majority of practical applications, the reference images are usually absent or difficult to obtain, leading to the exponential increase in the demand for no-reference (NR) IQA methods. Comparing with FR-IQA, NR-IQA is a more challenge task due to the lack of pristine reference information.

In the literature, numerous NR-IQA methods have been proposed based on the hypothesis that natural scenes possess certain statistical properties. Thus, the quality can be assessed by measuring the deviation of the statistics between distorted and pristine images [moorthy2011blind, saad2012blind, hou2014blind]

. With the development of deep learning technologies, the image quality can be inferred by learning from the labeled image data

[kang2014convolutional, kim2016fully, bosse2017deep, bianco2018use, gu2019blind, fu2016blind, kim2018multiple]. However, such data driven based methods highly rely on the large-scale training samples. Recently, the free energy based brain theory [friston2006free, friston2010free, gu2014using, zhai2011psychovisual] provides a novel solution for NR-IQA from the Bayesian view. In particular, the free energy theory reveals that human visual system (HVS) always attempts to reduce the uncertainty and explains the scene of perceived visual stimulus by an internal generative model. Rooted in the widely accepted view that the intrinsic, perceptually-meaningful and learnable features could govern the image quality, in this work we focus on the generative model at the feature level for NR-IQA. This method avoids the modeling of the image signal space of which the understanding is still quite limited. Herein, we propose to learn a new NR-IQA measure named FPR, by inferring the quality through Feature-level PR information. The underlying design philosophy of our method is learning the quality-specific PR feature instead of the restoration-specific PR feature. Along this vein, we can get rid of the design of a specific network for PR image generation, which is still a very challenging task. To verify the performance of our method, we conduct both intra-database and cross-database experiments on four databases, including TID2013 [ponomarenko2015image], LIVE [sheikh2006statistical], CSIQ [larson2010most] and KADID-10k [2019KADID]. Experimental results have demonstrated the superior performance of our method over existing state-of-the-art models. The main contributions of this paper are summarized as follows,

  • We propose a novel NR-IQA framework based on the PR information constructed at the feature-level. This scheme aims to infer the pristine features that enjoy the advantages of quality-aware, learnable and discriminative.

  • We learn the PR feature by a mutual learning strategy, leveraging the reference information. To improve the discrimination capability between the estimated PR feature and the distortion feature, a triplet loss is further adopted.

  • We develop the aggregation strategy for the predicted scores of different patches in an image. The strategy benefits from the gated recurrent units (GRU), and generates the attention map of the testing image for quality aggregation.

Ii Related Works

Due to the lack of reference information, the existing NR-IQA measures can be classified into two categories: quality-aware feature extraction based NR-IQA and discrepancy estimation based NR-IQA. In the first category, the quality-aware features are extracted based on a NSS model or a data-driven model, and the quality is finally predicted by a regression module. In the second category, the PR image/images are first constructed, then the discrepancy between the input image and its PR image/images are measured. The philosophy is that the larger the discrepancy, the worse quality the image possesses. Herein, we provide an overview of the two categories of NR-IQA models as well as the mutual learning methods.

Ii-a Quality-Aware Feature Extraction based NR-IQA

Typically, conventional NR-IQA methods extract the quality-aware features based on the natural scene statistics (NSS) and predict the image quality by evaluating the destruction of naturalness. In [mittal2012no, ye2013real]

, based on the Mean-Subtracted Contrast-Normalized (MSCN) coefficients, the NSS are modeled with a generalized Gaussian distribution and the quality can be estimated by the distribution discrepancy. The NSS features have also been exploited in the wavelet domain,

[moorthy2010two, hou2014blind, tang2011learning, wang2006quality]. In [tang2011learning], to discriminate degraded and undegraded images, the complex pyramid wavelet transform is performed and the magnitudes and phases of the wavelet coefficients are characterized as the NSS descriptor. Analogously, in [saad2012blind]

, the discrete cosine transform (DCT) is introduced for NSS model construction, leading to a Bayesian inference based NR-IQA model. Considering the structural information is highly quality-relevant, the joint statistics of gradient magnitude and Laplacian of Gaussian response are utilized in

[xue2014blind] to model the statistical naturalness. In []

, the hybrid features consisting of texture components, structure information and intra-predicted modes are extracted and unified for adaptive bitrate estimation. Recently, there has been a surge interest in deep-feature extraction for NR-IQA. In

[kang2014convolutional], a shallow ConvNet is first utilized for the patch-based NR-IQA learning. This work is extended by DeepBIQ [bianco2018use]

, where a pre-tranined convolutional neural network (CNN) is fine-tuned for the generic image description. Instead of learning only from the quality score, the multi-task CNN was proposed in

[kang2015simultaneous], in which both the quality estimation and distortion identification are learned simultaneously for the better quality degradation measure. However, although those deep-learning based methods have achieved high performance improvement, the insufficient training data usually create the over-fitting problem. To alleviate this issue, extra synthetic databases e.g. [zhang2018blind, ma2017end, chen2021no] have been proposed for more generalized model learning. The training data can also be enriched by ranking learning. In [liu2017rankiqa, niu2019siamese, ying2020quality, chen2021no], the quality scores of an image pair are regressed and ranked, leading to the more quality-sensitive feature extraction.

Fig. 1: The framework of our proposed method. Two feature extractors are utilized in the training phase: the quality-embedding feature extractor and the integrity feature extractor. For the quality-embedding feature extractor, we extract the quality-embedding features and from the distortion image and its reference image, respectively. For the integrity feature extractor, we aim to extract the feature from a single distorted image that contains the fusion information of a PR feature and a distortion feature by the guidance of and . Then the quality of the distorted image can be regressed by the . Finally, we further propose a GRU based quality aggregation module for patch-wise quality score aggregation. In the testing phase, only the testing image (without reference) is available for quality prediction based on the proposed NR-IQA model.

Ii-B Discrepancy Estimation based NR-IQA

The NR-IQA problem can be feasibly converted to the FR-IQA problem when the reference image can be inferred through generative models. In [lin2018hallucinated], the PR image is generated by a quality-aware generative network, then the discrepancy between the distorted image and PR image are measured for quality regression. In contrary to constructing the PR image with perfect quality, the reference information provided by the PR image that suffered from the severest distortion was explored in [min2017blind], then an NR-IQA metric was developed by measuring the similarity between the structures of distorted and the PR images. In [jiang2020no], both a pristine reference image (generated via a restoration model) and a severely distorted image (generated via a degradation model) are utilized for quality prediction. Analogously, through comparing the distorted images and its two bidirectional PRs, the bilateral distance (error) maps are extracted in [hu2020tspr].

Ii-C Mutual Learning

The assumption of mutual learning is highly relevant with the dual learning and collaborative learning, as their assumptions all lie in the encouragement of the models to teach each other during training. For example, the dual learning was adopted in [he2016dual]

, where two cross-lingual translation models are forced to learn from each other through the reinforcement learning. Comparing with dual learning, the same task is learned in the collaborative learning. In

[batra2017cooperative], multiple student models are expected to learn the same classification task while their inputs are sampled from different domains. Different from the dual learning and collaborative learning, both the tasks and inputs of the models in mutual learning are identical. For example, the deep mutual learning was utilized in [zhang2018deep], where two student models are forced to learn the classification collaboratively by a Kullback Leibler (KL) loss. This work was further extended in [lai2019adversarial]

, with the KL loss replaced by a generative adversarial network. In our method, the mutual learning strategy was adopted to improve the learnability of the PR feature and we further impose the triplet constraint to the output features, significantly enhancing their discriminability.


Iii The Proposed Scheme

We aim to learn an NR-IQA measure by hallucinating the PR features. In the training stage, given the pristine reference, we attempt to build a FR-IQA model with the distorted and corresponding pristine reference images. The PR feature is subsequently learned in a mutual way. Finally, the GRU based quality aggregation is performed to obtain the final quality score.

Iii-a PR Feature Learning

As shown in Fig. 1, in the training phase, we learn to hallucinate the PR feature from a single distorted image by the guidance of the pristine reference feature . In particular, the is generated from a FR model based on the quality-embedding feature extractor and the is decomposed from the feature which is regarded as a fusion feature that contains the entire information of the PR feature and the distortion feature .

In general, there are two properties a desired PR feature should possess. First, the PR feature should be learnable. Generally speaking, it is an ill-posed problem for learning the PR information from the distorted image. Though in the training process the reference image could be taken advantage of, the PR feature may not be able to be feasibly learned by only forcing the inferred PR feature to be close to the feature of the reference image. As such, the learning capability of the NR-IQA network should be carefully considered during the reference feature generation. Second, the PR feature should be discriminative enough when comparing with the distortion feature. Enhancing the discriminability could improve the quality sensitivity of the PR feature and subsequently promote the prediction performance.

The proposed method is conceptually appealing in the sense of learnability and discriminability. Regarding the learnability, a mutual learning strategy is adopted. As shown in Fig. 1, in the training process, the paired images including the distorted image and its corresponding reference are fed into the quality-embedding feature extractor, generating the reference feature and distortion feature . The integrity feature extractor, which accepts the distorted image only, is encouraged to generate the feature that contains the full information of the pseudo reference feature and the distortion feature . The mutual learning strategy enables the integrity feature extractor and quality-embedding feature extractor to be learned simultaneously with the feature distance constraint. Thus, more learnable reference feature can be generated by the FR model. The connection among the features , and

are constructed by an invertible layer, consisting of three invertible neural networks (INNs). Through the INNs, the integrity feature

can be disentangled in to a pseudo reference feature and a distortion feature , without losing any information due to the invertibility of INNs.

To equip the discriminative capability, a triplet loss is further utilized [schroff2015facenet] as the distance measure between the reference features (, ) and the corresponding distortion features ( and ), which is expressed as follows,


where is input patch index in a batch, is the batch size and is a margin that is enforced between positive and negative pairs. With this loss, on the one hand, the distance between the reference feature and PR feature can be reduced. On the other hand, the discrepancies between the reference/PR feature and two distortion features can be enlarged.

As illustrated in Fig. 1, to maintain the relationship of and to be consistent with and , we concatenate with (denoted as ) and with (denoted as ) for quality prediction through a shared quality aggregation module.

Iii-B GRU Based Quality Aggregation

To aggregate the predicted quality score of each patch in an image, the aggregation module should be invariant of the patch numbers. In this paper, we propose a GRU based quality score aggregation module as shown in Fig. 1. More specifically, regarding the concatenated features and , two sub-branches are adopted for quality prediction. The first branch is a fully-connected (FC) layer which is responsible for patch-wise quality prediction with the patch-wise concatenated feature as input. Another sub-branch consists of one GRU layer and one FC layer. Different from [bosse2017deep], the inputs of the GRU layer are the features of all the patches in an image. With GRU, the long-term dependencies between different patches can be modeled and synthesized, then we normalize the output weights from the last FC layer for final attention map generation,


where is patch index in an image, is the number of patches and is the attention weight of patch. Finally, the global image quality can be estimated as


where is the predicted quality of patch. As shown in Fig. 1, we also adopt the same strategy (network) for the quality aggregation of the patch-wise integrity feature . Due to the distinct representations of the fused feature and concatenated features, the parameters of the two aggregation modules are not shared during the model training.

Iii-C Objective Function

In summary, the objective function in our proposed method include two triplet losses and three quality regression losses. In particular, for quality regression, comparing with mean squared error (MSE), optimization with mean absolute error (MAE) is less sensitive to the outliers, leading to a more stable training processing. Consequently, the objective function is given by,


where and are the patch index and image index, respectively. is the number of images in a batch and is the Mean Opinion Score (MOS) provided by the training set. , , are the quality scores predicted from the features , and , respectively. It is also worth noting that the extractions of are not necessary in the testing phase, and we only adopt the for the final quality prediction, thus the computational complexity in testing phase can be highly reduced comparing with the network used in the training phase.

Database # of Ref. Images # of Images Distortion Types
TID2013 [ponomarenko2015image] 25 3,000 Additive Gaussian noise; Additive noise in color components; Spatially correlated noise; Masked noise; High frequency noise; Impulse noise; Quantization noise; Gaussian blur; Image denoising; JPEG compression; JPEG2000 compression; JPEG transmission errors; JPEG2000 transmission errors; Non eccentricity pattern noise; Local block-wise distortions of different intensity; Mean shift (intensity shift); Contrast change;Change of color saturation;Multiplicative Gaussian noise; Comfort noise; Lossy compression of noisy images; Image color quantization with dither; Chromatic aberrations; Sparse sampling and reconstruction
LIVE [sheikh2003image] 29 982 JPEG and JPEG2000 compression; Additive white; Gaussian noise; Gaussian blur; Rayleigh fast-fading channel distortion
CSIQ [larson2010most] 30 866

JPEG compression; JP2K compression; Gaussian blur; Gaussian white noise; Gaussian pink noise and contrast change

KADID-10k [2019KADID] 81 10,125 Blurs (Gaussian, lens,motion); Color related (diffusion, shifting, quantization, over-saturation and desaturation); Compression (JPEG2000, JPEG); Noise related (white, white with color, impulse, multiplicative white noise + denoise); Brightness changes (brighten, darken, shifting the mean); Spatial distortions (jitter, non-eccentricity patch, pixelate, quantization, color blocking); Sharpness and Contrast.
TABLE I: Descriptions of the Four IQA databases.

Iv Experimental Results

Iv-1 IQA Databases

Since our model is trained in a paired manner, the reference image should be available during the training phase. As such, to validate the proposed method, we evaluate our model on four synthetic IQA databases including: TID2013 [ponomarenko2015image], LIVE [sheikh2003image], CSIQ [larson2010most] and KADID-10k [2019KADID]. More details are provided in Table I.

TID2013. The TID2013 database consists of 3,000 images obtained from 25 pristine images for reference. The pristine images are corrupted by 24 distortion types and each distortion type corresponds to 5 levels. The image quality is finally rated by double stimulus procedure and the MOS values are obtained in the range [0, 9], where larger MOS indicates better visual quality.

LIVE. The LIVE IQA database includes 982 distorted natural images and 29 reference images. Five different distortion types are included: JPEG and JPEG2000 compression, additive white Gaussian noise (WN), Gaussian blur (BLUR), and Rayleigh fast-fading channel distortion (FF). Different from the construction of TID2013, a single-stimulus rating procedure is adopted for quality rating, producing a range of difference mean opinion scores (DMOS) from 0 to 100 and a lower DMOS value represents better image quality.

CSIQ. The CISQ database contains 30 reference images and 866 distorted images. This database involves six distortion types: JPEG compression, JP2K compression, Gaussian blur, Gaussian white noise, Gaussian pink noise and contrast change. The images are rated by 35 different observers and the DMOS results are normalized into the range [0, 1].

KADID-10k. In this database, 81 pristine images are included and each pristine image is degraded by 25 distortion types in 5 levels. All the images are resized into the same resolution (512×384). For each distorted image, 30 reliable degradation category ratings have been obtained by crowdsourcing.

Fig. 2: Illustration of the network architectures for the quality-embedding feature extractor, INNS, integrity feature extractor and quality aggregation module.

Iv-2 Implementation Details

We implement our model by PyTorch 

[paszke2019pytorch]. In Fig. 2, we show the layer-wise network design of our proposed method. We crop the image patches without overlapping and the size is set by . The number of image pairs in a batch is set by 32. We adopt Adam optimizer [kingma2014adam] for optimization. The learning rate is fixed to 1e-4 with a weight decay set as 1e-4. The weighting parameters in Eqn. (4

) are set as 20.0 and 2.0, respectively. We duplicate the samples by 16 times in a batch to augment the data. The maximum epoch is set by 1,000.

It should be mentioned that all the experimental pre-settings are fixed both in intra-database and cross-database training. For the intra-database evaluation, we randomly spilt the dataset into training set, validation set and testing set by reference image to guarantee there is no content overlap among the three sets. In particular, 60%, 20%, 20% images are used for training, validation and testing, respectively. We discard the 25th reference image and the distorted versions in TID2013, as they are not natural image. The experimental results on intra-database are reported based on 10 random splits. To make errors and gradients comparable for different databases, we linearly map the MOS/DMOS ranges of other three databases (TID2013, CSIQ, KADID-10k) to the DMOS range [0, 100] which is the same as LIVE database. Two evaluation metrics are reported for each experimental setting, including: Spearman rank correlation coefficient (SRCC), Pearson linear correlation coefficient (PLCC). The PLCC evaluates the prediction accuracy and the SRCC indicates the prediction monotonicity.

Method LIVE CSIQ TID2013
BRISQUE [mittal2012no] 0.939 0.935 0.746 0.829 0.604 0.694 [t]
M3 [xue2014blind] 0.951 0.950 0.795 0.839 0.689 0.771
FRIQUEE [ghadiyaram2017perceptual] 0.940 0.944 0.835 0.874 0.68 0.753
CORNIA [ye2012unsupervised] 0.947 0.950 0.678 0.776 0.678 0.768
HOSA [xu2016blind] 0.946 0.947 0.741 0.823 0.735 0.815
Le-CNN [kang2014convolutional] 0.956 0.953 - - - -
BIECON [kim2016fully] 0.961 0.962 0.815 0.823 0.717 0.762
DIQaM-NR [bosse2017deep] 0.960 0.972 - - 0.835 0.855
WaDIQaM-NR [bosse2017deep] 0.954 0.963 - - 0.761 0.787
ResNet-ft [kim2017deep] 0.950 0.954 0.876 0.905 0.712 0.756
IW-CNN 0.963 0.964 0.812 0.791 0.800 0.802
DB-CNN 0.968 0.971 0.946 0.959 0.816 0.865
CaHDC [wu2020end] 0.965 0.964 0.903 0.914 0.862 0.878
HyperIQA [wu2020end] 0.962 0.966 0.923 0.942 0.729 0.775 [b]
FPR 0.967 0.968 0.948 0.956 0.872 0.887 [t]
FPR (FR) 0.967 0.977 0.962 0.966 0.897 0.868[b]
TABLE II: Performance evaluation on the LIVE, CSIQA and TID2013 databases. The top two results are highlighted in boldface.
Method BIQI [moorthy2010two] BLIINDS-II [saad2012blind] BRISQUE [mittal2012no] CORNIA [ye2012unsupervised] DIIVINE [moorthy2011blind] HOSA [xu2016blind] SSEQ [liu2014no]
(fine-tune) [2019KADID]
PLCC 0.460 0.559 0.554 0.580 0.532 0.653 0.463 0.734 0.901 0.940 [t]
SRCC 0.431 0.527 0.519 0.541 0.489 0.609 0.424 0.731 0.899 0.941 [b]
TABLE III: Performance evaluation on the KADID-10k database. The top two results are highlighted in boldface.
Dataset Dist.Type Method
DIIVINE [moorthy2011blind] BLINDS-II [saad2012blind] BRISQUE [mittal2012no] CORNIA [ye2012unsupervised] HOSA [xu2016blind] WaDIQaM-NR [bosse2017deep] DIQA [kim2018deep] BPRI (c) [min2017blind] BPRI (p) [min2017blind] TSPR [min2017blind] FPR
LIVE WN 0.988 0.947 0.979 0.976 0.975 0.970 0.988 0.984 0.985 0.972 0.987 [t]
GB 0.923 0.915 0.951 0.969 0.954 0.960 0.962 0.927 0.924 0.978 0.979
JPEG 0.921 0.950 0.965 0.955 0.954 0.964 0.976 0.967 0.967 0.947 0.932
JP2K 0.922 0.930 0.914 0.943 0.954 0.949 0.961 0.908 0.907 0.950 0.965[b]
CSIQ WN 0.866 0.760 0.682 0.664 0.604 0.929 0.835 0.931 0.936 0.910 0.942 [t]
GB 0.872 0.877 0.808 0.836 0.841 0.958 0.870 0.904 0.900 0.908 0.939
JPEG 0.800 0.899 0.846 0.869 0.733 0.921 0.931 0.918 0.930 0.944 0.969
JP2K 0.831 0.867 0.817 0.846 0.818 0.886 0.927 0.863 0.862 0.896 0.967
TID 2013 WN 0.855 0.647 0.858 0.817 0.817 0.843 0.915 0.918 0.918 0.876 0.931
GB 0.834 0.837 0.814 0.840 0.870 0.861 0.912 0.873 0.859 0.837 0.912
JPEG 0.628 0.836 0.845 0.896 0.986 0.931 0.875 0.907 0.910 0.913 0.906
JP2K 0.853 0.888 0.893 0.901 0.902 0.932 0.912 0.883 0.868 0.935 0.900[b]
TABLE IV: SRCC comparison in three databases on four common distortion types. The top two results are highlighted in boldface.

Iv-a Quality Prediction on Intra-database

Iv-A1 Overall Performance on Individual Database

In this sub-section, we compare our method with other state-of-the-art NR-IQA methods, including BRISQUE [mittal2012no], M3 [xue2014blind], FRIQUEE [ghadiyaram2017perceptual], CORNIA [ye2012unsupervised], DIIVIN [moorthy2011blind], BLINDS-II [saad2012blind], HOSA [xu2016blind], Le-CNN [kang2014convolutional], BIECON [kim2016fully], WaDIQaM [bosse2017deep], ResNet-ft [kim2017deep], IW-CNN [kim2017deep], DB-CNN [kim2017deep], CaHDC [wu2020end] and HyperIQA [su2020blindly]. The comparison results are shown in Tables II and III. All our experiments are conducted under ten times random train-test splitting operations, and the median SRCC and PLCC values are reported as final statistics. From the table, we can observe that all competing models achieve comparable performance on LIVE database while the performance vary on more challenging databases: TID2013 and KADID-10k. Comparing with hand-crafted based methods like BRISQUE, FRIQUEE and CORNIA, the CNN-based methods can achieve superior performance on different databases, revealing that human-perception relevant features can be learned from the training set. Moreover, we can also find that our method achieves the best performance on TID2013, KADID-10k and CSIQA databases in terms of both SRCC and PLCC. Although our method achieves the second place on LIVE database, the results are still comparable (SRCC 0.967 v.s. 0.968) to the best model DB-CNN. Furthermore, different from DB-CNN, the external databases are not required for training by our proposed method.

As we train our model in a paired manner, the FR results can also be acquired during the testing by involving the reference image. Herein, we also provide the FR results denoted as FPR (FR) in Tables II and III. From the tables, we can observe that our FR model can achieve higher performance when compared with our NR model, as the pristine image provide more accurate reference information for quality evaluation. We also observe that the performance of our FR model is not as good as some other FR models e.g., WaDIQaM-FR [bosse2017deep]. We believe this is reasonable, as the learning capability of our NR model must be considered simultaneously during the extraction of reference feature. Furthermore, we compare our method with two PR image based NR-IQA methods named BPRI [min2017blind] and TSPR [hu2020tspr]. Following the experimental setting in [hu2020tspr] that four shared distortion types, i.e., JPEG, GB, WN and JP2K in the TID2013, LIVE and CSIQ databases are used for performance comparison. The results are presented in Table IV. From the Table, we can observe that our method achieves the best performance on most settings and significantly outperforms them. These results reveal the effectiveness of our PR information constructed in feature level. Moreover, comparing with the method in  [hu2020tspr] where a generative adversarial network (GAN) is utilized to restore the reference information at image-level, the light network utilized in our method can significantly reduce the inference time.

Iv-A2 Performance on Individual Distortion Type

To further explore the behaviors of our proposed method, we present the performance on individual distortion type and compare it with several competing NR-IQA models. The results of experiments performed on TID2013 database and LIVE database are shown in Table V and Table VI, respectively. For each database, the average SRCC values of above ten settings are reported. As shown in the Table V, we can easily observe that our method can achieve the highest accuracy on most distortion types (over 60% subsets). By contrast, lower SRCC values are obtained on some specific distortion types, e.g., mean shift. The reason may lie in the challenge of PR feature hallucination due to valuable information buried by the severe distortion. It is worth noting that our method achieves significant performance improvements on some noise-relevant distortion types (e.g., additive Gaussian noise, masked noise) and compression-relevant distortion types (e.g., JPEG compression, JPEG 2000 compression). The result is consistency with the performance on LIVE database, verifying the capacity that our model possesses in restoring the PR features from different distortion types.

SRCC BRISQUE [mittal2012no] M3 [xue2014blind] FRIQUEE [ghadiyaram2017perceptual] CORNIA [ye2012unsupervised] HOSA [xu2016blind] MEON [ma2017end] DB-CNN [kim2017deep] FPR
Additive Gaussian noise 0.711 0.766 0.730 0.692 0.833 0.813 0.790 0.953 [t]
Additive noise in color components 0.432 0.56 0.573 0.137 0.551 0.722 0.700 0.897
Spatially correlated noise 0.746 0.782 0.866 0.741 0.842 0.926 0.826 0.967
Masked noise 0.252 0.577 0.345 0.451 0.468 0.728 0.646 0.876
High frequency noise 0.842 0.900 0.847 0.815 0.897 0.911 0.879 0.934
Impulse noise 0.765 0.738 0.730 0.616 0.809 0.901 0.708 0.779
Quantization noise 0.662 0.832 0.764 0.661 0.815 0.888 0.825 0.920
Gaussian blur 0.871 0.896 0.881 0.850 0.883 0.887 0.859 0.833
Image denoising 0.612 0.709 0.839 0.764 0.854 0.797 0.865 0.944
JPEG compression 0.764 0.844 0.813 0.797 0.891 0.850 0.894 0.923
JPEG 2000 compression 0.745 0.885 0.831 0.846 0.919 0.891 0.916 0.923
JPEG transmission errors 0.301 0.375 0.498 0.694 0.73 0.746 0.772 0.797
JPEG 2000 transmission errors 0.748 0.718 0.660 0.686 0.710 0.716 0.773 0.752
Non-eccentricity pattern noise 0.269 0.173 0.076 0.200 0.242 0.116 0.270 0.559
Local bock-wise distortions 0.207 0.379 0.032 0.027 0.268 0.500 0.444 0.265
Mean shift 0.219 0.119 0.254 0.232 0.211 0.177 -0.009 0.009
Contrast change -0.001 0.155 0.585 0.254 0.362 0.252 0.548 0.699
Change of color saturation 0.003 -0.199 0.589 0.169 0.045 0.684 0.631 0.409
Multiplicative Gaussian noise 0.717 0.738 0.704 0.593 0.768 0.849 0.711 0.887
Comfort noise 0.196 0.353 0.318 0.617 0.622 0.406 0.752 0.830
Lossy compression of noisy images 0.609 0.692 0.641 0.712 0.838 0.772 0.860 0.982
Color quantization with dither 0.831 0.908 0.768 0.683 0.896 0.857 0.833 0.901
Chromatic aberrations 0.615 0.570 0.737 0.696 0.753 0.779 0.732 0.768
Sparse sampling and reconstruction 0.807 0.893 0.891 0.865 0.909 0.855 0.902 0.887 [b]
TABLE V: Average SRCC results of individual distortion types on TID2013 database. The top two results are highlighted in boldface.
BRISQUE [mittal2012no] 0.965 0.929 0.982 0.964 0.828 [t]
M3 [xue2014blind] 0.966 0.930 0.986 0.935 0.902
FRIQUEE [ghadiyaram2017perceptual] 0.947 0.919 0.983 0.937 0.884
CORNIA [ye2012unsupervised] 0.947 0.924 0.958 0.951 0.921
HOSA [xu2016blind] 0.954 0.935 0.975 0.954 0.954
dipIQ [ma2017dipiq] 0.969 0.956 0.975 0.940 -
DB-CNN [kim2017deep] 0.972 0.955 0.980 0.935 0.930 [b]
FPR 0.962 0.960 0.986 0.959 0.966
BRISQUE [mittal2012no] 0.971 0.940 0.989 0.965 0.894 [t]
M3 [xue2014blind] 0.977 0.945 0.992 0.947 0.920
FRIQUEE [ghadiyaram2017perceptual] 0.955 0.935 0.991 0.949 0.936
CORNIA [ye2012unsupervised] 0.962 0.944 0.974 0.961 0.943
HOSA [xu2016blind] 0.967 0.949 0.983 0.967 0.967
dipIQ [ma2017dipiq] 0.980 0.964 0.983 0.948 -
DB-CNN [kim2017deep] 0.986 0.967 0.988 0.956 0.961 [b]
FPR 0.960 0.971 0.991 0.971 0.977 [t]
TABLE VI: Average SRCC and PLCC results of individual distortion type on LIVE database. The top two results are highlighted in boldface.

Iv-B Cross-Database Evaluation

To verify the generalization capability of our FPR model, we further evaluate our model on cross-database settings. We compare our method with seven NR-IQA methods, including: BRISQUE, M3, FRIQUEE, CORNIA, HOSA and two CNN-based counterparts DIQam-NR and HyperIQA. The results of DIQam-NR are reported from the original paper, and we retrain the HyperIQA by the source codes provided by the authors. All experiments are conducted with one database as training set and the other two databases as testing sets. We present the experimental results in VII, from which we can find the model trained on LIVE (CSIQ) is easier to generalize to CSIQ (LIVE) as similar distortion types introduced by the two databases. However, it is a much more difficult task to generalize the model trained on CSIQ or LIVE to the TID2013 database, due to these unseen distortion types involved in TID2013 database. Despite this, we can still achieve a high SRCC in the two settings, demonstrating the superior generalization capability of our method.

Training LIVE CSIQ TID2013
BRISOUE [mittal2012no] 0.562 0.358 0.847 0.454 0.790 0.590 [t]
M3 [xue2014blind] 0.621 0.344 0.797 0.328 0.873 0.605
FRIOUEE [ghadiyaram2017perceptual] 0.722 0.461 0.879 0.463 0.755 0.635
CORNIA [ye2012unsupervised] 0.649 0.360 0.853 0.312 0.846 0.672
HOSA [xu2016blind] 0.594 0.361 0.773 0.329 0.846 0.612
DIQaM-NR [bosse2017deep] 0.681 0.392 - - - 0.717
HyperIQA [wu2020end] 0.697 0.538 0.905 0.554 0.839 0.543
Ours 0.620 0.433 0.895 0.522 0.884 0.732
TABLE VII: SRCC comparison on different cross-database settings. The numbers in bold are the best results.

Iv-C Ablation Study

In this subsection, to reveal the functionalities of different modules in our proposed method, we perform the ablation study on TID2013 database. To be consistent with the experimental setting on intra-database, 60%, 20%, 20% images in TID2013 are grouped for training, validation and testing sets without content overlapping. Herein, we only report the ablation results by one fixed experimental splitting in Table VIII. In particular, we first ablate the PR and INN modules from our model and retain the Integrity Feature Extractor and GRU based Quality Aggregation modules. The performance drop dramatically due to the fact that no extra constraint be introduced to prevent the overfitting problem. Then we replace the INN module by directly concatenating the learned pseudo reference feature and distortion feature for quality regression, resulting in the second ablation setting. The lower SRCC (0.86 v.s 0.89) reveals that more generalized model can be learned by our INN module. As described before, the triplet loss is adopted to learn more discriminative features. In this sense, we ablate the in our third experiment. Again, the significant performance dropping reveals that the discriminative PR feature learning is a vital factor for high accuracy quality prediction. Finally, three patch score aggregation modules are compared in Table VIII. The superior performance further demonstrates the effectiveness of our GRU based score aggregation module.

Exp.ID PR INN Patch Aggregation SRCC
Mean Weighted GRU
1 0.670 [t]
2 0.859
3 0.772
4 0.869
5 0.848
6 0.887 [b]
TABLE VIII: SRCC performance with ablation studies performed on the TID2013 database.

Iv-D Feature Visualization

To better understand the performance of our proposed method, we visualize the quality relevant features , , and . More specifically, we first learn two models by our method on TID2013 database and LIVE database, respectively. Then 900 image pairs of each database are randomly sampled from the two databases for testing. For each database, we reduce the feature dimensions of , , and to three by T-SNE [maaten2008visualizing] and the results are visualized in Fig. 3. As shown in Fig. 3, we can observe that the discrepancy of the reference feature and the pseudo reference feature is small due to the mutual learning strategy. By contrast, the large discrepancy can be acquired between the pseudo reference feature and distortion feature as the triplet loss performed, leading to the better performance.

Fig. 3: T-SNE visualization of the features extracted from TID2013 and LIVE databases.

V Conclusions

In this paper, we propose a novel NR-IQA method named FPR by restoring the reference information at feature-level. The image quality is evaluated by measuring the discrepancy at the feature-level and the PR feature is inferred based upon the INNs. The mutual learning strategy and triplet loss ensure the learnability and discriminability of PR features. To aggregate the patch-wise quality scores in an image, a GRU based quality aggregation module is further proposed. The superior performance on four synthetic databases demonstrates the effectiveness of our model.