Learning Transformer Features for Image Quality Assessment

by   Chao Zeng, et al.
City University of Hong Kong

Objective image quality evaluation is a challenging task, which aims to measure the quality of a given image automatically. According to the availability of the reference images, there are Full-Reference and No-Reference IQA tasks, respectively. Most deep learning approaches use regression from deep features extracted by Convolutional Neural Networks. For the FR task, another option is conducting a statistical comparison on deep features. For all these methods, non-local information is usually neglected. In addition, the relationship between FR and NR tasks is less explored. Motivated by the recent success of transformers in modeling contextual information, we propose a unified IQA framework that utilizes CNN backbone and transformer encoder to extract features. The proposed framework is compatible with both FR and NR modes and allows for a joint training scheme. Evaluation experiments on three standard IQA datasets, i.e., LIVE, CSIQ and TID2013, and KONIQ-10K, show that our proposed model can achieve state-of-the-art FR performance. In addition, comparable NR performance is achieved in extensive experiments, and the results show that the NR performance can be leveraged by the joint training scheme.



page 1

page 2

page 3

page 4


No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency

The goal of No-Reference Image Quality Assessment (NR-IQA) is to estimat...

Can No-reference features help in Full-reference image quality estimation?

Development of perceptual image quality assessment (IQA) metrics has bee...

Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment

We present a deep neural network-based approach to image quality assessm...

Hallucinated-IQA: No-Reference Image Quality Assessment via Adversarial Learning

No-reference image quality assessment (NR-IQA) is a fundamental yet chal...

A Multi-task convolutional neural network for blind stereoscopic image quality assessment using naturalness analysis

This paper addresses the problem of blind stereoscopic image quality ass...

Variational models for joint subsampling and reconstruction of turbulence-degraded images

Turbulence-degraded image frames are distorted by both turbulent deforma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The twenty-first century has witnessed the prospering of the wonderful Internet world. People nowadays not only live in the physical world but also enjoy a virtual life built with information technology. Multimedia documents in various forms like image and video are playing a great role in this online world. Every day there are a huge number of images uploaded or generated on social media and websites like Facebook, Google, Flickr, etc. In pursuit of high-quality life, people tend to seek visual contends of higher quality. In this scenario, being able to predict the perceptual quality of images is becoming significant in a wide range of applications like image compression, video coding, and transmission, image enhancement, image restoration, etc.

The Image Quality Assessment(IQA) task aims to enable computer system to recognize the perceptual quality level of visual contents. According to the availability of the reference image for the quality prediction of the target image, the IQA task can be divided into three categories: Full Reference(FR)bosse2017deepahn2021deepshi2021regionprashnani2018pieapp ding2020imagezhang2018unreasonablecheon2021perceptual, Reduced Reference(RR)soundararajan2011rredwu2016orientationzheng2021learning, and No Reference methods(NR)talebi2018nima wang2021activezhu2020metaiqayou2021transformerke2021musiqzhu2021saliencygolestaneh2021nosu2020blindlygu2020giqa. For the FR branch, a variety of methods have been proposed for better alignment to human perception mechanism than naïve Mean Square Error(MSE) on pixel level, among which the Structural Similarity Index(SSIM)wang2004image has remained a golden standard in the research community. More over, recent years have witnessed the surge of the wide application of Deep Neural Network on Computer Vision. Early IQA methods often rely on handcrafted features to predict the image qualityzhang2011fsimxue2013gradientsheikh2006image. With the help of powerful representation capability of deep convolutional neural network, the sophisticated designing on handcrafted features for quality prediction becomes unnecessary.

However, most of the current deep models only learn local convolutional features inherited from a pre-trained network on recognition tasks like VGGsimonyan2014very and ResNethe2016deep. In real-world scenarios, there can be various distortion types applied to whether locally or globally, within an image. The conventional scheme for deep IQA models with the CNN extractor and MLP score regressor only considers the final representation of the visual content, and attention among different levels is often ignored.

In this paper, we introduce a hybrid framework consisting of CNN layers and a Transformer encoder. To learn better features for the task of image quality assessment, we build a transformer encoder upon CNN feature extractor. Furthermore, to make the metric surjective, we also add the original pixels as part of the features. To make the designed model applicable to broad real-world scenarios for image quality assessment, we propose a compatible framework for both FR and NR tasks. To make FR and NR quality prediction compatible in the same framework, we address these two tasks at different levels of the working pipeline. Specifically, inspired by SSIM, we use feature-level statistics to calculate the FR scores between reference and distorted images. In contrast, for the NR phase, we address the task as a classification problem instead of directly regressing to a single score. To alleviate the input difference between the two tasks, we employ the Siamese Network structure to extract proper features for quality assessment. Since the model allows for both FR and NR settings, we can optimize the model with different datasets, though some may not have reference images.

The contributions of this paper are summarized as follows:

1. We introduce an end-to-end deep learning-based method that is compatible with both FR and NR image quality measurement. Previous learning-based methods generally use CNN feature extractor together with a score regressor to figure out the quality score. Since the reference image in the NR setting is missing, the features used for the score regressor are pretty different. This scenario makes it challenging to combine the FR and NR IQA tasks in a unifying framework. To achieve this, we propose to address the FR setting task as a comparison problem on the feature level and the NR setting task as a classification problem.

2. We propose a transformer encoder following CNN layers as a complementary feature extractor for the learning process. CNN layers are good at extracting local visual contexts while non-local information is neglected. To address this issue, we utilize the popular transformer model to model the non-local information.

3. We carry out image quality experiments for both FR and NR settings on several standard IQA datasets to show the effectiveness of the proposed method.

In the following sections of this paper, we will first give a brief review of the existing models of image quality assessment. Then we introduce our TFIQA method, followed by experiments and results analysis. Finally, we carry out the conclusion.

2 Related Work

2.1 Conventional and Convolutional IQA models

The most simple and well-known metrics for image quality are the MSE distance and the PSNR between the reference image and the distortion image. However, though these metrics are simple and convenient for optimization, they have been proven less correlated to human judgments. By considering more statistical information between the reference and distorted images, SSIMwang2004image is proposed as a full reference model, which introduces structure similarity for image quality assessment. It has shown nice alignment to human vision system and inspires many later pieces of research like FSIMzhang2011fsim and MS-SSIMwang2003multiscale. Another branch of the conventional IQA models are the fidelity based ones such as VSIzhang2014vsi, MADlarson2010most and VIFsheikh2006image.

In recent years, the research community has witnessed great progress made by deep learning. With the help of deep models, now the focus of research on image quality assessment turns from pixel-level or hand-crafted feature level to the automatically learned deep feature level. The LPIPS model introduced by Zhang et al. zhang2018unreasonable has shown the effectiveness of applying deep features learned from recognition tasks to the FR-IQA task. Ding et al. propose DISTSding2020image to show even better performance to use structure similarity computing and multi-scale features for FR tasks. DeepIQAbosse2017deep employs CNN as a feature extractor and uses two separate fully connected neural networks to predict patch attention weights and patch quality and output the global image quality score with attention-aware pooling on local scores. Su et al. are inspired by the hypernet in vision tasks and propose HyperIQAsu2020blindly for the NR-IQA task, which can adaptively generate parameters according to the input image for the quality prediction network. Zhu et al. propose MetaIQAzhu2020metaiqa to leverage meta-learning to learn IQA models that adapt to different distortion types.

To overcome the constraints of the small size of datasets and the overfitting issue for the IQA task, a variety of multi-task learning-based models are proposedkang2015simultaneousxu2016multima2017end. Another branch of IQA methods to alleviate the limitation of annotated datasets is the ranking based onesgao2015learningma2017dipiqprashnani2018pieappliu2017rankiqama2019blind. The ranking-based methods first build the quality ranking datasets(image pairs or lists) and then use the enhanced forms of datasets to train a more complex network for the IQA task. In recent years, the GAN model has been proven effective in generating high-quality images of various resolutions. Some recent works propose to apply the GAN model in IQA task to learn a mapping from distortion image to a hallucinated reference image, guiding the IQA model to learn perceptual differenceslin2018hallucinated better pan2018blindren2018ran4iqa.

2.2 Transformer based IQA models

Transformersvaswani2017attention were originally proposed in NLP tasks as a generalized non-local attention mechanism to learn contexts among tokens of arbitrary distance. It has shown excellent performance on language modeling compared to conventional models based on RNN. Apart from language modeling, transformers have shown their effectiveness on vision tasks as well. Vision Transformer has successfully applied a hybrid framework on image recognition, which consists of a transformer encoder and CNN extractor. Inspired by ViTdosovitskiy2020image, TRIQyou2021transformer re-define the classification token as the quality embedding token for the NR-IQA task. In order to deal with images of different resolutions, it provides sufficient token numbers to represent the encoded image features. MUSIQke2021musiq addresses the resolution issue in the IQA task with a different approach, which utilizes patch-based multi-scale mechanism in the transformer encoding process. Similar to that of TRIQ, the MLP layers are used for the final quality score prediction. In contrast, to deal with the dual input images in the FR setting, the IQT modelcheon2021perceptual adopts the complete transformer model, which includes both the encoder and decoder. The CNN backbone firstly extracts the deep features of both reference and distorted images. Then the different features are re-encoded by the transformer encoder and input to the decoder as contexts. The deep features of the reference image are also input into the decoder as query information. Finally, in the decoder, the quality embedding token is used for the final quality prediction with the MLP module. Compared to IQT, we propose a more lightweight model for the FR task with only a transformer encoder module. Moreover, instead of learning regression layers for a score prediction, we learn better features and predict quality scores by feature-level statistical comparison.

2.3 Attention Mechanism for IQA

Like the recognition tasks, the spatial and channel attention can also benefit the IQA taskding2020image. To assign different impacts caused by different image regions in quality measurements, DeepIQAbosse2017deep designs two branches of networks for computing predictions. One branch is for patch importance weights prediction, and the other is for patch score prediction. This kind of attention can be divided to spatial attention. And it is intuitive that different image regions can contribute differently to the overall image quality, especially for the authentic distortion images. The weighting mechanism applied on deep CNN features in the LPIPSzhang2018unreasonable and DISTSding2020image shows the contribution made by this attention mechanism on channel levels. As previously mentioned in transformer-based models, the self-attention embedded in the transformer encoder and the cross attention between the encoder and decoder are also proven effective in the IQA taskcheon2021perceptual. This paper also adopts the compelling deep CNN features as local information and extends with token level attention in the transformer encoding process to get non-local information for proper features predicting final qualities. Considering the different settings of FR and NR tasks, we propose a unified framework by Siamese encoding networks and settle

3 Methodology

Figure 1: The overview of the proposed framework. Our method utilizes both CNN layers and transformer encoder layers to extract image features. Specifically, our proposed method extends the CNN layers with transformer encoder layers and combines the deep features from both CNN and transformer by attention mechanism, which makes our method different from conventional CNN-based IQA models. The ‘conv’ denotes for convolutional layer, ‘’ stands for transformer encoder layers.

As shown in Fig. 1, the architecture of our proposed model mainly consists of five modules, including CNN encoder, transformer encoder, attention module, structure similarity for FR score, and NR classifier. The model uses both a full reference dataset that provides reference and distortion image pairs and no reference dataset that only has distortion images. In the training phase, the CNN encoder will first convert the input raw images into deep features, which are the inputs of the shallow transformer encoder. We apply the attention mechanism on both CNN channels and the transformer encoder layers. The feature similarity module will compare the deep features of both reference and distortion images at different channels or layers for the FR branch. The attention module gives adaptive weights on those features maps at different levels. The feature similarity module utilizes SSIM like the structural similarity between feature maps with mean and covariance statistics. We add MLP head as a classifier after the transformer encoder for the NR branch to predict five quality level probabilities.

Our inspiration comes from the trending research on transformersvaswani2017attention. CNN is usually employed as the feature extractor and MLP as the score regressor in conventional image quality assessment models. However, CNN is good at merging local contextual information fusion but ignores global and non-local information. Inspired by the success of the transformer in modeling contexts, we propose to use a transformer encoder to refine the features encoded from CNN layers. Also, from the respective of multi-task learning, the NR and FR tasks are intuitive inter correlated tasks for the shared goal of image quality assessment. Based on this consideration, we re-design the framework of image quality assessment and make it compatible for training both FR and NR branches simultaneously.

Our model aims to learn better deep features for image quality assessment and encourage joint training between the two branches.

3.1 CNN Backbone

Herein for the visual features extraction, we use the conventional VGG16simonyan2014very as the feature extraction backbone.

More specifically, following the practice of DISTSding2020image, we use feature maps from five different CNN layers.

In the second stage of transformer encoding, the CNN features are reshaped as tokens for the transformer model. The information from the feature maps of different layers will be fused by the attention module, especially for the FR branch. During training, the weights of the CNN extractor are fixed.

The feature extracting process can be summarized by the following equations:


where the notation stands for the global visual features extracted by the pretrained VGG16 network.

3.2 Transformer Encoder

In conventional deep image quality assessment models, the image feature extractor mainly consists of CNN layers. In this work, we extend the CNN encoding layers with transformer encoding layers. In order to avoid overfitting, we use a shallow transformer encoder to do the feature learning for this stage. Different from previous workyou2021transformerke2021musiqcheon2021perceptualzhu2021saliency, which whether use the transformer encoder only to encode distortion images for NR task or encode the difference features for FR task, we adopt Siamese network structure for the extended encoding process. In other words, we use a shared transformer encoder for both reference and distortion images to distinguish between images of different visual quality.

For the structure of the transformer encoder, we follow the conventional configuration of TRIQyou2021transformer. Firstly, the input features of shape HxWxC will be converted to NxD by 1x1 convolution as feature projection. Here, H, W, and C stand for the height, width, and channel of the deep CNN features. Moreover, the N is the number of transformer tokens. The D is the dimension of the transformer encoder. For the NR branch, to predict the quality score from the encoded features, we also add the quality token and concatenate it with the image feature tokens. In order to retain the positional information of the tokens, we then add the learnable positional embeddings to the token embeddings.

Figure 2: The inner details of the transformer encoder for feature learning.

The above encoding process can be summarized as the following equations.


where the represents the final sentence embeddings of a ground truth .


After this finer encoding process for the input images, the model now has collected the relevant features to predict the quality score for FR or NR branches.

3.3 Attentive Feature Comparison for FR branch

Inspired by LPIPS and DISTS, we proposed an attention module to better predict the quality score for the FR branch. This module is to combine transformer token attention and CNN channel attention for fr image quality assessment. In addition, to make the FR branch model injective, we also add the original image pixels into consideration.

Figure 3: The inner details of configuration for the FR branch.

The attention module is implemented as learnable parameters, which is a constraint as positive values and aims to assign different importance weights to image pixel channels and features at different levels.

As mentioned previously, we extend the widely used CNN layers with transformer encoder. For the feature comparison, we adopt the SSIM like structural statistics of the extracted features, which has been proven effective for FR task in DISTSding2020image. The modeling process can be expressed with the following equations:


where are the different weights applied on the difference terms of different feature levels. The index indicates the stage of features, including image pixels as stage zero, five convolutional stages and transformer encoding layers. And for each feature stage, there are different number of channels or tokens, which is indicated by the index . The upper scripts of and are labels to indicate whether the weights applied on or . The and

are the feature vectors of reference and distortion images at certain levels indicated by

and , respectively. In order to achieve the positive weights, we need to clip the values of attention weights before gradient back-propagation.

Our general goal is to utilize the transformer encoder to get better visual representations for the task of image quality assessment. And in our overall framework, we share the visual encoder for both reference and distortion images. In the next section, we take a step further and share the visual encoding framework of our proposed method to the NR task. In other words, we add the NR branch beside the original FR branch.

3.4 Classification for NR quality assessment

For the NR branch, the quality prediction is calculated by the MLP head once the visual features are collected. The prediction head consists of two linear layers with the relu activation in between. The prediction head receives the learned quality embeddings from the transformer encoder.

Figure 4: Structure Details for NR branch.

In the Classifier, the first layer will project the features of the transformer hidden dimension into the MLP head dimension. Then the activation layer is introduced for non-linearity. Finally, the second linear layer will map the hidden feature into a distribution over five classes as conducted in NIMA modeltalebi2018nima

. Finally, the predicted probability distribution will be merged into a quality score, which is denoted as


3.5 Model Training

On image quality learning the commonly used loss function is the MAE and MSE loss between the output of IQA model and the ground truth scores. It is expressed as the following equation:


In this work, Considering the difference of FR and NR tasks, the score prediction for the FR branch is calculated by feature level comparison, while a classifier is placed over the transformer layers to learn to classify on the quality levels. We use the MSE loss to guide the learning process of the FR branch, which could be expressed with the equation (9). The NR branch is guided by a classification loss, as shown in equation (10).

Above all, the proposed framework is compatible with training with both FR and NR settings. The two branches can both be learned separately and can also be trained in a joined manner, which will be discussed in detail in the next section.

4 Experiments

4.1 Dataset and Evaluation Metrics

Dataset. We evaluate the proposed IQA model on several publicly available IQA datasets, including synthetic and authentic types. For training the NR branch, we use the KONIQ-10K datasethosu2020koniq. For this dataset, we randomly split eighty percent and twenty percent for training and testing, respectively. For training the FR branch, we use the KADID-10K datasetlin2019kadid. The performance evaluation for the FR branch is conducted on three standard IQA datasets, i.e., LIVEsheikh2006statistical, CSIQlarson2010most and TID2013ponomarenko2013color.

Metrics. For performance evaluation, we use both correlation index and RMSE differences between the predicted scores and MOS scores.

Baseline and SOTA models. We compare our proposed method to both conventional and CNN-based ones. In addition, the recent transformer-based model IQT is also included. For simple notation, we just name them LPIPS, DISTS, DeepIQA, respectively. And our proposed transformer-based method is noted as TFIQA. The IQT model utilizes both transformer encoder and decoder for FR image quality assessment. The LPIPS and DISTS model are CNN-based FR IQA models.

4.2 Implementing Details

In this paper, we propose a framework to enhance the CNN features with a sequential transformer encoder for the image quality assessment task. This framework is compatible with both full reference and no reference configurations. For the full reference mode, we denote the corresponding model as . For this model, the VGG16 is used as the CNN backbone feature extractor, and the hyper-parameters of the transformer encoder are set following the configuration of TRIQyou2021transformer. Firstly, we choose a shallow transformer encoder and set the encoding layers as 2. For the multi-head self-attention, we set the number of heads as 4. And the transformer encoder dimension is set as 256. The hidden dimensions of MLP both in transformer encoder layers and the prediction head are set as 1024. For the MLP prediction head, we add dropout as 0.1. The initial learning rate is set as 2e-4. And the Adam optimizer is utilized for learning the model parameters.

In addition, we also consider the no-reference scenario for the quality assessment task. Since in this mode, we only have the distortion image to predict the quality score, the feature comparison method adopted in the full reference mode is not applicable in this phase. And comparing to directly predicting a score from image features, we think it is more intuitive to learn a classifier for the prediction. And following the work of NIMAtalebi2018nima, we utilize a five classes predictor and use the probability values to calculate the final quality score.

For the no reference mode, we denote our model as . For this mode, the data augmentation strategy is used, i.e., we use image patches for training instead of the whole image so that the model can have enough training data to converge. And to overcome the possibility of overfitting, a smaller configuration size is utilized. Again we set the transformer encoding layers as 2. The model dimension of the transformer encoder is set as 32, and the number of heads for multi-head self-attention is set as 8. The hidden dimensions of the MLP layer in both the prediction head and the encoding layers are set as 64. And the prediction head the number of quality levels is set as 5. In the MLP layers, a dropout rate of 0.1 is used. The initial learning rate is set as 0.5e-4, and the SGD optimizer is configured with a momentum of 0.9.

4.3 Experiments Results and Analysis

Comparisons with state-of-the-arts on FR performance. In this section, we conduct experiments to evaluate the full reference performance of the proposed method comparing with both the contemporary metrics and learning-based models. Note that our model is trained on KADID-10K dataset. And some of the results are borrowed from cheon2021perceptual and ding2020image.

From the above results, we can see that our model overperforms the conventional full reference models like PSNR and CNN-based models like DeepIQA and PieAPP. This should validate the effectiveness of our proposed method on the full reference scenario. The performance of our proposed model is better than LPIPS and DISTS nearly on every metric and evaluation dataset. We think this is mainly due to the differences in the feature learning process. The LPIPS model and DISTS model only have CNN layers to extract the image features for quality prediction while the non-local information is ignored. With the proposed model, we extend the CNN layers with transformer encoders, which employ the multi-head self-attention to model the contextual information in the images. Also, following the practice of DISTS, we apply attention weights to the deep features in different channels. Different from the DISTS model, which only has CNN features to attend to, we attend to both CNN features and transformer features. For the transformer features, we also observe that it is better to apply weights in the form of token embeddings instead of reforming the token embeddings into two-dimensional feature maps like the form of deep convolutional feature maps.

For the recently proposed transformer-based IQT modelcheon2021perceptual, the performance of our method is also comparable. Note that the IQT mode utilizes both transformer encoder and decoder for the modeling, while our method only employs the encoder layers. And the way to use the image features is also different. We use a statistical comparison on deep features to compute the quality score, while IQT uses regression to predict the score. Note that we utilize a Siamese structure for the image encoding process, which means both the reference and the distortion images are processed with the same visual encoder. It is feasible to unify the full reference and no reference prediction mode in the same model framework based on this setting.

PSNR 0.865 0.873 0.819 0.810 0.677 0.687
SSIMwang2004image 0.937 0.948 0.852 0.865 0.777 0.727
MS-SSIMwang2003multiscale 0.940 0.951 0.889 0.906 0.83 0.786
VSIzhang2014vsi 0.948 0.952 0.928 0.942 0.900 0.897
MADlarson2010most 0.968 0.967 0.950 0.947 0.827 0.781
VIFsheikh2006image 0.960 0.964 0.913 0.911 0.771 0.677
FSIMzhang2011fsim 0.961 0.965 0.919 0.931 0.877 0.851
NLPDlaparra2016perceptual 0.932 0.937 0.923 0.932 0.839 0.800
GMSDxue2013gradient 0.957 0.960 0.945 0.950 0.855 0.804
DeepIQAbosse2017deep 0.940 0.947 0.901 0.909 0.834 0.831
PieAPPprashnani2018pieapp 0.908 0.919 0.877 0.892 0.859 0.876
LPIPSzhang2018unreasonable 0.934 0.932 0.896 0.876 0.749 0.670
DISTSding2020image 0.954 0.954 0.928 0.929 0.855 0.830
IQTcheon2021perceptual * 0.970 * 0.943 * 0.899
0.947 0.958 0.941 0.938 0.858 0.832
Table 1: Performance evaluations on the three standard IQA Dataset in terms of PLCC and SROCC. triq-fr is our proposed full reference model.

Learning with NR branch. Most of the deep models form the problem of no-reference image quality assessment as a regression problem. However, compared to the large scale of datasets in conventional deep learning tasks like image classification, there are relatively limited annotations provided for the task of image quality assessment. In this scenario, we apply the patch-based training strategy for NR quality assessment following su2020blindly. For our model configured as NR mode, we also train with this patch strategy. In addition, we adopt a classifier and form the score prediction as a classification problem following the practice of talebi2018nima. The results are shown in Table 2. As we can see, our model achieves better performance than the CNN-based model . However, comparing to the MetaIQA modelzhu2020metaiqa, there is still a performance gap for the model. We think this should own to the gaining from multi-task training of the MetaIQA model.

bosse2017deep 0.761 0.739 *
MetaIQAzhu2020metaiqa 0.887 0.850 *
0.808 0.769 0.349
0.853 0.836 0.291
Table 2: Non-reference Performance evaluations on KONIQ-10K in terms of PLCC, SROCC and RMSE.

Improving the NR performance with FR branch. Previously, we have shown that the proposed method can be applied to both FR and NR mode. From previous results, we also observe that comparing the FR mode, the performance of NR mode has larger space to improve. Since our proposed framework is compatible for both FR and NR mode, in this section we consider whether the NR branch can be improved by a joint training scheme together with FR branch. Comparing to the original triq-nr model, we add another FR branch to allow input of the reference images. And the new model is denoted as triq-uni. Note that for both branches, the visual encoder is shared as a Siemese structure. For the training of triq-uni, we forward the two branches with different datasets. The NR branch is forwarded with KONIQ-10K dataset and the FR branch with KADID-10K dataset. The performances are shown in Table 2. We can see that comparing the previous model , the

achieves better performance in every evaluation metric, which should validate the effectiveness of joint training scheme over the proposed unified framework.

5 Conclusion

In this paper, we discuss a unified transformer-based framework that is compatible with both NR and FR image quality assessment tasks. Specifically, we extend the conventional CNN feature extractor with a transformer encoder. For the FR configuration, we utilize channel attention to assign different weights on deep CNN features from different channels and Transformer features from different tokens. The final quality scores are predicted by structural statistical comparison of the deep features with attention weights. The comparisons with state-of-the-art IQA models are also provided. And the results show that our proposed FR model achieves outstanding performance among all metrics.

In addition, we also explore the NR configuration of the proposed method. Several training strategies are considered, like patch-based training, data augmentation, and an FR-aided training scheme to gain better prediction accuracy. Experiment results show that our proposed NR method can achieve better performance than the conventional CNN-based model. And the results also show that a joint training scheme together with the FR branch can achieve good performance gains. Future work may relate to better generalization on various distortion types and introduce a better feature learning scheme for the image quality assessment task.


This work was supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China, and from the City University of Hong Kong.