Log In Sign Up

A study of deep perceptual metrics for image quality assessment

by   Rémi Kazmierczak, et al.

Several metrics exist to quantify the similarity between images, but they are inefficient when it comes to measure the similarity of highly distorted images. In this work, we propose to empirically investigate perceptual metrics based on deep neural networks for tackling the Image Quality Assessment (IQA) task. We study deep perceptual metrics according to different hyperparameters like the network's architecture or training procedure. Finally, we propose our multi-resolution perceptual metric (MR-Perceptual), that allows us to aggregate perceptual information at different resolutions and outperforms standard perceptual metrics on IQA tasks with varying image deformations. Our code is available at


page 1

page 2

page 3

page 4


Conformer and Blind Noisy Students for Improved Image Quality Assessment

Generative models for image restoration, enhancement, and generation hav...

PerceptNet: A Human Visual System Inspired Neural Network for Estimating Perceptual Distance

Traditionally, the vision community has devised algorithms to estimate t...

Identifying and Mitigating Flaws of Deep Perceptual Similarity Metrics

Measuring the similarity of images is a fundamental problem to computer ...

Towards a Semantic Perceptual Image Metric

We present a full reference, perceptual image metric based on VGG-16, an...

Generating Images with Perceptual Similarity Metrics based on Deep Networks

Image-generating machine learning models are typically trained with loss...

Comparison of Image Quality Models for Optimization of Image Processing Systems

The performance of objective image quality assessment (IQA) models has b...

A Comparative Study of Image Quality Assessment Models through Perceptual Optimization

The performance of objective image quality assessment (IQA) models has b...

1 Introduction

Image Quality Assessment (IQA) plays an essential role in image-based applications [1, 2] where the acquisition systems or algorithms can introduce image quality variations. Although IQA is a well-known problem [3], it is difficult to define a metric directly linked to human perception. Indeed, for humans, IQA is intuitive and effortless [4]. Still, it remains a subjective measurement that is insufficient for validating algorithms or acquisition systems.

Several perceptual metrics have been investigated [2, 1] for IQA, such as the Euclidean distance, or SSIM [5]

. Yet, the human perception of image similarity relies on psychological vision mechanisms that are in a large extent unknown, then hard to implement. On the other hand, existing metrics only rely on estimating global or local variations between images.

Deep Learning-based perceptual metrics have been first proposed in [6], and [7]

for the style transfer problem. It was followed by several applications, for the quality of super-resolution algorithms 

[6], semantic segmentation [8]

task, or Generative Adversarial Network (GAN) 

[9] outputs quality. A perceptual metric is typically a

distance between features extracted from Deep Neural Networks (DNNs) after a forward pass of the input images. While some perceptual metrics like the Fréchet Inception Distance (FID) 

[10] are widely used to evaluate the quality of images generated by GANs, they are limited to the comparison of the estimated distribution of two set of images.

In this paper, we propose investigating IQA using Deep Learning-based perceptual metrics to compute the similarity between two images. Unlike previous studies [11, 12]

that learn an image quality metric using DNN, we evaluate different DNNs and their associated hyperparameters (loss function, normalisation, resolution of input images, and the features extraction strategy), with the main goal of identifying a deep perceptual distance as general-purpose metric closer to human perception.

Our contribution is threefold: first, we empirically investigate different DNN perceptual metrics related to the network architecture. Next, we perform an ablation study highlighting the relationship between the training procedure of DNN parameters and the performances of deep perceptual metrics. Finally, we propose a perceptual metric that achieves the state-of-the-art results in unsupervised IQA by studying various hyperparameters impacting the computation of perceptual losses.

2 Multi Resolution Perceptual metric

2.1 Notations and formalism

We denote a dataset composed of images. Let denote the output of the DNN with trainable parameters applied on image .

DNNs can be decomposed into blocks applied sequentially. For example, we can decompose AlexNet [13], which comprises five convolutional layers, into blocks, where each block is a convolutional layer. Let us denote the set of the feature maps of the blocks output from image . For , the feature map

is a three dimension tensor where

and represent the height and width of the feature map and is the number of channels. We denote with a

particular value of the feature map.

2.2 Perceptual metric

The process of computing a perceptual metric can be divided into three stages: the deep feature extraction strategy from an image given a DNN architecture, the normalization strategy of the feature space, followed by the dissimilarity measure to compare the features. We now present these different stages.

2.2.1 Deep Feature Extraction

The deep feature extraction is the initial step that allows representing the data into a new feature space. Contrary to handcrafted features like GLCM [14], we can use a trained DNN to extract features at different levels of the network.

If in some cases DNNs parameters are obtained from pretrained general purpose networks like ImageNet 

[15], some DNNs are finetuned to achieve better performances in a dataset tailored to the evaluation of perceptual metrics [11]

Also, in previous works, features were extracted from the image at the original dimension (

). However, we also explore the feature results after upscaling by two the image, thanks to a bilinear interpolation (

). Let us denote the latent representation of the image .

Given extracted features at different levels of a DNN, a straightforward strategy, termed as linear features, takes all the feature maps containing the perceptual and contextual information at different resolutions, and concatenate them. It is defined as follows:


An alternative strategy consists in combining features, using the Gram matrix as proposed in [7], allowing to extract new features termed as quadratic features. The Gram matrix of the layer is a square matrix of size . Let . The Gram matrix’s coefficient at position defined for a feature map is given by: We can now define the quadratic features of an image as:


On the one hand, the linear features

are directly linked to the content (layout) of an image and to the first moments of the feature maps. On the other hand,

quadratic features are linked to the style of an image [7], and capture stationary information related to the second moment, i.e. the covariance.

2.2.2 Features Normalization

Because the values vary in magnitudes between feature maps, it is essential to normalize them to homogenize all the layers and their importance. This work compares two normalization strategies. Current solutions consider an normalization, that divides each value by the norm of the feature map. Yet it is also possible to normalize with

or with a sigmoid function. We propose to normalize using a sigmoid function, bounding all values of the feature map within

. Another operation which can be done is to use the ReLu function before normalising.

2.2.3 Dissimilarity measure

In order to quantify the difference between the latent representation of two images, we need to define a dissimilarity measure in the feature space, not necessarily limited to distance metrics. We expect that associated with the extracted features is linked to human perceptual dissimilarity metric.

Typically perceptual loss uses the norm (MSE) between features. In addition, we propose to use different dissimilarities such as the norm (MAE), and the binary cross-entropy (CE).

2.3 MR-Perceptual loss

Classically the perceptual loss [11] is composed of VGG [16] linear features, followed by a normalization and the dissimilarity metrics is the MSE. In the rest of the paper we will refer to this loss as the classical perceptual loss.

Based on these three main stages, we propose to change this classical perceptual loss [11] by first proposing a multi-scale and multi-statistic feature space. Our feature space is multi-scale because instead of extracting the feature at just one resolution, we extract the descriptor at two resolutions ( and ). Our descriptor is also multi-statistic since we concatenate quadratic and linear features for the standard resolution. We use the sigmoid function as a normalization function and then use a Binary cross-entropy as a dissimilarity measure. The full process is illustrated in Fig. 1.

Figure 1: MR-Perceptual loss pipeline: from two images of size , we create 6 outputs (3 per image). This permits to extract more information than classical perceptual loss.

3 Experiments

In this section, we first introduce the used dataset (Sec. 3.1), then compare different feature spaces from different architectures (Sec. 3.2). Next, we discuss the importance of how to learn the representation (Sec. 3.3), and finally we perform an ablation study with our technique (Sec. 3.4).

3.1 Dataset

To evaluate the performances of perceptual metrics, we use the Two-Alternative Forced-Choices (2AFC) dataset [11]. The test sets include 36.3 triplets composed of one reference image and two distorted images associated with their scores in defining the ground truth perceptual dissimilarity. According to a human panel, the score reflects the proportion of votes for the chosen image for each tuple. Specifically, a ground truth will have a score of 0 if all the testers wave chosen the first image and 1 if all the testers wave chosen the second image.

The dataset is organized into six groups according to the transformations applied to the distorted images as follows: Trad uses photometric and geometric transformations.

uses transformations coming from DNN, like denoising autoencoders.

SuperRes uses super resolution algorithms on images coming from the NITRE 2017 challenge [17].
Deblur uses image extracted from video clips [18], with video deblurring algorithms.

uses the output of image translation algorithms for image colorization applied on ImageNet 

FrameInterp uses different frame interpolation algorithms applied on the Davis Middlebury dataset [20].

3.2 Link between the IQA and the architectures

To evaluate the deep feature from different networks, we extracted linear feature from AlexNet [13], SqueezeNet [21], VGG [16], Resnet [22] and VIT [23]. All theses networks are pretrained on imagenet [19].

These architectures are organized in convolutional blocks followed by dense layers. Similarly to [11],we use the output of the five convolution blocks to extract features, except for SqueezeNet of which we use seven blocks.

Table 2 presents the results of a handcrafted technique (SSIM) and classical perceptual losses on different DNN architectures. The first result is that perceptual losses highly outperform SSIM, which supports the hypothesis of a better representation of human similarity perception. Moreover, the AlexNet feature space seems to perform better than other architectures. Our interpretation is twofold: first, ImageNet accuracy is not necessarily linked to the quality of the feature space since the tasks are different; secondly, the deeper an architecture is, the worse it performs. This might be linked to the propagation of the perceptual information throughout all the layers, such that all layers are initially trained for the ImageNet task.

Normalization L2 sigmoid L2 L2 sigmoid sigmoid L2 ReLu + L1 sigmoid
Feature Linear Linear Linear Linear Linear Linear Quadratic Linear +Quadratic Linear +Quadratic
resolution + + +
traditional 70,56 71,3 72,93 72,81 71,74 72,31 73,02 72,66 73,78
cnn 83,17 83,4 83,07 83,26 83,03 82,9 82,85 83,8 83,76
super resolution 71,65 71,42 71,56 71,7 71,55 71,18 71,59 71,72 71,73
debluring 60,68 61,27 60,64 60,8 60,67 60,48 60,99 61,48 61,37
colorization 65,01 64,76 63,59 64,81 64,78 63,06 64,73 64,93 64,81
frameinterp 62,65 63,55 62,02 62,49 62,56 62 62,08 63,82 63,46
AVERAGE 68,95 69,28 68,97 69,31 69,06 68,66 69,21 69,74 69,82
Table 1: Ablation study on AlexNet [13] pretrained on ImageNet with a supervised strategy.
Datasets SSIM Alexnet VGG SqueezeNet Resnet18 Resnet50 Resnet101 VIT
Trad 62,73 70,56 70,05 73,3 69,66 70,73 70,71 57,68
CNN 77,6 83,17 81,28 82,64 81,59 81,43 80,88 80,38
SuperRes 63,13 71,65 69,02 70,15 69,69 68,99 68,65 64,94
Deblur 54,23 60,68 59,05 60,13 59,8 58,9 58,9 58,93
Color 60,89 65,01 60,19 63,57 60,49 60,12 59,46 63,23
FrameInterp 57,11 62,65 62,11 61,98 62,54 61,33 61,93 56,09
AVERAGE 62,61 68,95 66,95 68,63 67,30 66,92 66,76 63,54
ImageNet Top1 acc
NA 63,30 74,50 57.50 73.19 77.15 80.9 77.91
Table 2: Comparative results of different DNN architectures linear features.The firsts rows denotes the 2AFC score. The last row shows the Top1 accuracy on ImageNet [19].

3.3 Link between the IQA and the training procedure

In Section 3.2, first experiments focused on the impact of the architecture against a classical perceptual metric. Now, we focus our experiments on the strategy to train a DNN for an optimal representation for perceptual queries. For this purpose, we consider a ResNet50 architecture trained on ImageNet and the following training strategies: a supervised training, DeepCluster [24], Dino [25], MoCo v2 [26], OBoW [27], SimCLR [28], SwAV [29], and finally a random initialization of the parameters. In Table 3, we compare the performance of ResNet50 with the different pretrained parameters, and we observe that our supervised training outperforms the others in most of the tasks. This shows that a supervised procedure helps to inject in the network useful information for the perceptual task.

Dataset Random Supervised Deepcluster Dino MoCo Obow SimCLR SwAV
Trad 58,54 70,73 68,73 69,78 67,91 69,74 68,97 68,77
CNN 80,07 81,43 80,21 80,03 78,3 79,04 79,12 79,74
SuperRes 65,97 68,99 66,7 66,25 67,17 65,88 67,43 66,48
Deblur 59,32 58,9 58,26 58,13 58,45 57,83 57,92 58,09
Color 63,03 60,12 55,92 56,12 56,17 55,81 56,37 56,26
FrameInterp 56,99 61,33 61,94 62,27 62,45 61,54 61,66 62,48
AVERAGE 63,99 66,92 65,29 65,43 65,08 64,97 65,25 65,3
Table 3: Comparative results showing the impact of supervised training on 2AFC [11] with Resnet 50 [22] architecture. We run a linear feature extraction with different pretraining conditions.

3.4 Importance of the different components for IQA

Based on previous results in Sections 3.2 and 3.3, AlexNet trained on ImageNet in a supervised manner is the best to quantify the perceptual dissimilarity. We studied in the table 4 the performance according to the extracted features and observed that features extracted from 4 and 5 layers are the best for the Trad set. But features extracted from the 2 and 3 outperform on CNN, SuperRes, Deblur and Color distortions.

This suggests that some layers might focus on particular details in the distorted images which provides clues for being invariant to some distortions.

Dataset Block 1 Block 2 Block 3 Block 4 Block 5 All
Trad 59,15 69,35 71,97 72,92 73,29 70,56
CNN 81,65 82,88 82,94 82,84 82,03 83,17
SuperRes 65,94 71,6 71,63 71,19 70,6 71,65
Deblur 59,08 60,87 60,67 60,43 60,1 60,68
Color 62,82 64,73 64,42 63,88 63,9 65,01
FrameInterp 57,23 61,95 62,69 62,67 62,71 62,65
AVERAGE 64,31 68,56 69,05 68,99 68,77 68,95
Table 4: Comparative results showing the impact the chosen layer on 2AFC [11] with AlexNet  [13] architecture. The bolded results shows the best results among blocks

Table 1 shows an ablation study to evaluate the relevant hyperparameters for designing a novel perceptual metric as detailed in Section 2.2. We consider the features extraction strategies, the type of dissimilarity metric in the feature space, the normalization strategy, and finally, the resolution. Multi-resolution seems to be the key to improving performances. In addition, Multi statistic seems to improve also the performances for certain distortions.

As shown in Tab. 5, the classic perceptual metric setup is outperformed in all the distortions; this remains true for all the networks, including Watching [30], Split-brain [31], Puzzle [32] and BiGAN [33] .

Dataset Ours Watching Split-Brain Puzzle BiGAN
Trad 73,8 66,5 69,5 71,5 69,8
CNN 83,8 80,7 81,4 82,0 83,0
SuperRes 71.7 69,6 69,6 70,2 70,7
Deblur 61,4 60,6 59.3 60,2 60,5
Color 64,9 64,4 64,3 62,8 63,7
FrameInterp 63.5 61,6 61,1 61,8 62,5
AVERAGE 69,8 67,2 67,5 68,1 68,4
Table 5: Comparison of our Alexnet with MR-perceptual loss with the setup presented in [11].

4 Conclusions

We empirically investigated general-purpose deep perceptual metrics w.r.t. different experimental settings on an IQA task. First, we show that it is unnecessary to use deeper DNN with complex architecture; a simple AlexNet is sufficient for perceptual metrics. Despite convincing results of self-supervised, we show that a supervised strategy remains the best choice. Finally, we confirm that combining features at different resolutions is relevant as it forces the DNN to be more robust against various types of distortion. Future work would involve combining this new perceptual metric and image-to-image translation DNNs to improve the quality of generated images.


  • [1] Guangtao Zhai and Xiongkuo Min, “Perceptual image quality assessment: a survey,” Science China Information Sciences, vol. 63, no. 11, pp. 211301, 2020.
  • [2] Vipin Kamble and KM Bhurchandi, “No-reference image quality assessment algorithms: A survey,” Optik, vol. 126, no. 11-12, pp. 1090–1097, 2015.
  • [3] Zhou Wang, “Applications of objective image quality assessment methods [applications corner],” IEEE signal processing magazine, vol. 28, no. 6, pp. 137–142, 2011.
  • [4] Oleg S Pianykh, Ksenia Pospelova, and Nick H Kamboj, “Modeling human perception of image quality,” Journal of digital imaging, vol. 31, no. 6, pp. 768–775, 2018.
  • [5] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [6] Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV. Springer, 2016, pp. 694–711.
  • [7] LA Gatys, AS Ecker, and M Bethge, “A neural algorithm of artistic style,” in 16th Annual Meeting of the Vision Sciences Society (VSS 2016). Scholar One, Inc., 2016, p. 326.
  • [8] Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek, “Semantic segmentation using adversarial networks,” in NIPS Workshop on Adversarial Training, 2016.
  • [9] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, “Image-to-image translation with conditional adversarial networks,” in CVPR, 2017, pp. 1125–1134.
  • [10] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
  • [11] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
  • [12] Hossein Talebi and Peyman Milanfar, “Learned perceptual image enhancement,” in 2018 IEEE international conference on computational photography (ICCP). IEEE, 2018, pp. 1–13.
  • [13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,

    “Imagenet classification with deep convolutional neural networks,”

    in Advances in Neural Information Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. 2012, vol. 25, Curran Associates, Inc.
  • [14] Robert M Haralick, Karthikeyan Shanmugam, and Its’ Hak Dinstein, “Textural features for image classification,” IEEE Transactions on systems, man, and cybernetics, , no. 6, pp. 610–621, 1973.
  • [15] Marcel Simon, Erik Rodner, and Joachim Denzler, “Imagenet pre-trained models with batch normalization,” arXiv preprint arXiv:1612.01452, 2016.
  • [16] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [17] Eirikur Agustsson and Radu Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in CVPR workshops, 2017, pp. 126–135.
  • [18] Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang, “Deep video deblurring for hand-held cameras,” in CVPR, 2017, pp. 1279–1288.
  • [19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., “Imagenet large scale visual recognition challenge,”

    International journal of computer vision

    , vol. 115, no. 3, pp. 211–252, 2015.
  • [20] Daniel Scharstein and Richard Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” International journal of computer vision, vol. 47, no. 1, pp. 7–42, 2002.
  • [21] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡0.5mb model size,” 2016.
  • [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  • [23] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021.
  • [24] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze,

    “Deep clustering for unsupervised learning of visual features,” 2019.

  • [25] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin, “Emerging properties in self-supervised vision transformers,” 2021.
  • [26] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He, “Improved baselines with momentum contrastive learning,” 2020.
  • [27] Spyros Gidaris, Andrei Bursuc, Gilles Puy, Nikos Komodakis, Matthieu Cord, and Patrick Pérez,

    “Obow: Online bag-of-visual-words generation for self-supervised learning,”

    in 2021 CVPR. IEEE, 2021, pp. 6826–6836.
  • [28] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, “A simple framework for contrastive learning of visual representations,” 2020.
  • [29] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” 2021.
  • [30] Deepak Pathak, Ross B. Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan, “Learning features by watching objects move,” CoRR, vol. abs/1612.06370, 2016.
  • [31] Richard Zhang, Phillip Isola, and Alexei A. Efros,

    “Split-brain autoencoders: Unsupervised learning by cross-channel prediction,” 2017.

  • [32] Mehdi Noroozi and Paolo Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” 2017.
  • [33] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell, “Adversarial feature learning,” CoRR, vol. abs/1605.09782, 2016.