Accurate and reliable prediction of visual quality is crucial for a wide range of applications in image processing, computer graphics and multimedia communication including compression, enhancement and restoration [chandler2013seven]
. However, while humans judge visual quality seamlessly, the computational prediction of human quality ratings, known as mean opinion scores (MOS), still remains challenging. Computational approaches to quality estimation are typically distinguished by the availability of the reference image into full reference (FR), reduced reference (RR) and no reference (NR) models. The reference availability of a quality model relates to its scope of application as e.g. in a compression scenario the reference signal is generally available[Bosse2019b], while in superresolution or enhancement tasks it might not even exist [chandler2013seven].
The typical design principle of quality models consists of different stages, including feature extraction, feature pooling, and a regression that maps the (pooled) features to a scalar quality estimate. Its technological development follows, in general lines, the development in other fields in computer vision, leading from approaches centered around feature engineering over feature learning-based methods to today’s end-to-end learned models. Classical models for perceptual quality prediction are based on sophisticated, handcrafted features that either imitate specific characteristics of the human visual system or capture the statistics of natural images and their perceptually relevant deviations. One of the most prominent and seminal examples for scene statistic-based FR quality estimation is SSIM[wang2004image] that inspired many other models [Wang2003, zhang2011fsim]. Feature engineering approaches led to a remarkable success, especially in the domain of FR quality estimation. However, it is often speculated that the underlying domain knowledge is too limited or incomplete to capture the full complexity of human visual (quality) perception. This led to quality models leveraging methods of feature learning that allowed advances in RR and NR quality estimation. Some models borrow ideas from both feature engineering and feature learning, as quality features correspond to learned
parameters of the (assumed) probability function of certainengineered features [Mittal2012, Saad2012]. Models qualifying for an unmistakable definition of feature learning employ dictionary learning [Ye2012a, Zhang2015]SPCA]Zhang2018, Bhardwaj2020]. Here, features are learned in an unsupervised [Ye2012a, Zhang2015, SPCA] or self-supervised [Bhardwaj2020] fashion or based on a task different to quality estimation [Zhang2018, chetouani2020image]
. With the success of deep learning in other fields of computer vision, researchers started to devise end-to-end optimized neural network-based approaches to FR, RR and NR quality estimation[Bosse2018, ding2020image, Prashnani2018, Bosse2018distsens]. Although the end-to-end design blurs the lines between feature extraction, pooling and regression, early to middle layers, foremost convolutional layers, can be regarded as implementing feature extractors, while later layers, mostly fully connected layers, can be considered to implement the regression.
Regardless of whether features are handcrafted from domain knowledge or learned by the model, a common underlying assumption appears to be that plausible features for image quality prediction relate to properties of human visual perception or natural scene statistics. In this paper, we reevaluate this hypothesis. To our surprise, we find that even feature extractors constructed from pure noise suffice to optimize a linear regression that achieves high correlation with human quality ratings, similar to a regression from learned features. In subsequent analyses we relate this result to the lottery ticket hypothesis [Ramanujan_2020_CVPR] and the double descent effect [belkin2019reconciling]
that have recently been described in the machine learning literature. The computational framework for our study is introduced in Sec.2 and details of the experimental setup are described in Sec. 3. Results and subsequent analyses are reported in Sec. 4 and our conclusions are derived in Sec. 5.
2 Computational framework
The computational framework for our evaluation is inspired by CORNIA [Ye2012a], one of the earliest visual quality models based on feature learning. CORNIA offers an attractive basis for our experiments as on the one hand it achieves high correlations with human quality ratings but on the other hand follows an elegant yet comparatively simple architecture. In particular, feature extractors and regression model are decoupled and optimized successively, allowing us to inject and evaluate different feature extractors irrespective of the rest of the model.
For a given image , CORNIA starts by extracting a set of descriptors with each descriptor , corresponding to a randomly sampled local image patch of hw pixels. Each descriptor is standardized and the set of descriptors is whitened by zero component analysis (ZCA). Preprocessed descriptors from a set of training images are clustered into
centroids via k-means. All centroids are normalized to unit length and stored in a matrixwhich represents a visual codebook. The codebook is used to extract features from a given image. Specifically, local descriptors of an image are encoded as . In addition, CORNIA applies a non-linear soft-encoding function to further process the extracted features. Let denote the dot product between the -th centroid and the -th descriptor. The -th soft-encoded descriptor is then given by
The soft-encoding step results in a matrix
which is reduced to a feature vectorin the final step of feature extraction, with , i.e., corresponds to the row-wise maxima of . Finally, feature vectors are mapped onto perceptual quality scores via a support vector regression. Whereas the codebook is optimized unsupervisedly, optimizing the regression requires quality annotations.
The computational architecture of CORNIA can also be interpreted as a shallow neural network with a single hidden layer (codebook), a non-linear activation function (soft-encoding function) and a fully-connected output layer (SVR). As such it directly relates to more modern neural network-based visual quality models which are usually much deeper, contain many more parameters and are optimized end-to-end.
2.1 Random codebooks
As a baseline to a learned codebook of image prototypes, we construct a codebook by randomly sampling patches from natural images.
Patches are individually standardized but no whitening is applied to the codebook. This feature model hence represents (some) natural image statistics but contains no learned parameters.
As a second baseline, we construct three codebooks from i.i.d. samples obtained respectively from a Normal, Laplace or Uniform distribution. Random codebooks represent neither knowledge of the human visual system nor of natural image statistics.
3 Experimental setup
To evaluate the performance of different codebook models, we follow the experimental setup of [Ye2012a].
A test image is converted to grayscale and represented by 10000 randomly sampled 7x7 patches.
Features are then extracted with one of the codebook models.
In case of CORNIA, we use the original codebook
which contains 10000 codes that were optimized on 7x7 patches extracted from distorted image of the CSIQ database [larson2010most].
For our first baseline model, we use randomly sampled 7x7 patches from the reference images of CSIQ to construct a codebook of 10000 patches.
For the second baseline model, random noise codebooks are generated with the respective sampling functions from scipy. All codebooks have the same number of parameters ().
For the support vector regression, we use the NuSVR model as provided in scikit-learn with default parameters (linear kernel, , ). The SVR is optimized on the LIVE database [sheikh2006statistical] which we split by reference image into a training set comprising the distorted images associated with 23 () reference images and a test set containing the distorted images of the remaining 6 reference images. Before training the SVR, features of the training set are individually scaled to the range [-1, 1]. The corresponding scaling parameters are also applied to the test data. To obtain reliable results, we repeat all experiments on 10 random data splits. We also evaluate the trained models on the TID2013 [ponomarenko2015image] and CSIQ [larson2010most] databases. TID2013 and LIVE share some of their reference images but we ensure that the TID2013 test data contains only images that were not part of the training set. Code for all experiments is available at https://github.com/fraunhoferhhi/CuriouslyEffectiveIQE.
Average test set performances on the LIVE database are presented in Table 1 in terms of Pearson and Spearman rank-order correlations with MOS values. Models are trained on the full LIVE database, but we evaluate them on both the full database and distortion type specific subsets to assess the generalization across the available distortion types. We observe that the performance differs only marginally between all models for both the full dataset as well as distortion specific subsets. Remarkably, even codebooks of image patches or noise achieve high correlations, comparable to those achieved with a learned codebook. This result is surprising as it questions the importance of learning or capturing perceptually relevant features. In the following, we further analyze this finding.
4.1 Cross database evaluation
In our first analysis we assess whether our findings generalize beyond the LIVE dataset and evaluate the trained models on TID2013 and CSIQ. Results in Table 2 show that correlations on full databases drop dramatically for both TID2013 and CSIQ. This can be explained by the fact that CSIQ and especially TID2013 contain additional, much more diverse distortion types that were not part of the training data. We also note performance declines for evaluations on individual distortion types that are part of the training set. However, except for additive white gaussian noise (awgn) distortions, correlations on distortion types that were part of the training set are still fairly strong for all models. In particular, codebooks from patches or noise perform overall on par with learned codebooks, demonstrating that our finding is not limited to the LIVE database. Interestingly, even though CORNIA and the codebook from patches were assembled from CSIQ, this does not lead to better generalization performance on the CSIQ database.
4.2 How well do the best codes perform?
For a linear SVR whose input features are all scaled to the same range, the importance of individual features can be assessed by the absolute value of the learned SVR coefficients. As depicted in Fig. 2, there is a relatively small and approximately equal number of highly important features in all models.
Analyzing the SVR coefficients allows us to select the subset of most important features per model. We vary
to collect subsets of different sizes and train an SVR for each of these using the same hyperparameters as in Sec.3. We repeat this experiment 10 times on the same random train/test splits of the LIVE database as before. The average Pearson correlation is depicted in Fig. 4 as a function of the number of selected features. For all models the performance follows qualitatively the same curve: initially the correlation increases with the number of features, then descends again, before it finally increases and saturates at a global maximum. This behavior strongly resembles the double descent effect that has been described for various classes of models, including linear regression [belkin2019reconciling, mei2019generalization]
. In particular, it has been shown that the test set performance of many models as a function of the number of parameters reaches a minimum at the interpolation point, the point at which the number of learned parameters matches the number of training samples (in our case, considering only SVR parameters). In addition, these models often reach their peak performance in highly overparameterized regimes. Both of these characteristics are evident in our results for all models.
For small numbers of features, CORNIA consistently outperforms random codebooks by a considerable margin. However, the overall best performance is obtained with large numbers of features with differences between models becoming negligable for 2000 or more features. Complementary, we can also train models on the 18000 least important features. In this case, the largest deviation from the respective peak Pearson correlations of any model is less than 0.02, demonstrating that even the less important features can lead to accurate predictions.
4.3 Spectral analysis
Although the most important codes of CORNIA are more effective than random codes, the overall similarity in performance is still remarkable. Did we, by chance, generate feature extractors that capture perceptually-relevant characteristics? To assess this hypothesis, we perform a spectral analysis of those codes of a model that generate the most important features according to their regression coefficients. We select the codes corresponding to the 1500 most important features per model and divide them into five mutually exclusive subsets. The first subset contains the codes corresponding to the 300 most important features, the last subset corresponds to the codes of the 1200th-1500th most important features. Power spectra, radially-averaged in the frequency domain and averaged across all codes of a subset, are presented in Fig.11; for a better comparison we normalized spectra per model. For the learned codebook of CORNIA, the power spectrum reveals a distinctive curve on all subsets that resembles the band-pass characteristic of the human contrast sensitivity function [Campbell1968]. The codebook constructed from random patches exhibits the spectral shape on all subsets well known for natural images [Tolhurst1992]
. The shapes of the power spectra of random noise codebooks on the other hand vary across the subsets. The spectra for the most important codes show a peak for low (non-dc) frequencies that coincides approximately with the peak in the learned codebook and the codebook of patches. As the codes become less important, the peak levels off and the shape approaches the expected flat white noise spectrum. We conclude that some of the random noise feature extractors indeed capture similar features as the learned codebook which partially explains their surprisingly good performance. Interestingly, by considering Fig.4 and Fig. 11 we observe that the similarity between learned codes and random codes decreases in exactly the regime of overparametrization in which all models reach the highest correlations with MOS.
In this paper we reevaluated the importance of perceptually relevant features for visual image quality prediction. To our surprise, we found that even random feature extractors can lead to high correlations between quality predictions and human quality ratings for a linear regression model. This effect generalizes across distortion types and datasets. In our analyses we have identified two underlying principles for this result. Firstly, by analyzing the power spectra of visual codebooks we found that the most important feature extractors capture similar signal aspects in all models. At first sight it may seem surprising that random noise feature extractors capture perceptually relevant aspects, however, we conjecture that this is a lottery effect [Ramanujan_2020_CVPR] due to the large number of feature extractors (10000) relative to the feature dimension (49). Secondly, we showed that the performance of all models depends critically on the number of features with peak performances being achieved in the highly overparameterized regime. Interestingly, in the overparameterized regime, the similarity between learned and random feature extractors diverges immensely. Taken together, we conclude that visual quality models benefit from feature extractors that capture perceptually relevant aspects, yet, having sufficiently many feature extractors can not only compensate but even outperform a smaller set of individually better feature extractors.
The present study is limited to a single computational model. However, as described in Sec. 2, this model directly relates to more sophisticated neural networks that constitute the current state of the art in the field. In future work we intend to assess the generalization of our results to state of the art neural networks as well as the generalization from the no-reference to the full reference setting. In addition, we aim to further analyze the capabilities of random feature extractors and are interested in whether these can also be used to optimize image processing systems or merely to monitor visual quality.