1 Predicting discrimination thresholds
Suppose we have a model for human visual representation, defined by conditional density , where is an
-dimensional vector containing the image pixels, andis an -dimensional random vector representing responses internal to the visual system (e.g., firing rates of a population of neurons). If the image is modified by the addition of a distortion vector, , where is a unit vector, and scalar controls the amplitude of distortion, the model can be used to predict the threshold at which the distorted image can be reliably distinguished from the original image. Specifically, one can express a lower bound on the discrimination threshold in direction for any observer or model that bases its judgments on (Seriès et al. (2009)):
where is a scale factor that depends on the noise amplitude of the internal representation (as well as experimental conditions, when measuring discrimination thresholds of human observers), and is the Fisher information matrix (FIM; Fisher (1925)), a second-order expansion of the log likelihood:
Here, we restrict ourselves to models that can be expressed as a deterministic (and differentiable) mapping from the input pixels to mean output response vector, , with additive white Gaussian noise in the response space. The log likelihood in this case reduces to a quadratic form:
Substituting this into Eq. (2) gives:
Thus, for these models, the Fisher information matrix induces a locally adaptive Euclidean metric on the space of images, as specified by the Jacobian matrix, .
1.1 Extremal eigen-distortions
The FIM is generally too large to be stored in memory or inverted. Even if we could store and invert it, the high dimensionality of input (pixel) space renders the set of possible distortions too large to test experimentally. We resolve both of these issues by restricting our consideration to the most- and least-noticeable distortion directions, corresponding to the eigenvectors of with largest and smallest eigenvalues, respectively. First, note that if a distortion direction is an eigenvector of with associated eigenvalue , then it is also an eigenvector of (with eigenvalue ), since the FIM is symmetric and positive semi-definite. In this case, Eq. (1) becomes
If human discrimination thresholds attain this bound, or are a constant multiple above it, then the ratio of discrimination thresholds along two different eigenvectors is the square root of the ratio of their associated eigenvalues. In this case, the strongest prediction arising from a given model is the ratio of the extremal (maximal and minimal) eigenvalues of its FIM, which can be compared to the ratio of human discrimination thresholds for distortions in the directions of the corresponding extremal eigenvectors (Fig. 1).
Although the FIM cannot be stored, it is straightforward to compute its product with an input vector (i.e., an image). Using this operation, we can solve for the extremal eigenvectors using the well-known power iteration method (von Mises and Pollaczek-Geiringer (1929)). Specifically, to obtain the maximal eigenvalue of a given function and its associated eigenvector ( and
, respectively), we start with a vector consisting of white noise,, and then iteratively apply the FIM, renormalizing the resulting vector, until convergence:
To obtain the minimal eigenvector, , we perform a second iteration using the FIM with the maximal eigenvalue subtracted from the diagonal:
1.2 Measuring human discrimination thresholds
For each model under consideration, we synthesized extremal eigen-distortions for 6 images from the Kodak image set111Downloaded from http://www.cipr.rpi.edu/resource/stills/kodak.html.
. We then estimated human thresholds for detecting these distortions using a two-alternative forced-choice task. On each trial, subjects were shown (for one second each with a half second blank screen between images, and in randomized order) a photographic image (18 degrees across),, and the same image distorted using one of the extremal eigenvectors, , and then asked to indicate which image appeared more distorted. This procedure was repeated for 120 trials for each distortion vector, , over a range of values, with ordering chosen by a standard psychophysical staircase procedure. The proportion of correct responses, as a function of , was fit with a cumulative Gaussian function, and the subject’s detection threshold, was estimated as the value of for which the subject could distinguish the distorted image 75% of the time. We computed the natural logarithm of the ratio of these detection thresholds for the minimal and maximal eigenvectors, and averaged this over images (indexed by ) and subjects (indexed by ):
where indicates the threshold measured for human subject . provides a measure of a model’s ability to predict human performance with respect to distortion detection: the ratio of thresholds for model-generated extremal distortions will be larger for models that are more similar to the human subjects (Fig. 1).
2 Probing representational sensitivity of VGG16 layers
We begin by examining discrimination predictions derived from the deep convolutional network known as VGG16, which has been previously studied in the context of perceptual sensitivity. Specifically, Johnson et al. (2016)
trained a neural network to generate super-resolution images using the representation of an intermediate layer of VGG16 as a perceptual loss function, and showed that the images this network produced looked significantly better than images generated with simpler loss functions (e.g. pixel-domain mean squared error).Hénaff and Simoncelli (2016) used VGG16 as an image metric to synthesize minimal length paths (geodesics) between images modified by simple global transformations (rotation, dilation, etc.). The authors found that a modified version of the network produced geodesics that captured these global transformations well (as measured perceptually), especially in deeper layers. Implicit in both of these studies, and others like them (e.g., Dosovitskiy and Brox (2016)), is the idea that a deep neural network trained to recognize objects may exhibit additional human perceptual characteristics.
|Front||Layer 3||Layer 5|
|Front||Layer 3||Layer 5|
Here, we compare VGG16’s sensitivity to distortions directly to human perceptual sensitivity to the same distortions. We transformed luminance-valued images and distortion vectors to proper inputs for VGG16 following the preprocessing steps described in the original paper, and verified that our implementation replicated the published object recognition results. For human perceptual measurements, all images were transformed to produce the same luminance values on our calibrated display as those assumed by the model.
We computed eigen-distortions of VGG16 at 6 different layers: the rectified convolutional layer immediately prior to the first max-pooling operation (Front), as well as each subsequent layer following a pooling operation (Layer2–Layer6). A subset of these are shown, both in isolation and superimposed on the image from which they were derived, in Fig.3. Note that the detectability of these distortions in isolation is not necessarily indicative of their detectability when superimposed on the underlying image, as measured in our experiments. We compared all of these predictions to a baseline model (MSE), where the image transformation,
, is replaced by the identity matrix. For this model, every distortion direction is equally discriminable, and distortions are generated as samples of Gaussian white noise.
Average Human detection thresholds measured across 10 subjects and 6 base images are summarized in Fig. 2, and indicate that the early layers of VGG16 (in particular, Front and Layer3) are better predictors of human sensitivity than the deeper layers (Layer4, Layer5, Layer6). Specifically, the most noticeable eigen-distortions from representations within VGG16 become more discriminable with depth, but so generally do the least-noticeable eigen-distortions. This discrepancy could arise from overlearned invariances, or invariances induced by network architecture (e.g. layer 6, the first stage in the network where the number of output coefficients falls below the number of input pixels, is an under-complete representation). Notably, including the "L2 pooling" modification suggested in Hénaff and Simoncelli (2016) did not significantly alter the visibility of eigen-distortions synthesized from VGG16 (images and data not shown).
3 Probing representational similarity of IQA-optimized models
The results above suggest that training a neural network to recognize objects imparts some ability to predict human sensitivity to distortions. However, we find that deeper layers of the network produce worse predictions than shallower layers. This could be a result of the mismatched training objective function (object recognition) or the particular architecture of the network. Since we clearly cannot probe the entire space of networks that achieve good results on object recognition, we aim instead to probe a more general form of the latter question. Specifically, we train multiple models of differing architecture to predict human image quality ratings, and test their ability to generalize by measuring human sensitivity to their eigen-distortions.
We constructed a generic 4-layer convolutional neural network (CNN, 436908 parameters - Fig. 4). Within this network, each layer applies a bank of
convolution filters to the outputs of the previous layer (or, for the first layer, the input image). The convolution responses are subsampled by a factor of 2 along each spatial dimension (the number of filters at each layer is increased by the same factor to maintain a complete representation at each stage). Following each convolution, we employ batch normalization, in which all responses are divided by the standard deviation taken over all spatial positions and all layers, and over a batch of input images (Ioffe and Szegedy (2015)). Finally, outputs are rectified with a softplus nonlinearity,
. After training, the batch normalization factors are fixed to the global mean and variance across the entire training set.
We compare our generic CNN to a model reflecting the structure and computations of the Lateral Geniculate Nucleus (LGN), the visual relay center of the Thalamus. Previous results indicate that such models can successfully mimic human judgments of image quality (Laparra et al. (2017)). The full model (On-Off), is constructed from a cascade of linear filtering, and nonlinear computational modules (local gain control and rectification). The first stage decomposes the image into two separate channels. Within each channel, the image is filtered by a difference-of-Gaussians (DoG) filter (2 parameters, controlling spatial size of the Gaussians - DoG filters in On and Off channels are assumed to be of opposite sign). Following this linear stage, the outputs are normalized by two sequential stages of gain control, a known property of LGN neurons (Mante et al. (2008)). Filter outputs are first normalized by a local measure of luminance (2 parameters, controlling filter size and amplitude), and subsequently by a local measure of contrast (2 parameters, again controlling size and amplitude). Finally, the outputs of each channel are rectified by a softplus nonlinearity, for a total of 12 model parameters. In order to evaluate the necessity of each structural element of this model, we also test three reduced sub-models, each trained on the same data (Fig. 5).
Finally, we compare both of these models to a version of VGG16 targeted at image quality assessment (VGG-IQA). This model computes the weighted mean squared error over all rectified convolutional layers of the VGG16 network (13 weight parameters in total), with weights trained on the same perceptual data as the other models.
3.1 Optimizing models for IQA
We trained all of the models on the TID-2008 database, which contains a large set of original and distorted images, along with corresponding human ratings of perceived distortion (Ponomarenko et al., 2009). Perceptual distortion distance for each model was calculated as the Euclidean distance between the model’s representations of the original and distorted images:
For each model, we optimized its parameters, , so as to maximize the correlation between the model-predicted perceptual distance, and the human mean opinion scores (MOS) reported in the TID-2008 database:
Optimization of VGG-IQA weights was performed using non-negative least squares. Optimization of all other models was performed using regularized stochastic gradient ascent with the Adam algorithm (Kingma and Ba (2015)).
3.2 Comparing perceptual predictions of generic and structured models
After training, we evaluated each model’s predictive performance using traditional cross-validation methods on a held-out test set of the TID-2008 database. By this measure, all three models performed well (Pearson correlation: CNN , On-Off: , VGG-IQA: ).
Stepping beyond the TID-2008 database, and using the more stringent eigen-distortion test, yielded a very different outcome (Figs. 7, 6 and 8). The average detection thresholds measured across 19 human subjects and 6 base images indicates that all of our models surpassed the baseline model in at least one of their predictions. However, the eigen-distortions derived from the generic CNN and VGG-IQA were significantly less predictive of human sensitivity than those derived from the On-Off model (Fig. 6) and, surprisingly, even somewhat less predictive than early layers of VGG16 (see Fig. 8). Thus, the eigen-distortion test reveals generalization failures in the CNN and VGG16 architectures that are not exposed by traditional methods of cross-validation. On the other hand, the models with architectures that mimic biology (On-Off, LGG, LG) are constrained in a way that enables better generalization.
We compared these results to the performance of each of our reduced LGN models (Fig. 5), to determine the necessity of each structural element of the full On-Off model. As expected, the models incorporating more LGN functional elements performed better on a traditional cross-validation test, with the most complex of the reduced models (LGG) performing at the same level as On-Off and the CNN (LN: , LG: , LGG: ). Likewise, models with more LGN functional elements produced eigen-distortions with increasing predictive accuracy (Fig. 6 and 8). It is worth noting that the three LGN models that incorporate some form of local gain control perform significantly better than the CNN and VGG-IQA models, and better than all layers of VGG16, including the early layers (see Fig. 8).
We have presented a new methodology for synthesizing most and least-noticeable distortions from perceptual models, applied this methodology to a set of different models, and tested the resulting predictions by measuring their detectability by human subjects. We show that this methodology provides a powerful form of “Turing test”: perceptual measurements on this limited set of model-optimized examples reveal failures that are not be apparent in measurements on a large set of hand-curated examples.
We are not the first to introduce a method of this kind. Wang and Simoncelli (2008) introduced Maximum Differentiation (MAD) competition, which creates images optimized for one metric while holding constant a competing metric’s rating. Our method relies on a Fisher approximation to generate extremal perturbations, and uses the ratio of their empirically measured discrimination thresholds as an absolute measure of alignment to human sensitivity (as opposed to relative pairwise comparisons of model performance). Our method can easily be generalized to incorporate more physiologically realistic noise assumptions, such as Poisson noise, and could potentially be extended to include noise at each stage of a hierarchical model.
We’ve used this method to analyze the ability of VGG16, a deep convolutional neural network trained to recognize objects, to account for human perceptual sensitivity. First, we find that the early layers of the network are moderately successful in this regard. Second, these layers (Front, Layer 3) surpassed the predictive power of a generic shallow CNN explicitly trained to predict human perceptual sensitivity, but underperformed models of the LGN trained on the same objective. And third, perceptual sensitivity predictions synthesized from a layer of VGG16 decline in accuracy for deeper layers.
We also showed that a highly structured model of the LGN generates predictions that substantially surpass the predictive power of any individual layer of VGG16, as well as a version of VGG16 trained to fit human sensitivity data (VGG-IQA), or a generic 4-layer CNN trained on the same data. These failures of both the shallow and deep neural networks were not seen in traditional cross-validation tests on the human sensitivity data, but were revealed by measuring human sensitivity to model-synthesized eigen-distortions. Finally, we confirmed that known functional properties of the early visual system (On and Off pathways) and ubiquitous neural computations (local gain control, Carandini and Heeger (2012)) have a direct impact on perceptual sensitivity, a finding that is buttressed by several other published results (Malo et al. (2006); Lyu and Simoncelli (2008); Laparra et al. (2010, 2017); Ballé et al. (2017)).
Most importantly, we demonstrate the utility of prior knowledge in constraining the choice of models. Although the biologically structured models used components similar to generic CNNs, they had far fewer layers and their parameterization was highly restricted, thus allowing a far more limited family of transformations. Despite this, they outperformed the generic CNN and VGG models. These structural choices were informed by knowledge of primate visual physiology, and training on human perceptual data was used to determine parameters of the model that are either unknown or underconstrained by current experimental knowledge. Our results imply that this imposed structure serves as a powerful regularizer, enabling these models to generalize much better than generic unstructured networks.
The authors would like to thank the members of the LCV and VNL groups at NYU, especially Olivier Henaff and Najib Majaj, for helpful feedback and comments on the manuscript. Additionally, we thank Rebecca Walton and Lydia Cassard for their tireless efforts in collecting the perceptual data presented here. This work was funded in part by the Howard Hughes Medical Institute, the NEI Visual Neuroscience Training Program and the Samuel J. and Joan B. Williamson Fellowship.
- Ballé et al. (2017) J. Ballé, V. Laparra, and E.P. Simoncelli. End-to-end optimized image compression. ICLR 2017, pages 1–27, March 2017.
- Carandini and Heeger (2012) Matteo Carandini and David J. Heeger. Normalization as a canonical neural computation. Nature Reviews Neuroscience, 13, 2012.
Dodge and Karam (2017)
Samuel Dodge and Lina Karam.
A study and comparison of human and deep learning recognition performance under visual distortions.arxiv.org, 2017.
- Dosovitskiy and Brox (2016) Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. NIP2 2016: Neural Information Processing Systems, 2016.
- Fisher (1925) R.A. Fisher. Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society, 22:700–725, 1925.
- Goodfellow et al. (2014) I.J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. ICLR 2014, December 2014.
- Hénaff and Simoncelli (2016) Olivier J Hénaff and Eero P Simoncelli. Geodesics of learned representations. ICLR 2016, November 2016.
- Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICLR 2015, February 2015.
Johnson et al. (2016)
Justin Johnson, Alexandre Alahi, and Fei Fei Li.
Perceptual losses for real-time style transfer and super-resolution.
ECCV: The European Conference on Computer Vision, 2016.
- Khaligh-Razavi and Kriegeskorte (2014) Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation. PLOS Computational Biology, 10(11):e1003915, November 2014.
- Kingma and Ba (2015) Diederik P Kingma and Jimmy Lei Ba. ADAM: A Method for Stochastic Optimization. ICLR 2015, pages 1–15, January 2015.
- Laparra et al. (2017) V. Laparra, A. Berardino, J. Ballé, and E.P. Simoncelli. Perceptually optimized image rendering. Journal of the Optical Society of America A, 34(9):1511–1525, September 2017.
- Laparra et al. (2010) Valero Laparra, Jordi Muñoz-Marí, and Jesús Malo. Divisive normalization image quality metric revisited. Journal of the Optical Society of America A, 27, 2010.
Lyu and Simoncelli (2008)
Siwei Lyu and Eero P. Simoncelli.
Nonlinear image representation using divisive normalization.
Proc. Computer Vision and Pattern Recognition, 2008.
- Malo et al. (2006) J. Malo, I Epifanio, R. Navarro, and E.P. Simoncelli. Nonlinear image representation for efficient perceptual coding. IEEE Transactions on Image Processing, 15, 2006.
- Mante et al. (2008) Valerio Mante, Vincent Bonin, and Matteo Carandini. Functional mechanisms shaping lateral geniculate responses to artificial and natural stimuli. Neuron, 58(4):625–638, May 2008.
- Nguyen and Clune (2015) J. Nguyen, A. Yosinski and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. in computer vision and pattern recognition. IEEE CVPR, 2015.
- Ponomarenko et al. (2009) N Ponomarenko, V Lukin, and A Zelensky. TID2008-a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern …, 2009.
- Portilla and Simoncelli (2000) Javier Portilla and Eero P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. Int’l Journal of Computer Vision, 40(1):"49–71", Dec 2000.
- Seriès et al. (2009) Peggy Seriès, Alan A. Stocker, and Eero P. Simoncelli. Is the homunculus "aware" of sensory adaptation? Neural Computation, 2009.
- Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015, September 2015.
- Szegedy et al. (2013) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv.org, December 2013.
- von Mises and Pollaczek-Geiringer (1929) Richard von Mises and H. Pollaczek-Geiringer. Praktische verfahren der gleichungsauflösung. ZAMM - Zeitschrift für Angewandte Mathematik und Mechanik, 9:152–164, 1929.
- Wang and Simoncelli (2008) Zhou Wang and Eero P. Simoncelli. Maximum differentiation (mad) competition: A methodology for comparing computational models of perceptual qualities. Journal of Vision, 2008.
- Yamins et al. (2014) D. L. K. Yamins, H. Hong, C. Cadieu, E.A. Solomon, D. Seibert, and J.J. DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23):8619–8624, June 2014.