Examining Representational Similarity in ConvNets and the Primate Visual Cortex

We compare several ConvNets with different depth and regularization techniques with multi-unit macaque IT cortex recordings and assess the impact of the same on representational similarity with the primate visual cortex. We find that with increasing depth and validation performance, ConvNet features are closer to cortical IT representations.



There are no comments yet.


page 2


Cognitive residues of similarity

What are the cognitive after-effects of making a similarity judgement? W...

More than meets the eye: Self-supervised depth reconstruction from brain activity

In the past few years, significant advancements were made in reconstruct...

Evaluating Multimodal Representations on Visual Semantic Textual Similarity

The combination of visual and textual representations has produced excel...

Coding of 3D Videos Based on Visual Discomfort

We propose a rate-distortion optimization method for 3D videos based on ...

A Robust Zero-Watermark Scheme with Similarity-based Retrieval for Copyright Protection of 3D Video

The copyright protection of 3D videos has become a crucial issue. In thi...

Transforming Neural Network Visual Representations to Predict Human Judgments of Similarity

Deep-learning vision models have shown intriguing similarities and diffe...

Near-Delaunay Metrics

We study metrics that assess how close a triangulation is to being a Del...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the success of convolutional neural networks in object image classification

[1, 2, 3, 4], it is of interest to examine how similar the representations learnt by these networks are to the visual representations learnt by the human brain [5, 6, 7]. In Cadieu et al.’s pioneering work on understanding representational similarity [8], a comparison of the pre-final layer activations of deep neural net models with multi-unit IT cortex responses is presented, which confirms a significant correspondence between the two representations for the task of core object recognition. We aim to shed more light on the role of engineered aspects of these deep networks, examining the effect of variation in regularization, network depth and model size on the representational similarity to multi-unit IT responses and classification accuracy. This can help understand whether sparsity and increased network depth create representations which are closer to representations employed by the primate visual cortex, and also validate the hypothesis that the mammalian visual cortex is the best known object detector, by checking whether increased proximity to cortical representations implies an increase in recognition performance.

2 Methodology

For each ConvNet, we supply input images from output classes and perform a feed-forward operation to obtain the layer-wise activations, and extract the activation at the penultimate fully-connected layer. For any network architecture , we denote the set of activations for the image set by . For any particular target class , the set of activations is given by . For any set of activations, we can define the average activation as .
    Given these representations, we can compute the representational dissimilarity matrix as ([8])

Where and

denote covariance and variance respectively. We compute this matrix for the cortical responses as well, treating the responses as the set of activations as

. To measure representational similarity, we use “similarity to IT dissimilarity matrix” () as the default metric [8], which is as the Spearman’s rank correlation between the upper triangular, non-diagonal elements of the two matrices and . In essence, this metric would be expected to encapsulate the nature of variation across classes for each representation [8].
    To create the image set for activations, we randomly sample images from the total set in a manner identical to [8]. We also add noise to the ConvNet activations in order to account for the measurement noise present in the IT cortical responses following the experimental noise matched model in [8].


Figure 1: (Reproduced from [8].) Samples from the dataset introduced by [8].

Cadieu et al. [8] introduce a dataset of multi-unit V4 and IT cortex responses across two male rhesus monkeys while they were presented synthetically generated object image samples. These recordings were taken from the 128 most visually driven neural measurement sites determined via a separate pilot dataset. We employ these recordings as the inferior-temporal (IT) representations (features) for our comparative experiments, and we have a total of 1,960 images for 7 categories (see Figure 1).


We evaluate the best performing ConvNet architectures on ImageNet LSVRC and employ their publicly available pre-trained weights. We experiment on both the original architectures and retrained architectures using different regularization schemes - L1, L2 (default) and DeCov

[9]. DeCov tries to learn independent, generalizable filters by minimizing the covariances of filter activations, with the additional DeCov loss at each hidden layer given by ([9])


operator extracts the main diagonal of a matrix into a vector. The function

is the matrix of covariances of all pairs of activations at a hidden layer , and is the sample batch during training of architecture , and is the sample mean of activation over the batch. We add this loss to each layer of the DeCov network [9]

and tune the hyperparameter via cross-validation. For L1, we replace weight-decay (L2) with L1-norm, and tune via cross-validation. We maintain the original weight decay formulation and weights for L2. All networks are trained with dropout.

3 Results and Conclusions

Figure 2: Variation of representational similarity with model regularization (left) and validation performance (right). We see that representational similarity consistently increases with validation performance across architectures.

While Cadieu et al. [8]

validated the plausibility of deep neural net representations being similar to cortical representations, our work seeks to examine a wider range of variation in the characteristics of the deep learning models used. Our results show that by increasing network depth, we observe a marginal increase in representational similarity, which is also consistent with an increase in validation accuracy. Another interesting result is that by decorrelating representations, we observe a higher representational similarity, even though validation performance is similar.

Deep Layers Accuracy ImageNet Top-5 Network on [8] Error Rate Original Architectures AlexNet [1] 8 0.507 () 0.523 () 18.1% ZFNet [10] 8 0.531 () 0.568 () 16.0% VGGNet-16 [2] 16 0.557 () 0.580 () 7.5% VGGNet-19 [2] 19 0.559 () 0.582 () 7.5% GoogLeNet [3] 22 0.551 () 0.575 () 7.89% ResNet-50 [4] 50 0.564 () 0.601 () 5.25% ResNet-101 [4] 101 0.567 () 0.603 () 4.60% ResNet-152 [4] 152 0.568 () 0.612 () 4.49% Modified Architectures AlexNet-L1[1] 8 0.378 () 0.413 () 27.2% ZFNet-L1[1] 8 0.383 () 0.431 () 23.9% VGGNet-16-L1[1] 16 0.405 () 0.434 () 13.5% VGGNet-19-L1[1] 19 0.411 () 0.437 () 12.9% AlexNet-DeCov [9] 8 0.539 () 0.521 ( 20.0% ZFNet-DeCov [9] 8 0.541 () 0.528 ( 18.8% VGGNet-16-DeCov [9] 16 0.562 () 0.556 ( 11.6% VGGNet-19-DeCov [9] 19 0.563 () 0.561 ( 11.2%
Table 1: A comparison of the representational similarity of ConvNets to the IT cortex (), as per [8]. Original accuracies on ImageNet are reported directly. Accuracy on the data set of [8] is obtained using a linear SVM trained on pre-final activations.

Our experiments offer a preliminary understanding of the effect of network depth and model complexity control on the similarity between deep neural net and cortical representations. Such an approach can help provide some pointers to the learning architectures and mechanisms employed by the brain. Substantial further work needs to be done: comparison of lower-layer activations of the network with preliminary regions of object recognition (V1, V2), visualization of the effect of context on representational similarity and understanding the impact of dropout and structural risk minimization on networks are some extensions we have initiated. The ultimate aim is to create models similar to humans in both representation and performance on complex vision tasks, as a means of better understanding, or ‘reverse engineering’, the human visual system itself.