1 Introduction
With the success of convolutional neural networks in object image classification
[1, 2, 3, 4], it is of interest to examine how similar the representations learnt by these networks are to the visual representations learnt by the human brain [5, 6, 7]. In Cadieu et al.’s pioneering work on understanding representational similarity [8], a comparison of the prefinal layer activations of deep neural net models with multiunit IT cortex responses is presented, which confirms a significant correspondence between the two representations for the task of core object recognition. We aim to shed more light on the role of engineered aspects of these deep networks, examining the effect of variation in regularization, network depth and model size on the representational similarity to multiunit IT responses and classification accuracy. This can help understand whether sparsity and increased network depth create representations which are closer to representations employed by the primate visual cortex, and also validate the hypothesis that the mammalian visual cortex is the best known object detector, by checking whether increased proximity to cortical representations implies an increase in recognition performance.2 Methodology
For each ConvNet, we supply input images from output classes and perform a feedforward operation to obtain the layerwise activations, and extract the activation at the penultimate fullyconnected layer. For any network architecture , we denote the set of activations for the image set by . For any particular target class , the set of activations is given by . For any set of activations, we can define the average activation as .
Given these representations, we can compute the representational dissimilarity matrix as ([8])
Where and
denote covariance and variance respectively. We compute this matrix for the cortical responses as well, treating the responses as the set of activations as
. To measure representational similarity, we use “similarity to IT dissimilarity matrix” () as the default metric [8], which is as the Spearman’s rank correlation between the upper triangular, nondiagonal elements of the two matrices and . In essence, this metric would be expected to encapsulate the nature of variation across classes for each representation [8].To create the image set for activations, we randomly sample images from the total set in a manner identical to [8]. We also add noise to the ConvNet activations in order to account for the measurement noise present in the IT cortical responses following the experimental noise matched model in [8].
Dataset
Cadieu et al. [8] introduce a dataset of multiunit V4 and IT cortex responses across two male rhesus monkeys while they were presented synthetically generated object image samples. These recordings were taken from the 128 most visually driven neural measurement sites determined via a separate pilot dataset. We employ these recordings as the inferiortemporal (IT) representations (features) for our comparative experiments, and we have a total of 1,960 images for 7 categories (see Figure 1).
Evaluation
We evaluate the best performing ConvNet architectures on ImageNet LSVRC and employ their publicly available pretrained weights. We experiment on both the original architectures and retrained architectures using different regularization schemes  L1, L2 (default) and DeCov
[9]. DeCov tries to learn independent, generalizable filters by minimizing the covariances of filter activations, with the additional DeCov loss at each hidden layer given by ([9])The
operator extracts the main diagonal of a matrix into a vector. The function
is the matrix of covariances of all pairs of activations at a hidden layer , and is the sample batch during training of architecture , and is the sample mean of activation over the batch. We add this loss to each layer of the DeCov network [9]and tune the hyperparameter via crossvalidation. For L1, we replace weightdecay (L2) with L1norm, and tune via crossvalidation. We maintain the original weight decay formulation and weights for L2. All networks are trained with dropout.
3 Results and Conclusions
While Cadieu et al. [8]
validated the plausibility of deep neural net representations being similar to cortical representations, our work seeks to examine a wider range of variation in the characteristics of the deep learning models used. Our results show that by increasing network depth, we observe a marginal increase in representational similarity, which is also consistent with an increase in validation accuracy. Another interesting result is that by decorrelating representations, we observe a higher representational similarity, even though validation performance is similar.
Our experiments offer a preliminary understanding of the effect of network depth and model complexity control on the similarity between deep neural net and cortical representations. Such an approach can help provide some pointers to the learning architectures and mechanisms employed by the brain. Substantial further work needs to be done: comparison of lowerlayer activations of the network with preliminary regions of object recognition (V1, V2), visualization of the effect of context on representational similarity and understanding the impact of dropout and structural risk minimization on networks are some extensions we have initiated. The ultimate aim is to create models similar to humans in both representation and performance on complex vision tasks, as a means of better understanding, or ‘reverse engineering’, the human visual system itself.
References
 [1] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097–1105
 [2] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014)

[3]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A.:
Going deeper with convolutions.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1–9
 [4] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
 [5] Yamins, D.L., Hong, H., Cadieu, C.F., Solomon, E.A., Seibert, D., DiCarlo, J.J.: Performanceoptimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences 111(23) (2014) 8619–8624
 [6] Güçlü, U., van Gerven, M.: Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. The Journal of neuroscience: the official journal of the Society for Neuroscience 35(27) (2015) 10005–10014
 [7] Cichy, R.M., Khosla, A., Pantazis, D., Torralba, A., Oliva, A.: Comparison of deep neural networks to spatiotemporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports 6 (2016)
 [8] Cadieu, C.F., Hong, H., Yamins, D.L., Pinto, N., Ardila, D., Solomon, E.A., Majaj, N.J., DiCarlo, J.J.: Deep neural networks rival the representation of primate it cortex for core visual object recognition. PLoS Comput Biol 10(12) (2014) e1003963
 [9] Cogswell, M., Ahmed, F., Girshick, R., Zitnick, L., Batra, D.: Reducing overfitting in deep networks by decorrelating representations. arXiv preprint arXiv:1511.06068 (2015)
 [10] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, Springer (2014) 818–833
Comments
There are no comments yet.