Artificial intelligence is evolving at a great rate, surpassing our very own in various fields, from the famous AlphaZero, a generic model conquering the world of intellectual board games, such as, go, chess and shogi, (Silver et al., 2018), to a 3D-CNN that swiftly reports abnormalities in medical images (Titano et al., 2018). Despite these impressive achievements, our understanding of the underlying mechanisms in deep networks—the fundamental technology behind the scenes—lags far behind.
Deciphering the basis of intelligence requires adequate explanations to, at least, two principal questions:
What operations lead to intelligent behaviour?
How can an intelligent behaviour be accomplished?
It is worth acknowledging that a resolution to the latter does not necessarily shed light on the former. For instance, conventional wisdom (LeCun et al., 2015) states that given a large enough amount of labelled data, deep neuronal networks (DNN) can accurately learn a task (the how question). At the same time, we do not possess a clear understanding of why DNNs work so excellently (Zeiler & Fergus, 2014) (the what question). This is one of the biggest challenges the artificial intelligence community is facing at present.
A more thorough comprehension of artificial networks is of great importance from several perspectives:
Performance improvement. Many elements and their constituent hyper-parameters, such as network architecture, optimisation algorithm, data augmentation, weights initialisation, etc., greatly influence the outcome of a training procedure. Currently, the performance boost is achieved through cumbersome parameter tuning (Glorot & Bengio, 2010). A better understanding of what operations are executed to represent a feature inside a network (Bengio et al., 2013) would facilitate this procedure.
Critical applications. DNNs are becoming a ubiquitous tool in various systems. However, they suffer from the butterfly effect: tiny variations in input cause dramatic changes in output, e.g., (Stallkamp et al., 2012; Szegedy et al., 2013; Moosavi-Dezfooli et al., 2016; Tian et al., 2018), this, in turn, implies vulnerability to malicious attacks (Papernot et al., 2017). Therefore, trusting DNNs with more critical applications, such as, monitoring nuclear plants or automatically diagnosing patients, is subject to a more profound insight into their operations.
Human perception. Recent developments in machine learning have created an amazing opportunity to gain more awareness about the neuronal network inside our brain (Marblestone et al., 2016). DNNs are actively studied as models of our perception, e.g., vision (Eckstein et al., 2017; Flachot & Gegenfurtner, 2018), hearing (Kell et al., 2018), touch (Zhao et al., 2018), to name a few. Much of these works happen at a coarse level—a population of cells. Deciphering kernels’ operations could elucidate finer details of our own brain—individual neurons.
1.1 Related works
This article attempts to discuss the what question in the context of visual information processing. Previous works on this topic could be broadly summarised into three groups:
Visualisation. A large body of literature is devoted to visualising internal units of DNNs, e.g., (Simonyan et al., 2013; Zeiler & Fergus, 2014; Mahendran & Vedaldi, 2015). Despite their genuine usefulness to give an idea of what kernels respond to, they are of a qualitative nature and should be complemented with quantitative techniques.
Transfer learning. Another set of papers investigates transferability of knowledge across networks, data sets and tasks, e.g., (Agrawal et al., 2014; Yosinski et al., 2014). Although, they empirically demonstrate the crucial hierarchical characteristics of layers, i.e., the transition from generic to specific in deeper layers, thus far, no account of feature invariance has been provided.
Cognitive. Further works attempt to interpret the intrinsic behaviour of kernels by analysing their activation patterns, e.g., (Bau et al., 2017; Akbarinia & Gegenfurtner, 2018; Zhang et al., 2018). These techniques successfully exhibit the existence of selectivity among kernels, similar to biological neurons (Quiroga et al., 2005). However, the causality of these kernels for a specific function remains to be demonstrated.
1.2 Research hypothesis
Our approach consists of generating two groups of networks. First, we trained multiple instances of an identical architecture under distinct conditions, ultimately aiming to classify objects in natural images. Second, we fine-tuned each model with various types of distortions and transformations, typically experienced by an intelligent agent. We computed the classification accuracy of all networks for eight types of image manipulation, such as contrast reduction or adding noise. This measure is considered the generalised cognitive capacity of a network. Next, we calculated a statistical metric of intrinsic similarity between the kernels’ weights. Our research hypothesis states that those instances whose overall accuracy is alike would tend to be of a more similar nature and vice versa. Networks whose weights are nearly identical would show a comparable level of performance.
It could be objected that in the hyper-dimensional space of neural networks, such direct correspondences between kernels are meaningless. Nevertheless, a large portion of our knowledge about biological intelligence, in neuroscience, has arguably been acquired through such neuron to neuron comparisons. For instance, a recent study of single neuron comparison has provided important insights about the differences between humans and monkeys brains in terms of the trade-off between “robustness” and “efficiency” (Pryluk et al., 2019). Furthermore, given kernels’ activation maps are often compared across networks to an interpretable concept, e.g., (Bau et al., 2017; Zhang & Zhu, 2018), therefore, it is expected that the underlying parameters of those kernels to be comparable as well.
In total we trained 329 instances of an identical architecture—ResNet20 (He et al., 2016)—to classify objects in the CIFAR data set (Krizhevsky & Hinton, 2009). We analysed every single pair of these networks, i.e., 53,956 comparisons (). We also trained 20 instances of ResNet50 on ImageNet data set (Krizhevsky et al., 2012), resulting in comparing 190 pairs of networks.
The results of our experiments refute the research hypothesis. On the one hand, we encountered numerous pairs of networks with a great resemblance in their performance, although their constituent kernels show no similarity whatsoever. On the other hand, the behaviour of many other pairs of networks is very different, despite the fact that their kernels are almost identical.
We further investigated whether a direct copy of weights from one network to another could convey the associated cognitive capacities as well. The results of our experiment suggest that such a naïve transfer learning is effective for those instances with high intrinsic similarity. This, in turn, strengthens the second part of our hypothesis: as networks’ kernels become more similar, so does their behaviours.
2.1 Data set
We conducted our experiments on two benchmark data sets of object classification in static images:
CIFAR (Krizhevsky & Hinton, 2009) consists of 60 thousands colour images of size 32 32, belonging to 10 classes of objects. The training and validation sets are divided by a ratio of 5 to 1, respectively.
ImageNet (Krizhevsky et al., 2012) is a collection of 1,000 object categories. The training and validation set contain 1.3 million and 50 thousands images, respectively (i.e., 1300 and 50 per category).
2.2 Network visual intelligence
We defined the cognitive visual capacity of a network as its overall accuracy under eight types of image manipulation that are common and therefore of great importance for machine intelligence:
Contrast reduction. We modulated contrast through this equation: , where is the input image, are pixel coordinates, denotes colour channel, and is the contrast level.
Illuminant variation. We simulated this by altering the ratio of colour channels: , where represents the luminance of a scene. For instance, results in a completely blue image.
Image blurring. We blurred an image through its convolution with a Gaussian function: , where
is Gaussian standard deviation anddenotes the convolution operator.
Gamma correction. We adjusted image gamma by a simple power-law expression: , where compresses the gamma and results in gamma expansion.
Salt & Pepper noise. This impulse noise is defined as: , where
is a probability density function (PDF).
Speckle noise. This multiplicative noise is defined as: , where is PDF of Gaussian distribution (a.k.a. normal distribution) with variance and zero mean (.
Poisson noise. Unlike all others that are independent of the image, Poisson noise is generated from pixel values, defined as: , where
is PDF of Poisson distribution defined as.
It is worth emphasising that the human visual system exhibits a great amount of robustness to all these variations (Logothetis & Sheinberg, 1996), and therefore equally expected from machine vision. Moreover, from a practical point view, these conditions occur under typical circumstances, e.g., contrast and luminance alteration in natural scenes (Frazor & Geisler, 2006), or presence of noise in images originated from various parts of the camera pipeline (Boncelet, 2009).
In the experiments reported in this article, we computed the classification accuracy of each network following these parameters111Source code and experimental materials are available at https://goo.gl/eNpaUW.:
Contrast levels .
Illuminant ratios .
Gamma adjustment with .
Image blurring with .
Percentage of independent noise .
In this article, we restricted all our analysis to the family of ResNet architecture (He et al., 2016). For CIFAR data set we utilised ResNet20, which consists of 274,442 parameters, and for ImageNet a deeper version of it, namely ResNet50 that contains 25,636,712 parameters.
We trained 104 instances of ResNet20 on CIFAR-10
. Common configurations for all these networks include: 200 epochs on a single GPU, Adam optimiser(Kingma & Ba, 2014), and following standard augmentation222We do not use the term augmentation in the sense of increasing the number of exposures. It merely refers to a manipulated version of the original image. Therefore, each network is essentially exposed to the same number of training images at each epoch. procedures: i.e., random horizontal flipping, zooming (within a 10% scale), and shifting (within a 10% range). Parameters that were varied across different training procedures include: batch size, weights initialisation, learning rate, decay scheduler, and data augmentation through one or more of the image manipulations explained above.
We fine-tuned each of those 104 networks for 10 more epochs with the same set of configurations, except for the choice of image augmentation, for instance, some only with contrast reduction, others with additive noise, and a few with all or none. This procedure resulted in 225 fine-tuned networks. Therefore, overall we gathered 329 instances of ResNet20.
We followed the same paradigm for ResNet50 on ImageNet, although naturally with fewer instances due to its demanding computational resources. First, we trained 10 instances from scratch for 30 epochs, on a single GPU of batch size 32, randomly cropping images to 224 224 pixels. Note that during the evaluation this randomness was excluded and each image was resized to its smaller edge and the central square of side 224 was cropped.
Next, we picked one of the networks and fine-tuned it for 5 epochs to each individual of the eight image manipulations specified above. We further fine-tuned the same network on all image augmentations twice for 5 and 10 epochs. This procedure resulted in 10 fine-tuned versions. Therefore, overall we gathered 20 instances of ResNet50.
2.4 Network intrinsic similarity
There is no established technique to compute a measure of similarity between two neural networks. In this work, we defined the measure of similarity between a pair of networks (of an identical architecture) as the average Pearson correlation coefficient of their constituent kernels at every corresponding layer. Kernels are not positioned in an identical order within the same layer across different instances of a network. To account for this, we first aligned all kernels across the same layer of two networks in a one-to-one fashion according to their matching highest correlation coefficients.
3 Experiment and results
We computed a measure of performance (referred to as visual intelligence) for all instances of ResNet20 on the images of the CIFAR-10 validation set. The results for a representative subset of these networks (20 out of 329) is shown in Figure 1. Each bar represents a network and consists of eight segments, corresponding to the eight image manipulations defined in Section 2.2. Classification accuracy of all these networks on the original images of the CIFAR-10 validation set without distortions is in the range of 0.88 to 0.92. With the distortions, their average performance has a larger range, between 0.59 to 0.85.
Evidently, various combinations could yield to a similar level of visual intelligence in its absolute term. However, this does not imply that the performance of networks is identical. For instance, N12 and N13 both score 0.84 in the measure of visual intelligence, but the former is more invariant to the changes of contrast and the latter to the illuminant. Therefore, we defined visual intelligence compatibility of a pair of networks as the Pearson correlation coefficient between their corresponding classification accuracy over all conducted experiments (refer to Figure 6 in supplementary materials for a pairwise comparison of the same twenty networks).
Likewise, we computed a pairwise networks intrinsic similarity (as defined in Section 2.4) for all 53,956 pairs of ResNet20 (refer to Figure 7 in supplementary materials). Next, in order to investigate our hypothesis: whether there is an agreement between networks intrinsic similarity and their visual intelligence compatibility, we calculated the difference between these two measures. This comparison for the same set of twenty networks is reported in Figure 2.
If the intrinsic similarity of a pair of networks is on a par with their visual intelligence compatibility, the corresponding cell in Figure 2 would be near to 0 (green cells). For instance, N16 and N17 are almost identical (over 99% intrinsic similarity) and so is their performance across all manipulations (over 99% visual intelligence compatibility). At the same time, N01 and N04 are only 20% similar in both these measures. This agreement is rather expected.
If a pair of networks are intrinsically identical yet their visual intelligence is very different, their corresponding cell in Figure 2 would be close to 1 (yellow cells). For instance, N14 and N16 are intrinsically the same (99%), yet there is no compatibility in their visual intelligence (0%), i.e., N16 obtains a better absolute performance, although N14 performs better in noisy images. Weight perturbations have been reported to influence the performance of DNN (Cheney et al., 2017), however not to the extent observed in our experiment. This is very puzzling on one side of the spectrum.
On the other side of the spectrum, if there is no intrinsic similarly between a pair of networks, yet their visual intelligence almost identical, the corresponding cell in Figure 2 would be nearly 1 (red cells). For instance, N01 and N10
are only 16% intrinsically similar (expected from the distinct nature of image their training procedure), be that as it may, their visual intelligence is 99% compatible. This is equally puzzling, that two distinct systems reach an indistinguishable performance across all eights manipulations. To emphasise, we do not refer merely to their absolute classification accuracy, rather a one-to-one correspondence in every detail of all distortions with many of them consisting random variables.
We conducted a similar procedure for all instances of ResNet50 on images of ImageNet validation set. The visual intelligence of a subset of these networks is illustrated in Figure 4. Networks N1 and N2 were trained from scratch with one difference: during the training procedure, N2 was exposed to those eight image manipulations and N1 was not. All others are fine-tuned versions of N1, either with one distinct image augmentation (e.g., N1-Contrast with images of a random contrast) or all the eight of them (i.e., N1-Alls). The classification accuracy of all obtained networks on original images of ImageNet validation set with no distortion applied is in the range of 0.69 to 0.73. However, their average visual intelligence spans a larger spectrum, between 0.53 to 0.65.
It is worth highlighting that the original ResNet50 (He et al., 2016) scores very similarly to N1 across all distortions. N1-Alls and N2 match their absolute performance on original images while outperforming them in all types of image manipulation. Although this is not a focus of our investigation, these results suggest that architectures similar to the original ResNet50 have the capacity to learn many more features without increasing their number of parameters (Akbarinia & Gegenfurtner, 2019), even to the level of surpassing human’s performance for the same task of object classification under distorted images (Geirhos et al., 2018).
Correspondingly, we computed the visual intelligence compatibility and network intrinsic similarity for all 190 pairs of ResNet50 (refer to Figures 9 and 10 in supplementary materials for a pairwise comparison of the same set twelve networks). The difference between these two measures is reported in Figure 3. We can observe multiple instances of expected cases (green cells). For example, N1 and N1-Gamma are identical intrinsically (99%) and so is their visual intelligence compatibility (99%). At the same time, comparison of many pairs result in this puzzling phenomenon. On the one hand, many of the N1-offspring that are of the same nature, show no compatibility in their visual intelligence (yellow cells). On the other hand, N2 and N1-Alls that are intrinsically very different, reach a comparable level of visual intelligence (red cells).
Comparison of deep neuronal networks (DNN) is often limited to their absolute performance, whether it is classification accuracy, computational time, or memory allocation, among others. Naturally, an infinite set of networks (both inter- and intra-architecture) could exhibit the same level of performance, irrespective of its type. The results of our experiments on image classification is another showcase of this fact. Several instances of ResNet reached the same capacity of visual intelligence defined in this article (see Figures 1 and 4). Extending this form of performance comparison to constancy across multiple tasks, instead of its absolute value, does not alter the nature of this infinite set, even if it reduces it. For instance, our analysis demonstrated that various networks perform indistinguishably from each other under eight types of image distortion (refer to Figures 6 and 9 in supplementary materials).
As discussed in Section 1
, it is of great importance to understand the underlying mechanism of DNN. Arguably, one fundamental approach is to faithfully measure intrinsic similarity among these artificial networks (again, both inter- and intra-architecture). To the best of our knowledge, this remains an open question. The simple measure of cosine similarity has been reported to explain the intrinsic comparison in transfer learning in a very small dimensional space(Schneider et al., 2018). Here, we considered convolutional kernels as the smallest building block of a DNN and utilised Pearson correlation coefficient at this small dimension to capture intrinsic similarity of a pair of networks (refer to Figures 7 and 10 in supplementary materials).
We hypothesised that there should be a close correspondence between these two measures: the innate structure of a network and its cognitive ability (i.e., in our experiments, object detection in images under multiple conditions). This is not an unrealistic hypothesis. An analogy to parameters of a network is biological information—DNA and its four-based sequence. In this context, we share more cognitive abilities with other species of similar nature, e.g., apes. Humans have 96% genomes in common with them (Sequencing et al., 2005), many parts of our brain corresponds to theirs, e.g., visual cortex (Orban et al., 2004), and so does our ability in object detection (Rajalingham et al., 2015). Therefore, essentially we expected a reasonable matching between Figures 6 and 7, and Figures 9 and 10 in supplementary materials.
However, the results of our experiments refute this hypothesis. We observed many cases in which intrinsic similarity of a pair of networks does not match to their classification performance (see Figures 2 and 3). Those puzzling cases are not isolated and rather quite common. This could be due to several objections either to the choice of our method or the hypothesis itself.
4.1 Research question
The first objection would contend that the outcome of an optimisation procedure could give rise to an infinite set of solutions for the same problem, therefore, there are no grounds to believe that a correspondence between intrinsic similarity of networks and their performance should exist. We believe this claim is true to a certain extent. Two systems do not have to be identical to behave very similarly, as this is evident in the animal kingdom, although, there is a high probability that such systems would possess many characteristics in common. For instance, numerous aspects of the organisation of brain—a formidably complicated structural network—is well understood to maintain across inter- and intra-subjects (Bullmore & Sporns, 2009), so does in the hierarchy of a DNN (LeCun et al., 2015). Therefore, the research question should hold true even if not in its absolute form, but to a large extent.
4.2 Measure of intrinsic similarity
The second objection would be that the Pearson correlation coefficient does not capture the complexity of a DNN. Certainly, correlation alone cannot account for the complexity of a network, in which structure and function are twisted together. Therefore, a comprehensive measure of intrinsic similarity between a pair of networks requires a topological approach (Börner et al., 2007). Nevertheless, there are reasons to believe that this simple measure should at least account for a part of the similarity at the level of kernels.
This notion that kernels of the first layer resemble Gabor filters or colour blobs (Yosinski et al., 2014), implies that individual kernels can be considered as smallest building blocks of a complicated structural network, tuned to a specific stimulus, and consequently there should be a sort of similarity between them across networks. Furthermore, it has been reported that overall architecture of a DNN is reminiscent of the LGN–V1–V2–V4–IT hierarchy in the visual cortex (LeCun et al., 2015), this, in turn, strengthens a layer-to-layer comparison of a pair of networks given their constituent kernels are faithfully aligned. Indeed, it has been reported (Bau et al., 2017) that training conditions does not affect dissection of AlexNet (Krizhevsky et al., 2012) into interpretable concepts. Given this finding is grounded on kernels’ activation maps, it can be argued that their underlying kernels’ parameters should perform very similar operations.
We computed the correlation coefficient for a pair of parent-child networks (see Figure 5). This indicates in which layers the child network has deviated more from its parent (e.g., here the convolutional layer #20, i.e., res3c_branch2c in ResNet50). Next, we directly transferred (literally copying) weights of res3c_branch2c from N1-All (child) to N1 (parent). The new network reaches the visual intelligent score of 0.55, i.e., 2% more than the original N1. This is a significant improvement, considering two facts: (i) res3c_branch2c contains only 66,048 parameters, which is a tiny fraction of the entire model (about 0.2%), and (ii) all its inputs and outputs connections are not changed. Correspondingly, transferring weights of all convolutional layers from N1-All to N1 produces a network with the same level of performance as N1-All. Again, this is very significant, considering the fully connected layer, which accounts for 8% of network’s parameters, remains intact.
4.3 Visual tasks
The third objection would state that the set of conducted evaluations to define the visual intelligence of networks is not well grounded. Perhaps a wider range of tasks is necessary to truly capture differences among networks. Naturally, the larger the information space, the more likelihood of differentiating, however, we believe image manipulations defined in our experiments include challenging realistic scenarios that are necessary for robust object detection. Furthermore, the evaluation phase included conditions unseen during the training, which has been shown to greatly trouble generalisation of DNNs (Geirhos et al., 2018). Therefore, the visual intelligence compatibility computed from each pair of networks cannot be a coincidence and captures similarity of cognitive ability to a reasonable level.
Recently, it has been demonstrated that an intertwined feature representation emerges for a network trained on multiple high-level cognitive tasks (Yang et al., 2019). Our approach differs in two aspects: (i) the variation in our tasks are of a low-level nature and (ii) we analysed kernels’ weights instead of their response. Nevertheless, this compositionality of task representations is present in our experiments. The convolutional layer res3c_branch2c, discussed above, exhibits the same characteristic as Figure 5 for all other offspring of N1. This means the very same layer encodes information about multiple low-level visual feature, such as the contrast of input image or presence of noise. Furthermore, the direct transfer of this layer’s weights did not only boost the absolute visual intelligence by 2%, but also it lead to an improvement in every single task, reinforcing this idea that various low-level features are integrated into that layer.
4.4 Future works
Most works in the literature seek feature representation from activity of kernels exposed to high-level concepts, e.g., (Bau et al., 2017; Yang et al., 2019). Although their findings are of great importance for the interpretation of DNNs, their general assumption is based on a latent variable perspective according to the hierarchical representation of of data across layers. Recently, a novel perspective has been proposed to approach kernels operations with Riemannian geometry, in which consecutive layers successively warp the coordinate representation of training data (Hauser & Ray, 2017). In the conducted comparisons between N1 and its fine-tuned offspring, we observed that variation in low-level features (e.g., contrast, illuminant, and noise) highly influenced a middle layer (res3c_branch2c in ResNet50). Perhaps, subtle changes in preceding layers has lead to a complex representation of data manifold at this layer. Therefore, we propose a geometrical matching of kernels (with their raw weights) to measure intrinsic similarity of two networks.
Throughout this article, we argued for a correspondence between the intrinsic similarity of networks and their performance. The results of our experiments come short of demonstrating this hypothesis. This mainly raises awareness of a need to a more adequate comparison of kernels’ weights and their connections between a pair of networks. Future works on this line of investigation would allow us to better quantify similarities and differences among various deep neuronal networks, which, in turn, would be a step forward towards a more profound understanding of their intelligent and hopefully our own.
This project was funded by the Deutsche Forschungsgemeinschaft SFB/TRR 135.
Agrawal et al. (2014)
Agrawal, P., Girshick, R., and Malik, J.
Analyzing the performance of multilayer neural networks for object
European conference on computer vision, pp. 329–344. Springer, 2014.
- Akbarinia & Gegenfurtner (2018) Akbarinia, A. and Gegenfurtner, K. R. How is contrast encoded in deep neural networks? arXiv preprint arXiv:1809.01438, 2018.
- Akbarinia & Gegenfurtner (2019) Akbarinia, A. and Gegenfurtner, K. R. Manifestation of image contrast in deep networks. arXiv preprint arXiv:1902.04378, 2019.
Bau et al. (2017)
Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A.
Network dissection: Quantifying interpretability of deep visual
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- Boncelet (2009) Boncelet, C. Image noise models. In The Essential Guide to Image Processing, pp. 143–167. Elsevier, 2009.
- Börner et al. (2007) Börner, K., Sanyal, S., and Vespignani, A. Network science. Annual review of information science and technology, 41(1):537–607, 2007.
- Bullmore & Sporns (2009) Bullmore, E. and Sporns, O. Complex brain networks: graph theoretical analysis of structural and functional systems. Nature Reviews Neuroscience, 10(3):186, 2009.
- Cheney et al. (2017) Cheney, N., Schrimpf, M., and Kreiman, G. On the robustness of convolutional neural networks to internal architecture and weight perturbations. arXiv preprint arXiv:1703.08245, 2017.
- Eckstein et al. (2017) Eckstein, M. P., Koehler, K., Welbourne, L. E., and Akbas, E. Humans, but not deep neural networks, often miss giant targets in scenes. Current Biology, 27(18):2827–2832, 2017.
Flachot & Gegenfurtner (2018)
Flachot, A. and Gegenfurtner, K. R.
Processing of chromatic information in a deep convolutional neural network.JOSA A, 35(4):B334–B346, 2018.
- Frazor & Geisler (2006) Frazor, R. A. and Geisler, W. S. Local luminance and contrast in natural images. Vision research, 46(10):1585–1598, 2006.
- Geirhos et al. (2018) Geirhos, R., Temme, C. R. M., Rauber, J., Schütt, H. H., Bethge, M., and Wichmann, F. A. Generalisation in humans and deep neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 7549–7561. Curran Associates, Inc., 2018.
- Glorot & Bengio (2010) Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256, 2010.
- Hauser & Ray (2017) Hauser, M. and Ray, A. Principles of riemannian geometry in neural networks. In Advances in Neural Information Processing Systems, pp. 2807–2816, 2017.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Kell et al. (2018) Kell, A. J., Yamins, D. L., Shook, E. N., Norman-Haignere, S. V., and McDermott, J. H. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron, 98(3):630–644, 2018.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
- Logothetis & Sheinberg (1996) Logothetis, N. K. and Sheinberg, D. L. Visual object recognition. Annual review of neuroscience, 19(1):577–621, 1996.
- Mahendran & Vedaldi (2015) Mahendran, A. and Vedaldi, A. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188–5196, 2015.
- Marblestone et al. (2016) Marblestone, A. H., Wayne, G., and Kording, K. P. Toward an integration of deep learning and neuroscience. Frontiers in computational neuroscience, 10:94, 2016.
- Moosavi-Dezfooli et al. (2016) Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2574–2582, 2016.
- Orban et al. (2004) Orban, G. A., Van Essen, D., and Vanduffel, W. Comparative mapping of higher visual areas in monkeys and humans. Trends in cognitive sciences, 8(7):315–324, 2004.
- Papernot et al. (2017) Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519. ACM, 2017.
- Pryluk et al. (2019) Pryluk, R., Kfir, Y., Gelbard-Sagiv, H., Fried, I., and Paz, R. A tradeoff in the neural code across regions and species. Cell, 2019.
- Quiroga et al. (2005) Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. Invariant visual representation by single neurons in the human brain. Nature, 435(7045):1102, 2005.
- Rajalingham et al. (2015) Rajalingham, R., Schmidt, K., and DiCarlo, J. J. Comparison of object recognition behavior in human and monkey. Journal of Neuroscience, 35(35):12127–12136, 2015.
- Schneider et al. (2018) Schneider, S., Ecker, A. S., Macke, J. H., and Bethge, M. Multi-task generalization and adaptation between noisy digit datasets: An empirical study. 1Neural Information Processing Systems (NIPS) Workshop on Continual Learning, 2018.
- Sequencing et al. (2005) Sequencing, T. C., Waterson, R. H., Lander, E. S., Wilson, R. K., Consortium, A., et al. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature, 437(7055):69, 2005.
Silver et al. (2018)
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A.,
Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al.
A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018.
- Simonyan et al. (2013) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Stallkamp et al. (2012) Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.
- Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Tian et al. (2018) Tian, Y., Pei, K., Jana, S., and Ray, B. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, pp. 303–314. ACM, 2018.
- Titano et al. (2018) Titano, J. J., Badgeley, M., Schefflein, J., Pain, M., Su, A., Cai, M., Swinburne, N., Zech, J., Kim, J., Bederson, J., et al. Automated deep-neural-network surveillance of cranial images for acute neurologic events. Nat Med, 24(9):1337–1341, 2018.
- Yang et al. (2019) Yang, G. R., Joglekar, M. R., Song, H. F., Newsome, W. T., and Wang, X.-J. Task representations in neural networks trained to perform many cognitive tasks. Nature Neuroscience, 1(1):1546–1726, 2019.
- Yosinski et al. (2014) Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? In Advances in neural information processing systems, pp. 3320–3328, 2014.
- Zeiler & Fergus (2014) Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014.
- Zhang et al. (2018) Zhang, Q., Wu, Y. N., and Zhu, S.-C. Interpretable convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8827–8836, 2018.
- Zhang & Zhu (2018) Zhang, Q.-s. and Zhu, S.-C. Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering, 19(1):27–39, 2018.
- Zhao et al. (2018) Zhao, C. W., Daley, M. J., and Pruszynski, J. A. Neural network models of the tactile system develop first-order units with spatially complex receptive fields. PloS one, 13(6):e0199196, 2018.
Appendix A Cifar-10
Figure 6 illustrates visual intelligence compatibility of a subset of studied ResNet20.
Figure 7 illustrates intrinsic similarity of a subset of studied ResNet20.
Figure 8 is the difference between these two measures.
Appendix B ImageNet
Figure 9 illustrates visual intelligence compatibility of a subset of studied ResNet50.
Figure 10 illustrates intrinsic similarity of a subset of studied ResNet50.
Figure 11 is the difference between these two measures.