Hierarchical nucleation in deep neural networks

by   Diego Doimo, et al.

Deep convolutional networks (DCNs) learn meaningful representations where data that share the same abstract characteristics are positioned closer and closer. Understanding these representations and how they are generated is of unquestioned practical and theoretical interest. In this work we study the evolution of the probability density of the ImageNet dataset across the hidden layers in some state-of-the-art DCNs. We find that the initial layers generate a unimodal probability density getting rid of any structure irrelevant for classification. In subsequent layers density peaks arise in a hierarchical fashion that mirrors the semantic hierarchy of the concepts. Density peaks corresponding to single categories appear only close to the output and via a very sharp transition which resembles the nucleation process of a heterogeneous liquid. This process leaves a footprint in the probability density of the output layer where the topography of the peaks allows reconstructing the semantic relationships of the categories.



page 7


B-CNN: Branch Convolutional Neural Network for Hierarchical Classification

Convolutional Neural Network (CNN) image classifiers are traditionally d...

On the Transferability of Representations in Neural Networks Between Datasets and Tasks

Deep networks, composed of multiple layers of hierarchical distributed r...

Hierarchical Density Order Embeddings

By representing words with probability densities rather than point vecto...

The Local Dimension of Deep Manifold

Based on our observation that there exists a dramatic drop for the singu...

Multiscale Hierarchical Convolutional Networks

Deep neural network algorithms are difficult to analyze because they lac...

Graph Modularity: Towards Understanding the Cross-Layer Transition of Feature Representations in Deep Neural Networks

There are good arguments to support the claim that feature representatio...

A Single-Pass Classifier for Categorical Data

This paper describes a new method for classifying a dataset that partiti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional networks (DCNs) have become fundamental tools of modern science and technology. They provide a powerful approach to supervised classification, allowing the automatic extraction of meaningful features from data. The capability of DCNs to discover representations without human input, has attracted the interest of the machine learning community. In the intermediate layers of a DCN, the data are represented with a set of features (the activations) embedded in a manifold whose tangent directions capture the relevant factors of variation of the inputs

Bengio et al. [2013], Goldt et al. [2020]. Accordingly, understanding these data representations requires both studying the geometrical properties of the underlying manifolds and characterising the data distributions on them.

In the present paper, we analyse how the probability density of the data on the supporting manifold changes across the layers of a DCN. We consider in particular DCNs trained for classifying ImageNet; as we will see, the complexity and heterogeneity of this dataset critically affects the results of our analysis.

Comparisons between representations based on generalizations of multivariate correlation have already been performed with the methods in Raghu et al. [2017] (SVCCA), Morcos et al. [2018] (PWCCA) and, more extensively, in Kornblith et al. [2019] (CKA). Representational similarity analysis (RSA) Kriegeskorte et al. [2008]

– introduced originally in neuroscience – investigates artificial representations as well, and in each layer a matrix of pairwise distances (representation dissimilarity matrix (RDM) ) between the activation vectors of the data points tells which data is similar or dissimilar in that layer. The introduction of RDMs allowed performing multiple comparisons including those between artificial and biological networks

Khaligh-Razavi and Kriegeskorte [2014], Yamins et al. [2014], Cadieu et al. [2014].

More recently, the question whether DCNs learn a hierarchy of classes was addressed, exploiting class confusion patterns Bilal et al. [2017]. Other studies investigated more specifically geometrical and structural properties of the representations. In Ansuini et al. [2019] a common trend in the intrinsic dimension was found across several architectures; in Salakhutdinov and Hinton [2007] the soft-neighbor loss was used as a tool to reveal structural changes in neighbors organization across layers, also during training Frosst et al. [2019].

Figure 1: Evolution of the data representations in ResNet152. Projections of the representations of the input (left), conv4 (middle) and output layers in ResNet152 for six ImageNet classes. Contours schematically portray the density isolines on the data manifold. The dark red circles surround the five nearest neighbors of a point ; represents the fraction of these points that are in the same class of .

Here we take a complementary perspective. One can view a DCN as an engine capable of iteratively shaping a probability density. The input data can be seen as instances harvested from a given probability distribution. This distribution is then modified again and again by applying, at each layer, a non-linear transformation to the data coordinates. The result of this sequence of transformations is well understood: in the output layer of a trained network, data belonging to different categories form well separated clusters, which can be viewed as distinct peaks of a probability density (see Fig.

1). But where in the network do these peaks appear? Do they develop slowly and gradually or all of a sudden? Is this change model-specific or is it shared across architectures? And what is the probability flux between a layer and the next? In the input layer the data points are mixed: data with different ground truth labels are close to each other. In the output layer, the neighborhood of each data point is ordered, namely it contains mostly data points belonging to the same category. Where in the network does the transition from disordered to ordered neighborhoods take place? Is it simultaneous with the formation of the probability peaks?

The pivotal role of depth in determining the accuracy of a neural network suggests that the transformation of the probability density should be slow and gradual in order to be effective. The analysis reported in Raghu et al. [2017], Morcos et al. [2018], Kornblith et al. [2019] are consistent with this scenario. However, we will see that especially in a DCN trained for a complex classification task the evolution of the probability density is not really smooth, with spikes in the probability flux and sudden changes in the modality.

We analyse the probability landscape in the intermediate layers of a DCN by a technique which allows estimating the probability density and characterizing its features even if this is defined as a function of hundreds of thousands of coordinates, provided that the data are embedded in a relatively low dimensional manifold

d’Errico et al. [2018]

. The advantage of this approach with respect to standard dimensional reduction techniques is that the embedding manifold does not necessarily have to be an hyperplane, but can be arbitrarily curved, twisted and topologically complex. To analyse the probability flux between the layers we use an extension of neighboring hit

Paulovich et al. [2008]. The main results of this analysis also sketched in Fig. 1 can be summarized as follows:

  • Representations in DCNs trained for complex classification tasks do not evolve smoothly, but through nucleation-like events, in which the neighborhood of a data point changes rather suddenly (Sec. 3.1);

  • In the first layers of the network any structure which is initially present in the probability density of the input is washed out, reaching a state with a single probability peak where the neighborhoods mainly contain simple images characterized by elementary geometrical shapes (Sec. 3.2);

  • In the successive layers a structure in the landscape starts to emerge, with probability peaks appearing in an order that mirrors the semantic hierarchy of the dataset: neighborhoods are first populated by images that share the same high-level attributes (Sec. 3.3);

  • In the output layer the probability landscape is formed by density peaks containing data points with the same ground truth classification; interestingly, these peaks are organized in complex "mountain chains" resembling the semantic kinship of the categories (Sec. 3.3).

It short, we find that the disorder-order transition induced by a trained DCN can be characterized without any reference to the ground truth categories as a sequence of changes in the modality of the probability density of the representations. These changes are achieved by reshuffling the neighbors of the data points again and again, in a process which resembles the diffusion in an heterogeneous liquid, followed by the nucleation of an ordered phase.

2 Methods

The DCNs we consider in this work are classifiers that map a data point , for example an image, to its categorical target typically encoded with a one-hot vector of dimension equal to the number of classes. Feedforward networks achieve the task via a function composition that transform the input sequentially . We call the vector containing the value of the activations of the -th layer for data point the representation of at the layer . The sequence of representations of these datapoints on a trained network can be seen as a "trajectory" in a very high dimensional space. The relative positions of the inputs change from an initial state where the neighborhood of each point contains members of different classes to a final state where images of the same class have been mapped close together to the same target point. We study this process with two approaches, one aimed at describing the probability flux across the layers (Sec. 2.1) the other aimed at characterizing the features of the probability density in each layer (Sec. 2.2).

2.1 The neighborhood overlap

Let be the set of points nearest to in euclidean distance at a given layer , and let be an adjacency matrix with entries if and otherwise. Through we define an index of similarity between two layers and as:


The similarity just introduced has a very intuitive interpretation: it is the average fraction of common neighbors in the two layers considered: for this reason, we will refer to as the neighborhood overlap between layers and .

In the same framework we also compare the similarity of a layer with the ground truth categorical classification defining the "ground truth" adjacency matrix if and otherwise. In this case is the average fraction of neighbors of a given point in that are in the same class as the central point (see Fig. 1). We set to one tenth of the number of images per class, but we verified that our findings are robust with respect to the choice of over a wide range of values (see Sec. A.2). When calculated using the ground truth adjacency matrix as a reference, reduces to the neighboring hit Paulovich et al. [2008]. A measure of overlap quantitatively similar to can be obtained by using the method in Kornblith et al. [2019] with a gaussian kernel of very small width (see Sec. A.4).

2.2 Estimating the probability density

We analyse the structure of the probability density of data representations following the approach in d’Errico et al. [2018], which allows to find the peaks of the data probability distribution and the location and the height of the saddle points between them. This in turn provides information on the relative hierarchical arrangement of the probability peaks.

The methodology works as follows. Using a kNN estimator the local volume density

around each point is estimated. The maxima of (namely the probability peaks) are then found. Data point is a maximum if the following two properties hold: (I) for all the points belonging to ; (II) does not belong to the neighborhood of any other point of higher density d’Errico et al. [2018]. A different integer label is assigned to each of the maxima, and the data points that are not maxima are iteratively linked to one of these labels, by assigning to each point the same label of its nearest neighbor of higher density. The set of points with the same label corresponds to a probability peak.

The saddle points between probability peaks are then found. A point is assumed to belong to the border with a different peak if it exists a point whose distance from is smaller than the distance from any other point belonging to . The saddle point between and is the point of maximum density in .

Finally, the statistical reliability of the peaks is assessed as follows. Let be the maximum density of peak , and the density of the saddle point between and . If , peak is merged with peak since the value of its density peak is considered indistinguishable from the saddle point at a statistical confidence fixed by the parameter d’Errico et al. [2018]. The process is repeated until all the peaks satisfy this criterion, and are therefore statistically robust with a confidence .

2.3 The dataset and the network architecture

We perform our analysis on the ILSVRC2012 dataset, a subset of mutually exclusive classes of ImageNet which can be considered leaves of a hierarchical structure with 860 internal nodes. The highest level of the hierarchy contains seven classes but of the ILSVRC2012 images belong to only two of these (artifacts or animals) and are split almost evenly between them ( and respectively). Unless otherwise stated, the analysis in this work is performed on a subset of randomly chosen categories, including images for each category, for a total of images.

We extracted the activations of the training set of ILSVRC2012 from a selection of PyTorch pre-trained ResNet

He et al. [2016] and VGG Simonyan and Zisserman [2014] networks. We measure our quantities on the output of each ResNet block and on the pooling and final fully connected layers of VGGs (checkpoints). These are the layers where all the architectures downsample the channels (see Sec. A.1) and the learned representations become more abstract and invariant to details of the input irrelevant for the classification task Bengio et al. [2013]. This allows making a direct comparison between VGGs and ResNets architectures of different depths.


We include the source code of our experiments with the instructions required to run it on a selection layers in the supplementary material.

3 Results

It is well known that neural networks modify the representations of the data from an initial state where all the data are randomly mixed, to a final state where they are orderly clustered according to their ground truth labels Frosst et al. [2019]. But where in the network does this order arise? In the output layer the nearest neighbors of, say, the image of a cat are very likely other images of cats. But in which layer, and in which manner do cat-like images come together? We describe the ordering process by analyzing the change in the probability distribution across the layers.

3.1 The evolution of the neighbor composition in a DCN

We first characterize the probability flux by computing the neighborhood overlap : the fraction of -neighbors of a data point which are the same in layer and in the output layer (Eq. 1). Figure 2-a shows the behaviour of as a function of for the checkpoint layers of the ResNet152 described in Sec. 2.3 (orange line). The neighborhood overlap remains close to zero up to =142. In the next 9 layers it starts growing significantly, reaching a value of 0.35 in layer 151 and 0.73 in layer 152, the last before the output. In the same figure, we also plot the neighborhood overlap of each layer with the ground truth classification (blue line). After layer 142, changes even more abruptly than , increasing from 0.10 to 0.72.

Figure 2: Overlap profiles in ResNet152 and in different architectures. (a:) Overlap between the checkpoint layers and the ground truth (blue) and with the output (orange). The green profile shows the overlap between nearby layers with dots in correspondence to the checkpoints. (b:) Probability distribution of for four layers. (c:) Profiles of for six architectures of different depths. The values of measured on the checkpoints are displayed uniformly on the -axis.

We can obtain more insights into this transition process looking at the probability distribution of across the dataset in four different layers (see Fig. 2-b). In the input and output layers the probability distribution is unimodal. In layer 142 (conv4), before the onset of the transition, the distribution is still strongly dominated by disordered neighbors, but an ordered tail starts to emerge. In layer 151 (conv5), immediately after the transition, the distribution indicates the coexistence of some data points in which the neighborhood is still disordered or only partially ordered () and some data in which it is already ordered ().

These results show that ordering, when measured by the consistency of the neighborhood of data points with respect to their class labels, changes abruptly, in a manner which qualitatively resembles the phase transition of a "nucleation" process. The green profile of

2-a reinforces this evidence showing the overlap between two nearby layers . This quantity is a measure of the probability flux between any two consecutive layers. In the first layer the neighborhoods are almost completely reshuffled, as indicated by an . Afterwards, up to layer 142 , indicating that the neighborhoods change their compositions smoothly like in a slow diffusion process. In the two central blocks, from layer 10 to layer 142, it takes 20-30 layers to change half of the neighbors of each data point, i.e. to decrease to 0.5 (see Sec. A.3). At layer 142 instead, the first ordered nuclei appear and drops to 0.55 in just one layer. A significant reshuffling of the neighborhoods takes place at layer 151, where drops again to 0.61. We will see in Sec. 3.3 that in this layer the structure of the probability density changes significantly, and the probability peaks corresponding to the "correct" categories appear. The effective "attractive force" acting between data with the same ground truth label overcomes the entropic-like component coming from the intrinsic complexity of the images, and clusters of akin images emerge almost all at the same time (i.e., at the same layer), giving rise to a sharp transition.

Is the sudden change we observed specific to this architecture or is it a common feature of deep networks trained for similar tasks? To answer this question we repeated the same experiment on architectures of different sizes of the ResNet and the VGG families. We observed a common trend in all the architectures, as depicted in Fig. 2-c, where we plot for the checkpoints described in Sec. 2.3. In all the cases, remains close to zero for many layers, and then sharply increases in only a few layers towards the end of the network. The value of in the output layer is different in different architectures, consistently with the fact that their classification accuracy is different.

3.2 The data landscape before the onset of ordering

It has been argued that the first layers of deep networks serve the important task of getting rid of unimportant structures present in the dataset Shwartz-Ziv and Tishby [2017], Achille and Soatto [2018], Ansuini et al. [2019], LeCun et al. [2015]. This phenomenon is illustrated in the upper panel of Fig.  3, which shows that any overlap with the input layer is lost roughly after the conv3 landmark (layer 34).

We found that in intermediate layers the DCNs analysed in this work construct high-dimensional hyperspherical arrangements of points with very few "simple" images at the center. This is related to the high intrinsic dimension (ID) of these layers Ansuini et al. [2019]. When the ID is very high, few data points act as "hubs"Radovanović et al. [2010], namely they fall in a large fraction of the other point’s

Figure 3: Image entropy in ResNet152. (Top:) Overlap with the input (blue), and output (orange) layers. (Bottom:) Average image image entropy within the first 30 neighbors; the errorbars are shorter than the marker size; the most frequent images found in conv3 are shown on the right.

neighborhoods while the others fall in just a few. Moving from the input to conv3 the same images appear in a growing number of neighborhoods. In conv3 the top ten most frequent images are found in almost half of the 90,000 neighborhoods with a high of 75,663 for the most frequent of all.

Hub images are particularly "simple", looking in most cases like elementary patterns (dots, blobs, etc.) lying on almost uniform backgrounds (see Fig.  3, bottom right). To quantify this perceptual judgment we computed the Shannon entropy of an image where is the normalized frequency of pixels of value and is the number of channels of RGB images. The average entropy of the neighbors of an image in a layer is given by , and averaging across all images we obtain a measure of the neighborhood entropy of a layer . A low value of means that, in that layer, the neighborhoods contain many low-entropy images. In bottom panel of Fig.  3 we show how changes: in intermediate layers, and most prominently in conv3 – where the intrinsic dimension is at its peak (see Sec. A.5) – the representation is organized around low-entropy hubs whose centers are low- images (blue line, and left stack of hub images). As a reference, we also report the entropy computed shuffling the neighbors assignments (grey dashed line).

3.3 The evolution of the probability density across the hidden layers

We have seen that at layer34 (conv3) all images are arranged in a high-dimensional hypersphere and that at layer 151 (conv5) the neighborhoods are already organized consistently with the classification. Clearly, the most important transformations of the representation happen in between these layers. To shed some light on the evolution of the representations in these intermediate layers we hence use a tool which allows characterizing multidimensional probability distributions, finding its probability peaks and localizing all the saddle points separating these peaks (see Sec. 2.2). We will see that the nucleation-like transition described in Sec. 3.1 is a complex process, in which the network separates the data in a gradually increasing number of density peaks laid out in a hierarchical fashion that closely mirrors the hyperonymous-hyponymous relations of the ILSVRC2012 dataset.

Figure 4: Structure and composition of density peaks of representations. (a:) A schematic view of the peaks in 6 layers. Color tones refer to the relative presence of animals and artifacts in each peak: dark red = of animals, dark blue = of artifacts. (b:) profiles for animal/artifact partition and the 300 low-level classes (blue and orange). (c:) The dendrogram portrays the hierarchical connections between the density peaks of the animal branch. On the -axis the value of the density peaks is plotted in logarithmic scale. Two insets above show the detailed composition of specific branches (light blue and light green).

Figure 4-a shows a two-dimensional visualization of the number and organization of the probability peaks of the representations in some of the layers. In the input layer (=0), the data are split into two major peaks, which roughly divide the training set into light and dark images. This structure is not useful for classification, and is wiped out within the first 34 layers of the network. In conv3 the probability density becomes unimodal, consistently with the analysis of the previous section. In the subsequent layers the network creates structure that is useful for classification, and in layer 97 a bimodal distribution appears. The other peaks shown in figure are very small and retain only a few hundreds data points each. The same density peaks persist until layer 142, where of the images still reside in the two biggest ones. Finally, after layer 142 the two large peaks break down into smaller ones representing individual classes.

To asses the population of the density peaks in terms of ground-truth categories we use the Adjusted Rand Index () [Rand, 1971, Milligan and Cooper, 1986]. Roughly speaking, is zero if the density peaks do not correspond to the reference partitions of the dataset, and is one if they match it. In Fig. 4-b we plot the with respect to the high level animal/artifact categories (, orange line), and with respect to the low level classes we sampled (, blu line). From layer 97 to layer 142 artifacts and animals predominantly populate one of the two major peaks increasing the value to 0.22 while the correlation with low level classes remains absent. The following breakup of the peaks leads to a drop of to 0 and to a concomitant sharp rise of from 0 to 0.55, consistent with the nucleation mechanism detected by and described in Sec. 3.1. Moreover, some classes are separated before others (Fig. 4-a, layer 145), consistently with the bimodality in the distribution of observed in the bottom panels of Fig. 2-a. Interestingly, many of the density peaks in the layers between 142 and 153 (i.e., during the nucleation transition) closely resemble the hierarchical structure of the concepts in ILSVRC2012. For instance, in layer 148 one can find peaks corresponding to insects, birds, but also ships and buildings (see Sec. A.6).

In the last layer ( = 153) the different peaks correspond to the different classes, but the structure of the probability density is much richer than a simple collection of disjointed peaks. Indeed the hierarchical process that shaped the density landscape across the layers leaves a footprint on the organization of the peaks. For instance, the division in macro-classes of animals and artifacts formed in layer 97 is still present in the last layer as indicated by the fact that red and blue clusters are found primarily on the left and on the right of the corresponding low dimensional embedding (Fig. 4-a). But much more structure is present. In Figure 4-c we visualize the probability landscape of the animal classes as a dendrogram, in which each leaf corresponds to a peak, and the leaves are merged sequentially, following the WPGMA algorithm Sokal [1958], according to the height of the saddle point of the probability density between them. In this manner the secondary probability peaks belonging to the same large scale structure form a branch of the dendrogram. The height of a leaf in Fig. 4-c is proportional to the logarithm of the density of the peaks. The morphological similarities of animals with similar genetic material make it possible for the dendrogram in Fig.  4 to reproduce the taxonomy of a phylogenetic tree to an astonishing degree. At the root of the dendrogram, we can notice a first distinction between mammals on the left and other animals on the right. At the following herarchical level we can find a more specific separation of animal types. Dogs, reptiles, birds and insects and so on can be easily identified. Finally, within each species, say dogs (Fig. 4), alike breeds are linked by tighter bounds, that is saddle points of higher density.

In the supplementary material we include the values of the probability density and the integer identifier of the density peak to which each image belongs for the relevant representations analysed in this section. We also provide the topography of the probability density, namely the height of all the peaks and of all the saddle points between them.

4 Discussion and Conclusion

This paper presents an explicit characterization of the evolution of the probability density of the data landscape across the layers of a DCN. We showed that this probability density undergoes a sequence of transformations which brings to the emergence of a rugged and complex probability landscape. Rather surprisingly, we found that the development of these structures is not gradual, as one would expect in a deep network with more than one hundred layers. Instead, the greatest changes to the neighborhood composition and the emergence of the probability peaks are localized in a few layers. This picture seems qualitatively different with the one emerging from SVCCA Raghu et al. [2017], projection weighted CCA Morcos et al. [2018] and linear CKA Kornblith et al. [2019], which have revealed smoother changes between nearby representations. A first reason of this difference lies in the kind of correlation captured by these similarity indices. The ordering mechanism starting with the separation between animals and artifacts is functional and correlated to a successful fine-grained classification of the categories. In essence CCA based methods capture the correlation between the final categories (the peaks) and the "intermediate level" concepts ("the mountain chains") required to construct them which are recognized already in the middle layers of the network. The overlap defined in Eq. 1 measures instead a correlation growing only when the neighbors become consistent with those of the output () or their labels (). In section A.4 of the appendix we show that is similar to the correlation obtained by Gaussian CKAKornblith et al. [2019] using a very small kernel width.

A second possible reason for the discrepancy between the results reported in this work and those reported in the literature is the complexity of the datasets analysed. Indeed, most previous studies have focused on datasets like MNIST and CIFAR-10. These datasets lack the semantic stratification of ImageNet and hence show a much smoother evolution of the probability landscape, because in these datasets the number steps needed to disentangle the hierarchy of features of the categories is smaller (see Sec. A.5). We are unaware of attempts that directly targeted the similarity of DCN representations in connection to the hierarchical structure of a complex dataset like ImageNet. In Bilal et al. [2017], confusion matrices have been used to visually analyse the correlations between classes showing results in agreement with our conclusions. However, the algorithm we use here (Sec. 2.2) relies on density estimation and is able to reconstruct a probability landscape that faithfully follows the hierarchical structure of categories (Sec. 3.3

) in an unsupervised manner, with no need to consider the ground truth labels and estimating the confusion matrix; indeed, our approach works also in the limiting case of 100% test accuracy.

We believe that the detailed picture of the evolution of the probability density provided in this work can trigger a more rational design of the architecture and of the learning protocols of DCNs. One can imagine to define training losses targeting the development of probability peaks according to a pre-defined semantic classification. This can be enforced in the intermediate layers of a network, rather than only in the output layer, somehow enhancing the separation between macro categories arising spontaneously for this dataset. One can even imagine to use the topography of the density peaks developed by a deep neural network as a hierarchical classifier, going well beyond the sharp classification in categories. An appropriate understanding of the nucleation mechanism could also be beneficial to transfer learning, since it gives a simple rational criterium to judge the generality (i.e. transferability) of the features of a representation

Yosinski et al. [2014]. Turning to synergies between artificial neural networks and neuroscience Barrett et al. [2019], we forsee many potential uses of this methodology among which 1) assessing the local similarity of cortical representations in different brain areas 2) localizing the cortical areas that code more explicitly for a certain set of features and 3) monitoring how neighbor relationships change during learning.

Broader impact

This work does not present any foreseeable societal consequence.


  • Bengio et al. [2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • Goldt et al. [2020] Sebastian Goldt, Marc Mezard, Florent Krzakala, and Lenka Zdeborová. Modelling the influence of data structure on learning in neural networks: the hidden manifold model. 2020.
  • Raghu et al. [2017] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein.

    Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.

    In Advances in Neural Information Processing Systems 30, pages 6076–6085. 2017.
  • Morcos et al. [2018] Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems, pages 5727–5736, 2018.
  • Kornblith et al. [2019] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3519–3529, 2019.
  • Kriegeskorte et al. [2008] Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience, 2:4, 2008.
  • Khaligh-Razavi and Kriegeskorte [2014] Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS computational biology, 10(11), 2014.
  • Yamins et al. [2014] Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23):8619–8624, 2014.
  • Cadieu et al. [2014] Charles F Cadieu, Ha Hong, Daniel LK Yamins, Nicolas Pinto, Diego Ardila, Ethan A Solomon, Najib J Majaj, and James J DiCarlo. Deep neural networks rival the representation of primate it cortex for core visual object recognition. PLoS Comput Biol, 10(12):e1003963, 2014.
  • Bilal et al. [2017] Alsallakh Bilal, Amin Jourabloo, Mao Ye, Xiaoming Liu, and Liu Ren.

    Do convolutional neural networks learn class hierarchy?

    IEEE transactions on visualization and computer graphics, 24(1):152–162, 2017.
  • Ansuini et al. [2019] Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. In Advances in Neural Information Processing Systems 32, pages 6111–6122. 2019.
  • Salakhutdinov and Hinton [2007] Ruslan Salakhutdinov and Geoff Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In Artificial Intelligence and Statistics, pages 412–419, 2007.
  • Frosst et al. [2019] Nicholas Frosst, Nicolas Papernot, and Geoffrey Hinton. Analyzing and improving representations with the soft nearest neighbor loss. arXiv preprint arXiv:1902.01889, 2019.
  • d’Errico et al. [2018] Maria d’Errico, Elena Facco, Alessandro Laio, and Alex Rodriguez.

    Automatic topography of high-dimensional data sets by non-parametric Density Peak clustering.

    arXiv e-prints, 2018.
  • Paulovich et al. [2008] F. V. Paulovich, L. G. Nonato, R. Minghim, and H. Levkowitz. Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping. IEEE Transactions on Visualization and Computer Graphics, 14(3):564–575, 2008.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 770–778, 2016.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • Shwartz-Ziv and Tishby [2017] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
  • Achille and Soatto [2018] Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018.
  • LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  • Radovanović et al. [2010] Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11(Sep):2487–2531, 2010.
  • Rand [1971] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971.
  • Milligan and Cooper [1986] Glenn W. Milligan and Martha C. Cooper.

    A study of the comparability of external criteria for hierarchical cluster analysis.

    Multivariate Behavioral Research, 21(4):441–458, 1986.
  • Sokal [1958] Robert R Sokal. A statistical method for evaluating systematic relationships. Univ. Kansas, Sci. Bull., 38:1409–1438, 1958.
  • Yosinski et al. [2014] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
  • Barrett et al. [2019] David GT Barrett, Ari S Morcos, and Jakob H Macke. Analyzing biological and artificial neural networks: challenges with opportunities for synergy? Current opinion in neurobiology, 55:55–64, 2019.
  • LeCun et al. [2010] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. 2010.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Appendix A Appendix

a.1 Details on the architectures considered

We briefly describe here the structure of the VGG Simonyan and Zisserman [2014] and ResNet He et al. [2016] architectures analysed in our work. The first part of the architectures is composed of convolutional and pooling layers. A convolutional layer maps a stack of features (or channels) of size into another through a filter. In ResNets and VGGs the size of the filter is mostly , their width is always equal to

. The result of a convolution is then passed through a ReLU and produces one output channel. Different filters produce different output channels. When the size

of the output channel is halved the number of filters is doubled. In VGGs channels are downsampled by pooling layers, in ResNets mainly doubling the filter stride. Finally, VGGs end with three fully connected layers, ResNets with only one. In our study we used the convolutional layers that downsample the channels together with the fully connected layers as chackpoints.

a.2 Scaling of and

Figure A.1-a shows the overlap with the ground truth (top) and with the output activations (bottom) in ResNet152, for the same subset of 90,000 examples from ILSVRC2012 analysed in Sec. 2.3. In our experiments we empirically set i.e. one tenth of the number of images per class. Figure A.1-a shows that the trend of and is rather robust over a wide range of -values. Only when is very large () the transition in the last layers of the network is not detected very clearly.

In Figure A.1-b we plot how (top) and (bottom) vary with the dataset size . As the number of examples increases we keep the ratio between the number of classes and images per class constant. This shows that the results are also robust with respect to .

Figure A.1: (a:) Profiles of the overlap with the ground truth labels (top) and with the output layer (bottom) as a function of the neighbor size. (b:) Profiles of the overlap with the ground truth labels (top) and with the output layer (bottom) as a function of the total number of images.

a.3 Overlap with the checkpoint layers

Figure A.2 shows the overlap of the representations with respect to the representation at tree layers , and , belonging to tree distinct ResNet152 blocks. On average the number of layers required to change half of the neighbors is in conv3 and in conv4, while in conv5 where the nucleation takes place the same change occurs in just one layer. Indeed, the rate at which neighbors are reshuffled grows dramatically when the ordered clusters appear (see Sec. 2.1). The neighborhood composition changes significantly also between two blocks when the channels are downsampled.

Figure A.2: Overlap with layers 25, 88, 148 in ResNet152. Different background colors indicate different ResNet blocks

a.4 Central kernel alignment vs overlap

Central kernel alignment (CKA) is the normalized squared Hilbert-Schmidt norm of the cross covariance operator between representations Kornblith et al. [2019]. Like the neighborhood overlap it is invariant under orthogonal transformations and isotropic scaling but not to an arbitrary invertible linear transformation. This has been argued to be too a limitation for a similarity index between representations Kornblith et al. [2019]. Gaussian CKA probes the local similarity between representations and can seen as a kernel smoothing of the neighborhood overlap presented in Sec 2.1. In figure A.3

-a we show the gaussian CKA (orange) and the overlap (green) with the output layer setting the kernel bandwidth to 0.2 times the average distance with the first nearest neighbor. Linear CKA is equivalent to a CCA between representations in which the canonical variates are weighted by the corresponding eigenvalues

Kornblith et al. [2019]. Linear CKA steadily increases already in the early layers of the network (see Fig. A.3-a blu profile).

Figure A.3-b shows how the gaussian CKA with the output is affected by different choices of the kernel bandwidth . The smaller is the sharper is the transition measured by the index.

Figure A.3: (a:) Linear CKA (blu), overlap (green) and gaussian CKA (orange) with the output layer in ResNet152 for a subset of 5000 ILSVRC2012 images. We kept 50 classes and 100 images per class and set the kernel bandwidth to 0.2 times the average distance with the first nearest neighbor . (b:) Gaussian CKA with the output layer as a function of the kernel badwidth : (blu), (orange), (green), (red), (pink).

a.5 Overlap and intrinsic dimension profiles in different datasets

Figure A.4: Overlap with the ground truth labels (Top) and intrinsic dimension profiles (Bottom) in ResNet152 for different datasets: MNIST (red), CIFAR10 (orange), modMNIST (green), ImageNet (blue).

In this section we compare the overlap with the ground truth labels and the intrinsic dimension (ID) profiles of different dataset of inceasing complexity in ResNet152.

The top panel of figure A.4 shows for a ResNet152 architecture trained on MNIST LeCun et al. [2010] modMNIST, CIFAR-10 Krizhevsky et al. [2009] and ImageNet Deng et al. [2009]

. To generate the modMNIST dataset we resize the dimension of the MNIST digits by a factor ranging from 0.2 to 0.4 and moved them in a random location of the image. We finally scale up the size of the images to 224x224 pixels. We trained MNIST and modMNIST for 10 and 20 epochs respectively using Adam optimizer

Kingma and Ba [2014]

with default parameters (lr=0,001, betas=(0,9; 0,999)); we trained CIFAR10 for 120 epochs with stochastic gradient descent with momentum (lr = 0.1, momentum = 0.9), decreasing the learning rate by a factor 10 after 60 epochs; we used the PyTorch pre-trained ResNet152 model for ImageNet.

MNIST can be directly classified with high accuracy with a -NN estimator. Consistently already in the input layer and reaches one from conv3 onwards. In modMNIST and CIFAR-10 the categories are only 10, therefore the initial values of are larger, the lag phase is shorter the one of ImageNet. While qualitatively, behaves similarly in modMNIST, CIFAR-10 and ILSVRC2012, the transition of seems to be sharp only for the ILSVRC2012 dataset, and is therefore likely related to the complexity of the prediction task.

Bottom panel shows the intrinsic dimension (ID) for the same datasets across the checkpoints layers of ResNets152. The higher the complexity of the dataset the more are the factors of variations encoded in the embedding manifold, the higher is the ID. For complex datasets like ImageNet the ID has the hunchback shape reported in Ansuini et al. [2019], while for MNIST and modMNIST it is almost constant, and it takes much smaller values, uncorrelated with . This supports the hypothesis that the transition observed in the value of the neighborhood overlap is not necessarily related with a sharp change of the intrinsic dimension of the representation.

a.6 Details of density peaks appearing between layer 142 and 148

In figure A.5 we report a visualisation of the the density peaks appearing during the "nucleation transition" of Resnet 152. In particular, the image shows the size and approximate composition of the peaks present in the layers 142, 145, and 148. As discussed in Section 3.3, in layer 142 the data density is dominated by two large peaks composed of images of animals and artifacts respectively. This structure is visible in panel (a), in which one can easily identify the two large peaks. In the subsequent layers, the animal and artifact peaks break down into small peaks containing images of the same class. The process happens in a hierarchical fashion: peaks corresponding to multiple classes sharing a lot of semantic similarities appear first, and subsequently break down into smaller peaks corresponding to the single classes. This phenomenon can be observed in panels (b) and (c). For instance, in layer 145 (panel (b)) one can clearly identify peaks corresponding to certain kinds of arachnids (wolf spider, harvestman, tick, …), insects (black and gold spider, leaf beetle, barn spiders) 4-wheel means of transportation (beach wagon, convertible, minivan), dogs (Samoyed, keeshond, chow), and so on. In layer 148 (panel (c)) this process continues and one finds many more peaks, corresponding either to single classes (e.g., iPod, piggy bank and beer bottle) or to groups of similar classes. At the end of the nucleation process described, from layer 152 (not shown here) one finds approximately one peak corresponding to each class label.

Figure A.5: Composition of density peaks in layers 142 (a), 145 (b) and 148 (c). The x-axis indicates the size of the peak, the y-axis reports the categories represented with more than 150 points in the peak. Consecutive dots ("…") indicate that more than three categories are well represented in the peak. The peaks are ordered from the smallest to the largest from top to bottom.