Intrinsic dimension of data representations in deep neural networks

05/29/2019 ∙ by Alessio Ansuini, et al. ∙ Technische Universität München SISSA 0

Deep neural networks progressively transform their inputs across multiple processing layers. What are the geometrical properties of the representations learned by these networks? Here we study the intrinsic dimensionality (ID) of data-representations, i.e. the minimal number of parameters needed to describe a representation. We find that, in a trained network, the ID is orders of magnitude smaller than the number of units in each layer. Across layers, the ID first increases and then progressively decreases in the final layers. Remarkably, the ID of the last hidden layer predicts classification accuracy on the test set. These results can neither be found by linear dimensionality estimates (e.g., with principal component analysis), nor in representations that had been artificially linearized. They are neither found in untrained networks, nor in networks that are trained on randomized labels. This suggests that neural networks that can generalize are those that transform the data into low-dimensional, but not necessarily flat manifolds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 6

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs), including convolutional neural networks (CNNs) for image data, are among the most powerful tools for supervised data classification. In DNNs, inputs are sequentially processed across multiple layers, each performing a nonlinear transformation from a high-dimensional vector to another high-dimensional vector. Despite the empirical success and widespread use of DNNs, we still have an incomplete understanding about why and when they work so well– in particular, it is not clear yet why they are able to generalize well to unseen data, not withstanding their massive overparametrization

[1]. While progress has been made recently (e.g. [2, 3]

), guidelines for selecting architectures and training procedures are still largely based on heuristics and domain knowledge.

A fundamental geometric property of a data representation in a neural network is its intrinsic dimension (ID), i.e. the minimal number of coordinates which are necessary to describe its points without significant information loss. It is widely appreciated that deep neural networks are over-parametrized, and that there is substantial redundancy amongst the weights and activations of deep nets – e.g., several studies in network compression have shown that many weights in deep neural networks can be pruned without significant loss in classification performance [4, 5]. In ref. [6], linear estimates of the ID in DNNs were computed theoretically and numerically in simplified models. In [7], estimates of the ID were related to robustness properties of deep networks to adversarial attacks, showing that a low local intrinsic dimension correlates positively with robustness. In [8], the local ID of object manifolds was estimated with a linear approach applied to several locations on the tangent space, and was found to decrease along the last hidden layers of AlexNet [9]. Linear [10, 11] and nonlinear dimensionality reduction techniques have been used extensively to visualize computations in deep networks [12].

However, there has not been a direct and systematic characterization of how the intrinsic dimension of data manifolds varies across the layers of CNNs. We here leverage TwoNN [13], a recently developed estimator for ID that exploits the fact that nearest-neighbour statistics depend on the ID [14] (see Fig. 1 for an illustration). TwoNN can be applied even if the manifold containing the data is curved, topologically complex, and sampled non-uniformly. This procedure is not only accurate, but also computationally efficient. In a few seconds on a desktop PC it provides the estimate of the ID of a data set with data, each with coordinates (for example the activations in an intermediate layer of a CNN), thus making it possible to map out ID across multiple layers and networks. Using this estimator, we investigated the variation of the ID along the layers of a wide range of deep neural networks trained for image recognition. Specifically, we addressed the following questions:

  • [noitemsep,topsep=0pt,leftmargin=20pt]

  • How does the ID change along the layers of CNNs? Do CNNs compress representations into low-dimensional manifolds, or conversely seek to expand the dimensionality?

  • How different is the ID to the ‘linear’ dimensionality of a network, i.e., the dimensionality of the linear subspace containing the data-manifold? A substantial mismatch would indicate that the underlying manifolds are curved rather than flat.

  • How is the ID of a network related to the generalization performance? Can we find empirical signatures of generalization performance in the geometrical structure of the representations?

Figure 1: The TwoNN estimator derives an estimate of intrinsic dimensionality from the statistics of nearest-neighbour distances.

Our analyses show that data representations in CNNs are embedded in manifolds of low dimensionality, which is typically several order of magnitude lower than the dimensionality of the embedding space (the number of units in a layer). In addition, we found that the variation of the ID along the hidden layers of CNNs follows a similar trend across different architectures – the early layers expand the dimensionality of the representations, followed by a monotonic decrease that brings the ID to reach low values in the final layers.

Moreover, we observed that, in networks trained to classify images, the ID of the training set in the last hidden layer is an accurate predictor of the network’s classification accuracy on the test set – i.e, the lower the ID in this layer, the better the network capability of correctly classifying the image categories in a test set. Conversely, the ID before the output remains high for a network trained on non predictable data (i.e., permuted labels), on which the network is forced to memorize rather than generalize.

These geometrical properties of representations in trained neural networks were empirically conserved across multiple architectures, and might point to an operating principle of deep neural networks.

2 Estimating the intrinsic dimension of data representations

Inferring the intrinsic dimension of high-dimensional and sparsely sampled data representations is a challenging statistical problem. To estimate the ID of data-representations in deep networks, we leverage a recently developed ID-estimator (‘TwoNN’) that is based on computing the ratio between the distances to the second and first nearest neighbors (NN) of each data point [13] (see Fig. 1). This allows overcoming the problems related to the curvature of the embedding manifold and to the local variations in the density of the data points, under the weak assumption that the density is constant on the scale of the distance between each point and its second nearest neighbor.

Formally, let points be uniformly sampled on a manifold with intrinsic dimension and let be the total number of points. Let and be the distances of the first and second neighbor of respectively. Then follows a Pareto distribution with parameter on , . Taking advantage of this observation, we can formulate the likelihood of vector as

(1)

At this point can be easily computed, for instance by maximizing the likelihood, or, following [13], by employing the empirical cumulate of the distribution of the

values to reduce the ID estimation task to a linear regression problem. The ID estimated by this approach is asymptotically correct even for samples harvested from highly non-uniform probability distributions. For a finite number of data points, the estimated values remain very close to the ground truth ID, when this is smaller than

20. For larger IDs and finite sample size, the approach moderately underestimates the correct value, especially if the density of data is non-uniform. Therefore, the values reported in the following figures, when larger 20, should be considered as lower bounds.

For real-world data, the intrinsic dimension always depends on the scale of distances on which the analysis is performed. This implies that the reliability of the dimensionality estimate needs to be assessed by measuring the intrinsic dimension at different scales and by checking whether it is, at least approximately, scale invariant [13]. In our analyses, this test was performed by systematically decimating the dataset, thus gradually reducing its size. The ID was then estimated on the reduced samples, in which the average distance between data points had become progressively larger. This allowed estimating the dependence of the ID on the scale. As explained in [13], if the ID is well-defined, its estimated value will only depend weakly on the number of data points .

For computational efficiency, we analyzed the representations of a subset of layers. We extracted representations at pooling layers after a convolution or a block of consecutive convolutions, and at fully connected layers. In the experiments with ResNets, we extracted the representations after each ResNet block [15] and the average pooling before the output. See A.1 for details.

3 Results

3.1 The intrinsic dimension exhibits a characteristic shape across several networks

Our first goal was to empirically characterize the ID of data representations in different layers of deep neural networks. Given a layer of a DNN, an individual data point (e.g., an image) is mapped onto the set of activations of all the units of the layer, which define a point in a -dimensional space. We refer to as the embedding dimension (ED) of the representation in layer . A set of input samples (e.g., images) generate, in each layer , a set of -dimensional points. We estimated the dimension of the manifold containing these points, using TwoNN.

We first investigated the variation of the ID across the layers of a VGG-16 network [16]

, pre-trained on ImageNet

[9], and fine-tuned and evaluated on a synthetic data-set of 1440 images [17]. The dataset consisted of 40 3D objects, each rendered in 36 different views (we left out 6 images for each object as a test set) – it thus spanned a spectrum of different appearances, but of a small number of underlying geometrical objects. When estimating the ID of data representations on this network (referred to as ‘VGG-16-R’), we found that the ID first increased in the first pooling layer, before successively and monotonically decreasing across the following layers, reaching very low values in the final hidden layers (Fig. 2A). For instance, in the fourth layer of pooling (pool4) of VGG-16-R, ID and ED , with . One potential concern is whether the number of stimuli is sufficient for the ID-estimate to be robust. To investigate this, we repeated the analysis on subsamples randomly chosen on the data manifold, finding that the estimated IDs were indeed stable across a wide range of sample sizes (Fig. 2B). We note that, for the early/intermediate layers, the reported values of the ID are likely a lower bound to the real ID (see discussion in [13]).

Figure 2: Modulation of ID across hidden layers of deep convolutional networks A)

ID across layers of VGG-16-R, error bars are the standard deviation of the ID (see

A.1). Numbers in plot indicate embedding dimensionality of each layer.
B Subsampling analysis on VGG-16-R experiment, reported for the same layers as in the inset in A (see A.1 for details).

Are the ‘hunchback’ shape of the ID variation across the layers (i.e., the initial steep increase followed by a gradual monotonic decrease), and the overall low values of the ID, specific to this particular network architecture and dataset? To investigate this question, we repeated these analyses on several standard architectures (AlexNet, VGG and ResNet) pre-trained on ImageNet [18]. Specifically, we computed the average ID of the object manifolds corresponding to the 7 most populated ImageNet categories, using  500 images per category (see section A.1). We found both the hunchback-shape and the low IDs to be preserved across all networks (Fig. 3A): the ID initially grew, then reached a peak or a plateau and, finally, progressively decreased towards its final value. The ID in the output layer was the smallest, often assuming a value of the order of ten.

Is the relative (rather than the absolute) depth of a layer indicative of the ID? To investigate this, we plotted ID against relative depth (defined as the absolute depth of the layer divided by the total number of layers, not counting batch normalization layers

[10]) of the 14 models belonging to the three classes of networks (Fig. 3B). Remarkably, the ID profiles approximately collapsed onto a common hunchback shape 111with the exception of AlexNet, and a small network trained on MNIST in a separate analysis, see section 3.4 for details and analysis, despite considerable variations in the architecture, number of layers, and optimization algorithms. For networks belonging to the VGG and ResNet families, the rising portions of the ID profiles substantially overlapped, with the ID reaching similar large peak values (between 100 and 120) in the relative depth range 0.2-0.4. The dependence on relative depth is consistent with the results of [10], where it was observed that similarity between layers depended on relative depth.

Notably, in all networks the ID eventually converged to small values in the last hidden layer. These results suggest that state-of-the-art deep neural networks – after an initial increase in ID – perform a progressive dimensionality reduction of the input feature vectors. One could speculate that this progressive, gradual reduction of dimensionality of data-manifolds is a feature of deep neural networks which allows them to generalize well. In the following, we will investigate this idea further by showing that the ID of the last hidden layers predicts generalization performance, and by showing that these properties cannot be found in networks with random weights or trained on non predictable data.

Figure 3: ID of object manifolds across networks. A) IDs of data representations for 4 networks: each point is the average of the IDs of 7 object manifolds. The error bars are the standard deviations of the ID across the single object’s estimates (see A.1). B) The ID as a function of the relative depth in 14 deep convolutional networks spanning different sizes, architectures and training techniques. Despite the wide diversity of these models, the ID profile follows a typical hunchback shape (error bars not shown).

3.2 The intrinsic dimension of the last hidden layer of deep networks predicts classification performance

Although the hunchback shape was preserved across networks, the IDs in the last hidden layers were not exactly the same for all the networks. To better resolve such differences, we computed the ID in the last hidden layer of each network using a much larger pool of images of the training set ( 2,000), sampled from all ImageNet categories (see section A.1). This revealed a spread of ID values, ranging between (for ResNet152) and (for AlexNet, Fig. 4). These differences may appear small, compared to the much larger size of the embedding space in the last hidden layer (where the ED was between and orders of magnitude larger than the ID (range ). However, the ID in the last hidden layer on the training set was indeed a strong predictor of the performance of the network on the test set, as measured by top 5-score (Fig. 4, Pearson correlation coefficient ).

A tight correlation was found not only across the full set of networks, but also within each class of architectures, when such comparison was possible – i.e., in the classes of the VGG with and without batch normalization and ResNets ( in the latter case, see inset in Fig. 4).

Overall, this analysis suggests that the ID in the last hidden layer can be used as a proxy for the generalization ability of a network. Importantly, this proxy can be measured without estimating the performance on an external validation set.

Figure 4: ID of the last hidden layer predicts performance. The ID of data representations (training set) predicts the top 5-score performance on the test set. Inset Detail for the ResNet class.

3.3 Data representations lie on curved manifolds

The strength of the TwoNN method lies in its ability to infer the ID of data representations, even if they lie on curved manifolds. This raises the question of whether our observations (low IDs, hunchback shapes, correlation with test-error) reflect the fact that data points live on low-dimensional, yet highly curved manifolds, or, simply, in low-dimensional, but largely flat (linear) subspaces.

To test this, we performed linear dimensionality reduction (principal component analysis, PCA) on the normalized covariance matrix (i.e., the matrix of correlation coefficients – using the raw covariance resulted in qualitatively similar results) for each layer and network. We did not find a clear gap in the eigenvalue spectrum (Fig.

5A). This result is qualitatively consistent with those obtained for stimulus-representations in the primary visual cortex [19].

The absence of a gap in the spectrum, with the magnitude of the eigenvalues smoothly decreasing as a function of their rank, is, by itself, an indication that the data manifolds are not linear. Nevertheless, we defined an ‘ad-hoc’ estimate of dimensionality by computing the number of components that should be included to describe

of the variance in the data. In what follows, we call this number PC-ID. We found PC-ID to be about one or two orders of magnitude larger than the value of the ID computed with TwoNN. For example, the PC-ID in the last hidden layer of VGG-16 was

(Fig. 5C, solid red line), while the ID estimated with TwoNN was (solid black line).

The discrepancy between the ID estimated with TwoNN and with PCA points to the existence of strong non-linearities in the correlations between the data, which are not captured by the covariance matrix. To verify that this was indeed the case (and, e.g., not a consequence of estimation bias), we used TwoNN to compare the ID of the last hidden layer of VGG-16 with the ID of a synthetic Gaussian dataset with the same second-order moments. The ID of the original dataset, was low and stable as a function of the size

of the data sample used to estimate it (Fig. 5B, black curve; similar subsampling analysis as previously shown in Fig. 2A-B). In contrast, the ID of the synthetic dataset was two orders of magnitude larger, and grew with (Fig. 5B, red curve), as expected in the case of an ill-defined estimator [13].

We also computed the PC-ID of the object manifolds across the layers of VGG-16 on randomly initialized networks, and we found that its profile was qualitatively the same as in trained networks (compare solid and dashed red curves in Fig. 5C). By contrast, when the same comparison was performed on the ID (as computed using TwoNN), the trends obtained on random weights (dashed black curve) and after training the network (solid black curve) were very different. While the latter showed the hunchback profile (same as in Fig. 3), the former was remarkably flat. This behaviour can be explained observing that the ID of the input is very low (see section 3.4 for a discussion of this point). For random weights, each layer effectively performs an orthogonal transformation, thus preserving such low ID across layers.

Thus, the increase followed by a decrease of the ID (as a function of the network’s depth, Figs 2A, 3A-B) is a genuine result of training and does not merely reflect the initial expansion of the ED (Fig. 5C, blue curve) from the input to the first hidden layers.

Figure 5: Evidence that data-representations are on curved manifolds A) Variance spectra of last hidden layer do not show a clear gap. B) ID in the last hidden layer of VGG-16 (black), compared with the ID of a synthetic Gaussian dataset with the same size and second-order correlations structure (red). C) The ID and the PC-ID along the layers of VGG-16 for a trained network and an untrained, randomly initialized network. The ED, rescaled to reach the maximum at 400, is shown in blue.

3.4 The initial increase in intrinsic dimension can arise from irrelevant features

Figure 6: A The addition of a luminance gradient across the images of the MNIST dataset results in a stretching of the image manifold along a straight line in the input space of the pixel representation. B Change of the ID along all the layers of the MNIST network, as obtained in three different experiments: 1) with the original MNIST dataset (black curve) 2) with the luminance-perturbed MNIST dataset (blue curve) and 3) with the MNIST, in which the label of the MNIST images where randomly shuffled (red curve).

We generally found ID to increase in the initial layers. However, this was not observed for a small network trained on the MNIST data-set (Fig. 6B, black curve) and was also less pronounced for AlexNet (Fig. 3A, red curve). A mechanism underlying the initial ID rise could be the fact that the input is dominated by features irrelevant for predicting the output, but highly correlated between each other.

To validate this hypothesis, we generated a modified MNIST dataset (referred to as MNIST) by adding a luminance perturbation that was constant for all pixels within an image, but random across the various images (Fig. 6A). Given an image with pixels of (where is the number of pixels), we added shared random perturbations, where is a positive parameter and

are i.i.d. uniformly distributed random variables in the range

. This perturbation has the effect of stretching the dataset along a specific direction in the input space (the vector ) thus reducing the ID of the data manifold in the input layer. Indeed, with , the ID of the input representation dropped from (its original value) to . The network trained on MNIST was still able to generalize (accuracy ). However, the variation of the ID (blue curve in Fig. 6B) now showed a hunchback shape reminiscent of that already observed in Figs 2A and 3A-B for large architectures.

This suggests that the growth of the ID in the first hidden layers of a deep network is determined by the presence in the input data of low-level features that carry no information about the correct labeling – for instance, in the case of images, gradients of luminance or contrast. One can speculate that, in a trained deep network, the first layers prune the irrelevant features, formatting the representation for the more advanced processing carried out by the last layers [20]. The initial increase of the dimensionality of the data manifold could be the signature of such pruning. This notion is consistent with recent evidence gathered in the field of visual neuroscience, where the pruning of low-level confounding features, such as luminance, has been demonstrated along the progression of visual cortical areas that, in the rat brain, are thought to support shape processing and object recognition [21].

3.5 No characteristic ID-profile for a network trained on random labels

In untrained networks the ID profile is largely flat (Fig. 5C). Are there other circumstances in which the ID profile deviates from the typical hunchback shape of Figs 2A and 3A-B, with IDs that do not decrease progressively towards the output? It turns out that this is the case when generalization is impossible by construction. We randomly shuffled labels on MNIST (we refer to the shuffled data as MNIST). It is well known [1] that deep networks can perfectly fit the training set on randomly labelled data, while necessarily achieving chance level performance on the test set. We trained the same network as in section 3.4 on MNIST, achieving a training error of zero. However, we found that the network had an ID profile which did not decrease monotonically (red curve in Fig. 6B) – in contrast to the same network trained with the original dataset. Instead, it grew considerably in the second half of the network, almost saturating the upper bound, set by the ED, in the output layer.

This suggests that the reduction of the dimensionality of the data manifolds corresponds to the process of learning on a generalizable data set. In addition, it indicates that a network trained on inconsistent data can be recognized without estimating its performance on a test set, but by simply looking at whether the ID increases substantially across its final layers.

4 Conclusions and Discussion

Convolutional neural networks, as well as their biological counterparts, such as the visual system of primates [22] and other species [21, 23]

, transform the input images across a progression of processing stages. Theories in the field of visual neuroscience postulate that such re-formatting gradually untangles and flattens the manifolds produced by the different images within the representational space defined by the activity of all the neurons (or units) in a layer

[22, 24, 25]. This suggests that the dimensionality of the object manifolds may progressively decrease along the layers of a deep network, and that such a decrease may be at the root of the high classification accuracy achieved by deep networks. Our study is the first investigating systematically how this happens in large, state-of-the-art CNNs used for image classification.

Figure 7: A. Input layer. The intrinsic dimensionality of the data can assume low values due to the presence of irrelevant features uncorrelated with the ground truth. B. The first hidden layers pre-process the data raising its intrinsic dimension. C

. The representation is squeezed onto manifolds of progressively lower intrinsic dimension. These manifolds are typically not hyperplanes.

C,D. In the last hidden layer (D) the ID shows a remarkable correlation with the performance in trained networks. E. The output layer.

We find (see Fig. 7 for a visual summary) that the ID in the initial layer is low, as a consequence of the irrelevant correlations in natural images [26]. Early layers of DNNs appear to get rid of these correlations, thus increasing the ID of the object manifolds along the initial layers of a deep network (Fig. 2A and 3A,B). Such an initial dimensionality-expansion is also thought to be performed in the visual system, and is consistent with recent characterization of the dimensionality of representations in the primary visual cortex [19]. In the neural network trained with the preprocessed, standardized MNIST dataset, the initial growth only emerged after manipulating images by introducing luminance gradients.

After this initial expansion, the representation is squeezed into manifolds of progressively lower ID (Figs 2, 3A,B graphically illustrated in Fig. 7C,D). This phenomenon has been already observed by [27] and [6] on simplified datasets and architectures, and by [8] in the final, fully connected layers of AlexNet.

We here demonstrate that this progressive reduction of the dimension of data manifolds is a general behavior and a key signature of every CNN we tested – both small toy models (Fig. 6B) and large state-of-the-art networks (Fig. 3A,B). We identified an empirical link between the ID of the final layers and classification performance (Fig. 4), suggesting that the ability of a network to compress representations is a predictor of its ability to generalize. We find that ID values are lower than those identified using PCA, or on ‘linearized’ data, which is an indication that the data lies on curved manifolds. In addition, ID measures from PCA did not qualitatively distinguish between trained and randomly initialized networks (Fig. 5

C). This conclusion is at odds with the unfolding of data manifolds reported by

[28] across the layers of a small network tested with simple datasets. It also suggests a slight twist on theories about transformations in the visual system [22, 24] – it indicates that a flattening of data manifolds may not be a general computational goal that deep networks strive to achieve: progressive reduction of the ID, rather than gradual flattening, seems to be the key to achieving linearly separable representations.

Achieving a theoretical understanding of how deep neural networks can successfully solve difficult classification tasks has proven to be very challenging. Here, we took an empirical approach to characterize the geometrical structure of their representations. Our results are broadly consistent with recent theoretical studies linking the classification capacity of data manifolds by perceptrons to their geometrical properties

[29, 30]. Our findings also resonate with the compression of the information about the input data during the final phase of training of deep networks [31], which is progressively larger as a function of the layer’s depth, thus displaying a trend that is reminiscent of the one observed for the ID in our study. More generally, we hope that data-driven, empirical approaches to studying deep neural networks will provide intuitions and constraints, which will ultimately inspire and enable the development of theoretical explanations of their computational capabilities.

Acknowledgments

We warmly thank Eis Annavini for providing the custom dataset described in A.1.1; Artur Speiser and Jan-Matthis Lückmann for comments on the manuscript.

We warmly thank Elena Facco for her numerous methodological clarifications on intrinsic dimension estimations and her crucial help in the early phases of this work.

We also thank Fabio Anselmi, Luca Bortolussi, SueYeon Chung, Jim DiCarlo, Florent Krzkala, Matteo Marsili, Naftali Tishby, Lenka Zdeborova and Riccardo Zecchina for useful discussions and suggestions.

This work was supported by a European Research Council (ERC) Consolidator Grant, project number 616803-LEARN2SEE (D.Z.). J.H.M. is funded by the German Research Foundation (DFG) through SFB 1233 (276693517), SFB 1089 and SPP 2041, the German Federal Ministry of Education and Research (BMBF, project ‘ADMIMEM’, FKZ 01IS18052 A-D), and the Human Frontier Science Program (RGY0076/2018).

References

Appendix A Appendix

a.1 Details of numerical experiments

All our experiments were performed in PyTorch [32] (version 1.0) on a Linux workstation with 64GB of RAM and a GeForce GTX 1080 Ti NVIDIA graphic card. The code to run all the experiments is available at this link. The data is also available at this link.

a.1.1 Datasets

Custom dataset

A dataset of 1400 images developed for a neurophysiological study [17]. The dataset consisted of 40 three-dimensional (3D), computer graphics models of both natural and man-made objects, each rendered in 36 different views, obtained by combining in-plane and in-depth rotations of the 3D models with horizontal translations and size variations. As a result, the image set encompassed a spectrum of object identities, poses and low-level features (e.g., luminance, contrast, position, size, aspect ratio, etc.), but without reaching the size, complexity and variety of shapes and identity-preserving transformations that are typical of naturalistic image sets, such as ImageNet.

a.1.2 Architectures

We describe the architectures used in order of appearance in the main text.

Vgg-16-R

We removed the last hidden layer of a VGG-16 network [16] pre-trained on ImageNet [9] and substituted it with a new randomly initialized layer in order to fine-tune it on the of the custom dataset described in A.1.1. More specifically we used images for each category as training set and we tested on the remaining images for each category. We called this network VGG-16-R (where R stands for restricted, with reference to this small dataset). For the fine-tuning, we used a SGD with momentum , and a learning rate of in the fifth pooling layer (pool5) and of

in the classifier stack (i.e., the sequence of fully connected layers after the last pooling layer and before the output). The other layers were kept frozen. The generalization performance after 15 epochs was

accuracy on the test set.

Standard architectures pre-trained on ImageNet

We instantiated fourteen pre-trained networks that are representative of the state-of-the-art models used in visual object recognition and image understanding: AlexNet [9], eight models belonging to the VGG class (11,13,16 and 19 with and without batch normalization) [16]), and five models belonging to the ResNet class (18,34,52,101,152) [15]. All these models are available for download in Pytorch [32] at torchvision/models.html.

Small convolutional network for the experiments on the MNIST dataset

We trained a small convolutional network on the MNIST dataset [33]

. The sequence of layers is: a convolutional layer with 1 input channel, 32 output channels and a kernel size of 3; a max pooling layer of kernel size 2; a convolutional layer with 32 input channels, 64 output channels and a kernel size of 3; a max pooling layer of kernel size 2; a fully connected layer with 1600 inputs and 128 outputs; a fully connected layer with 128 input and 10 output units; a softmax. We used ReLU non-linearity after each convolutional, pooling and fully connected layer. The stride is always set to zero. The network has been trained for

epochs with a small learning rate () and zero momentum on the original dataset, for epochs and momentum on MNIST and for epochs and momentum on MNIST.

Checkpoints

In each experiment we defined architecture-specific checkpoints from where to extract and analyze the representations. The only exception was in the case of the network used for MNIST, where we extracted the representations and performed the analysis in all the layers, in the experiments described in sections 3.4, 3.5. As a general rule, we always extracted representations at pooling layers after a convolution or a block of consecutive convolutions, and at fully connected layers. In the experiments with ResNets, we extracted the representations after each ResNet block [15] and the average pooling before the output. Depending on the computational demands of our experiments, we extracted and analyzed data samples of different sizes, we describe this in the following section A.1.3.

a.1.3 Estimating intrinsic dimension

Experiments with the custom dataset

In this experiment (Fig. 2A,B), we fine-tuned the last layers of a VGG-16 network pre-trained on ImageNet using the of the 1440 images of the dataset in [17] (30 images for each category in the training set, the remaining 6 images for each category as test set). The whole dataset was used to estimate the ID of the representations across the layers of the network. The values of the ID reported in our analysis are the averages resulting from randomly sampling 20 times the of the activations at each checkpoint layer. The error bars are the standard deviations across these estimates. In the decimation analysis (Fig. 2B) we proceeded as described in [13]. After a random shuffling, we splitted the dataset in a -fold way, with ranging from 20 to 1. The -fold splits yielded ID estimates at each layer on roughly of the data. The ID estimates were then averaged and the standard deviation were computed.

Experiments with ImageNet

In the experiments with the pre-trained state-of-the-art networks, we performed two kinds of analysis. In the first one (Fig. 3A-B), we sampled randomly 500 images from each of the 7 most populated ImageNet categories: “koalas”, “shih-tzu”, “rhodesian”, “yorkshire”, “vizsla”, “setter”, “butterfly”. These 7 sets were kept fixed in all the subsequent analysis. Let us call the -th set. We then estimated the ID of the resulting object manifolds across the layers of the networks, independently for each category. For each , we randomly subsampled from the representations of at each checkpoint layer the of the data (450 data points) for 5 times and we computed their IDs. We then averaged these 7 values obtaining a category-specific estimate of the ID at each layer. We finally averaged the IDs obtained for the 7 categories and computed their standard deviations. In the second analysis (Fig. 4A-B), we randomly sampled images for times from the ImageNet training set. Let us call the -th of these samples. We computed their representations in the last hidden layer, then we randomly subsampled, in each , the of the data (consisting of 1800 data points) for 20 times and we computed their IDs. We then pooled together all these 100 ID estimates, computed their average and their standard deviation: these are respectively our final ID estimate and its error. Notice that, in this case, the ID estimates refer to random mixtures of all possible object categories of ImageNet.

Experiments with MNIST

In these experiments (Fig. 6B, black line), we randomly sampled a set of 2000 images from the test set; this set - called in the following - was kept fixed. We extracted the activations at each layer of the trained network described in A.1.2. For the representations are the original images. For each layer we randomly subsampled from the of the data (1800 data points) for 50 times and we computed their IDs. We then averaged these ID values and computed their standard deviation: these are respectively our final ID estimate and its error.