A Neural Scaling Law from the Dimension of the Data Manifold

04/22/2020
by   Utkarsh Sharma, et al.
Johns Hopkins University
0

When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law L ∝ N^-α in the number of network parameters N. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension d. This simple theory predicts that the scaling exponents α≈ 4/d for cross-entropy and mean-squared error losses. We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of d and α by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/23/2020

Scaling Laws for Neural Language Models

We study empirical scaling laws for language model performance on the cr...
06/13/2020

Sample complexity and effective dimension for regression on manifolds

We consider the theory of regression on a manifold using reproducing ker...
02/12/2021

Explaining Neural Scaling Laws

The test loss of well-trained neural networks often follows precise powe...
06/18/2020

On the Predictability of Pruning Across Scales

We show that the error of magnitude-pruned networks follows a scaling la...
02/02/2022

Unified Scaling Laws for Routed Language Models

The performance of a language model has been shown to be effectively mod...
07/21/2017

Ultraslow diffusion in language: Dynamics of appearance of already popular adjectives on Japanese blogs

What dynamics govern a time series representing the appearance of words ...
10/26/2017

Phase Transitions in Image Denoising via Sparsely Coding Convolutional Neural Networks

Neural networks are analogous in many ways to spin glasses, systems whic...

Code Repositories

1 Introduction

Figure 1: This figure shows the relationship between the measured intrinsic dimension (ID) of the data manifold and , where is the model size scaling exponent. We include data from fully-connected teacher/student experiments, simple CNNs, and GPT-type radford2018improving ; radford2019language language models (represented as a lower-bound due to large uncertainties with large IDs).

Neural Network based Machine Learning has made enormous progress in a wide variety of domains. Scale has been a key ingredient in this success: large amounts of computation, large datasets, and large models with millions or billions of parameters.

Not only is scale beneficial to performance, but the benefits from scale can be predicted precisely. Recent works 1712.00409 ; hestness2019beyond ; rosenfeld2019constructive ; kaplan2020scaling studying a variety of data modalities and model architectures all find the same scaling relation in the underfitting regime. In particular, the dependence of the loss on the number of model parameters has the following properties, and each suggests a corresponding question:

  • As the number of model parameters is increased, the cross-entropy loss of well-trained and well-tuned models scales with as a power-law

    (1.1)

    with observed values such as for language modeling kaplan2020scaling , and much larger observed for image classification rosenfeld2019constructive . Why do we encounter this simple functional form, and what determines the value of the exponent ?

  • Scaling holds very accurately across a wide range of , sometimes spanning many orders of magnitude 1712.00409 ; hestness2019beyond ; kaplan2020scaling . Why does scaling persist over a large range of model sizes, and what determines the where it eventually breaks down?

  • Empirically, the scaling exponent may not depend greatly on model architecture. For example, LSTMs and Transformers scale similarly over a large range of kaplan2020scaling , with losses differing only by an overall, -independent factor. Why would scaling exponents be roughly independent of model architecture?

We will argue that a simple conjectural theory can address these questions while making a number of testable predictions.

1.1 Main Ideas

The key idea is that neural models map the data to a manifold with intrinsic dimension , and then use added capacity to carve up this manifold into ever smaller sub-regions. If the underlying data varies continuously on the manifold, then the size of these sub-regions (rather than their number) determines the model’s loss. To shrink the size of the sub-regions by a factor of requires increasing the parameter count by a factor of , and so the inverse of the scaling exponent will be proportional to the intrinsic dimension of the data manifold. We develop these ideas in detail in section 2.

Figure 2:

This figure estimates the behavior of

, the maximum network size where we find power-law scaling, as a function of the intrinsic dimension in student/teacher experiments. We determine as the model size where the loss reaches an arbitrarily chosen small value of , as a stand-in for the entropy of real data. We discuss this procedure in section 3.1.

The scaling exponent can be measured by training a succession of models of varying size. We measure the intrinsic dimension within the final layer111It was shown in ansuini2019intrinsic that the final hidden layer activations have the smallest intrinsic dimension in image classifiers. Our findings are largely consistent with this.

activations of trained networks, using the distances among nearest neighbor activation vectors

levina2005maximum ; TwoNN .

We test the theory in a student/teacher framework, which makes it possible to scan over a large range of and and test more idiosyncratic features of the theory (see figure 4). We also perform tests using CNNs for image classification, and by measuring the intrinsic dimension of GPT-type models radford2018improving ; radford2019language , where scaling exponent have already been documented kaplan2020scaling .

1.2 Contributions: Predictions and Results

In what follows we list the concrete predictions made by our theory, and their status based on our results222Code for our experiments will be available at: https://github.com/U-Sharma/NeuralScaleID and information in the literature. Throughout we use to denote the loss, to denote the number of parameters in a neural network (often referred to informally as ‘model size’), as the power-law scaling exponent, and as the intrinsic dimension of the data manifold.

Figure 3: We show how ID measurements vary among different student network sizes trained from the same teacher (left), and for CNNs on CIFAR10 (right). We display the test loss for reference. The ID does not depend significantly on , though it increases by about % among the various model sizes tested as increases.
  1. Prediction: In the range of where the loss scales as , we predict , where

    is the intrinsic dimension of the data manifold for the dataset and task in question. If the network is composed of ReLU non-linearities and the loss is mean squared error or cross-entropy (or KL divergence), we predict

    (1.2)

    with equality expected in the generic case.

    Results: See figure 1 for the summary combining all datasets. We find a variety of evidence supporting this prediction, and the factor of ‘4’ fits quite well. We show in figure 8

    that this factor can be modified if we use other loss functions. For language modeling with GPT

    radford2018improving ; radford2019language , we know while we measure the intrinsic dimension as (figure 10), in accord with the inequality, but quite far from equality.

  2. Prediction: The maximum network size where we obtain power-law scaling grows with via . Larger should correspond with much larger .

    Results: We have confirmed the approximate relation (see figure 2) with teacher/student experiments by identifying when reaches a fixed value.

  3. Prediction: The exponent will not depend significantly on model architecture except through the intrinsic dimension . Since larger and smaller lead to improved performance with scale, the best architectures will tend to have the smallest .

    Results: In ansuini2019intrinsic it was discovered empirically that better performing image classifiers have smaller , and kaplan2020scaling showed that LSTMs and Transformers have very similar exponents. We leave the measurement of both and across distinct architectures to future work.

  4. Prediction: Models with size where the loss scales as a power-law in all map the data to a manifold with the same intrinsic dimension .

    Results: We verify this for teacher/student experiments in figure 3 and for CIFAR10 in figure 9. This prediction holds to about 10% for these models.

  5. Prediction: If the data manifold and the loss , then we should replace the dimension of with the maximum dimension of when estimating , as the network can behave as an ensemble, modeling each independently (see the right of figure 4).

    Results: We confirm this prediction in section 3.2.1, see figure 7.

2 A Simple Theory for Scaling in the Underfitting Regime

In this section we explain our theory, beginning with a toy model in section 2.1. Then in section 2.2 we argue333one might say conjecture; for a more sophisticated perspective in a simpler context see bickel2007local that the toy model can be applied to realistic neural networks with only a few small modifications. In section 2.3 we explain how we measure the dimension of the data manifold, a necessary step in validating the theory.

2.1 A Toy Model

Consider one of the simplest scenarios for multidimensional regression. We are given a Lipschitz function , and we would like to approximate it as a piecewise constant function , by cutting into smaller hypercubes. If these hypercubes have a side length , then we will have

(2.1)

cubes, and so our approximation will depend on the constant values takes within each hypercube. If the loss is mean-squared error (MSE), then it will be bounded by

(2.2)

where is the Lipschitz bound , and we have ignored overall numerical factors. Translating the -dependence into , this means that up to a constant factor.

If the model is piecewise linear instead of piecewise constant and is smooth with bounded derivatives, then the deviation , and so the loss will scale444A straightforward generalization suggests that if is composed of piece-wise -degree polynomials, and we use a loss , then

(2.3)
in the infinite data limit. But if is large then within each hypercube will utilize many parameters. We test the -dependence of this prediction in figure 8. as . We would predict

(2.4)

This will be important later, since networks with ReLU activations produce piecewise linear functions.

Finally, consider the case where

encode a smooth probability distribution over

possibilities, and we replace the MSE loss with the KL divergence. If the

are a piecewise linear model for the logits, then we also find that

. So the KL and MSE losses will scale with the same exponent in at a given value of . We demonstrate this in appendix A.5; it is a simple consequence of the fact that the expansion of in begins at second order. Note that if we use a cross-entropy instead of the KL divergence, the loss will scale in the same way towards a fixed constant value, the entropy of the true distribution.

Figure 4: Left: This shows the setup of a teacher network, emphasizing how we can control the data manifold dimension via the number of input features . Right: When the data manifold is a product and the teacher , then student networks can learn by combining sub-networks and behaving, in effect, like an ensemble. Then we predict , the maximum among the components.

2.2 A Conjectural Theory for Neural Networks

Neural Networks perform well on data with thousands or even millions of dimensions. It is widely believed that this is possible because neural networks map the data into a much lower-dimensional ‘data manifold’, preserving and focusing on the features that are relevant for the task.

We emphasize that the data manifold is a feature of both the dataset and the task or loss function that has been optimized. Classifiers need only attend to features relevant for classification. Similarly, in the case of autoregressive models the data manifold would consist only of the features necessary to predict the next token in a sequence. So the data manifold for such a model (as we are defining it) may have many fewer dimensions than the space of full sequences, such as complete images or text samples. Properties of the data manifold may also depend on the model that is learning it, such as its architecture and activation functions.

We can explain the observed scaling relations for NNs by applying our toy theory while replacing the ambient dimension of the dataset with the intrinsic dimension of the data manifold. If we perform regression with a neural network with ReLU activations and a mean-squared error or KL divergence loss, the analysis of section 2.1 implies555Depending on the network architecture and parameter values, the network could represent a piecewise linear function with piecewise components montufar2014number . However, these components cannot be independently configured to optimize the loss. Since there are only

independent degrees of freedom available, we expect

, rather than , to determine the effective capacity.

(2.5)

In the case where the function depends in a generic way on independent variables, we will confirm this prediction empirically in section 3.1 (see figure 1). We also explore some special data manifolds and other loss functions in section 3.2.

This theory also largely explains why the scaling relation holds over such a large range of . To double the resolution with which the model differentiates different points on the data manifold, we need times more parameters. It’s reasonable to expect that model performance improves smoothly when we change the resolution by an order-one factor. But this seemingly natural assumption implies that if , we will see smooth scaling with over many orders of magnitude. We would predict that the range in over which smooth scaling holds satisfies . This also strongly suggests , where is the largest network size exhibiting power-law scaling, as we do not expect , the beginning of the power-law region, to increase with . We discuss some reasons why power-law scaling may cease in section 2.2.2.

Finally, the theory suggests an interpretation for the fact that different NN architectures tend to have similar scaling exponents when applied to the same dataset. It would appear that a given dataset and task are associated with a data manifold of fixed dimension, and improvements in architecture do not greatly alter its properties. Network architectures that can achieve smaller on the same dataset can be scaled up to achieve larger gains, and so we would expect smaller to correlate with better performance.

The interpretation of as the dimension of the data manifold has a close connection with the notion of fractal dimensions. Typically fractal dimensions measure how the number of components needed to approximate a fractal scales as the components shrink. But we can reinterpret this definition by asking how many components are needed to obtain a certain quality of approximation to the underlying fractal. When we use the loss itself to measure the quality of the approximation, then is proportional to the corresponding fractal dimension.

Before moving on, let us discuss a few subtleties.

2.2.1 A Bound, Not an Equality

The classic analysis we reviewed in section 2.1 provides an upper bound on the loss for function approximation (regression in the infinite data limit) using piecewise constant or piecewise linear approximators. This bound becomes an estimate when the function being approximated is a generic Lipschitz function in -dimensions. However, if the function has a simple, non-generic structure then the loss may decrease much more quickly with increasing model size. So we should expect that

(2.6)

In special cases where the true underlying function or distribution is non-generically simple, we may find that this inequality is far from saturation.

As a concrete example, consider a data manifold with loss , as suggested on the right of figure 4. In this case a fully connected neural network may learn666If the total loss does not decompose as a sum, it is less clear that the network can learn an effective decomposition, but it may still be possible. this decomposition, computing each using a separate path through the network, and only combining these paths in the last layer. This would result in a scaling exponent determined by the maximum of the dimensions of the manifolds . We test for product data manifolds in section 3.2.1 and verify these predictions.

We may end up finding for other reasons. We will attempt to measure among neural activations, but there may not be any single layer where the model compresses all of the data onto the data manifold. For example, one might imagine a scenario where different components of the manifold are processed or compressed in different layers of the network. And networks with non-ReLU activations (eg Transformers and ResNets) may mix and superimpose different data manifolds upon each other, obscuring the manifold structure and causing the measured dimension to exceed the true dimension.

2.2.2 Why Does Power-Law Scaling Break Down?

If the dataset size is finite, then power-law scaling with model size will cease when we begin to overfit the data. Overfitting dominates performance on many real-world datasets, obscuring potentially clean scalings with . We encounter it with CIFAR10 in figure 9 and on other datasets in appendix A.4.

Even in the infinite data limit, if the data contains any entropy or noise then the power-law scaling must eventually end with the loss reaching a final plateau. Scaling could also end for other, more interesting reasons. For example, perhaps beyond a certain point the loss can only improve by exploring a higher dimensional data manifold. This is possible if the data manifold has a pancake-like structure, with a small width that can only be dissected by models with very large capacity. We will explore the simplest possibility, where the data has entropy, with mock teacher/student experiments; see figure 2 for the result.

Figure 5: This figure shows along with power-law fits for teacher/student experiments. The students learn from a randomly initialized 2-layer teacher with - features and use a cross-entropy loss. The students have 2,3, or 4 layers, but for input features the 2-layer students perform best and determine the model-size scaling. The measured increases linearly with the number of features, as shown in figure 6.

2.3 Measuring the Intrinsic Dimension of the Data Manifold

In section 2.2 we extended the toy model in order to make a variety of predictions relating the scaling of the loss with model size to , the intrinsic dimension (ID) of the data manifold. In some of our experiments, we will control by constructing generic functions of inputs and then measuring . But the theory would be tautological for real-world data if we could not independently measure the data manifold’s ID.

We will define by measuring the ID of neural activations as the network processes data from the distribution on which it was trained. There is an extensive literature on intrinsic dimension estimation (for a review see camastra2016intrinsic ). In most cases we use the simple two-nearest neighbors (TwoNN) method TwoNN , though we also compare to the MLE estimation levina2005maximum method on which TwoNN was based.

To summarize the method, let be the distance from a given datapoint to its th nearest neighbor, and define . Then the cumulative distribution takes the form

(2.7)

and so we can measure the intrinsic dimension by using the relation

(2.8)

Practically speaking, we evaluate

for every point on the manifold, and then apply linear regression to measure the slope

. We measure using various and verify that different values of give consistent results. We also verify that the MLE method levina2005maximum agrees with the TwoNN method. Fortunately, nearest neighbors can be efficiently identified sklearn_api .

The TwoNN method (the case ) has already been applied to neural networks ansuini2019intrinsic . There it was found that the dimension is smallest when measured using the activations of the final hidden layer of the network (immediately before the logits or output, so sometimes we refer to this as ‘prefinal’). We will use these activations to measure and compare to . For the GPT-type models (and for some others as a test in appendix C) we show ID measurements for every layer.

For convenience we provide a self-contained derivation of these ID measurement algorithms and a minor extension ( in appendix B. We also provide several tests of the method in appendix C, using both synthetic and neural activation data. We find that the method is fairly accurate for , while for larger dimensions it’s less reliable, and typically (but not always) underestimates the true dimension. Statistical errors from these methods are often fairly small (particularly from TwoNN), but we expect there may be larger systematic errors, as discussed in the appendices.

3 Experiments and Results

Figure 6: These figures show the correlation between the inverse scaling exponent and both the measured intrinsic dimension and the number of input features (dimensions) in the teacher network. Both notions of dimension are linearly correlated with , and the intrinsic dimension scales almost exactly as , as predicted in section 2.2.

In this section we discuss results from teacher/student experiments and various extensions, and also some tests using image classification and language modeling. We relegate a variety of technical details and a few minor observations to appendix A. We discuss potential errors in the ID measurement, along with several examples, in appendix C.

3.1 Teacher/Student with Random Teachers

We generate functions of input features using a randomly initialized, fully connected ‘teacher’ neural network with a 20-dimensional input space. To achieve we simply zero out all other inputs to this single teacher. We refer to as the number of features, and distinguish it from , the intrinsic dimension, which we measure using the activations of trained student networks.

For each value of , we train fully connected student networks of various widths and depths to imitate the outputs of the teacher. We work in the online setting, generating random inputs in so the dataset size is effectively infinite. Details of the network topologies, training procedure, fits, errors, and ID measurements are documented in appendix A.2.

After training the students, we evaluate the loss for each number of features . Then we fit

(3.1)

to measure for each . The results of this process (with cross-entropy loss) are shown in figure 5.

Next we measure the intrinsic dimension from the activations of the final hidden layer of each trained student. We use activation vectors for each ID measurement. In all cases we find that using more nearest neighbors, as discussed in section 2.3, does not change the result significantly. In figure 3 we show the measured ID of the final layer of a student network with various sizes , along with a plot of the loss . We see that the ID is approximately constant for these networks, though it does slowly grow by about % from the smallest to the largest student network.

We plot the relationship between and either the number of features or the measured ID . The result, along with linear fits, are shown in figure 6. For both the cross-entropy and MSE loss functions, . The inverse exponent is linearly related to the number of input features , but the multiplier is larger than .

Figure 7: This figure shows results for and for product data manifolds with teachers (left), (middle), and (right). We see that in all cases among the product factor manifolds. The total measured IDs are approximately equal to the sum of the dimensions of the product factors, as expected.

In section 2.2.2 we argued that scaling should end at an that grows as . We would like to test this prediction with teacher/student experiments, but in this case the data has no entropy. So instead we will introduce an artificial threshold for the loss, as a fictitious stand-in for the entropy of real data. Then we simply ask at what the loss reaches this fixed, arbitrary value.

We chose as an arbitrary threshold in figure 2. Note that for the teacher networks with fewer features we used the power-law fit for to estimate , as it was smaller than any network tested. This means we had to extrapolate , so these results are not purely empirical. We also compare and by defining as the end of the purely empirical power-law scaling region for 2-layer students (due to a failure of optimization or numerical precision issues); these results are relegated to figure 12 in the appendix.

The ID is typically a bit smaller than the number of input features. This may arise from a combination of two factors: the ID measurement may be underestimating the data manifold dimension, and randomly initialized networks may not provide sufficiently generic or non-linear functions of their inputs. We explore the second hypothesis in appendix A.3, where we show that by vetting the teacher networks we can improve agreement between ID and the number of input features. Figure 18

provides some idea of the potential errors in the ID measurements. Since the inputs themselves are drawn from a uniform distribution it is plausible that the ID is somewhat of an underestimate due to boundary effects.

3.2 Product Data Manifolds and Other Loss Functions

3.2.1 Product Data Manifolds

If the data manifold takes the form , with the underlying function of decomposing as , then we expect that a neural network should be capable of separately modeling each within separate blocks of activations, and then combining them in the final layer to compute the full . This means that although the ID of will be measured as , we should expect

(3.2)

as we discussed briefly in section 2.2.1, and demonstrate diagrammatically on the right of figure 4.

Figure 8: This figure shows the relationship between and the power when we use the generalized loss . As expected from section 2.1, we find . This is a student/teacher experiment with .

To test this prediction we use a vetted teacher network with 3 real inputs and another vetted teacher taking 6 real inputs . Individually, these had ID and and their exponents satisfied and . These teachers each produce a pair of logits. We then constructed the new teacher functions with logits

(3.3)

and trained students to imitate these teachers using the cross-entropy loss. We then measured the resulting ID and for these three product-manifold teachers. For the and cases we used two or three different teachers to make sure the network could not take advantage of the exact repetition of a single teacher.

As shown in figure 7, the results confirm our predictions. This provides a concrete example where we may find that for reasons that the theory precisely anticipates. More importantly, it provides a very detailed test of our theoretical picture relating scaling exponents to properties of the data manifold.

3.2.2 Other Loss Functions

The factor of ‘4’ in the relation is derived from the behavior of the loss function and the expectation that networks with ReLU activations form piecewise linear functions. If we use a loss function such as for regression, from the argument of section 2.1 we would expect

(3.4)

where the MSE case corresponds to . We verify this in figure 8 using a fixed teacher with intrinsic dimension , as measured in the usual student/teacher context.

Figure 9: The left figure shows the test and training loss for various sizes of CNN trained on CIFAR10, while the right figure shows error ( accuracy). All results are evaluated at the early stopping step, where the test loss is minimized. We report test loss results in figure 1, but note that the exponents for accuracy are very close to those for loss.

3.3 Image Classification with Simple CNNs

Our goal with these experiments was to study a simple, all ReLU architecture that could scale down to a small enough size to avoid overfitting CIFAR10 Krizhevsky09learningmultiple

. So we used a version of the default tutorial CNN in tensorflow

tensorflow2015-whitepaper , which we modified only by scaling the number of channels (ie the width). Figure 9 shows the scaling of the test loss with number of parameters . Our only regularization was early stopping. The results match quite well.

In an ideal test of the theory, we would measure fully in the underfitting regime, with no distinction between train and test performance. But there is a train/test gap even for the smallest network sizes, so its unclear how to model the error in the measurement. In addition to the test loss, we also measured the scaling of the training loss for these models, recording it at the early-stopping step, and found that it also scales similarly. Furthermore, note that on the right of figure 9 we record the error rate ( accuracy), and find that it scales very similarly to the loss.

We performed a very similar analysis on the MNIST lecun-mnisthandwrittendigit-2010 , fashion MNIST fmnist , and SVHN netzer2011reading datasets using slightly smaller networks (see section A.4). We plot in figure 15, which we have relegated to the appendix, as the power-law trends on these datasets are less clear than on CIFAR10.

Power-law exponents and IDs for CIFAR10 have been measured elsewhere using more powerful architectures, finding both a larger value of (for the error rate) rosenfeld2019constructive and a smaller ID ansuini2019intrinsic . We cannot make a clean comparison, but given that we find that the exponent for error-rate and loss scaling seem to be similar, these results appear to match our predictions.

3.4 Language Modeling with GPT-type Models

Figure 10: These figures show the ID estimates for the attention and fully-connected outputs of a 117M parameter GPT-type model, where . The left figure shows results from the nearest neighbor method, with 2,3, and 4 neighbors, while the right plot shows results from the MLE method. The results roughly agree for the first layer, but the MLE method gives smaller IDs for later layers, and is likely an under-estimate.

The GPT-type language models display power-law scaling of over at least five orders of magnitude in , with exponent kaplan2020scaling . This value of is much smaller than those observed for many other datasets rosenfeld2019constructive , meaning that it allows us to probe a rather different regime, where we predict the quite large value .

We generated activation vectors from the ‘small’ 117M parameter GPT-2 model using test data drawn from the same distribution as the training data

radford2018improving ; radford2019language , and measured the IDs. Decoder-only liu2018generating Transformers OriginalTransformer have a residual structure with blocks including an attention mechanism and a fully-connected component. For each layer of blocks, one can measure the ID from the output of the attention mechanism, the fully-connected layer, or from the output of the residual re-combination.

The activations that contribute to the Transformer’s outputs at any given token-position depend on all activations from earlier in the sequence, except for the case of the final layer (before multiplying by the unembedding matrix). Thus it is only the final layer activations that can be said to capture the data manifold associated with the model’s prediction for a single token. The mean loss over tokens has scaling exponent , and from figure 21 of kaplan2020scaling we see that is roughly constant for tokens that occur late in any text sequence. So we use the activations from the last token in each sequence to measure the ID, though the ID does not vary significantly across token positions (see figure 11).

In figure 10 we plot the measured ID for the attention output, the fully connected output, and the combined output of the residual blocks for all layers. For these measurements we used 10,000 activation vectors, each from the last token in a different text sequence (for more details see appendix C.2). We see that unlike the case of image classifiers ansuini2019intrinsic , the ID is roughly constant across layers, with the exception of the first layer, where it is significantly smaller. If instead we measure the ID from the 1024 tokens in a single contiguous passage of text, we instead find an ID . This strongly suggests that the data manifold has a scale-dependent structure, and may not be well-characterized by a single intrinsic dimension.

It is tempting to observe that the intrinsic dimension of activations from the first attention layer is of order -, which matches well with for these models. One might argue that this bounds the total data manifold dimensionality entering the model through its input tokens. But as discussed above, this reasoning seems untrustworthy as an estimate of the data manifold dimensionality relevant for next-token predictions. So we take a conservative attitude and do not use early layer IDs as an estimate of the relevant ID for scaling.

We conclude that since , we have that , which accords with our expectations (see 2.2.1). Given the very small value of in language modeling, it is satisfying to observe that the corresponding ID is very large. But it would have been more exciting to discover for language modeling. We do not know if the discepancy is due to added complexities from the structure of the Transformer, special structure on the data manifold itself, a scrambling of data manifolds due to the residual structure and attention mechanism, or some other oversimplification in our theory.

Figure 11: ID estimates from a single 1024-token text sequence (left) and the final layer ID as measured using tokens with fixed positions within distinct sequences (right). The data manifold associated with a single sequence has a much, much smaller dimension than the full manifold.

4 Related Work

The theory of scaling we have advocated applies basic, ‘textbook’ wasserman2006all

ideas from regression and density estimation. Our work was also partly inspired by similar scaling relations in random forest models; with some added assumptions, it is possible to prove them

biau2012analysis . As one passes from classical techniques, to random forests, and then to neural networks, the models become increasingly powerful but less and less amenable to a direct analysis. Nevertheless, we argue that similar principles apply and underly their scaling behavior. A similar overall perspective has been discussed by Bickel and collaborators bickel2007local .

There is a large literature on dimensionality estimation; for a nice overview see camastra2016intrinsic . We have primarily used the two nearest neighbor method TwoNN , which was based on the MLE method levina2005maximum for distances among points in a local neighborhood. In neural image classifiers, the intrinsic dimension of the data manifold was studied ansuini2019intrinsic using the TwoNN method. They demonstrated that the ID is much smaller than the dimension estimated via linear methods such as PCA, among other interesting results. Other authors have established a connection between ID and noisy labels ma2018dimensionalitydriven , and demonstrated that neural models can effectively identify a low-dimensional manifold in a larger ambient space basri2016efficient . It would be interesting to understand the relationship between the data manifold and neural circuits olah2020zoom , and how the manifold changes when non-robust features are eliminated notbugsfeatures . Recent work spigler2019asymptotic relates data dimensionality and dataset size scaling exponents for kernel methods. The intrinsic dimension of the neural network parameter space has also been discussed li2018measuring .

Neural scaling laws have been studied in a number of papers. Perhaps the first work on the subject was 1712.00409 . The more recent work rosenfeld2019constructive studies scaling with model size and dataset size, both independently and simultaneously. Language models were studied in kaplan2020scaling , where scaling relations with model size, dataset size, training compute, and training steps were identified. EfficientNet DBLP:journals/corr/abs-1905-11946 displays near power-law scaling with model size, though these models are not in the underfitting regime.

5 Discussion

We have proposed a theory connecting the model-size scaling exponent with the intrinsic dimension of the data manifold. Many other neural scaling laws have been identified 1712.00409 ; rosenfeld2019constructive ; kaplan2020scaling

, including scalings with dataset size and compute budget, and fairly accurate power-law fits to learning curves. We have focused on scaling with model size in the infinite data limit because we expect it to be the simplest and most theoretically tractable scaling relation. Scaling with dataset size may involve issues of regularization, requiring a balance between bias and variance, while understanding the scaling with compute would require that we contend with optimization.

Nevertheless, neural scaling exponents with dataset size are often very similar777Though in almost all cases rosenfeld2019constructive ; kaplan2020scaling dataset exponents are slightly larger. This runs somewhat counter to classical expectations wasserman2006all , where the number of parameters determines a tradeoff between bias and variance, and dataset size exponents are smaller than the bias-scaling exponents that depend on model size.

to model size exponents. One might argue that dataset size scaling can be understood as a consequence of interpolation between points on the data manifold, and so should have a similar relationship to the data manifold dimension. Recent works have made this case

spigler2019asymptotic . Compute scaling exponents kaplan2020scaling are also not far from model-size exponents, but combine optimization and model scaling. It seems most natural to interpret them by modeling learning curves, but perhaps optimization can be re-interpreted as the identification and dissection of the data manifold. Something like this will be necessary in order to explain the fact that larger models are much more sample efficient kaplan2020scaling than small models. This may be the most impactful direction for future work.

It will be interesting to test this theory with a wider variety of models and datasets. Generative modeling may be the ideal setting, since the abundance of unlabeled text, image, and video data provides many opportunities to train large models on nearly unlimited datasets. In this context, it may be interesting to explore what the theory suggests for finetuning pre-trained generative models on downstream tasks. We would expect that these tasks benefit from the pre-established existence of the data manifold; perhaps finetuning can be understood as a process of zooming-in and refining performance in a small region of this manifold. It would also be interesting to understand how scaling relations for the loss compare to those for quantities that are not directly optimized, such as prediction accuracies. In the case of CIFAR10 we saw that accuracy and loss exhibit similar exponents. Finally, it’s worth thinking about the extent to which larger models perform better in reinforcement learning

cobbe2019leveraging . Due to the non-stationary distribution in RL it may be difficult to understand model-size scaling quantitatively, and it’s less clear how to apply our theory in that context. A theory of sample efficiency scaling would be more likely to be relevant to RL.

Acknowledgments

We thank Yasaman Bahri, Ethan Dyer, Tom Henighan, Danny Hernandez, Jaehoon Lee, and Sam McCandlish for interesting discussions and feedback. We especially thank Ethan for sharing his notes on linear models and Yasaman for emphasizing that our theory of model size scaling might be re-purposed as a theory of dataset size scaling. JK has been supported in part by NSF grant PHY-1454083. This work was also supported in part by Open Philanthropy.

Appendix A Technical Details and Minor Results

a.1 Fitting

To extract the scaling exponent we need to fit power-laws to the empirical for trained models with parameters. For this purpose we simply fit straight lines to vs , assuming that the error in was independent of (ie we assumed Gaussian errors in ). We fit from the smallest value of tested until the power-law behavior breaks down. This point is quite clear visually in most cases, as seen in figures 5, 13, and 9. For the case where we had networks with both different widths and different depths 5 we only used the networks that performed among the best at each model size (ie we used points on the ‘convex hull’ in the vs plane).

However, to avoid bias we determined the last point to include in the fit in the following way. We fit a circle (parameterized by its center and radius) to the first points in the vs plane (starting at ), and evaluated , the radius of the best-fit circle for each . We then chose the value of that achieved the maximal radius , as this is the ‘most linear’ set of points. Finally, we fit a straight line to this collection of points to determine .

Note that this provides an alternative way to determine , the largest network in the power-law scaling region. This was the input for figure 12, where we show as a function of for teacher/student experiments.

The power-law scaling breaks down in CIFAR10 and other small image datasets due to overfitting. We do not have a complete understanding of why it breaks down for the teacher/student experiments, but it seems to be due to a failure of optimization, perhaps related to numerical precision. We note that the power-law behavior persists to larger model size and smaller loss with the deeper networks in figure 5.

Figure 12: This figure shows the maximum number of parameters at which we observe power-law scaling of , as a function of the intrinsic dimension, for teacher/student experiments. This is determined as described in appendix A.1. The left plot uses cross-entropy loss, while the right uses MSE loss. This plot should be viewed as a more empirical (but less well understood) alternative to figure 2.

a.2 Teacher/Student Experiments

a.2.1 Network Architectures

Our teacher networks had shape (i.e. dimensional input, two hidden layers of output dimension , and final layer ouput of dimension ) for experiments with cross entropy loss (figures 5, 7 and 8), for MSE loss (figure 13) and for cross entropy loss with vetted teacher (figure 14

). The teachers are randomly initialized, with biases set to zero, and weights picked from a gaussian distribution of mean zero and standard deviation

, where is the input size of the layer. We experimented with including random non-zero biases, but did not find that they significantly alter the behavior of teachers.

For experiments with mean-squared error loss, the teacher and student networks each outputted a single real value. For experiments using a cross-entropy loss, networks output two logits, and we computed the cross entropy directly from these teacher outputs (ie we did not sample discrete values from the teacher, but used its exact output distribution). For cross-entropy experiments we used students with 2, 3, and 4 hidden layers, and let the best performing models define the

fits, while for MSE loss we simply used students with 2 hidden layers.

We ran trials each for cross-entropy and MSE losses, and in each case selected the ones with the lowest losses. Intrinsic dimension calculations were done using the same networks. For vetted teacher experiments, we took trials and computed the mean of the loss excluding the worst performing students.

a.2.2 Optimization and LR Schedule

We use the ADAM optimizer kingma2014adam with default settings except for the learning rate. In order to optimize effectively, we scanned over a grid of learning rates, and experimented with cosine, linear, and step-function learning rate schedules. We ended up using step function schedules for teacher/student experiments, and a constant learning rate for CIFAR10 and other image datasets, as these performed roughly as well or better than other choices. We did not find it necessary to vary the overall learning rate among different network sizes, but the schedules themselves were important for optimization. Our learning rate schedules for the various teacher/student experiments in the paper (labeled by associated figures) are summarized in table 1.

Experiment student training steps batch size learning rate
(T/S) architecture (ADAM)
(random) MSE: [20,n,n,1] 0-200k 200 0.01
figures 6, 7, 8 CE: [20,n,n,2] 200-220k 1000 0.01
220-240k 4000 0.001
(vetted) 0-100k 200 0.01
figure 14 [9,n,n,2] 100-150k 200 0.001
150-170k 200 0.0001
Table 1: Architectures and training schedules for Teacher/Student experiments in the paper, referenced by the figures in which the results are described.
Figure 13: This figure shows with a MSE loss for students (all with 2 hidden layers) learning from a randomly initialized teacher with - features. Figure 5 shows the results for cross-entropy loss.

a.3 Vetting Teachers to Increase Intrinsic Dimension

In figure 6, the ID is typically smaller than the number of features, especially when the latter is large. One might worry that this indicates ID measurements are inaccurate. In fact, we believe that this occurs partly because randomly initialized teacher networks do not typically produce fully generic functions of their inputs.

We can partially remedy this problem by generating a large number of teachers and vetting them, keeping only those that produce the most complicated and non-linear functions of their inputs. The result is pictured in figure 14, where we repeat the experiment of section 3.1 with up to features. We see that sufficiently vetted teachers have ID nearly equal to their feature count, and that the relationship continues to hold.

Presumably many vetting procedures could be successfully applied to filter the teacher networks. To increase the complexity and non-linearity of teachers so that ID would better match the number of input features, we followed this ad-hoc approach:

  1. For a given teacher, we took a random slice along each input coordinate axis (i.e. the values of the other coordinates are chosen uniformly at random from ). We performed linear regression on this slice and computed the score(, the coefficient of determination), and took the mean of the scores across coordinate axes. A low score implies more non-linearity.

  2. We repeated this procedure times and computed the mean score of all the trials. This is the score for the teacher.

  3. We iterated over randomly generated teachers and selected the one with the minimum score.

Figure 14: This figure shows the number of features and ID vs for vetted teachers. ID is still smaller than the number of input features, but vetting partially closes the gap. Compare the slope of for number of features vs here to the left of figure 6, where the slope was . Slopes for ID vs are very similar with or without vetting.

a.4 CNNs on CIFAR10, MNIST, FMNIST, and SVHN

Figure 15: This shows train and test loss on MNIST, Fashion MNIST, and test loss on SVHN, along with the exponents and ID measurement.

For CIFAR10 we used the architecture from the tensorflow CNN tutorial tensorflow2015-whitepaper , and modified the channel width. The architecture is recorded in table 2.

The networks were trained for epochs with the ADAM optimizer with default hyperparameters. We use iterations of each network and average the loss (on log scale) over the iterations. Note that we record the test and training loss at the early stopping point where the test loss reaches its minimum value. These are the results in figure 9.

For MNIST lecun-mnisthandwrittendigit-2010 , fashion MNIST fmnist , and SVHN netzer2011reading , we use a slightly smaller network (3 instead of 4 hidden layers) with architecture shown in table 3. We used a smaller network in the hopes of identifying a power-law scaling region without significant overfitting.

For MNIST and fashion MNIST, we ran each network for trials and took the mean loss (on log scale). The networks were trained for epochs with the ADAM optimizer with default hyperparameters. As with CIFAR10, we take the minimum test loss during training (i.e. early stopping), and also report training loss at this point.

For SVHN, the networks were trained for epochs with both training and additional datasets used for training (total k images), and test dataset (k images) for testing. We used default hyperparameters.

Layer Output shape
Conv2D
MaxPooling2D
Conv2D
MaxPooling2D
Conv2D
Dense
Output
Table 2: Architecture of the CNN network used for CIFAR10. We chose in the range to minimize overfitting. All convolutions were

with unit stride, and the images have 3 colors, so the network has a total of

parameters.
Layer Output shape
Conv2D
MaxPooling2D
Conv2D
MaxPooling2D
Dense
Output
Layer Output shape
Conv2D
MaxPooling2D
Conv2D
MaxPooling2D
Dense
Output
Table 3: Architecture of the CNN network used for MNIST and fashion MNIST (left) and SVHN (right). All convolutions were with unit stride.

a.5 Scaling of KL Divergence with Piecewise Linear Logits

We assume the logits are linear in a small region of volume we take to surround the origin, and that the underlying probability distribution over discrete choices is smooth. The loss in this region is

(A.1)

where . If we write then as is well known

(A.2)

After optimization the linear will determine a that is quadratic in , and so the loss per unit volume will scale as , as claimed.

Appendix B Review of Intrinsic Dimension Estimation Methods

In this section we review the two nearest neighbor method ansuini2019intrinsic and explain that it can be extended to -nearest neighbors. Then we note that the same analysis derives the maximum likelihood method levina2005maximum .

b.1 The Two Nearest Neighbor Method

Assume that points are drawn from a distribution with density with support on a -dimensional manifold in a potentially much higher dimensional ambient space. We will see that drops out of our results, assuming that it is constant across the first few nearest neighbors, so we will drop its explicit -dependence in what follows.

The probability of finding points from the dataset in a region with -dimensional volume is Poisson:

(B.1)

To see this, note that in an infinitesimal volume , and , with all . Thus the generating function for in a finite volume

can be found by taking the product of binomial distributions over all

in , giving

(B.2)

The coefficients of are the above.

With this result in hand, we can consider the distribution of nearest-neighbor distances. Consider some point in the dataset. The probability for its nearest neighbor to be in is given by the product of the probability that there are no points in times the probability of finding a point in the shell , which is

(B.3)

where is the volume of a unit -ball. This result easily generalizes to the case where there are many corresponding to the first nearest neighbors. For example for two nearest neighbors we find

(B.4)

since we are demanding that there are two points on two infinitesimal shells at radii and no points otherwise.

Now we can compute the distribution over nearest neighbor distances, and their ratios. The TwoNN method ansuini2019intrinsic is based on the distribution of the ratio , which we can compute by integrating over while fixing their ratio:

(B.5)

This means that the cumulative distribution for is

(B.6)

This means that we can identify the dimension by measuring the slope of a linear fit of vs . That’s the TwoNN method, as seen in figure 16.

Figure 16: This figure shows the relationship in equation B.16, which we use to determine the ID using the nearest neighbor method. We display examples using teacher/student data, CIFAR10, and GPT.

b.2 Extension to -Neighbors and MLE

The beauty of the TwoNN method ansuini2019intrinsic is that it uses very short-distance information, and so it’s plausible that the density can be well-approximated as a constant. A down-side of this method is that it primarily measures the dimension on short scales. This can be mitigated by applying the method while sampling different numbers of points from the data distribution, but it’s also easy to validate the TwoNN method by simply using more neighbors.

Let’s see what happens with three neighbors, and then we will generalize. We can compute the distribution of , and use it for validation. We have

(B.7)

Intuitively, large becomes unlikely because it implies that there are few points inside a large radius, but with fixed , a larger value of is more probable due to the larger volume at large radius.

We find a nice simplification when we study and its cumulative distribution after marginalizing over . The probability distribution is

(B.8)

The cumulative distribution is then

(B.9)

Thus we also find a simple method for identifying based on alone, namely

(B.10)

This directly generalizes the TwoNN; in practice we measure via a linear fit to the numerator as a function of the denominator in this expression.

Generalizing to neighbors, the probability distribution for is

(B.11)

for . This can be used directly for maximum likelihood estimation levina2005maximum . If we maximize with respect to we find

(B.12)

In fact, this MLE estimator is biased; the unbiased estimator is

levina2005maximum

(B.13)

In practice, we can compute the RHS for all points in the manifold (after fixing some value for the number of neighbors ) and compute the mean. We display a histogram of the MLE estimates over many points in the data manifold for two examples in figure 17. The variance provides some measure of the errors. Alternatively, we could directly measure and evaluate the likelihood as a function of . The variance of this estimator was studied in levina2005maximum . They also found numerically that it can be useful to tune of the value of , as very small overestimates ID while large underestimates ID.

Figure 17: These figures show a histogram of the results for from MLE (with neighbors) among all of the points used for measurement. On the left we have a teacher with 10 features, in the middle we have the CNN trained on CIFAR10, while on the right we have the GPT model’s prefinal attention output for the last token in the text sequence. Smaller numbers of neighbors typically give larger IDs.

We can use these results to extend the TwoNN method in a simple way to general . Marginalizing over all but , we find that

(B.14)

which leads to the cumulative distribution

(B.15)

and the formula

(B.16)

for the th nearest neighbor. This can be used as a cross-check for TwoNN. For examples of the relationship between the numerator and denominator with various , and the relevant fits, see figure 16. Just as with MLE, we find empirically that larger leads to smaller estimates of ID (see figure 21).

Appendix C Examples and Tests of Intrinsic Dimension Estimation

The MLE and TwoNN methods have been tested and demonstrated by their authors levina2005maximum ; ansuini2019intrinsic . We conduct a few tests with synthetic data. Then we provide some other examples of the ID measurement process, including errors, using our student/teacher, CIFAR10, and language data.

c.1 Tests on Synthetic Data

Figure 18: Here we show measured ID as a function of the number of points in the dataset used for the measurement, for both the TwoNN (top) and MLE (bottom) methods (with ). The left plots show a uniform distribution in the hypercube , while the plot on the right show a -torus embedded in dimensions.

As a baseline test, we evaluate the TwoNN and MLE methods on synthetic datasets with dimensions ranging from to , with results in figure 18. We display synthetic data on the hypercube as well as a -torus embedded in dimensions (in the simplest way, by embedding each circle factor in 2 Euclidean dimensions).

We notice that 1) results are more accurate for smaller , with quite reliable results for the TwoNN method for , 2) at large all methods tend to underestimate the true ID, but 3) its certainly possible to both under and over-estimate the true ID, and measurements are not necessarily even monotonic with the number of points used for the measurement. We also see that for the torus the ID estimates are reasonably accurate even for dimensions , though there’s certainly no guarantee that this will hold for unknown data manifolds.

As other authors have noted camastra2016intrinsic , the ID is under-estimated on the hypercube, likely because cubes have sharp boundaries and corners which reduce the number of neighbors. Similarly, we believe that the ID is often over-estimated for the torus because (due to the curvature of the circles in the embedding space) points are often closer together than they would be in flat Euclidean space. We have also seen as shown in levina2005maximum that for small the MLE method typically overestimates ID. The NN method seems a bit less sensitive to as compared to MLE.

c.2 Tests on Neural Network Activations

Figure 19: Variation of Intrinsic Dimension(ID) with number of vectors for a single student network (left), for the last layer of an CNN trained on CIFAR10 (middle), and also for the last layer and last token of GPT (right). The student is of size and was trained on teacher with features.
Figure 20: Variation of Intrinsic Dimension (ID) across network sizes for a single teacher. The figure on the left shows number of inputs features and the one on the right has . Each point on either figure is one student. All students on each figure are trained on the same teacher, but the teacher for the left and right figures are different.
Figure 21: Variation of Intrinsic Dimension (ID) with number of neighbors used in the algorithm. The figure on the left shows a student of size trained on a teacher with 10 features, while the one on the right has student shape trained on teacher with features.

In all cases we measure ID from fully trained networks, and we always use students (not teachers) in that context. There are a large variety of potential statistical and systematic errors associated with these measurements:

  • Variation among IDs measured from students of the same size and trained with the same teacher network (or dataset), but with different initialization (see figure 20).

  • Variation of ID measurements among random groups of points sampled from the same data manifold

  • Dependence of ID on the number of points used (and so the overall density) from the data manifold. More points samples shorter distance scales on the manifold. See figure 19.

  • Dependence of ID on how many nearest neighbor points are used, either for NN (see figure 21) or MLE type estimation.

  • Variation of ID from among points in different locations on the data data manifold (we show a histogram of results from MLE in figure 17)

  • Dataset specific distinctions, eg from the same or different classes in an image classifier, or from the same or different text sequences in a language model (discussed in section 3.4)

  • Dependence of ID measurements on the layer studied (see figures 10 and 19)

We provide some brief information about many of these sources of variation in the referenced plots. In most cases we find that the variation of the ID is small as long as it is measured with sufficiently many vectors. It would be interesting obtain a more precise theoretical and experimental characterization of these methods in the future.

But as evidenced by the synthetic examples in figure 18, this does not lead us to believe that the IDs are fully trustworthy, especially when they are measured to be large. Though the apparent statistical errors in ID measurements may seem small, there may be systematic errors that are more difficult to observe.

It’s conceivable that deficiencies in ID measurement actually work to the advantage of the theory relating and . For example, tends to be underestimated when the data manifold has a boundary (or simply less support in some region), but this may also correlate with regions of the manifold where there really is less data, and these regions do not need to be modeled as precisely to achieve a good test loss. But we leave a more thorough investigation of such subtleties to future work.

Figure 22: These figures are histograms of the GPT MLE estimates using the last token of the prefinal layer (using ). Counts include the number of points in the data manifold that produce a given maximum-likelihood ID. These are computed using all available text sequences, ie test+validation (10k pts)

References