# How many degrees of freedom do we need to train deep networks: a loss landscape perspective

A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters. We explain this phenomenon by first examining the success probability of hitting a training loss sub-level set when training within a random subspace of a given training dimensionality. We find a sharp phase transition in the success probability from 0 to 1 as the training dimension surpasses a threshold. This threshold training dimension increases as the desired final loss decreases, but decreases as the initial loss decreases. We then theoretically explain the origin of this phase transition, and its dependence on initialization and final desired loss, in terms of precise properties of the high dimensional geometry of the loss landscape. In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large. In several architectures and datasets, we measure the threshold training dimension as a function of initialization and demonstrate that it is a small fraction of the total number of parameters, thereby implying, by our theory, that successful training with so few dimensions is possible precisely because the Gaussian width of low loss sub-level sets is very large. Moreover, this threshold training dimension provides a strong null model for assessing the efficacy of more sophisticated ways to reduce training degrees of freedom, including lottery tickets as well a more optimal method we introduce: lottery subspaces.

## Authors

• 2 publications
• 16 publications
• 1 publication
• 44 publications
• ### Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization

Deep neural networks are typically highly over-parameterized with prunin...
02/15/2019 ∙ by Hesham Mostafa, et al. ∙ 0

• ### Geometry of energy landscapes and the optimizability of deep neural networks

Deep neural networks are workhorse models in machine learning with multi...
08/01/2018 ∙ by Simon Becker, et al. ∙ 0

• ### Theory of overparametrization in quantum neural networks

The prospect of achieving quantum advantage with Quantum Neural Networks...
09/23/2021 ∙ by Martin Larocca, et al. ∙ 0

• ### HiDeNN-PGD: reduced-order hierarchical deep learning neural networks

This paper presents a proper generalized decomposition (PGD) based reduc...
05/13/2021 ∙ by Lei Zhang, et al. ∙ 7

• ### The intriguing role of module criticality in the generalization of deep networks

We study the phenomenon that some modules of deep neural networks (DNNs)...
12/02/2019 ∙ by Niladri S. Chatterji, et al. ∙ 13

• ### Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature Learning and Lazy Training

Deep learning algorithms are responsible for a technological revolution ...
12/30/2020 ∙ by Mario Geiger, et al. ∙ 12

• ### Avoiding pathologies in very deep networks

Choosing appropriate architectures and regularization strategies for dee...
02/24/2014 ∙ by David Duvenaud, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

How many parameters are needed to train a neural network to a specified accuracy? Recent work on two fronts indicates that the answer for a given architecture and dataset pair is often much smaller than the total number of parameters used in modern large-scale neural networks. The first is successfully identifying lottery tickets or sparse trainable subnetworks through iterative training and pruning cycles frankle2018Lottery. Such methods utilize information from training to identify lower-dimensional parameter spaces which can optimize to a similar accuracy as the full model. The second is the observation that constrained training within a random, low-dimension affine subspace, is often successful at reaching a high desired train and test accuracy on a variety of tasks, provided that the training dimension of the subspace is above an empirically-observed threshold training dimension li2018intrinsicdimension. These results, however, leave open the question of why low-dimensional training is so successful and whether we can give a theoretical explanation for the existence of a threshold training dimension for a given model and task.

In this work, we provide such an explanation in terms of the high-dimensional geometry of the loss landscape, the initialization, and the desired loss. In particular, we leverage a powerful tool from high-dimensional probability theory, namely Gordon’s escape theorem, to show that this threshold training dimension is equal to the dimension of the full parameter space minus the Gaussian width of the desired loss sublevel set projected onto the unit sphere around initialization. This theory can then be applied in several ways to enhance our understanding of the loss landscape of neural networks. For a quadratic well or second-order approximation around a local minimum, we derive this threshold training dimension analytically in terms of the Hessian spectrum and the distance of the initialization from the minimum. For general models, this relationship can be used in reverse to measure important high dimensional properties of loss landscape geometry. For example, by performing a tomographic exploration of the loss landscape, i.e. training within random subspaces of varying training dimension, we uncover a phase transition in the success probability of hitting a given loss sub-level set. The threshold-training dimension is then the phase boundary in this transition, and our theory explains the dependence of the phase boundary on the desired loss sub-level set and the initialization, in terms of the Gaussian width of the loss sub-level set projected onto a sphere surrounding the initialization.

Motivated by lottery tickets, we furthermore consider training not only within random dimensions, but also within optimized subspaces using information from training in the full space. Lottery tickets can be viewed as constructing an optimized, axis-aligned subspaces of parameters, i.e. where each subspace dimension corresponds to a single parameter. What would constitute an analogous optimized choice for general subspaces? We propose two new methods: burn-in subspaces which optimize the offset of the subspace by taking a few steps along a training trajectory and lottery subspaces determined by the span of gradients along a full training trajectory (Fig. 1). Burn-in subspaces in particular can be viewed as lowering the threshold training dimension by moving closer to the desired loss sublevel set. For all three methods, we empirically explore the threshold training dimension across a range of datasets and architectures.

Related Work: An important motivation of our work is the observation that training within a random, low-dimensional affine subspace starting from a random initialization can suffice to reach high training and test accuracies on a variety of tasks, provided the training dimension exceeds a threshold that was called the intrinsic dimension li2018intrinsicdimension (and which we call the threshold training dimension). However li2018intrinsicdimension

provided no theoretical explanation for this threshold and did not explore the dependence of this threshold on the quality of the initialization. Our primary goal is to provide a theoretical explanation for the existence of this threshold in terms of the geometry of the loss landscape and the quality of initialization. Indeed understanding the geometry of high dimensional error landscapes has been a subject of intense interest in deep learning (see e.g.

Dauphin2014-lk; goodfellow2014qualitatively; fort2019largescale; ghorbani2019investigation; sagun2016eigenvalues; sagun2017empirical; yao2018hessianbased; fort2019goldilocks; papyan2020traces; gur2018gradient; fort2019emergent; papyan2019measurements; fort2020deep, or see Bahri2020-mi for a review). But to our knowledge, the Gaussian width of sub-level sets projected onto a sphere surrounding initialization, a key quantity that determines the threshold training dimension, has not been extensively explored in deep learning.

Another motivation for our work is assessing the efficacy of diverse more sophisticated network pruning methods. Here we can use training within random affine subspaces with different initializations, for which we can obtain theoretical insights, as a null baseline. In this work we focus on lottery tickets, or pruned networks obtained at initialization using information about the magnitude of weights at the end of training frankle2018Lottery; frankle2019stabilizing. Further work revealed the advantages obtained by pruning networks not at initialization frankle2018Lottery; Lee2018-vt; Wang2020-jt; Tanaka2020-no but slightly later in training frankle2020pruning, highlighting the importance of early stages of training jastrzebski2020breakeven; lewkowycz2020large. We will find empirically, as well as explain theoretically, that even when training within random subspaces, one can obtain higher accuracies for a given training dimension if one starts from a slightly pre-trained, or burned-in initialization as opposed to a random initialization. Thus the null baseline for achievable accuracy is more stringent when one prunes later in training versus at initialization.

## 2 An empirically observed phase transition in the success of training

We begin with the empirical observation of a phase transition in the probability of hitting a loss sub-level set when training within a random subspace of a given training dimension, starting from some initialization. Before presenting this phase transition, we first define loss sublevel sets and two different methods for training within a random subspace that differ only in the quality of the initialization. In the next section we develop theory for the nature of this phase transition.

#### Loss sublevel sets.

Let be a neural network with weights and inputs . For a given training set

, the empirical loss landscape is given by Though our theory is general, we focus on classification for our experiments, where

is a one-hot encoding of

class labels,

is a vector of class probabilities, and

is the cross-entropy loss. In general, the loss sublevel set at a desired value of loss is the set of all points for which the loss is less than or equal to :

 S(ϵ):={\Vw∈RD:L(\Vw)≤ϵ}. (2.1)

#### Random affine subspace.

Consider a

dimensional random affine hyperplane contained in

dimensional weight space, parameterized by : Here is a random Gaussian matrix with columns normalized to and is a random Gaussian vector. To train within this subspace, we initialize , which corresponds to randomly initializing the network at , and we minimize with respect to .

#### Burn-in affine subspace.

Alternatively, we can initialize the network with parameters and train the network in the full space for some number of iterations , arriving at the parameters . We can then construct the random burn-in subspace

 \Vw(\Vθ)=\MxA\Vθ+\Vwt, (2.2)

with chosen randomly as before, and then subsequently train within this subspace by minimizing with respect to . The random affine subspace is identical to the burn-in affine subspace but with . Exploring the properties of training within burn-in as opposed to random affine subspaces enables us to explore the impact of the quality of the initialization, after burning in some information from the training data, on the success of subsequent restricted training.

#### Success probability in hitting a sub-level set.

In either training method, achieving implies that the intersection between our random or burn-in affine subspace and the loss sub-level set is non-empty for all . As both the subspace and the initialization leading to are random, we are interested in the success probability that a burn-in (or random when ) subspace of training dimension actually intersects a loss sub-level set :

 Ps(d,ϵ,t)≡P[S(ϵ)∩{\Vwt+span(\MxA)}≠∅]. (2.3)

Here, denotes the column space of . Note in practice we cannot guarantee that we obtain the minimal loss in the subspace, so we use the best value achieved by Adam kingma2014adam as an approximation. Thus the probability of achieving a given loss sublevel set via training constitutes an approximate lower bound on the probability in (2.3) that the subspace actually intersects the loss sublevel set.

#### Threshold training dimension as a phase transition boundary.

We will find that for any fixed , the success probability in the by plane undergoes a sharp phase transition. In particular for a desired (not too low) loss it transitions sharply from to as the training dimension increases. To capture this transition we define:

###### Definition 2.1.

[Threshold training dimension] The threshold training dimension is the minimal value of such that for some small .

For any chosen criterion (and fixed ) we will see that the curve forms a phase boundary in the by plane separating two phases of high and low success probability.

This definition also gives an operational procedure to approximately measure the threshold training dimension: run either the random or burn-in affine subspace method repeatedly over a range of training dimensions and record the lowest loss value found in the plane when optimizing via Adam. We can then construct the empirical probability across runs of hitting a given sublevel set and the threshold training dimension is lowest value of for which this probability crosses (where we employ ).

### 2.1 An empirical demonstration of a training phase transition

In this section, we carry out this operational procedure, comparing random and burn-in affine subspaces across a range of datasets and architectures. We examined architectures: 1) Conv-2

which is a simple 2-layer CNN with 16 and 32 channels, ReLU activations and

maxpool after each convolution followed by a fully connected layer; 2) Conv-3 which is a 3-layer CNN with 32, 64, and 64 channels but otherwise identical setup to Conv-2; and 3) ResNet20v1 as described in he2016deep

with on-the-fly batch normalization

ioffe2015batch

. We perform experiments on 5 datasets: MNIST

lecun2010mnist, Fashion MNIST xiao2017fashion

, CIFAR-10 and CIFAR-100

krizhevsky2014cifar

, and SVHN

. Baselines and experiments were run for the same number of epochs for each model and dataset combination; further details on architectures, hyperparameters, and training procedures are provided in the appendix. The code for the experiments was implemented in JAX

jax2018github.

Figure 2 shows results on the training loss for 4 datasets for both random and burn-in affine subspaces with a Conv-2. We obtain similar results for the two other architectures (see Appendix). Figure 2 exhibits several broad and important trends. First, for each training method within a random subspace, there is indeed a sharp phase transition in the success probability in the (or equivalently accuracy) by plane from (white regions) to (black regions). Second, the threshold training dimension (with ) does indeed track the tight phase boundary separating these two regimes. Third, broadly for each method, to achieve a lower loss, or equivalently higher accuracy, the threshold training dimension is higher; thus one needs more training dimensions to achieve better performance. Fourth, when comparing the threshold training dimension across all methods on the same dataset (final column of Figure 2) we see that at high accuracy (low loss ), increasing the amount of burn in lowers the threshold training dimension. To see this, pick a high accuracy for each dataset, and follow the horizontal line of constant accuracy from left to right to find the threshold training dimension for that accuracy. The first method encountered with the lowest threshold training dimension is burn-in with . Then burn-in with has a higher threshold training dimension and so on, with random affine having the highest. Thus the main trend is, for some range of desired accuracies, burning more information into the initialization by training on the training data reduces the number of subsequent training dimensions required to achieve the desired accuracy.

Figure 3 shows the threshold training dimension for each accuracy level for all three models on MNIST, Fashion MNIST and CIFAR-10, not only for training accuracy, but also for test accuracy. The broad trends discussed above hold robustly for both train and test accuracy for all 3 models.

## 3 A theory of the phase transition in training success

Here we aim to develop mathematical theory that accounts for the major trends observed empirically above, namely: (1) there exists a phase transition in the success probability yielding a phase boundary given by a threshold training dimension ; (2) at fixed and this threshold increases as the desired loss decreases (or desired accuracy increases), indicating more dimensions are required to perform better; (3) at fixed and , this threshold decreases as the burn-in time increases, indicating fewer training dimensions are required to achieve a given performance starting from a better burned-in initialization. Our theory will build upon several aspects of high dimensional geometry which we first review. In particular we discuss, in turn, the notion of the Gaussian width of a set, then Gordon’s escape theorem, and then introduce a notion of local angular dimension of a set about a point. Our final result, stated informally, will be that the threshold training dimension plus the local angular dimension of a desired loss sub-level set about the initialization must equal the total number of parameters . As we will see, this succinct statement will conceptually explain the major trends observed empirically. First we start with the definition of Gaussian width:

###### Definition 3.1 (Gaussian Width).

The Gaussian width of a subset is given by (see Figure 4):

 w(S)=12Esup\Vx,\Vy∈S⟨\Vg,\Vx−\Vy⟩,\Vg∼N(\V0,\MxID×D).

As a simple example, let be a solid ball of radius and dimension embedded in . Then its Gaussian width for large is well approximated by .

#### Gordon’s escape theorem.

The Gaussian width of a set , at least when that set is contained in a unit sphere around the origin, in turn characterizes the probability that a random subspace intersects that set, through Gordon’s escape theorem gordon1988milman:

###### Theorem 3.1.

[Escape Theorem] Let be a closed subset of the unit sphere in . If , then a dimensional subspace drawn uniformly from the Grassmannian satisfies gordon1988milman:

 P(Y∩S=∅)≥1−2.5exp[−(k/√k+1−w(S))2/18].

A clear explanation of the proof can be found in mixon_2014. Thus, the bound says when , the probability of no intersection quickly goes to for any . Matching lower bounds which state that the intersection occurs with high probability when have been proven for spherically convex sets amelunxen2014living. Thus, this threshold is sharp except for the subtlety that you are only guaranteed to hit the spherical convex hull of the set (defined on the sphere) with high probability.

When expressed in terms of the subspace dimension , rather than its co-dimension , these results indicate that a dimensional subspace will intersect a closed subset of the unit sphere around the origin with high probability if and only if , with a sharp transition at the threshold . This is a generalization of the result that two random subspaces in of dimension and intersect with high probability if and only if . Thus we can think of as playing a role analogous to dimension for sets on the centered unit sphere.

### 3.1 Intersections of random subspaces with general subsets

To explain the training phase transition, we must now adapt Gordon’s escape theorem to a general loss sublevel set in , and we must take into account that the initialization is not at the origin in weight space. To do so, we first define the projection of a set onto a unit sphere centered at :

 proj\Vwt(S)≡{(\Vx−\Vwt)/||\Vx−\Vwt||2:\Vx∈S}. (3.1)

Then we note that any affine subspace of the form in eq. 2.2 centered at intersects if and only if it intersects . Thus we can apply Gordon’s escape theorem to to compute the probability of the training subspace in eq. 2.2 intersecting a sublevel set . Since the squared Gaussian width of a set in a unit sphere plays a role analogous to dimension, we define:

###### Definition 3.2 (Local angular dimension).

The local angular dimension of a general set about a point is defined as

 (3.2)

An escape theorem for general sets and affine subspaces now depends on the initialization also, and follows from the above considerations and Gordon’s original escape theorem:

###### Theorem 3.2.

[Main Theorem] Let be a closed subset of . If , then a dimensional affine subspace drawn uniformly from the Grassmannian and centered at satisfies:

 P(Y∩S=∅)≥1−2.5exp[−(k/√k+1−w(proj\Vw0(S)))2/18].

To summarise this result in the context of our application, given an arbitrary loss sub-level set , a training subspace of training dimension starting from an initialization will hit the (convex hull) of the loss sublevel set with high probability when , and will miss it (i.e have empty intersection) with high probability when . This analysis thus establishes the existence of a phase transition in the success probability in eq. 2.3, and moreover establishes the threshold training dimension for small values of in definition 2.1:

 d∗(S(ϵ),\Vwt)=D−dlocal(S(ϵ),\Vwt). (3.3)

Our theory provides several important insights on the nature of threshold training dimension. Firstly, small threshold training dimensions can only arise if the local angular dimension of the loss sublevel set about the initialization is close to the ambient dimension. Second, as increases, becomes larger, with a larger , and consequently a smaller threshold training dimension. Similarly, if is closer to , then will be larger, and the threshold training dimension will also be lower (see fig. 4). This observation accounts for the observed decrease in threshold training dimension with increased burn-in time . Presumably, burning in information into the initialization for a longer time brings the initialization closer to the sublevel set , making it easier to hit with a random subspace of lower dimension. This effect is akin to staring out into the night sky in a single random direction and asking with what probability we will see the moon; this probability increases the closer we are to the moon.

To illustrate our theory, we work out the paradigmatic example of a quadratic loss function where and is a symmetric, positive definite Hessian matrix. A sublevel set

of the quadratic well is an ellipsoidal body with principal axes along the eigenvectors

of . The radius along principal axis obeys where is the eigenvalue. Thus , and so a large (small) Hessian eigenvalue leads to a narrow (wide) radius along each principal axis of the ellipsoid. The overall squared Gaussian width of the sublevel set obeys , where denotes bounded above and below by this expression times positive constants vershynin2018high.

We next consider training within a random subspace of dimension starting from some initialization . To compute the probability the subspace hits the sublevel set , as illustrated in Fig. 4, we must project this ellipsoidal sublevel set onto the surface of the unit sphere centered . The Gaussian width of this projection will depend on the distance from the initialization to the global minimum at (i.e. it should increase with decreasing ). We can develop a crude approximation to this width as follows. Assuming , the direction will be approximately orthogonal to , so that . The distance between the tip of the ellipsoid at radius along principal axis and the initialization is therefore . The ellipse’s radius then gets scaled down to approximately when projected onto the surface of the unit sphere. Note the subtlety in this derivation is that the point actually projected onto the sphere is where a line through the center of the sphere lies tangent to the ellipse rather than the point of fullest extent. As a result, provides a lower bound to the projected extent on the circle. This is formalized in the appendix along with an explanation as to why this bound becomes looser with decreasing . Taken together, a lower bound to the local angular dimension of about is:

 dlocal(ϵ,R)=w2(proj\Vw0(S(ϵ)))≳∑ir2iR2+r2i, (3.4)

where again . In Fig. 5, we plot the corresponding upper bound on the threshold training dimension, i.e. alongside simulated results for two different Hessian spectra.

## 4 Comparison to lottery tickets and lottery subspaces

Training within random subspaces is not meant to be a practical method, but rather is meant to be a scientific tool to explore the properties of the loss landscape. However, random affine subspaces can serve as a null model for other more sophisticated methods, like lottery tickets which prune parameters at initialization according to their magnitude at the end of training. Indeed any such pruning method should ideally outperform random affine subspaces in terms of higher reachable accuracies for a given training dimension. Also burn-in affine subspaces use training information in a minimal way, specifically to set the offset of the subspace. Thus such a method can serve as a null model for any method that prunes networks after initialization. For scientific purposes, it is also interesting to consider optimized, rather than random training subspaces, that reach a given accuracy with far fewer dimensions, to serve as potential performance targets for future methods. To this end we introduce one optimized subspace, namely the lottery subspace, and we compare random affine, burn-in affine, lottery tickets, and lottery subspaces to explore how high of an accuracy we can achieve with a given low training dimension.

#### Lottery subspaces.

Analogous to lottery tickets, we first train the network in the full space starting from an initialization . We then form the matrix whose columns are the top principal components of entire the training trajectory (see Appendix for details). We then train again from the same initialization within the lottery subspace

 \Vw(\Vθ)=\MxUd\Vθ+\Vw0. (4.1)

Since the subspace is optimized to match the top dimensions of the training trajectory, we expect lottery subspaces to achieve much higher accuracies for a given training dimension than random or potentially even burn-in affine subspaces. This expectation is indeed bourne out in Fig. 3 (purple lines above all other lines). Intriguingly, very few lottery subspace training dimensions (in the range of to depending on the dataset and architecture) are required to attain full accuracy, and adding subdominant principal components of the training trajectory can actually decrease performance (purple curves are nonmonotonic in training dimension ). In this sense, training within different subspaces sets expectations for what accuracies might be reasonable for practical network pruning methods to achieve as a function of training dimension, with random affine subspaces (blue curves in Fig. 3) constituting a null lower bound and lottery subspaces constituting a strong potential upper bound (purple curves in Fig. 3).

#### Comparison of various subspace training methods to lottery tickets.

Figure 6 presents empirical results comparing random affine subspaces, burn-in affine subspaces, lottery subspaces, and lottery tickets plotted against model compression ratio (defined as parameters in full model over parameters, or training dimension, in restricted model). The lottery tickets were constructed by training for 2 epochs, performing magnitude pruning of weights and biases, rewinding to initialization, and then training for the same number of epochs as the other methods. Note that lottery tickets are created by pruning the full model (increasing compression ratio) in contrast to all other methods which are built up from a single dimension (decreasing compression ratio).

We observe lottery subspaces significantly outperform random subspaces and lottery tickets at low training dimensions (high compression ratios). The experiments show a drop in accuracy as we increase the dimension further, but recall that the dimensions were added in order of decreasing singular values. Thus, the lottery subspace still contains the best optima achieved at lower dimension, and this is simply a failing of the optimization algorithm to successfully reach it in the higher dimensional space. We explore the spectrum of these spaces in more detail in the Appendix.

The comparison to lottery tickets at low compression ratios is limited by the fact that it is computationally expensive to project to higher dimensional subspaces and thus the highest training dimension we used was . In the regions where the experiments overlap, the lottery tickets do not outperform random affine subspaces, indicating that they are not gaining an advantage from the training information they utilize. A notable exception is Conv-2 on CIFAR-10 in which the lottery tickets do outperform random affine subspaces. Finally, we note lottery tickets do not perform well at high compression ratios due to the phenomenon of layer collapse, where an entire layer gets pruned.

## 5 Conclusion

In this paper, we gained fundamental theoretical insight into when and why training within a random subspace of small training dimension can achieve a given low loss . In particular, this can occur only when the local angular dimension of the loss sublevel set about the initialization is high, or close to the ambient dimension . We furthermore proposed two ways to optimize the selection of a subspace using information from training: burn-in affine subspaces which move the initialization closer to the sublevel set and lottery subspaces constructed from the principal components of a full training run. Our theory also explains geometrically why longer burn-in lowers the the number of degrees of freedom required to train. Overall, these theoretical insights and methods provide a much needed and powerful high dimensional geometric framework to think about and assess the efficacy of a wide range of network pruning methods at or beyond initialization, including lottery tickets.

## Acknowledgements

B.W.L. was supported by the Department of Energy Computational Science Graduate Fellowship program (DE-FG02-97ER25308). S.G. thanks the Simons Foundation, NTT Research and an NSF Career award for funding while at Stanford.

## Appendix A Experiment supplement

The core code for the experiments is available on Github. The three top-level scripts are burn_in_subspace.py, lottery_subspace.py, and lottery_ticket.py. Random affine experiments were run by setting the parameter init_iters to 0 in the burn-in subspace code. The primary automatic differentiation framework used for the experiments was JAX jax2018github. The code was developed and tested using JAX v0.1.74, JAXlib v0.1.52, and Flax v0.2.0 and run on an internal cluster using NVIDIA TITAN Xp GPU’s.

Figures 8 and 7 show the corresponding empirical probability plots for the two other models considered in this paper: Conv-3 and ResNet20. These plots are constructed in the same manner as fig. 2 except a larger value of was used since fewer runs were conducted ( was always chosen such that all but one of the runs had to successfully hit a training accuracy super-level set). The data in these plots is from the same runs as figs. 6 and 3.

### a.1 Spectra of lottery subspaces

In our experiments, we formed lottery subspaces by storing the directions traveled during a full training trajectory and then finding the singular value decomposition of this matrix. As we increased the subspace dimension, directions were added in order of descending singular values.

Figure 9 and the left panel of fig. 10 show the associated spectra for the results presented in figs. 6 and 3. The spectra are aligned with the train and test accuracy plots such that the value directly below a point on the curve corresponds to the singular value of the last dimension added to the subspace. There were 10 runs for Conv-2, 5 for Conv-3, and 3 for ResNet20. Only the first 5 out of 10 runs are displayed for the experiments with Conv-2. No significant deviations were observed in the remaining runs.

From these plots, we observe that the spectra for a given dataset are generally consistent across architectures. In addition, the decrease in accuracy after a certain dimension (particularly for CIFAR-10) corresponds to the singular values of the added dimensions falling off towards 0.

The right panel of fig. 10 shows a tangential observation that lottery subspaces for CIFAR-10 display a sharp transition in accuracy at . This provides additions evidence for the conjecture explored by gur2018gradient, fort2019emergent, and papyan2020traces

that the sharpest directions of the Hessian and the most prominent logit gradients are each associated with a class. Very little learning happens in these directions, but during optimization you bounce up and down along them so that the are prominent in the SVD of the gradients. This predicts exactly the behavior observed.

### a.2 Accuracy of Burn-in Initialization

Figure 11 shows a subset of the random affine and burn-in affine subspace experiments with a value plotted at dimension 0 to indicate the accuracy of the random or burn-in initialization. This is to give context for what sublevel set the burn-in methods are starting out, enabling us to evaluate whether they are indeed reducing the threshold training dimension of sublevel sets with higher accuracy. In most cases, as we increase dimension the burn-in experiments increase in accuracy above their initialization and at a faster pace than the random affine subspaces. A notable exception is Conv-3 on MNIST in which the burn-in methods appear to provide no advantage.

### a.3 Hyperparameters

Optimization restricted to an affine subspace was done using Adam kingma2014adam with , , and . We explored using and for the learning rate but worked substantially better for this restricted optimization and is used in all experiments. The baseline runs used the better result of and for the learning rate. ResNet20v1 was run with on-the-fly batch normalization ioffe2015batch

, meaning we simply use the mean and variance of the current batch rather than maintaining a running average.

Table 1 shows the number of epochs used for each dataset and architecture combination across all experiments. 3 epochs was chosen by default and then increased if the full model was not close to convergence.

## Appendix B Theory supplement

In this section, we provide additional details for our study of the threshold training dimension of the sublevel sets of quadratic wells. We also derive the threshold training dimension of affine subspaces to provide further intuition.

### b.1 Proof: Gaussian width of sublevel sets of the quadratic well

In our derivation of eq. 3.4, we employ the result that the Gaussian width squared of quadratic well sublevel sets is bounded as , i.e. bounded above and below by this expression times positive constants. This follows from well-established bounds on the Gaussian width of an ellipsoid which we now prove.

In our proof, we will use an equivalent expression for the Gaussian width of set :

 w(S):=12Esup\Vx,\Vy∈S⟨\Vg,\Vx−\Vy⟩=Esup\Vx∈S⟨\Vg,\Vx⟩,\Vg∼N(\V0,\MxID×D).
###### Lemma B.1 (Gaussian width of ellipsoid).

Let be an ellipsoid in defined by the vector with strictly positive entries as:

 E:={\Vx∈RD∣∣ ∣∣D∑j=1x2jr2j≤1}

Then or the Gaussian width squared of the ellipsoid satisfies the following bounds:

 √2πD∑j=1r2j≤w(E)2≤D∑j=1r2j
###### Proof.

Let . Then we upper-bound by the following steps:

 w(E) =E\Vg[sup\Vx∈ED∑i=1gixi] =E\Vg[sup\Vx∈ED∑i=1xirigiri] riri=1 ≤E\Vg⎡⎢⎣sup\Vx∈E(D∑i=1x2ir2i)1/2(D∑i=1g2ir2i)1/2⎤⎥⎦ Cauchy-Schwarz inequality ≤E\Vg⎡⎢⎣(D∑i=1g2ir2i)1/2⎤⎥⎦ Definition of E ≤ ⎷E\Vg[D∑i=1g2ir2i] Jensen’s inequality ≤(D∑i=1r2i)1/2 E[w2i]=1

giving the upper bound in the lemma. For the lower bound, we will begin with a general lower bound for Gaussian widths using two facts. The first is that if are i.i.d. Rademacher random varaibles and, then . Second, we have:

 E[|gi|]=12π∫∞−∞|y|e−y2/2dy=2√2π∫∞0ye−y2/2=2π

Then for the Gaussian width of a general set:

 w(S) =E[supx∈SD∑i=1wixi] =Eϵ[Ew[supx∈Sn∑i=1ϵi|gi|⋅xi∣∣∣ϵ1:n]] Using ϵi|gi|∼N(0,1) ≥Eϵ[supx∈SD∑i=1ϵiE[|gi|]xi] Jensen’s Inequality =√2πE[supx∈SD∑i=1ϵixi]

All that remains for our lower bound is to show that for the ellipsoid . We begin by showing it is an upper-bound:

 E[supx∈ED∑i=1ϵixi] =supx∈ED∑i=1|xi| Using E is symmetric =supx∈ED∑i=1∣∣∣xiriri∣∣∣ riri=1 ≤sup\Vx∈E(D∑i=1x2ir2i)1/2(D∑i=1r2i)1/2 Cauchy-Schwarz inequality =(D∑i=1r2i)1/2 Definition of E

In the first line, we mean that is symmetric about the origin such that we can use for all without loss of generality. Finally, consider such that . For this choice we have and:

 D∑i=1|xi|=D∑i=1r2i(∑Di=1r2i)1/2=(D∑i=1r2i)1/2

showing that equality is obtained in the bound. Putting these steps together yields the overall desired lower bound:

 w(E)≥√2π⋅E[supx∈ED∑i=1ϵixi]=√2π⋅(D∑i=1r2i)1/2

With this bound in hand, we can immediately obtain the following corollary for a quadratic well defined by Hessian . The Gaussian width is invariant under affine transformation so we can shift the well to the origin. Then note that is an ellipsoid with and thus .

###### Corollary B.1 (Gaussian width of quadratic sublevel sets).

Consider a quadratic well defined by Hessian . Then the Gaussian width squared of the associated sublevel sets obey the following bound:

 √2π⋅2ϵTr(\MxH−1)≤w2(S(ϵ))≤2ϵTr(\MxH−1)

### b.2 Details on threshold training dimension upper bound

In section 3.2, we consider the projection of ellipsoidal sublevel sets onto the surface of a unit sphere centered at . The Gaussian width of this projection will depend on the distance from the initialization to the global minimum at (i.e. it should increase with decreasing ). We used a crude approximation to this width as follows. Assuming , the direction will be approximately orthogonal to , so that . The distance between the tip of the ellipsoid at radius along principal axis and the initialization is therefore . The ellipse’s radius then gets scaled down to approximately when projected onto the surface of the unit sphere.

We now explain why this projected size is always a lower bound by illustrating the setup in two dimensions in fig. 12. As shown, the linear extent of the projection will always result from a line that is tangent to the ellipse. For an ellipse and a line in a two-dimensional space (we set the origin at the center of the unit circle), a line tangent to the ellipse must satisfy . That means that the linear extent of the projection on unit circle will be . For and , this is exactly Eq. 3.4 provided . The will always make the linear projections larger, and therefore Eq. 3.4 will be a lower bound on the projected Gaussian width. Furthermore, this bound will be looser with decreasing . We then obtain a corresponding upper bound on the threshold training dimension, i.e. .

### b.3 Threshold training dimension of affine subspaces

In Section 3.2, we considered the threshold training dimension of the sublevel sets of a quadratic well and showed that it depends on the distance from the initialization to the set, formalized in eq. 3.4. As a point of contrast, we include a derivation of the threshold training dimension of a random affine subspace in ambient dimension and demonstrate that this dimension does not depend on distance to the subspace. Intuitively this is because any dimension in the subspace is of infinite or zero extent, unlike the quadratic sublevel sets which have dimensions of finite extent.

Let us consider a -dimensional space for which we have a randomly chosen -dimensional affine subspace defined by a vector offset and a set of orthonormal basis vectors that we encapsulate into a matrix . Let us consider another random -dimensional affine subspace . Our task is to find a point that has the minimum distance to the subspace , i.e.:

 \Vx∗=argmin\Vx∈A∥∥\Vx−argmin\Vx′∈B∥∥\Vx−\Vx′∥∥2∥∥2

In words, we are looking for a point in the -dimensional subspace that is as close as possible to its closest point in the -dimensional subspace . Furthermore, points within the subspace can be parametrized by a -dimensional vector as ; for all choices of , the associated vector is in the subspace .

Without loss of generality, let us consider the case where the basis vectors of the subspace are aligned with the dimensions of the coordinate system (we can rotate our coordinate system such that this is true). Call the remaining axes the short directions of the subspace . A distance from a point to the subspace now depends only on its coordinates . Under our assumption of the alignment of subspace we then have:

 l2(\Vx,B):=argmin\Vx′∈B∥∥\Vx−\Vx′∥∥22=s∑i=1x2i

The only coordinates influencing the distance are the first values, and thus let us consider a subspace of the original only including those without loss of generality. Now , and , and the distance between a point within the subspace parameterized by the vector is given by:

The distance attains its minimum for

 ∂\Vθl2(\Vx(\Vθ),B)=2⋅(\Vθ\MxM+\Vx0)\MxMT=\V0

yielding the optimality condition . There are 3 cases based on the relationship between and .

1. The overdetermined case, . In case , the optimal belongs to a ()-dimensional family of solutions that attain distance to the plane . In this case the affine subspaces and intersect and share a ()-dimensional intersection.

2. A unique solution case, . In case of , the solution is a unique . After plugging this back to the distance equation, we obtain is

 l2(\Vx(\Vθ∗),B) =∥∥−\Vx0\MxM−1\MxM+\Vx0∥∥2 =∥∥−\Vx0+\Vx0∥∥2=0.

The matrix is square in this case and cancels out with its inverse .

3. An underdetermined case, . In case of , there is generically no intersection between the subspaces. The inverse of is now the Moore-Penrose inverse . Therefore the closest distance :

 l2(\Vx(\Vθ∗),B)=∥∥−\VX0\MxM+\MxM+\Vx0∥∥2

Before our restriction from dimensions, the matrix consisted of -dimensional, mutually orthogonal vectors of unit length each. We will consider these vectors to be component-wise random, each component with variance to satisfy this condition on average. After restricting our space to dimensions, ’s vectors are reduced to components each, keeping their variance . They are still mutually orthogonal in expectation, but their length are reduced to . The transpose of the inverse consists of vectors of the same directions, with their lengths scaled up to . That means that in expectation, is a diagonal matrix with diagonal components set to , and the remainder being . The matrix contains ones on its diagonal. The projection is therefore of the expected value of . The expected distance between the -dimensional subspace and the -dimensional subspace is:

 E[d(A,B)]∝{√D−n−d√Dn+d

To summarize, for a space of dimension , two affine subspaces generically intersect provided that their dimensions and add up to at least the ambient (full) dimension of the space. The exact condition for intersection is , and the threshold training dimension of subspace is . This result provides two main points of contrast to the quadratic well:

• Even extended directions are not infinite for the quadratic well. While in the case of the affine subspaces even a slight non-coplanarity of the target affine subspace and the random training subspace will eventually lead to an intersection, this is not the case for the sublevel sets of the quadratic well. Even its small eigenvalues, i.e. shallow directions, will still have a finite extent for all finite .

• Distance independence of the threshold training dimension. As a result of the dimensions having finite extent, the distance independence of threshold training dimension for affine subspaces does not carry over to the case of quadratic wells. In the main text, this dependence on distance is calculated by projecting the set onto the unit sphere around the initialization enabling us to apply Gordon’s Escape Theorem.