PCA of high dimensional random walks with comparison to neural network training

06/22/2018 ∙ by Joseph M. Antognini, et al. ∙ Google 0

One technique to visualize the training of neural networks is to perform PCA on the parameters over the course of training and to project to the subspace spanned by the first few PCA components. In this paper we compare this technique to the PCA of a high dimensional random walk. We compute the eigenvalues and eigenvectors of the covariance of the trajectory and prove that in the long trajectory and high dimensional limit most of the variance is in the first few PCA components, and that the projection of the trajectory onto any subspace spanned by PCA components is a Lissajous curve. We generalize these results to a random walk with momentum and to an Ornstein-Uhlenbeck processes (i.e., a random walk in a quadratic potential) and show that in high dimensions the walk is not mean reverting, but will instead be trapped at a fixed distance from the minimum. We finally compare the distribution of PCA variances and the PCA projected training trajectories of a linear model trained on CIFAR-10 and ResNet-50-v2 trained on Imagenet and find that the distribution of PCA variances resembles a random walk with drift.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (NNs) are extremely high dimensional objects. A popular deep NN for image recognition tasks like ResNet-50 (He et al., 2016) has 25 million parameters, and it is common for language models to have more than one billion parameters (Jozefowicz et al., 2016). This overparameterization may be responsible for NNs impressive generalization performance (Novak et al., 2018). Simultaneously, the high dimensional nature of NNs makes them very difficult to reason about.

Over the decades of NN research, the common lore about the geometry of the loss landscape of NNs has changed dramatically. In the early days of NN research it was believed that NNs were difficult to train because they tended to get stuck in suboptimal local minima. Dauphin et al. (2014) and Choromanska et al. (2015) argued that this is unlikely to be a problem for most loss landscapes because local minima will tend not to be much worse than global minima. There are, however, many other plausible properties of the geometry of NN loss landscapes that could pose obstacles to NN optimization. These include: saddle points, vast plateaus where the gradient is very small, cliffs where the loss suddenly increases or decreases, winding canyons, and local maxima that must be navigated around.

Ideally we would like to be able to somehow visualize the loss landscapes of NNs, but this is a difficult, perhaps even futile, task because it involves embedding this extremely high dimensional space into very few dimensions — typically one or two. Goodfellow et al. (2014) introduced a visualization technique that consists of plotting the loss along a straight line from the initial point to the final point of training (the “royal road”). The authors found that the loss often decreased monotonically along this path. They further considered the loss in the space from the residuals between the NN’s trajectory to this royal road. Note that while this is a two-dimensional manifold, it is not a linear subspace. Lorch (2016)

proposed another visualization technique in which principal component analysis (PCA) is performed on the NN trajectory and the trajectory is projected into the subspace spanned by the lowest PCA components. This technique was further explored by

Li et al. (2018), who noted that most of the variance is in the first two PCA components.

In this paper we consider the theory behind this visualization technique. We show that PCA projections of random walks in flat space qualitatively have many of the same properties as projections of NN training trajectories. We then generalize these results to a random walk with momentum and a random walk in a quadratic potential, also known as an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930)

. This process is more similar to NN optimization since it consists of a deterministic component (the true gradient) plus a stochastic component. In fact, recent work has suggested that stochastic gradient descent (SGD) approximates a random walk in a quadratic potential

(Ahn et al., 2012; Mandt et al., 2016; Smith & Le, 2017). Finally, we perform experiments on linear models and large NNs to show how closely they match this simplified model.

The approach we take to study the properties of the PCA of high dimensional random walks in flat space follows that of Moore & Ahmed (2017), but we correct several errors in their argument, notably in the values of the matrix and the trace of in Eq. 10. We also fill in some critical omissions, particularly the connection between banded Toeplitz matrices and circulant matrices. We extend their contribution by proving that the trajectories of high dimensional random walk in PCA subspaces are Lissajous curves and generalizing to random walks with momentum and Ornstein-Uhlenbeck processes.

2 PCA of random walks in flat space

2.1 Preliminaries

Let us consider a random walk in -dimensional space consisting of

steps where every step is equal to the previous step plus a sample from an arbitrary probability distribution,

, with zero mean and a finite covariance matrix.111The case of a constant non-zero mean corresponds to a random walk with a constant drift term. This is not an especially interesting extension from the perspective of PCA because in the limit of a large number of steps the first PCA component will simply pick out the direction of the drift (i.e., the mean), and the remaining PCA components will behave as a random walk without a drift term. For simplicity we shall assume that the covariance matrix has been normalized so that its trace is 1. This process can be written in the form

(1)

where is a

-dimensional vector and

. If we collect the s together in an dimensional design matrix X, we can then write this entire process in matrix form as

(2)

where the matrix S is an matrix consisting of 1 along the diagonal and -1 along the subdiagonal,

(3)

and the matrix R is an matrix where every column is a sample from . Thus .

To perform PCA, we need to compute the eigenvalues and eigenvectors of the covariance matrix , where is the matrix X with the mean of every dimension across all steps subtracted. can be found by applying the centering matrix, C:

(4)

We now note that the analysis is simplified considerably by instead finding the eigenvalues and eigenvectors of the matrix . The non-zero eigenvalues of are the same as those of . The eigenvectors are similarly related by , where is a (non-normalized) eigenvector of , and is the corresponding eigenvector of .

We therefore would like to find the eigenvalues and eigenvectors of the matrix

(5)

where we note that . Consider the middle term, . In the limit we will have because the off diagonal terms will be , whereas the diagonal terms will be . (Recall that we have assumed that the covariance of the noise distribution is normalized; if the covariance is not normalized, this simply introduces an overall scale factor given by the trace of the covariance.) We therefore have the simplification

(6)

2.2 Asymptotic convergence to circulant matrices

Let us consider the new middle term, . The matrix S is a banded Toeplitz matrix. Gray (2006) has shown that banded Toeplitz matrices asymptotically approach circulant matrices as the size of the matrix grows. In particular, Gray (2006) showed that banded Toeplitz matrices have the same inverses, eigenvalues, and eigenvectors as their corresponding circulant matrices in this asymptotic limit (see especially theorem 4.1 and subsequent material from Gray 2006). Thus in our case, if we consider the limit of a large number of steps, S asymptotically approaches a circulant matrix that is equal to S in every entry except the top right, where there appears a instead of a 0.222We note in passing that is the exact representation of a closed random walk.

With the circulant limiting behavior of S in mind, the problem simplifies considerably. We note that C is also a circulant matrix, the product of two circulant matrices is circulant, the transpose of a circulant matrix is circulant, and the inverse of a circulant matrix is circulant. Thus the matrix is asymptotically circulant as . Finding the eigenvectors is trivial because the eigenvectors of all circulant matrices are the Fourier modes. To find the eigenvalues we must explicitly consider the values of . The matrix consists of a 2 along the diagonal, -1 along the subdiagonal and superdiagonal, and 0 elsewhere, with the exception of the bottom right corner where there appears a 1 instead of a 2.

While this matrix is not a banded Toeplitz, it is asymptotically equivalent to a banded Toeplitz matrix because it differs from a banded Toeplitz matrix by a finite amount in a single location. We now note that multiplication of the centering matrix does not change either the eigenvectors or the eigenvalues of this matrix since all vectors with zero mean are eigenvectors of the centering matrix with eigenvalue 1, and all Fourier modes but the first have zero mean. Thus the eigenvalues of can be determined by the inverse of the non-zero eigenvalues of , which is an asymptotic circulant matrix. The eigenvalue of a circulant matrix with entries in the first row is

(7)

where is the root of unity. The imaginary parts of the roots of unity cancel out, leaving the eigenvalue of to be

(8)

and the eigenvalue of to be

(9)

The sum of the eigenvalues is given by the trace of , and is given by a lower triangular matrix with ones everywhere on and below the diagonal. The trace of is therefore given by

(10)

and so the explained variance ratio from the PCA component, in the limit is

(11)

If we let we can consider only the first term in a Taylor expansion of the cosine term. Requiring that , the explained variance ratio is

(12)

We test Eq. 12 empirically in Fig. 5 in the supplementary material.

We pause here to marvel that the explained variance ratio of a random walk in the limit of infinite dimensions is highlyskewed towards the first few PCA components. Roughly 60% of the variance is explained by the first component, 80% by the first two components, 95% by the first 12 components, and 99% by the first 66 components.

2.3 Projection of the trajectory onto PCA components

Let us now turn to the trajectory of the random walk when projected onto the PCA components. The trajectory projected onto the PCA component is

(13)

where is the normalized . We ignore the centering operation from here on because it changes neither the eigenvectors nor the eigenvalues. From above, we then have

(14)

By the symmetry of the eigenvalue equations and , it can be shown that

(15)

Since is simply the Fourier mode, we therefore have

(16)

This implies that the random walk trajectory projected into the subspace spanned by two PCA components will be a Lissajous curve. In Fig. 1 we plot the trajectories of a high dimensional random walk projected to various PCA components and compare to the corresponding Lissajous curves. We perform 1000 steps of a random walk in 10,000 dimensions and find an excellent correspondence between the empirical and analytic trajectories. We additionally show the projection onto the first few PCA components over time in Fig. 6 in the supplementary material.

Figure 1: The PCA projections of the trajectories of high dimensional random walks are Lissajous curves. Left tableau: Projections of a 10,000-dimensional random walk onto various PCA components. Right tableau: Corresponding Lissajous curves from Eq. 16.

While our experiments thus far have used an isotropic Gaussian distribution for ease of computation, we emphasize that these results are completely general for

any probability distribution with zero mean and a finite covariance matrix with rank much larger than the number of steps. We include the PCA projections and eigenvalue distributions of random walks using non-isotropic multivariate Gaussian distributions in Figs. 7 and 8 in the supplementary material.

3 Generalizations

3.1 Random walk with momentum

It is a common practice to train neural networks using stochastic gradient descent with momentum. It is therefore interesting to examine the case of a random walk with momentum. In this case, the process is governed by the following set of updates:

(17)
(18)

It can be seen that this modifies Eq. 2 to instead read

(19)

where M is a lower triangular Toeplitz matrix with 1 on the diagonal and on the subdiagonal. The analysis from Section 2 is unchanged, except that now instead of considering the matrix we have the matrix . Although M is not a banded Toeplitz matrix, its terms decay exponentially to zero for terms very far from the main diagonal. It is therefore asymptotically circulant as well, and the eigenvectors remain Fourier modes. To find the eigenvalues consider the product , noting that is a matrix with 1s along the main diagonal and s subdiagonal. With some tedious calculation it can be seen that the matrix is given by

(20)

with the exception that , and . As before, this matrix is asymptotically circulant, so the eigenvalues of its inverse are

(21)

In the limit of , the distribution of eigenvalues is identical to that of a random walk in flat space, however for finite , it has the effect of shifting the distribution towards the lower PCA components. We empirically test Eq. 21 in Fig. 9 in the supplementary material.

3.2 Discrete Ornstein-Uhlenbeck processes

A useful generalization of the above analysis of random walks in flat space is to consider random walks in a quadratic potential, also known as an AR(1) process or a discrete Ornstein-Uhlenbeck process. For simplicity we will assume that the potential has its minimum at the origin. Now every step consists of a stochastic component and a deterministic component which points toward the origin and is proportional in magnitude to the distance from the origin. In this case the update equation can be written

(22)

where measures the strength of the potential. In the limit the potential disappears and we recover a random walk in flat space. In the limit the potential becomes infinitely strong and we recover independent samples from a multivariate Gaussian distribution. For the steps will oscillate across the origin. For outside the updates diverge exponentially.

3.2.1 Analysis of eigenvectors and eigenvalues

This analysis proceeds similarly to the analysis in Section 2 except that instead of S we now have the matrix which has 1s along the diagonal and along the subdiagonal. remains a banded Toeplitz matrix and so the arguments from Sec. 2 that is asymptotically circulant hold. This implies that the eigenvectors of are still Fourier modes. The eigenvalues will differ, however, because we now have that the components of are given by

(23)

From Eq. 7 we have that the eigenvalue of is

(24)

We show in Fig. 2 a comparison between the eigenvalue distribution predicted from Eq. 24 and the observed distribution from a 3000 step Ornstein-Uhlenbeck process in 30,000 dimensions for several values of . There is generally an extremely tight correspondence between the two. The exception is in the limit of , where there is a catch which we have hitherto neglected. While it is true that the mean eigenvalue of any eigenvector approaches the same constant, there is nevertheless going to be some distribution of eigenvalues for any finite walk. Because PCA sorts the eigenvalues, there will be a characteristic deviation from a flat distribution.

Figure 2: Left panel: The variance of the PCA components for several choices of . The empirical distribution is shown in solid and the predicted distribution with a dotted line. The predicted distribution generally matches the observed distribution closely, but there is a systematic deviation for near 1. This is due to the fact that when the mean distribution is flat, there will nevertheless be a distribution around this mean when these eigenvalues are sampled from real data. Because PCA sorts these eigenvalues, this will always lead to a deviation from the flat distribution. Right panel: Distance from the origin for discrete Ornstein-Uhlenbeck processes with several choices of (solid lines) with the predicted asymptote from Eq. 25 (dotted lines).

3.2.2 Critical distance and mixing time

While we might be tempted to take the limit as we did in the case of a random walk in flat space, doing so would obscure interesting dynamics early in the walk. (A random walk in flat space is self-similar so we lose no information by taking this limit. This is no longer the case in an Ornstein- Uhlenbeck process because the parameter sets a characteristic scale in the system.) In fact there will be two distinct phases of a high dimensional Ornstein-Uhlenbeck process initialized at the origin. In the first phase the process will behave as a random walk in flat space — the distance from the origin will increase proportionally to and the variance of the PCA component will be proportional to . However, once the distance from the origin reaches a critical value, the gradient toward the origin will become large enough to balance the tendency of the random walk to drift away from the origin.333Assuming we start close to the origin. If we start sufficiently far from the origin the trajectory will exponentially decay to this critical value. At this point the trajectory will wander indefinitely around a sphere centered at the origin with radius given by this critical distance. Thus, while an Ornstein-Uhlenbeck process is mean-reverting in low dimensions, in the limit of infinite dimensions the Ornstein-Uhlenbeck process is no longer mean-reverting — an infinite dimensional Ornstein-Uhlenbeck process will never return to its mean.444Specifically, since the limiting distribution is a -dimensional Gaussian, the probability that the process will return to within of the origin is , where is the regularized gamma function. For small this decays exponentially with .

This critical distance can be calculated by noting that each dimension is independent of every other and it is well known that the asymptotic distribution of an AR(1) process with Gaussian noise is Gaussian with a mean of zero and a standard deviation of

, where is the variance of the stochastic component of the process. In high dimensions the asymptotic distribution as is simply a multidimensional isotropic Gaussian. Because we are assuming , the overwhelming majority of points sampled from this distribution will be in a narrow annulus at a distance

(25)

from the origin. Since the distance from the origin during the initial random walk phase grows as , the process will start to deviate from a random walk after steps. We show in the right panel of Fig. 2 the distance from the origin over time for 3000 steps of Ornstein-Uhlenbeck processes in 30,000 dimensions with several different choices of . We compare to the prediction of Eq. 25 and find a good match.

3.2.3 Iterate averages converge slowly

We finally note that if the location of the minimum is unknown, then iterate (or Polyak) averaging can be used to provide a better estimate. But the number of steps must be much greater than

before iterate averaging will improve the estimate. Only then will the location on the sphere be approximately orthogonal to its original location on the sphere and the variance on the estimate of the minimum will decrease as . We compute the mean of converged Ornstein-Uhlenbeck processes with various choices of in Fig. 10 in the supplementary material.

3.2.4 Random walks in non-isotropic potential are dominated by low curvature directions

While our analysis has been focused on the special case of a quadratic potential with equal curvature in all dimensions, a more realistic quadratic potential will have a distribution of curvatures and the axes of the potential may not be aligned with the coordinate basis. Fortunately these complications do not change the overall picture much. For a general quadratic potential described by a positive semi-definite matrix , we can decompose into its eigenvalues and eigenvectors. We then apply a coordinate transformation to align the parameter space with the eigenvectors of . At this point we have a distribution of curvatures, each one given by an eigenvalue of . However, because we are considering the limit of infinite dimensions, we can assume that there will be a large number of dimensions that fall in any bin . Each of these bins can be treated as an independent high-dimensional Ornstein-Uhlenbeck process with curvature . After steps, PCA will then be dominated by dimensions for which is small enough that . Thus, even if relatively few dimensions have small curvature they will come to dominate the PCA projected trajectory after enough steps.

4 Comparison to linear models and neural networks

While random walks and Ornstein-Uhlenbeck processes are analytically tractable, there are several important differences between these simple processes and optimization of even linear models. In particular, the statistics of the noise will depend on the location in parameter space and so will change over the course of training. Furthermore, there may be finite data or finite trajectory length effects.

To get a sense for the effect of these differences we now compare the distribution of the variances in the PCA components between two models and a random walk. For our first model we train a linear model without biases on CIFAR-10 using a learning rate of

for 10,000 steps. For our second model we train ResNet-50-v2 on Imagenet without batch normalization for 150,000 steps using SGD with momentum and linear learning rate decay. We collect the value of all parameters at every step for the first 1500 steps, the middle 1500 steps, and the last 1500 steps of training, along with collecting the parameters every 100 steps throughout the entirety of training. Further details of both models and the training procedures can be found in the supplementary material. While PCA is tractable on a linear model of CIFAR-10, ResNet-50-v2 has

25 million parameters and performing PCA directly on the parameters is infeasible, so we instead perform a random Gaussian projection into a subspace of 30,000 dimensions. We show in Fig. 3 the distribution of the PCA variances at the beginning, middle, and end of training for both models and compare to the distribution of variances from an infinite dimensional random walk. We show tableaux of the PCA projected trajectories from the middle of training for the linear model and ResNet-50-v2 in Fig. 4. Tableaux of the other training trajectories in various PCA subspaces are shown in the supplementary material.

The distribution of eigenvalues of the linear model resembles an OU process, whereas the distribution of eigenvalues of ResNet-50-v2 resembles a random walk with a large drift term. The trajectories appear almost identical to those of random walks shown in Fig. 1, with the exception that there is more variance along the first PCA component than in the random walk case, particularly at the start and end points. This manifests itself in a small outward turn of the edges of the parabola in the PCA2 vs. PCA1 projection. This suggests that ResNet-50-v2 generally moves in a consistent direction over relatively long spans of training, similarly to an Ornstein-Uhlenbeck process initialized beyond .

Figure 3: Left panel: The distribution of PCA variances at various points in training for a linear model trained on CIFAR-10. At the beginning of training the model’s trajectory is more directed than a random walk, as exhibited by the steep distribution in the lower PCA components. By the middle of training this distribution has flattened (apart from the first PCA component) and more closely resembles that of an Ornstein-Uhlenbeck process. Right panel: The distribution of PCA variances of the parameters of ResNet-50-v2 at various points in training. The distribution of PCA variances generally matches that of a random walk with the exception of the first PCA component, which dominates the distribution, particularly at the end of training.
Figure 4: Left tableau: PCA projected trajectories from the middle of training a linear model on CIFAR-10. Training has largely converged at this point, producing an approximately Gaussian distribution in the higher PCA components. Right tableau: PCA projected trajectories from the middle of training ResNet-50-v2 on Imagenet. These trajectories strongly resemble those of a random walk. See Figs. 12 and 13 in the supplementary material for PCA projected trajectories at other phases of training.

5 Random walks with decaying step sizes

We finally note that the PCA projected trajectories of the linear model and ResNet-50-v2 over the entire course of training qualitatively resemble those of a high dimensional random walk with exponentially decaying step sizes. To show this we train a linear regression model

, where W is a fixed, unknown vector of dimension 10,000. We sample x from a 10,000 dimensional isotropic Gaussian and calculate the loss

(26)

where is the correct output. We show in Fig. 14 that the step size decays exponentially. We fit the decay rate to this data and then perform a random walk in 10,000 dimensions but decay the variance of the stochastic term by this rate. We compare in Fig. 15 of the supplementary material the PCA projected trajectories of the linear model trained on synthetic data to the decayed random walk. We note that these trajectories resemble the PCA trajectories over the entire course of training observed in Figs. 12 and 13 for the linear model trained on CIFAR-10 and ResNet-50-v2 trained on Imagenet.

6 Conclusions

We have derived the distribution of the variances of the PCA components of a random walk both with and without momentum in the limit of infinite dimensions, and proved that the PCA projections of the trajectory are Lissajous curves. We have argued that the PCA projected trajectory of a random walk in a general quadratic potential will be dominated by the dimensions with the smallest curvatures where they will appear similar to a random walk in flat space. Finally, we find that the PCA projections of the training trajectory of a layer in ResNet-50-v2 qualitatively resemble those of a high dimensional random walk despite the many differences between the optimization of a large NN and a high dimensional random walk.

Acknowledgments

The authors thank Matthew Hoffman, Martin Wattenberg, Jeffrey Pennington, Roy Frostig, and Niru Maheswaranathan for helpful discussions and comments on drafts of the manuscript.

References

7 Further empirical tests

7.1 High dimensional random walks

We test Eq. 12 by computing 1000 steps of a random walk in 10,000 dimensions and performing PCA on the trajectory. We show in Fig. 5 the empirical variance ratio for the various components compared to the prediction from Eq. 12 and find excellent agreement. The empirical variance ratio is slightly higher than the predicted variance ratio for the highest PCA components due to the fact that there are a finite number of dimensions in this experiment, so the contribution from all components greater than the number of steps taken must be redistributed among the other components, which leads to proportionally the largest increase in the largest PCA components.

Figure 5: The fraction of the total variance of the different PCA components for a high dimensional random walk. The solid line is calculated from performing PCA on a 10,000 dimensional random walk of 1000 steps. The dashed line is calculated from the analytic prediction of Eq. 12. There is excellent agreement up until the very largest PCA components where finite size effects start to become non-negligible.

We show in Fig. 6 the projection of the trajectory onto the first few PCA components. The projection onto the PCA component is a cosine of frequency and amplitude given by Eq. 12.

Figure 6: The projection of the trajectory of a high-dimensional random walk onto the first five PCA components forms cosines of increasing frequency and decreasing amplitude. The predicted trajectories are shown with dotted lines, but the difference between the predicted and observed trajectories is generally smaller than the width of the lines. The random walk in this figure consists of 1000 steps in 10,000 dimensions.

7.2 Random walk with non-isotropic noise

To demonstrate that our results hold for non-isotropic noise distributions we perform a random walk where the noise is sampled from a multivariate Gaussian distribution with a random covariance matrix, . Because sampling from a multivariate Gaussian with an arbitrary covariance matrix is difficult in high dimensions, we restrict the random walk to 1000 dimensions, keeping the number of steps 1000 as before. To construct the covariance matrix, we sample a

dimensional random matrix,

R

, where each element is a sample from a normal distribution and then set

. Although

will be approximately equal to the identity matrix, the distribution of eigenvalues will follow a fairly wide Marchenko-Pastur distribution because

R is square. We show the distribution of explained variance ratios with the prediction from Eq. 12 in Fig. 7. There is a tight correspondence between the two up until the largest PCA components where finite dimension effects start to dominate. We also show in Fig. 8 PCA projected trajectories of this random walk along with a random walk where the random variates are sampled from a 1000-dimensional isotropic distribution for comparison to provide a sense for the amount of noise introduced by the relatively small number of dimensions. Although the small dimensionality introduces noise into the PCA projected trajectories, it is clear that the general shapes match the predicted Lissajous curves.

Figure 7: The distribution of explained variance ratios from the PCA of a random walk with noise sampled from a multivariate Gaussian with a non- isotropic covariance matrix. Despite the different noise distribution, the distribution of explained variance ratios closely matches the prediction from Eq. 12.
Figure 8: Left tableau: The PCA projected trajectory of a random walk with noise sampled from an isotropic Gaussian distribution in 1000 dimensions. Right tableau: The PCA projected trajectory of a random walk with noise sampled from a multivariate Gaussian distribution with a random covariance matrix in 1000 dimensions. Although the smaller number of dimensions introduces noise into the trajectory, it is clear that the trajectories are still Lissajous curves even when the random variates are sampled from a more complicated distribution.

7.3 Random walk with momentum

We test Eq. 21 by computing 1000 steps of a random walk in 10,000 dimensions with various choices of the momentum parameter, . We show in Fig. 9 the observed distribution of PCA variances (not the explained variance ratio) along with the prediction from Eq. 21. There is an extremely tight correspondence between the two, except for the lowest PCA components for . This is expected because the effective step size is set by , and because , the walk does not have sufficient time to settle into its stationary distribution of eigenvalues when .

Figure 9: The distribution of explained PCA variances for random walks with momentum where we vary the strength of the momentum parameter, . The distribution observed from a 1000 step random walk in 10,000 dimensions is shown in the solid lines. The prediction from Eq. 21 is shown in the dashed line.

7.4 Iterate averaging of an Ornstein-Uhlenbeck process

We show in Fig. 10 the mean of all steps of Ornstein-Uhlenbeck processes which have converged to a random walk on a sphere of radius . We show in the dashed line the predicted value of , the number of steps required to reach (i.e., the crossing time of the sphere). The position on the sphere will close to its original location for so iterate averaging will not improve the estimate of the minimum. Only when will iterate averaging improve the estimate of the minimum since the correlation between new points wit the original location will be negligible.

Figure 10: The mean of all steps of converged Ornstein-Uhlenbeck processes of various lengths. The mean remains approximately constant until the total angle from the initial position on the sphere grows to , which requires steps (dashed line).

8 Details of models and training

8.1 Linear regression on CIFAR-10

We train linear regression on CIFAR-10 for 10,000 steps using SGD and a batch size of 128 and a learning rate of . The model achieves a validation accuracy of 29.1%.

8.2 ResNet-50-v2 on Imagenet

We train ResNet-50-v2 on Imagenet for 150,000 steps using SGD with momentum and a batch size of 1024. We do not use batch normalization since this could confound our analysis of the training trajectory. We instead add bias terms to every convolutional layer. We decay the learning rate linearly with an initial learning rate of 0.0345769 to a final learning rate a factor of 10 lower by 141,553 steps, at which point we keep the learning rate constant. We set the momentum to 0.9842. The network achieves a validation accuracy of 71.46%.

9 Gallery of PCA projected trajectories

We present here tableaux of the PCA projections of various trajectories. We show in Fig. 11 four tableaux of the PCA projections of the trajectories of high-dimensional Ornstein-Uhlenbeck processes with different values of . For the trajectories are almost identical to a high-dimensional random walk, as they should be since the process was sampled for only 1000 steps. Once we have the trajectories start to visibly deviate from those of a high-dimensional random walk. For larger the deviations continue to grow until they become unrecognizable at because 1000 steps corresponds to many crossing times on the high dimensional sphere on which the process takes place.

Figure 11: Tableaux of the PCA projections of the trajectories of high-dimensional Ornstein-Uhlenbeck processes with various values of . All processes were sampled for 1000 steps in 10,000 dimensions. Upper left tableau: . Upper right tableau: . Lower left tableau: . Lower right tableau: .

In Fig. 12 we present tableaux of the PCA projections of the linear model trained on CIFAR-10. The trajectory of the entire training process somewhat resembles a high-dimensional random walk, though because the model makes larger updates at earlier steps than at later ones there are long tails on the PCA projected trajectories. The model’s trajectory most closely resembles a high-dimensional random walk early in training, but towards the end the higher components become dominated by noise, implying that these components more closely resemble a converged Ornstein-Uhlenbeck process. This corresponds with the flattening of the distribution of eigenvalues in Fig. 3.

Figure 12: Tableaux of the trajectories of a linear model trained on CIFAR-10 in different PCA subspaces. Upper left tableau: PCA applied to every tenth step over all of training. Upper right tableau: PCA applied to the first 1000 steps of training. Lower left tableau: PCA applied to the middle 1000 steps of training. Lower right tableau: PCA applied to the last 1000 steps of training.

In Fig. 13 we present tableaux of the PCA projections of ResNet-50-v2 trained on Imagenet. Perhaps remarkably, these trajectories resemble a high-dimensional random walk much more closely than the linear model. However, as in the case of the linear model, the resemblance deteriorates later in training.

Figure 13: Tableaux of the parameter trajectories of ResNet-50-v2 trained on Imagenet in different PCA subspaces. The parameters were first projected into a random Gaussian subspace with 30,000 dimensions before PCA was applied. Upper left tableau: PCA applied to every hundredth step over all of training. Upper right tableau: PCA applied to the first 1500 steps of training. Lower left tableau: PCA applied to the middle 1500 steps of training. Lower right tableau: PCA applied to the last 1500 steps of training.
Figure 14: The change in the step size from training a linear model on synthetic Gaussian data. The step size decays exponentially with the best fit shown in the orange dashed line.
Figure 15: Left tableau: PCA projected trajectories of a linear regression model trained on synthetic Gaussian data. Right tableau: PCA projected trajectories of a 10,000 dimensional random walk where the variance of the stochastic component is decayed using the best fit found from the linear regression model trained on synthetic data. The trajectories in the two tableaux appear very similar.