1 Introduction
Stochastic gradient descent (SGD) [RM51] and its variants with momentum [SMDH13, Nes83, Pol64] have become the standard optimization routine for neural networks due to their fast convergence and good generalization properties [WRS17, KS17, SMDH13]. Yet the behavior of SGD on highdimensional neural network models still eludes full theoretical understanding, both in terms of its convergence and generalization properties. In this paper, we study why SGD is so efficient at converging to low loss values on most standard neural networks, and how neural net architecture design affects this.
Classical stochastic optimization theory predicts that the learning rate of SGD needs to decrease over time for convergence to be guaranteed to the minimizer of a convex function [SZ13, Ber11]. For strongly convex functions for example, such results show that a decreasing learning rate schedule of is required to guarantee convergence to within accuracy of the minimizer in iterations, where denotes the iteration number. Typical stochastic optimization procedures experience a transient phase, where the optimizer makes progress towards a neighborhood of a minimizer, followed by a stationary phase, where the gradient noise starts to dominate the signal and the optimizer typically oscillates around the minimizer [DM92, Mur98, TA17, CT18]. With decaying learning rates of the form or , the convergence of SGD in the transient phase can be very slow, typically leading to poor performance on standard neural network problems.
Neural networks operate in a regime where the number of parameters is much larger than the number of training data. In this regime, SGD seems to converge quickly with constant learning rates. So quickly, in fact, that neural net practitioners often use a constant learning rate for the majority of training, with exponentially decaying learning rate schedules towards the end, without seeing the method stall [KSH12, SZ14, HZRS16, ZK16]. With constant learning rates, theoretical guarantees show that SGD converges quickly to a neighborhood of the minimizer (i.e., fast convergence in the transient phase), but then reaches a noise floor
beyond which it stops converging; this noise floor depends on the learning rate and the variance of the gradients at the minimizer
[MB11, NWS14]. Some more recent results have shown that when models can overfit the data completely while being strongly convex, convergence without a noise floor is possible without decaying the learning rate [SR13, MBB17, BBM18, VBS18]. While these results do give insights into why constant learning rates followed by an exponential decay might work well in practice [CT18], they fail to fully explain the efficiency of SGD on neural nets, and how they relate to overparameterization.The behavior of SGD is also highly affected by the neural network architecture. It is common knowledge among neural network practitioners that deeper networks train slower [BSF94, GB10]
. This has led to several innovations over the years to get deeper networks to train more easily, such as residual connections
[HZRS16], careful initialization strategies [GB10, HZRS15, ZDM19], and various normalization schemes like batch normalization [IS15] and weight normalization [SK16]. Furthermore, there is ample evidence to indicate that wider networks are easier to train [ZK16, NH17, LXS19]. Several prior works have investigated the difficulties of training deep networks [GB10, BFL17], and the benefits of width [NH17, LXS19, DZPS18, AZLS18]. This work adds to the existing literature by identifying and analyzing a condition that affects the SGD dynamics on overparameterized neural networks.Our contributions. The goal of this paper is to study why SGD is efficient for neural nets, and how neural net design affects SGD. Typical neural nets are overparameterized (i.e., the number of parameters exceed the number of training points). We ask how this overparameterization, as well as the architecture of a neural net, affect the dynamics of SGD. We list the main contributions of this work below.

We identify a condition, called gradient confusion, that controls the convergence properties of SGD on overparameterized models (defined in Section 2). When confusion is high, stochastic gradients produced by different data samples may be negatively correlated, causing slow convergence. On the other hand, when confusion is low, we show that convergence is accelerated, and SGD can converge faster and to a lower noise floor than predicted by classical theory, thus indicating a regime where constant learning rates would work well in practice (Sections 2 and 3).

We then theoretically study the effect of overparameterization on the gradient confusion condition (Sections 4, 5 and 6). In Section 5
, we show that on a large class of random input instances, and for a large class of weights, gradient confusion increases as the depth increases, indicating the difficulty in training very deep networks. The results require minimal assumptions, and hold for a large family of neural networks with nonlinear activations and can be extended to a large class of lossfunctions. In particular, our results hold for fully connected and convolutional networks with the squareloss and logisticloss functions, and also for commonly used nonlinear activations such as sigmoid, tanh and ReLU.

We show that the same qualitative results hold (i.e., gradient confusion increases as depth increases) when considering arbitrary data and random weights, i.e., at neural network initializations (Section 6). We further show evidence that wider networks tend to have lower gradient confusion for some standard initialization procedures (Section 6.2).

Using experiments on standard models and datasets, we validate our theoretical results, and show that wider networks have better convergence properties and lower gradient confusion, while deeper networks have slower convergence with higher gradient confusion. We further show that innovations like batch normalization and skip connections in residual networks help lower gradient confusion, thus indicating why standard neural networks that employ such techniques are so efficiently trained using SGD (Section 7 and Appendix A).
2 Preliminaries
Notations. Throughout this paper, vectors are represented in bold lowercase while matrices in bold uppercase. We use
to indicate the cell in matrix and for the row of matrix . denotes the operator norm of . denotes the set and denotes the set .SGD basics. Given training points (specified by the corresponding loss functions ), we use SGD to solve empirical risk minimization problems of the form,
(1) 
using the following iterative update rule for rounds:
(2) 
Here is the learning rate and is a function chosen uniformly at random from at iteration . In this paper, we consider constant learning rates . We use to denote the optimal solution, i.e., .
Gradient confusion. SGD works by iteratively selecting a random function , and modifying the parameters to move in the direction of the negative gradient of the objective term without considering the effect on other terms.
It may happen that the selected gradient is negatively correlated with the gradient of another term
When the gradients of different minibatches are negatively correlated, the objective terms disagree on which direction the parameters should move, and we say that there is gradient confusion^{1}^{1}1This is related to gradient diversity [YPL17] but with important differences, which we describe in Section 8..
A set of objective functions has gradient confusion bound if the pairwise inner products between gradients satisfy, for a fixed ,
(3) 
SGD converges fast when the gradient confusion is low. To see why, consider the case of training a logistic regression model on a dataset with
orthogonal vectors. We have where is the logistic loss, is a set of orthogonal training vectors, and is the label for the training example. We then have where Because of our orthogonal data assumption, the gradient confusion is 0 since , and . Because of gradient orthogonality, an update in the gradient direction has no effect on the loss value of for . In this case, SGD decouples into a (deterministic) gradient decent on each objective term separately, and we can expect to see the fast rates of convergence attained by deterministic gradient descent, rather than the slow rates of SGD.Can we expect a problem to have low gradient confusion in practice? It is known that randomly chosen vectors in high dimensions are nearly orthogonal with high probability
[MS86, GS16, Ver18] (this statement is formalized by Lemma 4.1 below). For this reason, we would expect an averagecase (i.e., random) problem to have nearly orthogonal gradients, provided that we don’t train on too many training vectors (as the number of training vectors grow relative to the dimension, it becomes likely that we will see two training vectors with large negative correlation). In other words, we should expect a random optimization problem to have low gradient confusion when the number of parameters is "large" and the number of training data is "small" – i.e., when the model is overparameterized. This is further evidenced by a simple toy example in Figure 1, where we show a slightly overparameterized linear regression model can have much faster convergence rates (without any noise floor), as well as have a positive average gradient cosine similarity, compared to the underparameterized model.The above arguments are rather informal, and ignore issues like nonconvexity and the effect of the structure of neural networks. Furthermore, it is unclear whether we can expect low levels of gradient confusion in practice, and what effect nonzero confusion has on convergence rates. Below, we present a rigorous argument that low confusion levels accelerate SGD and help achieve faster convergence and lower noise floors for nonconvex problems. Then, we turn to the issue of overparameterization, and study how it affects gradient confusion, and how this depends on neural network architecture. Finally, we use computational experiments to show that gradient confusion is low for standard neural nets used in practice, and that this effect contributes to the superior optimization performance of SGD.
3 SGD is efficient when gradient confusion is low
We now present a rigorous analysis of gradient confusion and its effect on SGD. Several prior papers have analyzed the convergence rates of constant learning rate SGD [NB01, MB11, NWS14, FB15, DFB17]. These results show that for strongly convex and Lipschitz smooth functions, SGD with a constant learning rate converges linearly to a neighborhood of the minimizer. The noise floor it converges to depends on the learning rate and the variance of the gradients at the minimizer, i.e., . To guarantee convergence to accuracy in such a setting, the learning rate needs to be small, i.e., , and the method requires iterations. Some more recent results show convergence of constant learning rate SGD without a noise floor and without small step sizes using an "overfitting" condition, i.e., where the model can completely overfit the data [SR13, MBB17, BBM18, VBS18]. The condition effectively translates to assuming , getting rid of the noise floor.
The gradient confusion bound is related to the overfitting condition. Note that if , then , . This implies that the gradient confusion at the minimizer is small when the variance of the gradients at the minimizer is small. Further note that when the variance of the gradients at the minimizer is small, i.e., , a direct application of the results in [MB11, NWS14] shows that constant learning rate SGD has fast convergence to accuracy in iterations, without the learning rate needing to be vanishingly small.
Bounded gradient confusion does not, however, provide a bound on the variance of the gradients. Thus, it is instructive to derive convergence bounds of SGD explicitly in terms of the gradient confusion bound, to properly understand its effect. We begin by looking at the case where the objective satisfies the PolyakLojasiewicz (PL) inequality [Loj65], a condition related to, but weaker than, strong convexity, and used in recent work in stochastic optimization [KNS16, DYJG17]. Using the PL inequality, we provide tight bounds on the rate of convergence in terms of the optimality gap. Then we look at a broader class of smooth nonconvex functions, and analyze convergence to a stationary point.
We first make two standard assumptions about the objective function.

The individual functions are Lipschitz smooth:

The individual functions satisfy the PL inequality:
where .
We now state the following convergence result of constant learning rate SGD in terms of the gradient confusion bound.
Theorem 3.1.
Proof.
See Appendix C.1 for the proof. ∎
This result shows that SGD converges linearly to a neighborhood of a minimizer, and the size of this neighborhood depends on the level of gradient confusion. When there is no confusion, and SGD converges directly to a minimizer. Further, when the gradient confusion is small, i.e., , then SGD has fast convergence to accuracy in iterations, without requiring the learning rate to be vanishingly small.
Convergence on general smooth nonconvex functions.
We now show that low gradient confusion leads to fast convergence on more general smooth nonconvex functions.
Theorem 3.2.
Proof.
See Appendix C.1 for the proof. ∎
Theorems 3.1 and 3.2, similar to previous constant learning rate SGD convergence results, predict an initial transient phase of optimization with fast convergence to the neighborhood of a minimizer or a stationary point. This behavior is often observed when optimizing popular deep neural network models [DM92, SMDH13]; there is often an initial phase of fast convergence where a constant learning rate reaches a high level of accuracy on the model. This is typically followed by slow local convergence in the stationary phase where drops in the objective function are achieved by employing exponentially decreasing learning rate schedules [KSH12, SZ14, HZRS16, ZK16] (which from these theorems, would be equivalent to exponentially decreasing the noise floor that the algorithm converges to).
Note that the constants in Theorems 3.1 and 3.2 are slightly worse, and result in a slower convergence rate result (in terms of constants), than those shown in previous work [MB11, NWS14]. This is possibly a product of the analysis and the constants can probably be improved. See Appendix B for further discussion of this, where we explore a strengthened gradient confusion bound that guarantees faster local convergence. That being said, the main intention of these theorems is to show the direct effect that the gradient confusion bound has on the convergence rate and the noise floor that constant learning rate SGD converges to. This new perspective helps us more directly understand how overparameterization affects the bound on the gradient confusion and thus the convergence properties, which we explore in the following sections.
4 Lowrank Hessians lead to low gradient confusion
While in Section 3 we showed that SGD is more efficient when gradient confusion is low, it raises the question of whether commonlyused neural network models have low gradient confusion, which would help explain SGD’s efficiency on them. There is some evidence that the Hessian at the minimizer is very low rank for many standard overparameterized neural network models [SEG17, Coo18, CCS16, WZ17]. In this section, we show that the gradient confusion bound (eq. 3) is often low for a large class of parameter configurations for problems with random, lowrank Hessians.
The simplest case of this (which we already saw above) occurs for losses of the form , for some function which includes logistic regression. In this case, the Hessian of is rank1. We have:
where we denote the derivative of the function by . This inner product is expected to be small for all ; the logistic loss satisfies for all , and for fixed the quantity is whenever are randomly sampled from a sphere.^{2}^{2}2More generally, this is true whenever where is an isotropic random vector ([Ver18], Remark 3.2.5).
Specifically, we have the following lemma, which is often attributed to [MS86] (see Appendix C.2 for a short proof).
Lemma 4.1 (Near orthogonality of random vectors).
For vectors drawn uniformly from a unit sphere in dimensions, and
For general classes of functions, suppose, for clarity in presentation, that each has a minimizer at the origin (the same argument can be easily extended to the more general case).
Suppose also that there is a Lipschitz constant for the Hessian of each function that satisfies Then
,
where is an error term bounded as:
and we use the shorthand to denote
In this case, the inner product between two gradients is given as:
 ∇f_i(→w) , ∇f_j(→w) 
& ≤ →H_i→w , →H_j→w  + 12 L_H ∥→w∥^3 ( ∥→H_i∥ + ∥→H_j∥ )
& ≤∥→w∥^2 ∥ →H_i →H_j∥_2 + 12 L_H ∥→w∥^3 ( ∥→H_i∥ + ∥→H_j∥ ).
If the Hessians are sufficiently random and lowrank (e.g., of the form where are randomly sampled from a unit sphere), then one would expect the terms in this expression to be small for all within a neighborhood of the minimizer.
This indicates that for many standard neural network models, the gradient confusion might be low for a large class of weights near the minimizer. In the next two sections, we explore more formally and in more detail how overparameterization and the neural network architecture affects the probability which which the gradient confusion bound holds.
5 Effect of overparameterization on gradient confusion
To draw a more rigorous connection between overparameterization and neural network structure, we analyze gradient confusion for generic (i.e., random) model problems using methods from highdimensional probability. We rigorously analyze the case where training data is randomly sampled from a unit sphere, and identify specific cases where gradient confusion (Definition 2) is low with highprobability. Our results require minimal additional assumptions, and hold for a large family of neural networks with nonlinear activations and a large class of lossfunctions. In particular, our results hold for fully connected and convolutional networks, with the squareloss and logisticloss functions, and commonly used nonlinear activations such as sigmoid, tanh and ReLU.
We consider synthetic training data of the form for some labeling function and with data points drawn uniformly from the surface of a dimensional unit sphere. The concept being learned satisfies and for Note that this automatically holds for every model considered in this paper where the concept is realizable (i.e., where the model can express the concept function using its parameters), and more generally, this assumes a Lipschitz condition on the labels (i.e., the concept function doesn’t change too quickly with the input). In this paper, we consider two lossfunctions, namely, squareloss for regression and logistic loss function for classification. The squareloss function is defined as and the logistic function is defined as . Here, denotes the parameterized function (either a linear function, or a neural network) we fit to the training data and denotes the lossfunction of hypothesis on data point .
Using tools from highdimensional probability, we analyze below how the gradient confusion changes for a range of overparameterized models (including neural networks) with randomized training data. For clarity in presentation, we begin by analyzing the simple class of linear models.
5.1 A simple case: linear models
We begin by examining gradient confusion in the case of fitting a simple linear model to data. In this case, the function can be written as follows.
(4) 
As stated earlier, we consider two loss functions, namely the square loss for regression and the logistic loss for classification. Both these functions have the following useful properties.
Lemma 5.1.
Consider the set of lossfunctions where all are either the squareloss function or the logisticloss function. Consider a linear model with weights satisfying . Consider the gradient of each function . Note that we can write , where we define . Then we have the following properties.

When we have .

There exists a constant such that , , .
When is a linear model, as defined in eq. 4, we have that .
Proof.
See Appendix C.3 for the proof. ∎
Define the function . The following theorem shows that, for randomly sampled data, the gradient confusion bound
holds with high probability.
Theorem 5.1 (Concentration for linear models).
Consider the problem of fitting linear models (eq. 4) to data using either the squareloss or the logisticloss function. Let be an arbitrary weight vector such that and let be a given constant. The gradient confusion bound (eq. 3) holds at with probability at least for some absolute constant . For both the squareloss and the logisticloss functions, the value of .
Proof.
See Appendix C.4 for the proof.∎
This theorem can be interpreted as follows. As long as the dimension of the input (and thus the number of parameters in the problem) is large enough, the gradient confusion for a given weight vector is low, with highprobability. Thus, gradient confusion is lower as the linear model becomes more overparameterized, an effect that shows up in the simulation in Figure 1.
Note that the results in Section 3 showing fast convergence of SGD for low gradient confusion (Theorems 3.1 and 3.2) assume that the gradient confusion bound holds at every point along the path of SGD. On the other hand, Theorem 5.1 above shows that gradient confusion is low with high probability for overparameterized models, at a fixed weight . Thus, to ensure that the above result is relevant for the convergence of SGD on overparameterized linear models, we now make the concentration bound in Theorem 5.1 uniform over all weights inside a ball of radius .
Theorem 5.2 (Uniform concentration).
Consider the problem of fitting linear models (eq. 4) to data using either the squareloss or the logisticloss function. Select a point such that , and consider a ball centered at of radius If the data are sampled uniformly from a unit sphere, then the gradient confusion bound (eq. 3) holds uniformly at all points with probability at least
Proof.
See Appendix C.5 for the proof.∎
Thus, as long as the radius is not too large, we see that the gradient confusion bound holds uniformly with high probability for all points within the ball .
5.2 Extension to general neural networks
We now extend the previous results to feedforward neural networks with nonlinear activations. Formally, let and such that be the given weight matrices. Let denote the tuple . Define to be the width and to be the depth of the neural network. Then, the model is defined as
(5) 
where
denotes the nonlinear activation function applied pointwise to its arguments. We assume that the nonlinear activation is given by a function
with the following properties.
(P1) Boundedness: for vector .

(P2) Bounded differentials: Let and denote the first and second subdifferentials respectively. Then, and for all .
When , as in our random data model, most classical activation functions such as sigmoid, tanh, softmax and ReLU satisfy these requirements. Additionally, we will make the following assumption on the weight matrices.
Assumption 5.1 (Small Weights).
We will assume that the operator norm of the weight matrices are bounded above by . In other words, for every we have .
Is the smallweights assumption reasonable?
Without the smallweights assumption, i.e., that for every layer , the signal propagated forward or the gradients could potentially blow up in magnitude, making the network untrainable. Proving nonvacuous bounds in case of such blowups in magnitude of the signal or the gradient is not possible in general, and thus, we assume this restricted class of weights.
Note that the smallweights assumption is not just a theoretical concern, but also usually mandated in practice. Neural networks are often trained with weight decay regularizers of the form , which force the weights to remain small during optimization. The operator norm of convolutional layers have also recently been used as an effective regularizer as well for image classification tasks [SGL18].
We can also show formally that Assumption 5.1 holds with high probability for random initializations that are frequently used in practice [Ben12]. Consider a random weight matrix where every entry is an independent sample from . Then we can show that the operator norm of is upperbounded by 1 with high probability. In particular we have the following, the proof for which follows directly from Theorem 2.3.8 and Proposition 2.3.10 in [Tao12] with appropriate scaling. Define . Then with probability at least we have that .
This argument can also be extended to convolutional neural nets. Consider a 2D input and a filter . The convolution operation where is a Toeplitz
matrix (assuming a stride of 1 and no padding) and
is the flattened input vector [Gra06]. Define . Consider a randomly initialized filter where every entry of the filter is a random matrix. It then holds that with probability at least , we have . This again follows from Theorem 2.3.8 and Proposition 2.3.10 in [Tao12] and the fact that each column in has at most nonzero entries each of which is an independentgaussian random variable.
While, in general, there is no reason to believe that such a smallweights assumption would continue to hold during optimization without explicit regularizers like weight decay, some recent work has shown evidence that the weights do not move too far away during training from the random initialization point for overparameterized neural nets [NLB18, DR17, NK19, ZCZG18, AZLS18, DZPS18, OS18]. It is worth noting though that all these results have been shown under some restrictive assumptions, such as very wide networks.
As in the case of Lemma 5.1 for linear models, we show that the considered lossfunctions have the following property for neural nets, provided the weights satisfy the smallweights Assumption 5.1.
Lemma 5.2.
Proof.
See Appendix C.3 for the proof. ∎
We now prove concentration bounds for the gradient confusion on neural networks.
Theorem 5.3 (Concentration bounds for arbitrary depth neural networks).
Consider the problem of fitting neural networks (eq. 5) to data using either the squareloss or the logisticloss function. Let be a given constant. Additionally, assume that the weights satisfy Assumption 5.1 and the nonlinearities in each layer satisfy properties (P1) and (P2). For some fixed constant , with probability at least we have that the gradient confusion bound in eq. (3) holds.
Further, select a point , satisfying Assumption 5.1. Consider a ball centered at of radius . If the data are sampled uniformly from a unit sphere, then the gradient confusion bound in eq. (3) holds uniformly at all points with probability at least
For both the squareloss and the logisticloss functions, the value of (from Lemma 5.2).
Proof.
See Appendix C.6 for the proof.∎
Thus we see that, for a given dimension and number of samples , when the depth decreases, the probability that the gradient confusion bound in eq. (3) holds increases, and vice versa. This helps explain why training very deep models is hard and typically slow with SGD, even with small weights, as observed previously [BSF94, GB10]. Note that this is also related to the shattered gradients phenomenon [BFL17] that was observed to arise with depth (see Section 8 for more discussion). This naturally raises the question why modern deep neural networks are so efficiently trained using SGD. While careful initialization strategies prevented vanishing or exploding gradients making deeper networks trainable, as we show in Section 6, these strategies still suffer from high gradient confusion for very deep networks (see [ZDM19] for some recent progress in this direction). Thus, in Section 7, we empirically explore how popular techniques typically used in very deep networks like residual connections [HZRS16] and batch normalization [IS15] affect gradient confusion. We find that these techniques drastically lower gradient confusion, making very deep networks significantly easier to train using SGD.
Note that all results in this section automatically hold for convolutional neural nets, since a convolutional operation on can be represented as a matrix multiplication for an appropriate Toeplitz matrix .
6 Gradient confusion at initialization
In this section, we explore the implications for initialization of the weights in neural networks, and discuss the effect of width on gradient confusion.
6.1 Gradient confusion at standard weight initializations
In Section 5, we assumed a random data model and analyzed the effect of overparameterization on gradient confusion for a large class of weights. In this section, we instead assume arbitrary bounded data and random weights (more specifically, we consider typical weight initializations used when training neural networks) and show that the same effect of overparameterization holds. As before, we first present the result for linear models, before moving on to neural networks.
Linear models. Consider a dataset for some labeling function . Suppose the data points are deterministic and arbitrary vectors restricted to lie in a dimensional unit sphere (, ).
Consider the weight vector , with every coordinate sampled independently from the distribution . We now prove the following theorem which shows that gradient confusion is low at a random initial point with high probability.
Theorem 6.1 (Linear models with randomly chosen weights).
Consider the problem of fitting linear models (eq. 4) to data using either the squareloss or the logisticloss function. Let be a fixed data vector restricted to lie within a unit sphere. Consider a randomly generated weight vector such that for every we have an i.i.d. sample . Let be a given constant. Then we have the following.

When , the gradient confusion bound in eq. (3) fails with a constant probability (i.e., ). In particular, the failure probability is always bounded away from even in highdimensions.

When , we have that with probability at least the gradient confusion bound in eq. (3) holds. In particular, as the dimension increases the probability converges to .
Proof.
See Appendix C.7 for the proof.∎
The above theorem shows that if the weights have high variance, then the gradient confusion condition does not hold with highprobability. However when the weights have lower variance, the gradient confusion condition does hold with high probability, justifying the current practice of initializing with small weights [GB10, HZRS15].
Neural networks. We can extend the arguments above to the case of general neural networks as defined in Section 5.2. We show that when weights are chosen at random from the interval, where is the layer width, the gradient confusion bound in eq. (3) holds with high probability, and as before, this probability decreases with increasing depth of the neural network. This initialization strategy, and similar variants, are used almost universally for neural networks [GB10, HZRS15]. Consider the weight matrices
where every entry is an i.i.d. sample from the uniform distribution in the range
for some . This implies that we have for every . Thus, as long as we have that for every . Using a proof strategy similar to Theorem 5.3, we get that the following concentration bound on the gradient confusion.Theorem 6.2 (Neural nets with randomly chosen weights).
Let be weight matrices such that every entry in each of these matrices is an i.i.d. sample chosen uniformly from the interval, , where is the maximum layer width. Then, for any dataset with for every , we have that the gradient confusion bound in eq. (3) holds with probability at least
Proof.
See Appendix C.7 for the proof.∎
Thus, with both random data, and random weights, we see how increasing depth increases gradient confusion, making very deep models harder to train. This helps explain a phenomenon observed for many years in practice.
6.2 The effect of network layer width
Consider a weight matrix containing i.i.d. entries. Then for any vector ,
Similarly, for any vector , we see that . Thus, a layer with weights can grow or shrink an input vector (which can be the outputs of the previous layer) on either the forward or the backward propagation at initialization time, depending on the variance of the entries in .
The operator norm of the weight matrices being close to 1 is important for the trainability of neural networks, as it ensures that the signal is passed through the network without exploding or shrinking across layers [GB10]. As we mention in Section 5, typical weight initialization techniques ensure that the operator norm is bounded by 1 with high probability. This indicates that we would expect the effect of width on convergence and gradient confusion to be much less pronounced than the effect of depth for typical neural net designs. This is also why, on assuming for each weight matrix in our results in Section 5, the dependence of gradient confusion on width goes away in general. A simple example that illustrates this is to consider the case where each weight matrix in the neural network has exactly one nonzero element, which is set to 1. The operator norm of each such weight matrix is 1, but the forward or backward propagated signals would not depend on the width.
That being said, it is interesting to consider how under specific weight initialization strategies, the layer width has an effect on the gradient confusion. To this end, consider a simple twolayer linear network with , where , and . Further, assume that all elements of and are random, and distributed as , while all elements of are distributed as . Under this initialization, and on adapting Proposition A.1 for least square loss functions in [CWZ18], we have that is , for any pair of data . Further, using the same proof technique, we have a bound on the variance as: . From these results, we see that the gradient inner product concentrates towards 0 with increasing width (this can be written more formally using a straightforward application of Chebyshev’s inequality). In other words, our results show that gradient confusion decreases with increasing width. See Figure 2 for an additional simulation on a deeper neural network where the weights are initialized using the Glorot initializer [GB10].
Note that these results hold for the specific initialization strategy mentioned. Nonetheless, some other recent work also provides evidence on the benefits of width, by showing that the learning dynamics of gradient descent simplify significantly for sufficiently wide networks under certain parameterizations [JGH18, LXS19]. In the next section (as well as in Appendix A.2), we show empirical evidence that, given a sufficiently deep network, increasing the layer width often helps in lowering gradient confusion and speeding up convergence close to the minimizer for a range of neural network models.
7 Experimental results
In this section, we present experimental results showing the effect of neural network architecture on convergence rates of SGD and gradient confusion. It is worth noting that Theorems 3.1 and 3.2 indicate that gradient confusion primarily affects the stationary phase of the optimizer and the final "noise floor" of constant stepsize SGD. Thus, we expect the effect of gradient confusion to be most prominent near the end of training, particularly when the gradient noise begins to dominate and the convergence curve has flattened out near this floor.
Experimental setup.
We perform experiments on wide residual networks (WRN) [ZK16] for an image classification task on CIFAR10. Appendix A contains more experiments on fullyconnected linear and nonlinear neural networks, where we show that the same qualitative results as presented in this section hold for a wide variety of neural net architectures with different activation functions and weight initializations.
For the rest of this section, we use WRN to denote a WRN with depth and width factor ^{3}^{3}3Width factor is the number of filters relative to the original ResNet model, e.g., a factor of 1 corresponds to the original ResNet, and 2 means the network is twice as wide. . See Section E in the Appendix for more details on the WRN architecture. We turn off dropout for all our experiments. Our first round of experimental networks have no skip connections or batch normalization [IS15], so as to stay as close as possible to the assumptions of our theorems. Later on, we study the effects that skip connections and batch normalization have on the convergence rate and gradient confusion. We added biases to the convolutional layers when not using batch normalization. We use SGD as the optimizer without any momentum, and use a weight decay parameter of . Following [ZK16], we train all experiments for 200 epochs with minibatches of size 128, and reduce the initial learning rate by a factor of 10 at epochs 80 and 160. While a decreasing learning rate schedule is not required to see the effect of the neural net architecture on SGD and gradient confusion, this significantly speeds up training, and thus we follow the same schedule as in [ZK16]. See Appendix A for more experiments with constant learning rates. We use the MSRA initializer [HZRS15] for the weights as is standard for this model, and use the same preprocessing steps for the CIFAR10 images as described in [ZK16]. We tune the optimal initial learning rate for each model over a logarithmicallyspaced grid (, , , , , ) and select the run that achieves the lowest training loss value. Our grid search was such that the optimal learning rate never occurred at one of the extreme values tested.
To measure gradient confusion, at the end of every training epoch, we sample 100 minibatches each of size 128 (the same size as the training batch size). We calculate gradients on each of these minibatches, and then calculate pairwise cosine similarities. To measure the worsecase gradient confusion, we compute the lowest gradient cosine similarity among all pairs. We also plot a histogram of the pairwise gradient cosine similarities of the 100 minibatches sampled at the end of training (after 200 epochs), to see the concentration of the distribution.
Effect of depth.
To test our theoretical results, and in particular Theorem 5.3
, we consider a WRN with no batch normalization and no skip connections. This makes the network behave like a typical deep convolutional neural network. We now keep the width fixed, and change the depth over the networks WRN162, WRN402 and WRN1002. From Figure
3, we see that our theoretical results seem to be backed by the experiments, where smaller depth helps in faster convergence (and in reaching a better noise floor), as well as a better bound on the gradient confusion. We also notice that with smaller depth, the histogram of pairwise gradient cosine similarities concentrates more sharply around 0 (indicating more orthogonal gradients), which makes the network easier to train. See Appendix A.1 for further experiments, where we see the same qualitative results hold.Effect of width.
Using the same experimental setup as above, we now test the effect of increasing width in this network, while keeping the depth fixed, by considering the networks: WRN281, WRN282, WRN2810. From Figure 4, we now see that width helps in faster convergence, lower gradient confusion, as well as more orthogonal gradients. These results indicate that as long as gradients don’t contradict each other, orthogonal gradients lead to faster convergence. Note that the smallest network considered here is still highly overparameterized (approximately 0.37 million parameters), and can reach performance levels close to the WRN402 model considered above. See Appendix A.2 for further experiments, where again we see the same qualitative results hold.
Effect of batch normalization and skip connections.
To help understand why most standard neural nets are so efficiently trained using SGD, we test the effect that batch normalization and skip connections have on a deep thin model: WRN402. Figure 5 shows results where we start with a network with no batch normalization and no skip connections, and then progressively add them to the network. To make the comparison fair such that all networks start from the same initialization point, for the network without batch normalization, we use a reparameterization equivalent to batch normalization using the first sampled minibatch of data (see [MM15, SK16, KDDD15] for similar papers using datadependent initializations).
In Figure 5, we see that adding batch normalization makes a big difference in the convergence speed as well as in lowering gradient confusion. Adding skip connections on top of this further accelerates training. Notice that batch normalization leads to a very sharp concentration of the gradient cosine similarity, and makes the gradients behave like random vectors drawn from a sphere, making optimization more stable; an observation consistent with previous works [BFL17, STIM18].
8 Additional discussion of some related work
The convergence of SGD on overparameterized models has received a lot of attention. The authors of [ACH18] study the behavior of SGD on overparameterized problems, and show that SGD on overparameterized linear neural nets is similar to applying a certain preconditioner while optimizing. This can sometimes lead to acceleration when overparameterizing by increasing the depth of linear neural networks. In this paper, we show that this property does not hold in general (as mentioned briefly in [ACH18]), and that convergence typically slows down because of gradient confusion when training very deep networks.
The behavior of SGD on overparameterized problems was also studied in [MBB17, BBM18, VBS18, SR13], using an overfitting condition observed for some overparameterized neural nets (particularly for convnets), where the minimizer returned by the optimizer simultaneously minimizes each individual data sample, to show fast convergence of SGD. In contrast, we aim to establish a more direct relationship between width, depth, problem dimensionality, and the error floor of SGD convergence.
Other works have studied the impact of structured gradients on SGD. [BFL17] study the effects of shattered gradients at initialization for ReLU networks, which is when (nonstochastic) gradients at different (but close) locations in parameter space become negatively correlated. The authors show how gradients get increasingly shattered with depth in ReLU networks.
[Han18] show that the variance of gradients in fully connected networks with ReLU activations is exponential in the sum of the reciprocals of the hidden layer widths at initialization. Further, [HR18] show that this sum of the reciprocals of the hidden layer widths determines the variance of the sizes of the activations at each layer during initialization. When this sum of reciprocals is too large, early training dynamics are very slow, suggesting the difficulties of starting training on deeper networks, as well as the benefits of increased width.
In [YPL17], the authors define a property called gradient diversity. This quantity is related to gradient confusion, but with important differences. Gradient diversity also measures the degree to which individual gradients at different data samples are different from each other. This measure gets larger as the individual gradients become orthogonal to each other, and further increases as the gradients start pointing in opposite directions. In a large batch, higher gradient diversity is desirable, and this leads to improved convergence rates in distributed settings, as shown in [YPL17]. On the other hand, gradient confusion between two individual gradients is zero unless the inner product between them is negative. This is useful for studying convergence rates of small minibatch SGD since possible gradient updates do not conflict unless they are negatively correlated with each other. The choice of the definition of gradient diversity in [YPL17] also has important implications when its behavior is studied in overparameterized settings. [CWZ18] extends the work of [YPL17], where the authors prove on 2layer neural nets (and multilayer linear neural nets) that gradient diversity increases with increased width and decreased depth. This does not however distinguish between the cases where gradients become more orthogonal vs. more negatively correlated. In this paper, we show that this can have very different effects on the convergence of SGD in overparameterized settings, and we view our works as complementary.
There has recently also been interest in analyzing conditions under which SGD converges to global minimizers of overparameterized linear and nonlinear neural networks. [ACGH18] shows SGD converges linearly to global minimizers for linear neural nets under certain conditions. [DZPS18, AZLS18, ZCZG18, BGMSS17] also show convergence to global minimizers of SGD for nonlinear neural nets. While all these results require the network to be sufficiently wide, they represent an important step in the direction of better understanding optimization on neural nets. This paper complements these recent results by studying how one property, low gradient confusion, contributes to SGD’s success on overparameterized neural nets.
9 Conclusions, limitations and future work
In this paper, we investigate how overparameterization affects the dynamics of SGD on neural networks. We introduce a concept called gradient confusion, and show that when gradient confusion is low, SGD has better convergence properties than predicted by classical theory. Further, using both theoretical and empirical results, we show that overparameterization by increasing the number of parameters of linear models or by increasing the width of neural network layers leads to lower gradient confusion, making the models easier to train. In contrast, overparameterization by increasing the depth of neural networks results in higher gradient confusion, making deeper models harder to train. We further show evidence of how techniques like batch normalization and skip connections in residual networks help in tackling this problem.
Note that many previous results have shown how deeper models are better at modeling higher complexity function classes than wider models, and thus depth is essential for the success of neural networks [ES16, Tel16, RPK17]. Thus, our results indicate that, given a sufficiently deep network, increasing the network width is important for the trainability of the model, and will lead to faster convergence rates. This is further supported by other recent research [Han18, HR18] that show that the width should increase linearly with depth in a neural network to help dynamics at the beginning of training (i.e., at initialization). Our results also suggest the importance of further investigation into good initialization schemes for neural networks that make training very deep models possible. See [ZDM19] for some recent advances in this direction.
We consider the main limitation of this work to be the use of the random dataset that consists of i.i.d. samples of isoperimetric vectors to derive the concentration bounds in Section 5. Note that while this simple data model helps make the analysis tractable and gives insights into the effect of overparameterization, it is not clear to what extent realworld datasets follow such a model (i.e.,
i.i.d. data and isoperimetric property in the feature space). Nonetheless, these are standard assumptions used in statistical learning theory since it allows for mathematical analysis. To make our observations on this model robust, we show that the same qualitative results hold (i) with arbitrary data and random weights in Section
6 (i.e., at initialization), and (ii) in the experimental setting (Section 7 and Appendix A). For future work, it would be interesting to extend this analysis to richer random data models. It would also be interesting to gain a better understanding of how layer width impacts longterm training dynamics, while the analysis here focuses on behavior near initialization.An active area of research currently is to better understand how overparameterization and neural net architecture promote generalization [NLB18, AZLL18, BHMM18, NMB18]. An interesting topic for future work is whether there is a connection between gradient confusion, sharp and flat minimizers, and generalization for SGD. See [FNN19] for some recent work in this direction.
Acknowledgements
The authors thank Brendan O’Donoghue, James Martens, Sudha Rao, and Sam Smith for helpful discussions and for reviewing earlier versions of this manuscript.
References
 [ACGH18] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint arXiv:1810.02281, 2018.
 [ACH18] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018.
 [AZLL18] Zeyuan AllenZhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
 [AZLS18] Zeyuan AllenZhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962, 2018.
 [BBM18] Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of sgd in nonconvex overparametrized learning. arXiv preprint arXiv:1811.02564, 2018.
 [Ben12] Yoshua Bengio. Practical recommendations for gradientbased training of deep architectures. In Neural networks: Tricks of the trade, pages 437–478. Springer, 2012.

[Ber11]
Dimitri P Bertsekas.
Incremental gradient, subgradient, and proximal methods for convex
optimization: A survey.
Optimization for Machine Learning
, 2010(138):3, 2011.  [BFL17] David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt WanDuo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? arXiv preprint arXiv:1702.08591, 2017.
 [BGMSS17] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai ShalevShwartz. Sgd learns overparameterized networks that provably generalize on linearly separable data. arXiv preprint arXiv:1710.10174, 2017.
 [BHMM18] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the biasvariance tradeoff. arXiv preprint arXiv:1812.11118, 2018.
 [BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
 [BSF94] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 [CCS16] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropysgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
 [Coo18] Y Cooper. The loss landscape of overparameterized neural networks. arXiv preprint arXiv:1804.10200, 2018.

[CT18]
Jerry Chee and Panos Toulis.
Convergence diagnostics for stochastic gradient descent with constant
learning rate.
In
International Conference on Artificial Intelligence and Statistics
, pages 1476–1485, 2018.  [CWZ18] Lingjiao Chen, Hongyi Wang, Jinman Zhao, Dimitris Papailiopoulos, and Paraschos Koutris. The effect of network width on the performance of largebatch training. arXiv preprint arXiv:1806.03791, 2018.
 [DFB17] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger convergence rates for leastsquares regression. The Journal of Machine Learning Research, 18(1):3520–3570, 2017.
 [DM92] Christian Darken and John Moody. Towards faster stochastic gradient search. In Advances in neural information processing systems, pages 1009–1016, 1992.
 [DR17] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
 [DYJG17] Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pages 1504–1513, 2017.
 [DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
 [ES16] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pages 907–940, 2016.
 [FB15] Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a stepsize. In Conference on Learning Theory, pages 658–695, 2015.
 [FNN19] Stanislav Fort, Paweł Krzysztof Nowak, and Srini Narayanan. Stiffness: A new perspective on generalization in neural networks. arXiv preprint arXiv:1901.09491, 2019.
 [GB10] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
 [Gra06] Robert M Gray. Toeplitz and circulant matrices: A review. Foundations and Trends® in Communications and Information Theory, 2(3):155–239, 2006.
 [GS16] Tom Goldstein and Christoph Studer. Phasemax: Convex phase retrieval via basis pursuit. arXiv preprint arXiv:1610.07531, 2016.
 [Han18] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In Advances in Neural Information Processing Systems, pages 582–591, 2018.
 [HR18] Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. In Advances in Neural Information Processing Systems, pages 571–581, 2018.

[HZRS15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
InProceedings of the IEEE international conference on computer vision
, pages 1026–1034, 2015. 
[HZRS16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [IS15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8580–8589, 2018.
 [KDDD15] Philipp Krähenbühl, Carl Doersch, Jeff Donahue, and Trevor Darrell. Datadependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856, 2015.
 [KNS16] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximalgradient methods under the polyakłojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
 [KS17] Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
 [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [LBOM12] Yann A LeCun, Léon Bottou, Genevieve B Orr, and KlausRobert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
 [Loj65] Stanislaw Lojasiewicz. Ensembles semianalytiques. Lectures Notes IHES (BuressurYvette), 1965.
 [LXS19] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha SohlDickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.

[MB11]
Eric Moulines and Francis R Bach.
Nonasymptotic analysis of stochastic approximation algorithms for machine learning.
In Advances in Neural Information Processing Systems, pages 451–459, 2011.  [MBB17] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern overparametrized learning. arXiv preprint arXiv:1712.06559, 2017.
 [MM15] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.
 [MS86] Vitali D Milman and Gideon Schechtman. Asymptotic Theory of Finite Dimensional Normed Spaces. SpringerVerlag, Berlin, Heidelberg, 1986.
 [Mur98] Noboru Murata. A statistical study of online learning. Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, pages 63–92, 1998.
 [NB01] Angelia Nedić and Dimitri Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pages 223–264. Springer, 2001.
 [Nes83] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). 1983.
 [NH17] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2603–2612. JMLR. org, 2017.
 [NK19] Vaishnavh Nagarajan and J Zico Kolter. Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672, 2019.
 [NLB18] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of overparametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
 [NMB18] Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon LacosteJulien, and Ioannis Mitliagkas. A modern take on the biasvariance tradeoff in neural networks. arXiv preprint arXiv:1810.08591, 2018.
 [NWS14] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pages 1017–1025, 2014.
 [OS18] Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? arXiv preprint arXiv:1812.10004, 2018.
 [Pol64] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
 [RM51] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
 [RPK17] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2847–2854. JMLR. org, 2017.
 [SEG17] Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of overparametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
 [SGL18] Hanie Sedghi, Vineet Gupta, and Philip M Long. The singular values of convolutional layers. arXiv preprint arXiv:1805.10408, 2018.
 [SK16] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909, 2016.

[SMDH13]
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
On the importance of initialization and momentum in deep learning.
In International conference on machine learning, pages 1139–1147, 2013.  [SR13] Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
 [STIM18] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604, 2018.
 [SZ13] Ohad Shamir and Tong Zhang. Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
 [SZ14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[TA17]
Panos Toulis, Edoardo M Airoldi, et al.
Asymptotic and finitesample properties of estimators based on stochastic gradients.
The Annals of Statistics, 45(4):1694–1727, 2017.  [Tao12] Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
 [Tel16] Matus Telgarsky. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.
 [VBS18] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for overparameterized models and an accelerated perceptron. arXiv preprint arXiv:1810.07288, 2018.

[Ver18]
Roman Vershynin.
Highdimensional probability: An introduction with applications in data science
, volume 47. Cambridge University Press, 2018.  [WRS17] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4151–4161, 2017.
 [WZ17] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.
 [YPL17] Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. arXiv preprint arXiv:1706.05699, 2017.
 [ZCZG18] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes overparameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.
 [ZDM19] Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization. arXiv preprint arXiv:1901.09321, 2019.
 [ZK16] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Appendix A Additional experimental results
To further test the main claims in the paper, we perform additional experiments on an image classification problem on the MNIST dataset using fully connected neural networks. We iterate over neural networks of varying depth and width, and consider both the identity activation function (i.e., linear neural networks) and the tanh activation function. We also consider two different weight initializations that are popularly used and appropriate for these activation functions:

The Glorot normal initializer [GB10] with weights initialized by sampling from the distribution , where fanin denotes the number of input units in the weight matrix, and fanout denotes the number of output units in the weight matrix.

The LeCun normal initializer [LBOM12] with weights initialized by sampling from the distribution .
We consider the simplified case where all hidden layers have the same width . Thus, the first weight matrix , where for the sized images of MNIST; all intermediate weight matrices ; and the final layer for the 10 image classes in MNIST. We also add biases to each layer, which we initialize to 0. We use softmax cross entropy as the loss function.
We choose this relatively simple model as it gives us the ability to iterate over a large number of combinations of network architectures of varying width and depth, and different activation functions and weight initializations. Linear neural networks are an efficient way to directly understand the effect of changing depth and width without increasing model complexity over linear regression. Thus, we consider both linear and nonlinear neural nets in our experiments.
We use SGD with constant learning rates for training with a minibatch size of 128 and train each model for 40000 iterations (more than 100 epochs). The constant learning rate was tuned over a logarithmicallyspaced grid (, , ). We ran each experiment 10 times, and picked the learning rate that achieved the lowest training loss value on average at the end of training. Our grid search was such that the optimal learning rate never occurred at one of the extreme values tested.
To measure gradient confusion at the end training, we sample 1000 pairs of minibatches each of size 128 (the same size as the training batch size). We calculate gradients on each of these pairs of minibatches, and then calculate the cosine similarity between them. To measure the worsecase gradient confusion, we compute the lowest gradient cosine similarity among all pairs. We also calculate the average pairwise gradient cosine similarities of the 1000 pairs. We explore the effect of changing depth and changing width on the different activation functions and weight initializations. We plot the final training loss achieved for each model, as well as the minimum and average gradient cosine similarities calculated over the 1000 pairs of gradients at the end of training. For each point, we plot both the mean and the standard deviation over the 10 independent runs.
a.1 The effect of depth
We first explore the effect of depth of the neural network for these image classification models. To do this, we consider a fixed width of , and vary the depth of the neural network, on the log scale, as: , , , , . Figure 6 shows results on linear neural networks for the two weight initializations considered (Glorot normal and LeCun normal). Figure 7 shows results on neural networks with tanh activation functions, for the same two weight initializations.
Similar to the experimental results in Section 7, and matching our theoretical results in Section 5, we see the consistent trend of gradient confusion increasing with increasing depth. This makes the networks harder to train with increasing depth, and this is evidenced by an increase in the final training loss value. By depth , the increased gradient confusion effectively makes the network untrainable when using tanh nonlinearities. In Section 7, we further showed that increased depth results in lower concentration of the gradient cosine similarities (higher variance), and gradients get increasingly nonorthogonal with increased depth. We see the same effect in these experiments in the plots showing the average gradient cosine similarity.
a.2 The effect of width
We now explore the effect of width by varying the width of the neural network while keeping the depth fixed at . We choose a very deep model, which is essentially untrainable for small widths (with standard initialization techniques) and helps better illustrate the effects of increasing width. We vary the width of the network, again on the log scale, as: , , , , . Crucially, note that the smallest network considered here (, ) still has more than 50000 parameters (i.e., more than the number of training samples), and the network with width has almost three times the number of parameters as the highperforming , network considered in the previous section. Figures 8 and 9 show results on linear neural nets and neural nets with tanh activations for both the Glorot normal and LeCun normal initializations.
As in the experimental results of Section 7, we see the consistent trend of gradient confusion decreasing with increasing width. Thus, wider networks become easier to train and improve the final training loss value. We further see that when the width is too small (), the gradient confusion becomes drastically high and the network becomes completely untrainable. As in Section 7, we also notice, from the average gradient cosine similarity plots, that increasing width helps in better concentration of the gradient cosine similarities (lower variance), effectively making the gradients more orthogonal, and the network easier to train.
Appendix B Conditions for faster convergence
In Theorem 3.1, we show convergence to the neighborhood of a minimizer for problems satisfying the PL inequality and Lipschitzsmoothness. The learning rate for which the rate of convergence in the transient phase is optimal is given by , with a corresponding decay rate of .
Faster convergence can be guaranteed if we strengthen the definition of confusion by examining the correlation between and for all and Compared to Theorem 3.1, convergence is guaranteed with a larger learning rate that is independent of the training set size and faster geometric decay.
Theorem B.1.
Proof.
We start by noting that, for
We then have
Comments
There are no comments yet.