# The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent

The goal of this paper is to study why stochastic gradient descent (SGD) is efficient for neural networks, and how neural net design affects SGD. In particular, we investigate how overparameterization -- an increase in the number of parameters beyond the number of training data -- affects the dynamics of SGD. We introduce a simple concept called gradient confusion. When confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, we show that SGD has better convergence properties than predicted by classical theory. Using theoretical and experimental results, we study how overparameterization affects gradient confusion, and thus the convergence of SGD, on linear models and neural networks. We show that increasing the number of parameters of linear models or increasing the width of neural networks leads to lower gradient confusion, and thus faster and easier model training. We also show how overparameterization by increasing the depth of neural networks results in higher gradient confusion, making deeper models harder to train. Finally, we observe empirically that techniques like batch normalization and skip connections reduce gradient confusion, which helps reduce the training burden of deep networks.

## Authors

• 4 publications
• 20 publications
• 22 publications
• 13 publications
• 102 publications
06/11/2018

### The Effect of Network Width on the Performance of Large-batch Training

Distributed implementations of mini-batch stochastic gradient descent (S...
05/31/2019

### Training Dynamics of Deep Networks using Stochastic Gradient Descent via Neural Tangent Kernel

Stochastic Gradient Descent (SGD) is widely used to train deep neural ne...
05/25/2019

### Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

In deep neural nets, lower level embedding layers account for a large po...
05/28/2019

### SGD on Neural Networks Learns Functions of Increasing Complexity

We perform an experimental study of the dynamics of Stochastic Gradient ...
06/07/2021

### Batch Normalization Orthogonalizes Representations in Deep Random Networks

This paper underlines a subtle property of batch-normalization (BN): Suc...
10/27/2020

### A Bayesian Perspective on Training Speed and Model Selection

We take a Bayesian perspective to illustrate a connection between traini...
03/22/2021

### Data Cleansing for Deep Neural Networks with Storage-efficient Approximation of Influence Functions

Identifying the influence of training data for data cleansing can improv...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1 Introduction

Stochastic gradient descent (SGD) [RM51] and its variants with momentum [SMDH13, Nes83, Pol64] have become the standard optimization routine for neural networks due to their fast convergence and good generalization properties [WRS17, KS17, SMDH13]. Yet the behavior of SGD on high-dimensional neural network models still eludes full theoretical understanding, both in terms of its convergence and generalization properties. In this paper, we study why SGD is so efficient at converging to low loss values on most standard neural networks, and how neural net architecture design affects this.

Classical stochastic optimization theory predicts that the learning rate of SGD needs to decrease over time for convergence to be guaranteed to the minimizer of a convex function [SZ13, Ber11]. For strongly convex functions for example, such results show that a decreasing learning rate schedule of is required to guarantee convergence to within -accuracy of the minimizer in iterations, where denotes the iteration number. Typical stochastic optimization procedures experience a transient phase, where the optimizer makes progress towards a neighborhood of a minimizer, followed by a stationary phase, where the gradient noise starts to dominate the signal and the optimizer typically oscillates around the minimizer [DM92, Mur98, TA17, CT18]. With decaying learning rates of the form or , the convergence of SGD in the transient phase can be very slow, typically leading to poor performance on standard neural network problems.

Neural networks operate in a regime where the number of parameters is much larger than the number of training data. In this regime, SGD seems to converge quickly with constant learning rates. So quickly, in fact, that neural net practitioners often use a constant learning rate for the majority of training, with exponentially decaying learning rate schedules towards the end, without seeing the method stall [KSH12, SZ14, HZRS16, ZK16]. With constant learning rates, theoretical guarantees show that SGD converges quickly to a neighborhood of the minimizer (i.e., fast convergence in the transient phase), but then reaches a noise floor

beyond which it stops converging; this noise floor depends on the learning rate and the variance of the gradients at the minimizer

[MB11, NWS14]. Some more recent results have shown that when models can over-fit the data completely while being strongly convex, convergence without a noise floor is possible without decaying the learning rate [SR13, MBB17, BBM18, VBS18]. While these results do give insights into why constant learning rates followed by an exponential decay might work well in practice [CT18], they fail to fully explain the efficiency of SGD on neural nets, and how they relate to overparameterization.

The behavior of SGD is also highly affected by the neural network architecture. It is common knowledge among neural network practitioners that deeper networks train slower [BSF94, GB10]

. This has led to several innovations over the years to get deeper networks to train more easily, such as residual connections

[HZRS16], careful initialization strategies [GB10, HZRS15, ZDM19], and various normalization schemes like batch normalization [IS15] and weight normalization [SK16]. Furthermore, there is ample evidence to indicate that wider networks are easier to train [ZK16, NH17, LXS19]. Several prior works have investigated the difficulties of training deep networks [GB10, BFL17], and the benefits of width [NH17, LXS19, DZPS18, AZLS18]. This work adds to the existing literature by identifying and analyzing a condition that affects the SGD dynamics on overparameterized neural networks.

Our contributions. The goal of this paper is to study why SGD is efficient for neural nets, and how neural net design affects SGD. Typical neural nets are overparameterized (i.e., the number of parameters exceed the number of training points). We ask how this over-parameterization, as well as the architecture of a neural net, affect the dynamics of SGD. We list the main contributions of this work below.

• We identify a condition, called gradient confusion, that controls the convergence properties of SGD on overparameterized models (defined in Section 2). When confusion is high, stochastic gradients produced by different data samples may be negatively correlated, causing slow convergence. On the other hand, when confusion is low, we show that convergence is accelerated, and SGD can converge faster and to a lower noise floor than predicted by classical theory, thus indicating a regime where constant learning rates would work well in practice (Sections 2 and 3).

• We then theoretically study the effect of overparameterization on the gradient confusion condition (Sections 4, 5 and 6). In Section 5

, we show that on a large class of random input instances, and for a large class of weights, gradient confusion increases as the depth increases, indicating the difficulty in training very deep networks. The results require minimal assumptions, and hold for a large family of neural networks with non-linear activations and can be extended to a large class of loss-functions. In particular, our results hold for fully connected and convolutional networks with the square-loss and logistic-loss functions, and also for commonly used non-linear activations such as sigmoid, tanh and ReLU.

• We show that the same qualitative results hold (i.e., gradient confusion increases as depth increases) when considering arbitrary data and random weights, i.e., at neural network initializations (Section 6). We further show evidence that wider networks tend to have lower gradient confusion for some standard initialization procedures (Section 6.2).

• Using experiments on standard models and datasets, we validate our theoretical results, and show that wider networks have better convergence properties and lower gradient confusion, while deeper networks have slower convergence with higher gradient confusion. We further show that innovations like batch normalization and skip connections in residual networks help lower gradient confusion, thus indicating why standard neural networks that employ such techniques are so efficiently trained using SGD (Section 7 and Appendix A).

### 2 Preliminaries

Notations. Throughout this paper, vectors are represented in bold lower-case while matrices in bold upper-case. We use

to indicate the cell in matrix and for the row of matrix . denotes the operator norm of . denotes the set and denotes the set .

SGD basics. Given training points (specified by the corresponding loss functions ), we use SGD to solve empirical risk minimization problems of the form,

 min→w∈RdF(→w):=min→w∈Rd1NN∑i=1fi(→w), (1)

using the following iterative update rule for rounds:

 →wk+1=→wk−αk∇~fk(→wk). (2)

Here is the learning rate and is a function chosen uniformly at random from at iteration . In this paper, we consider constant learning rates . We use to denote the optimal solution, i.e., .

Gradient confusion. SGD works by iteratively selecting a random function , and modifying the parameters to move in the direction of the negative gradient of the objective term without considering the effect on other terms. It may happen that the selected gradient is negatively correlated with the gradient of another term When the gradients of different mini-batches are negatively correlated, the objective terms disagree on which direction the parameters should move, and we say that there is gradient confusion111This is related to gradient diversity [YPL17] but with important differences, which we describe in Section 8..

A set of objective functions has gradient confusion bound if the pair-wise inner products between gradients satisfy, for a fixed ,

 ⟨∇fi(→w),∇fj(→w)⟩≥−η,∀i≠j∈[N]. (3)

SGD converges fast when the gradient confusion is low. To see why, consider the case of training a logistic regression model on a dataset with

orthogonal vectors. We have where is the logistic loss, is a set of orthogonal training vectors, and is the label for the training example. We then have where Because of our orthogonal data assumption, the gradient confusion is 0 since , and . Because of gradient orthogonality, an update in the gradient direction has no effect on the loss value of for . In this case, SGD decouples into a (deterministic) gradient decent on each objective term separately, and we can expect to see the fast rates of convergence attained by deterministic gradient descent, rather than the slow rates of SGD.

Can we expect a problem to have low gradient confusion in practice? It is known that randomly chosen vectors in high dimensions are nearly orthogonal with high probability

[MS86, GS16, Ver18] (this statement is formalized by Lemma 4.1 below). For this reason, we would expect an average-case (i.e., random) problem to have nearly orthogonal gradients, provided that we don’t train on too many training vectors (as the number of training vectors grow relative to the dimension, it becomes likely that we will see two training vectors with large negative correlation). In other words, we should expect a random optimization problem to have low gradient confusion when the number of parameters is "large" and the number of training data is "small" – i.e., when the model is over-parameterized. This is further evidenced by a simple toy example in Figure 1, where we show a slightly overparameterized linear regression model can have much faster convergence rates (without any noise floor), as well as have a positive average gradient cosine similarity, compared to the underparameterized model.

The above arguments are rather informal, and ignore issues like non-convexity and the effect of the structure of neural networks. Furthermore, it is unclear whether we can expect low levels of gradient confusion in practice, and what effect non-zero confusion has on convergence rates. Below, we present a rigorous argument that low confusion levels accelerate SGD and help achieve faster convergence and lower noise floors for non-convex problems. Then, we turn to the issue of over-parameterization, and study how it affects gradient confusion, and how this depends on neural network architecture. Finally, we use computational experiments to show that gradient confusion is low for standard neural nets used in practice, and that this effect contributes to the superior optimization performance of SGD.

### 3 SGD is efficient when gradient confusion is low

We now present a rigorous analysis of gradient confusion and its effect on SGD. Several prior papers have analyzed the convergence rates of constant learning rate SGD [NB01, MB11, NWS14, FB15, DFB17]. These results show that for strongly convex and Lipschitz smooth functions, SGD with a constant learning rate converges linearly to a neighborhood of the minimizer. The noise floor it converges to depends on the learning rate and the variance of the gradients at the minimizer, i.e., . To guarantee convergence to -accuracy in such a setting, the learning rate needs to be small, i.e., , and the method requires iterations. Some more recent results show convergence of constant learning rate SGD without a noise floor and without small step sizes using an "overfitting" condition, i.e., where the model can completely overfit the data [SR13, MBB17, BBM18, VBS18]. The condition effectively translates to assuming , getting rid of the noise floor.

The gradient confusion bound is related to the overfitting condition. Note that if , then , . This implies that the gradient confusion at the minimizer is small when the variance of the gradients at the minimizer is small. Further note that when the variance of the gradients at the minimizer is small, i.e., , a direct application of the results in [MB11, NWS14] shows that constant learning rate SGD has fast convergence to -accuracy in iterations, without the learning rate needing to be vanishingly small.

Bounded gradient confusion does not, however, provide a bound on the variance of the gradients. Thus, it is instructive to derive convergence bounds of SGD explicitly in terms of the gradient confusion bound, to properly understand its effect. We begin by looking at the case where the objective satisfies the Polyak-Lojasiewicz (PL) inequality [Loj65], a condition related to, but weaker than, strong convexity, and used in recent work in stochastic optimization [KNS16, DYJG17]. Using the PL inequality, we provide tight bounds on the rate of convergence in terms of the optimality gap. Then we look at a broader class of smooth non-convex functions, and analyze convergence to a stationary point.

We first make two standard assumptions about the objective function.

• The individual functions are -Lipschitz smooth:

 fi(→w′)≤fi(→w)+∇fi(→w)⊤(→w′−→w)+L2∥→w′−→w∥2,∀i∈[N].
• The individual functions satisfy the PL inequality:

 12∥∇fi(→w)∥2≥μ(fi(→w)−fi\opt),∀i∈[N],

where .

We now state the following convergence result of constant learning rate SGD in terms of the gradient confusion bound.

###### Theorem 3.1.

If the objective function satisfies A1 and A2, and has gradient confusion bound (eq. 3), SGD with updates of the form (2) converges linearly to a neighborhood of the minima of the problem (1) as:

 E[F(→wk)−F\opt]≤ρk(F(→w0)−F\opt)+αη1−ρ,

where , and .

###### Proof.

See Appendix C.1 for the proof. ∎

This result shows that SGD converges linearly to a neighborhood of a minimizer, and the size of this neighborhood depends on the level of gradient confusion. When there is no confusion, and SGD converges directly to a minimizer. Further, when the gradient confusion is small, i.e., , then SGD has fast convergence to -accuracy in iterations, without requiring the learning rate to be vanishingly small.

Convergence on general smooth non-convex functions. We now show that low gradient confusion leads to fast convergence on more general smooth non-convex functions.

###### Theorem 3.2.

If the objective satisfies A1 and the gradient confusion bound (eq. 3), then SGD converges to a neighborhood of a stationary point as:

 mink=1,…,TE∥∇F(→wk)∥2≤ρ(F(→w1)−F\opt)T+αρη,

for learning rate , , and .

###### Proof.

See Appendix C.1 for the proof. ∎

Theorems 3.1 and 3.2, similar to previous constant learning rate SGD convergence results, predict an initial transient phase of optimization with fast convergence to the neighborhood of a minimizer or a stationary point. This behavior is often observed when optimizing popular deep neural network models [DM92, SMDH13]; there is often an initial phase of fast convergence where a constant learning rate reaches a high level of accuracy on the model. This is typically followed by slow local convergence in the stationary phase where drops in the objective function are achieved by employing exponentially decreasing learning rate schedules [KSH12, SZ14, HZRS16, ZK16] (which from these theorems, would be equivalent to exponentially decreasing the noise floor that the algorithm converges to).

Note that the constants in Theorems 3.1 and 3.2 are slightly worse, and result in a slower convergence rate result (in terms of constants), than those shown in previous work [MB11, NWS14]. This is possibly a product of the analysis and the constants can probably be improved. See Appendix B for further discussion of this, where we explore a strengthened gradient confusion bound that guarantees faster local convergence. That being said, the main intention of these theorems is to show the direct effect that the gradient confusion bound has on the convergence rate and the noise floor that constant learning rate SGD converges to. This new perspective helps us more directly understand how over-parameterization affects the bound on the gradient confusion and thus the convergence properties, which we explore in the following sections.

While in Section 3 we showed that SGD is more efficient when gradient confusion is low, it raises the question of whether commonly-used neural network models have low gradient confusion, which would help explain SGD’s efficiency on them. There is some evidence that the Hessian at the minimizer is very low rank for many standard overparameterized neural network models [SEG17, Coo18, CCS16, WZ17]. In this section, we show that the gradient confusion bound (eq. 3) is often low for a large class of parameter configurations for problems with random, low-rank Hessians.

The simplest case of this (which we already saw above) occurs for losses of the form , for some function which includes logistic regression. In this case, the Hessian of is rank-1. We have:

 |\la∇fi(→w),∇fj(→w)\ra|=|→x⊤i→xj|⋅|ℓ′(yi→x⊤i→w)|⋅|ℓ′(yj→x⊤j→w)|,

where we denote the derivative of the function by . This inner product is expected to be small for all ; the logistic loss satisfies for all , and for fixed the quantity is whenever are randomly sampled from a sphere.222More generally, this is true whenever where is an isotropic random vector ([Ver18], Remark 3.2.5). Specifically, we have the following lemma, which is often attributed to [MS86] (see Appendix C.2 for a short proof).

###### Lemma 4.1 (Near orthogonality of random vectors).

For vectors drawn uniformly from a unit sphere in dimensions, and

 Pr[maxij|→x⊤i→xj|>ν]≤N2√π8exp(−d−12ν2).

For general classes of functions, suppose, for clarity in presentation, that each has a minimizer at the origin (the same argument can be easily extended to the more general case). Suppose also that there is a Lipschitz constant for the Hessian of each function that satisfies Then , where is an error term bounded as: and we use the shorthand to denote In this case, the inner product between two gradients is given as: | ∇f_i(→w) , ∇f_j(→w) | & ≤| →H_i→w , →H_j→w | + 12 L_H ∥→w∥^3 ( ∥→H_i∥ + ∥→H_j∥ )
& ≤∥→w∥^2 ∥ →H_i →H_j∥_2 + 12 L_H ∥→w∥^3 ( ∥→H_i∥ + ∥→H_j∥ ). If the Hessians are sufficiently random and low-rank (e.g., of the form where are randomly sampled from a unit sphere), then one would expect the terms in this expression to be small for all within a neighborhood of the minimizer.

This indicates that for many standard neural network models, the gradient confusion might be low for a large class of weights near the minimizer. In the next two sections, we explore more formally and in more detail how overparameterization and the neural network architecture affects the probability which which the gradient confusion bound holds.

### 5 Effect of overparameterization on gradient confusion

To draw a more rigorous connection between overparameterization and neural network structure, we analyze gradient confusion for generic (i.e., random) model problems using methods from high-dimensional probability. We rigorously analyze the case where training data is randomly sampled from a unit sphere, and identify specific cases where gradient confusion (Definition 2) is low with high-probability. Our results require minimal additional assumptions, and hold for a large family of neural networks with non-linear activations and a large class of loss-functions. In particular, our results hold for fully connected and convolutional networks, with the square-loss and logistic-loss functions, and commonly used non-linear activations such as sigmoid, tanh and ReLU.

We consider synthetic training data of the form for some labeling function and with data points drawn uniformly from the surface of a -dimensional unit sphere. The concept being learned satisfies and for Note that this automatically holds for every model considered in this paper where the concept is realizable (i.e., where the model can express the concept function using its parameters), and more generally, this assumes a Lipschitz condition on the labels (i.e., the concept function doesn’t change too quickly with the input). In this paper, we consider two loss-functions, namely, square-loss for regression and logistic loss function for classification. The square-loss function is defined as and the logistic function is defined as . Here, denotes the parameterized function (either a linear function, or a neural network) we fit to the training data and denotes the loss-function of hypothesis on data point .

Using tools from high-dimensional probability, we analyze below how the gradient confusion changes for a range of overparameterized models (including neural networks) with randomized training data. For clarity in presentation, we begin by analyzing the simple class of linear models.

#### 5.1 A simple case: linear models

We begin by examining gradient confusion in the case of fitting a simple linear model to data. In this case, the function can be written as follows.

 g→w(→x)=→w⊤→x. (4)

As stated earlier, we consider two loss functions, namely the square loss for regression and the logistic loss for classification. Both these functions have the following useful properties.

###### Lemma 5.1.

Consider the set of loss-functions where all are either the square-loss function or the logistic-loss function. Consider a linear model with weights satisfying . Consider the gradient of each function . Note that we can write , where we define . Then we have the following properties.

1. When we have .

2. There exists a constant such that , , .

When is a linear model, as defined in eq. 4, we have that .

###### Proof.

See Appendix C.3 for the proof. ∎

Define the function . The following theorem shows that, for randomly sampled data, the gradient confusion bound holds with high probability.

###### Theorem 5.1 (Concentration for linear models).

Consider the problem of fitting linear models (eq. 4) to data using either the square-loss or the logistic-loss function. Let be an arbitrary weight vector such that and let be a given constant. The gradient confusion bound (eq. 3) holds at with probability at least for some absolute constant . For both the square-loss and the logistic-loss functions, the value of .

###### Proof.

See Appendix C.4 for the proof.∎

This theorem can be interpreted as follows. As long as the dimension of the input (and thus the number of parameters in the problem) is large enough, the gradient confusion for a given weight vector is low, with high-probability. Thus, gradient confusion is lower as the linear model becomes more overparameterized, an effect that shows up in the simulation in Figure 1.

Note that the results in Section 3 showing fast convergence of SGD for low gradient confusion (Theorems 3.1 and 3.2) assume that the gradient confusion bound holds at every point along the path of SGD. On the other hand, Theorem 5.1 above shows that gradient confusion is low with high probability for overparameterized models, at a fixed weight . Thus, to ensure that the above result is relevant for the convergence of SGD on overparameterized linear models, we now make the concentration bound in Theorem 5.1 uniform over all weights inside a ball of radius .

###### Theorem 5.2 (Uniform concentration).

Consider the problem of fitting linear models (eq. 4) to data using either the square-loss or the logistic-loss function. Select a point such that , and consider a ball centered at of radius If the data are sampled uniformly from a unit sphere, then the gradient confusion bound (eq. 3) holds uniformly at all points with probability at least

 1−N2exp(−cdη264ζ40), if r≤η/4ζ20, 1−N2exp(−cdη264ζ40+8dζ20rη), otherwise.
###### Proof.

See Appendix C.5 for the proof.∎

Thus, as long as the radius is not too large, we see that the gradient confusion bound holds uniformly with high probability for all points within the ball .

#### 5.2 Extension to general neural networks

We now extend the previous results to feed-forward neural networks with non-linear activations. Formally, let and such that be the given weight matrices. Let denote the tuple . Define to be the width and to be the depth of the neural network. Then, the model is defined as

 g→W(→x):=σ(→Wβσ(→Wβ−1…σ(→W1σ(→W0→x))…), (5)

where

denotes the non-linear activation function applied point-wise to its arguments. We assume that the non-linear activation is given by a function

with the following properties.

• (P1) Boundedness: for vector .

• (P2) Bounded differentials: Let and denote the first and second sub-differentials respectively. Then, and for all .

When , as in our random data model, most classical activation functions such as sigmoid, tanh, softmax and ReLU satisfy these requirements. Additionally, we will make the following assumption on the weight matrices.

###### Assumption 5.1 (Small Weights).

We will assume that the operator norm of the weight matrices are bounded above by . In other words, for every we have .

##### Is the small-weights assumption reasonable?

Without the small-weights assumption, i.e., that for every layer , the signal propagated forward or the gradients could potentially blow up in magnitude, making the network untrainable. Proving non-vacuous bounds in case of such blow-ups in magnitude of the signal or the gradient is not possible in general, and thus, we assume this restricted class of weights.

Note that the small-weights assumption is not just a theoretical concern, but also usually mandated in practice. Neural networks are often trained with weight decay regularizers of the form , which force the weights to remain small during optimization. The operator norm of convolutional layers have also recently been used as an effective regularizer as well for image classification tasks [SGL18].

We can also show formally that Assumption 5.1 holds with high probability for random initializations that are frequently used in practice [Ben12]. Consider a random weight matrix where every entry is an independent sample from . Then we can show that the operator norm of is upper-bounded by 1 with high probability. In particular we have the following, the proof for which follows directly from Theorem 2.3.8 and Proposition 2.3.10 in [Tao12] with appropriate scaling. Define . Then with probability at least we have that .

This argument can also be extended to convolutional neural nets. Consider a 2D input and a filter . The convolution operation where is a Toeplitz

matrix (assuming a stride of 1 and no padding) and

is the flattened input vector [Gra06]. Define . Consider a randomly initialized filter where every entry of the filter is a random matrix. It then holds that with probability at least , we have . This again follows from Theorem 2.3.8 and Proposition 2.3.10 in [Tao12] and the fact that each column in has at most non-zero entries each of which is an independent

gaussian random variable.

While, in general, there is no reason to believe that such a small-weights assumption would continue to hold during optimization without explicit regularizers like weight decay, some recent work has shown evidence that the weights do not move too far away during training from the random initialization point for overparameterized neural nets [NLB18, DR17, NK19, ZCZG18, AZLS18, DZPS18, OS18]. It is worth noting though that all these results have been shown under some restrictive assumptions, such as very wide networks.

As in the case of Lemma 5.1 for linear models, we show that the considered loss-functions have the following property for neural nets, provided the weights satisfy the small-weights Assumption 5.1.

###### Lemma 5.2.

Consider the premise of Lemma 5.1 and denoting a feed-forward neural network as defined in eq. (5) whose weights satisfy Assumption 5.1. Then the loss functions satisfy the properties (1) and (2) of Lemma 5.1, where the value of . Here, is the depth of the neural network.

###### Proof.

See Appendix C.3 for the proof. ∎

We now prove concentration bounds for the gradient confusion on neural networks.

###### Theorem 5.3 (Concentration bounds for arbitrary depth neural networks).

Consider the problem of fitting neural networks (eq. 5) to data using either the square-loss or the logistic-loss function. Let be a given constant. Additionally, assume that the weights satisfy Assumption 5.1 and the non-linearities in each layer satisfy properties (P1) and (P2). For some fixed constant , with probability at least we have that the gradient confusion bound in eq. (3) holds.

Further, select a point , satisfying Assumption 5.1. Consider a ball centered at of radius . If the data are sampled uniformly from a unit sphere, then the gradient confusion bound in eq. (3) holds uniformly at all points with probability at least

 1−N2exp(−cdη264ζ40(β+2)4), if r≤η/4ζ20, 1−N2exp(−cdη264ζ40(β+2)4+8dζ20rη), otherwise.

For both the square-loss and the logistic-loss functions, the value of (from Lemma 5.2).

###### Proof.

See Appendix C.6 for the proof.∎

Thus we see that, for a given dimension and number of samples , when the depth decreases, the probability that the gradient confusion bound in eq. (3) holds increases, and vice versa. This helps explain why training very deep models is hard and typically slow with SGD, even with small weights, as observed previously [BSF94, GB10]. Note that this is also related to the shattered gradients phenomenon [BFL17] that was observed to arise with depth (see Section 8 for more discussion). This naturally raises the question why modern deep neural networks are so efficiently trained using SGD. While careful initialization strategies prevented vanishing or exploding gradients making deeper networks trainable, as we show in Section 6, these strategies still suffer from high gradient confusion for very deep networks (see [ZDM19] for some recent progress in this direction). Thus, in Section 7, we empirically explore how popular techniques typically used in very deep networks like residual connections [HZRS16] and batch normalization [IS15] affect gradient confusion. We find that these techniques drastically lower gradient confusion, making very deep networks significantly easier to train using SGD.

Note that all results in this section automatically hold for convolutional neural nets, since a convolutional operation on can be represented as a matrix multiplication for an appropriate Toeplitz matrix .

### 6 Gradient confusion at initialization

In this section, we explore the implications for initialization of the weights in neural networks, and discuss the effect of width on gradient confusion.

#### 6.1 Gradient confusion at standard weight initializations

In Section 5, we assumed a random data model and analyzed the effect of overparameterization on gradient confusion for a large class of weights. In this section, we instead assume arbitrary bounded data and random weights (more specifically, we consider typical weight initializations used when training neural networks) and show that the same effect of overparameterization holds. As before, we first present the result for linear models, before moving on to neural networks.

Linear models. Consider a dataset for some labeling function . Suppose the data points are deterministic and arbitrary vectors restricted to lie in a -dimensional unit sphere (, ). Consider the weight vector , with every coordinate sampled independently from the distribution . We now prove the following theorem which shows that gradient confusion is low at a random initial point with high probability.

###### Theorem 6.1 (Linear models with randomly chosen weights).

Consider the problem of fitting linear models (eq. 4) to data using either the square-loss or the logistic-loss function. Let be a fixed data vector restricted to lie within a unit sphere. Consider a randomly generated weight vector such that for every we have an i.i.d. sample . Let be a given constant. Then we have the following.

1. When , the gradient confusion bound in eq. (3) fails with a constant probability (i.e., ). In particular, the failure probability is always bounded away from even in high-dimensions.

2. When , we have that with probability at least the gradient confusion bound in eq. (3) holds. In particular, as the dimension increases the probability converges to .

###### Proof.

See Appendix C.7 for the proof.∎

The above theorem shows that if the weights have high variance, then the gradient confusion condition does not hold with high-probability. However when the weights have lower variance, the gradient confusion condition does hold with high probability, justifying the current practice of initializing with small weights [GB10, HZRS15].

Neural networks. We can extend the arguments above to the case of general neural networks as defined in Section 5.2. We show that when weights are chosen at random from the interval, where is the layer width, the gradient confusion bound in eq. (3) holds with high probability, and as before, this probability decreases with increasing depth of the neural network. This initialization strategy, and similar variants, are used almost universally for neural networks [GB10, HZRS15]. Consider the weight matrices

where every entry is an i.i.d. sample from the uniform distribution in the range

for some . This implies that we have for every . Thus, as long as we have that for every . Using a proof strategy similar to Theorem 5.3, we get that the following concentration bound on the gradient confusion.

###### Theorem 6.2 (Neural nets with randomly chosen weights).

Let be weight matrices such that every entry in each of these matrices is an i.i.d. sample chosen uniformly from the interval, , where is the maximum layer width. Then, for any dataset with for every , we have that the gradient confusion bound in eq. (3) holds with probability at least

 1−N2exp(−cd(η−4)264ζ40(β+2)4).
###### Proof.

See Appendix C.7 for the proof.∎

Thus, with both random data, and random weights, we see how increasing depth increases gradient confusion, making very deep models harder to train. This helps explain a phenomenon observed for many years in practice.

#### 6.2 The effect of network layer width

Consider a weight matrix containing i.i.d.  entries. Then for any vector ,

 E→W[∥→W→x∥2]=→x⊤E→W[→W⊤→W]→x=ν2p∥→x∥2.

Similarly, for any vector , we see that . Thus, a layer with weights can grow or shrink an input vector (which can be the outputs of the previous layer) on either the forward or the backward propagation at initialization time, depending on the variance of the entries in .

The operator norm of the weight matrices being close to 1 is important for the trainability of neural networks, as it ensures that the signal is passed through the network without exploding or shrinking across layers [GB10]. As we mention in Section 5, typical weight initialization techniques ensure that the operator norm is bounded by 1 with high probability. This indicates that we would expect the effect of width on convergence and gradient confusion to be much less pronounced than the effect of depth for typical neural net designs. This is also why, on assuming for each weight matrix in our results in Section 5, the dependence of gradient confusion on width goes away in general. A simple example that illustrates this is to consider the case where each weight matrix in the neural network has exactly one non-zero element, which is set to 1. The operator norm of each such weight matrix is 1, but the forward or backward propagated signals would not depend on the width.

That being said, it is interesting to consider how under specific weight initialization strategies, the layer width has an effect on the gradient confusion. To this end, consider a simple two-layer linear network with , where , and . Further, assume that all elements of and are random, and distributed as , while all elements of are distributed as . Under this initialization, and on adapting Proposition A.1 for least square loss functions in [CWZ18], we have that is , for any pair of data . Further, using the same proof technique, we have a bound on the variance as: . From these results, we see that the gradient inner product concentrates towards 0 with increasing width (this can be written more formally using a straightforward application of Chebyshev’s inequality). In other words, our results show that gradient confusion decreases with increasing width. See Figure 2 for an additional simulation on a deeper neural network where the weights are initialized using the Glorot initializer [GB10].

Note that these results hold for the specific initialization strategy mentioned. Nonetheless, some other recent work also provides evidence on the benefits of width, by showing that the learning dynamics of gradient descent simplify significantly for sufficiently wide networks under certain parameterizations [JGH18, LXS19]. In the next section (as well as in Appendix A.2), we show empirical evidence that, given a sufficiently deep network, increasing the layer width often helps in lowering gradient confusion and speeding up convergence close to the minimizer for a range of neural network models.

### 7 Experimental results

In this section, we present experimental results showing the effect of neural network architecture on convergence rates of SGD and gradient confusion. It is worth noting that Theorems 3.1 and 3.2 indicate that gradient confusion primarily affects the stationary phase of the optimizer and the final "noise floor" of constant stepsize SGD. Thus, we expect the effect of gradient confusion to be most prominent near the end of training, particularly when the gradient noise begins to dominate and the convergence curve has flattened out near this floor.

##### Experimental setup.

We perform experiments on wide residual networks (WRN) [ZK16] for an image classification task on CIFAR-10. Appendix A contains more experiments on fully-connected linear and non-linear neural networks, where we show that the same qualitative results as presented in this section hold for a wide variety of neural net architectures with different activation functions and weight initializations.

For the rest of this section, we use WRN-- to denote a WRN with depth and width factor 333Width factor is the number of filters relative to the original ResNet model, e.g., a factor of 1 corresponds to the original ResNet, and 2 means the network is twice as wide. . See Section E in the Appendix for more details on the WRN architecture. We turn off dropout for all our experiments. Our first round of experimental networks have no skip connections or batch normalization [IS15], so as to stay as close as possible to the assumptions of our theorems. Later on, we study the effects that skip connections and batch normalization have on the convergence rate and gradient confusion. We added biases to the convolutional layers when not using batch normalization. We use SGD as the optimizer without any momentum, and use a weight decay parameter of . Following [ZK16], we train all experiments for 200 epochs with minibatches of size 128, and reduce the initial learning rate by a factor of 10 at epochs 80 and 160. While a decreasing learning rate schedule is not required to see the effect of the neural net architecture on SGD and gradient confusion, this significantly speeds up training, and thus we follow the same schedule as in [ZK16]. See Appendix A for more experiments with constant learning rates. We use the MSRA initializer [HZRS15] for the weights as is standard for this model, and use the same preprocessing steps for the CIFAR-10 images as described in [ZK16]. We tune the optimal initial learning rate for each model over a logarithmically-spaced grid (, , , , , ) and select the run that achieves the lowest training loss value. Our grid search was such that the optimal learning rate never occurred at one of the extreme values tested.

To measure gradient confusion, at the end of every training epoch, we sample 100 mini-batches each of size 128 (the same size as the training batch size). We calculate gradients on each of these mini-batches, and then calculate pairwise cosine similarities. To measure the worse-case gradient confusion, we compute the lowest gradient cosine similarity among all pairs. We also plot a histogram of the pairwise gradient cosine similarities of the 100 minibatches sampled at the end of training (after 200 epochs), to see the concentration of the distribution.

##### Effect of depth.

To test our theoretical results, and in particular Theorem 5.3

, we consider a WRN with no batch normalization and no skip connections. This makes the network behave like a typical deep convolutional neural network. We now keep the width fixed, and change the depth over the networks WRN-16-2, WRN-40-2 and WRN-100-2. From Figure

3, we see that our theoretical results seem to be backed by the experiments, where smaller depth helps in faster convergence (and in reaching a better noise floor), as well as a better bound on the gradient confusion. We also notice that with smaller depth, the histogram of pairwise gradient cosine similarities concentrates more sharply around 0 (indicating more orthogonal gradients), which makes the network easier to train. See Appendix A.1 for further experiments, where we see the same qualitative results hold.

##### Effect of width.

Using the same experimental setup as above, we now test the effect of increasing width in this network, while keeping the depth fixed, by considering the networks: WRN-28-1, WRN-28-2, WRN-28-10. From Figure 4, we now see that width helps in faster convergence, lower gradient confusion, as well as more orthogonal gradients. These results indicate that as long as gradients don’t contradict each other, orthogonal gradients lead to faster convergence. Note that the smallest network considered here is still highly overparameterized (approximately 0.37 million parameters), and can reach performance levels close to the WRN-40-2 model considered above. See Appendix A.2 for further experiments, where again we see the same qualitative results hold.

##### Effect of batch normalization and skip connections.

To help understand why most standard neural nets are so efficiently trained using SGD, we test the effect that batch normalization and skip connections have on a deep thin model: WRN-40-2. Figure 5 shows results where we start with a network with no batch normalization and no skip connections, and then progressively add them to the network. To make the comparison fair such that all networks start from the same initialization point, for the network without batch normalization, we use a reparameterization equivalent to batch normalization using the first sampled minibatch of data (see [MM15, SK16, KDDD15] for similar papers using data-dependent initializations).

In Figure 5, we see that adding batch normalization makes a big difference in the convergence speed as well as in lowering gradient confusion. Adding skip connections on top of this further accelerates training. Notice that batch normalization leads to a very sharp concentration of the gradient cosine similarity, and makes the gradients behave like random vectors drawn from a sphere, making optimization more stable; an observation consistent with previous works [BFL17, STIM18].

### 8 Additional discussion of some related work

The convergence of SGD on over-parameterized models has received a lot of attention. The authors of [ACH18] study the behavior of SGD on over-parameterized problems, and show that SGD on over-parameterized linear neural nets is similar to applying a certain preconditioner while optimizing. This can sometimes lead to acceleration when overparameterizing by increasing the depth of linear neural networks. In this paper, we show that this property does not hold in general (as mentioned briefly in [ACH18]), and that convergence typically slows down because of gradient confusion when training very deep networks.

The behavior of SGD on over-parameterized problems was also studied in [MBB17, BBM18, VBS18, SR13], using an overfitting condition observed for some overparameterized neural nets (particularly for convnets), where the minimizer returned by the optimizer simultaneously minimizes each individual data sample, to show fast convergence of SGD. In contrast, we aim to establish a more direct relationship between width, depth, problem dimensionality, and the error floor of SGD convergence.

Other works have studied the impact of structured gradients on SGD. [BFL17] study the effects of shattered gradients at initialization for ReLU networks, which is when (non-stochastic) gradients at different (but close) locations in parameter space become negatively correlated. The authors show how gradients get increasingly shattered with depth in ReLU networks.

[Han18] show that the variance of gradients in fully connected networks with ReLU activations is exponential in the sum of the reciprocals of the hidden layer widths at initialization. Further, [HR18] show that this sum of the reciprocals of the hidden layer widths determines the variance of the sizes of the activations at each layer during initialization. When this sum of reciprocals is too large, early training dynamics are very slow, suggesting the difficulties of starting training on deeper networks, as well as the benefits of increased width.

There has recently also been interest in analyzing conditions under which SGD converges to global minimizers of overparameterized linear and non-linear neural networks. [ACGH18] shows SGD converges linearly to global minimizers for linear neural nets under certain conditions. [DZPS18, AZLS18, ZCZG18, BGMSS17] also show convergence to global minimizers of SGD for non-linear neural nets. While all these results require the network to be sufficiently wide, they represent an important step in the direction of better understanding optimization on neural nets. This paper complements these recent results by studying how one property, low gradient confusion, contributes to SGD’s success on overparameterized neural nets.

### 9 Conclusions, limitations and future work

In this paper, we investigate how overparameterization affects the dynamics of SGD on neural networks. We introduce a concept called gradient confusion, and show that when gradient confusion is low, SGD has better convergence properties than predicted by classical theory. Further, using both theoretical and empirical results, we show that overparameterization by increasing the number of parameters of linear models or by increasing the width of neural network layers leads to lower gradient confusion, making the models easier to train. In contrast, overparameterization by increasing the depth of neural networks results in higher gradient confusion, making deeper models harder to train. We further show evidence of how techniques like batch normalization and skip connections in residual networks help in tackling this problem.

Note that many previous results have shown how deeper models are better at modeling higher complexity function classes than wider models, and thus depth is essential for the success of neural networks [ES16, Tel16, RPK17]. Thus, our results indicate that, given a sufficiently deep network, increasing the network width is important for the trainability of the model, and will lead to faster convergence rates. This is further supported by other recent research [Han18, HR18] that show that the width should increase linearly with depth in a neural network to help dynamics at the beginning of training (i.e., at initialization). Our results also suggest the importance of further investigation into good initialization schemes for neural networks that make training very deep models possible. See [ZDM19] for some recent advances in this direction.

We consider the main limitation of this work to be the use of the random dataset that consists of i.i.d. samples of isoperimetric vectors to derive the concentration bounds in Section 5. Note that while this simple data model helps make the analysis tractable and gives insights into the effect of overparameterization, it is not clear to what extent real-world datasets follow such a model (i.e.,

i.i.d. data and isoperimetric property in the feature space). Nonetheless, these are standard assumptions used in statistical learning theory since it allows for mathematical analysis. To make our observations on this model robust, we show that the same qualitative results hold (i) with arbitrary data and random weights in Section

6 (i.e., at initialization), and (ii) in the experimental setting (Section 7 and Appendix A). For future work, it would be interesting to extend this analysis to richer random data models. It would also be interesting to gain a better understanding of how layer width impacts long-term training dynamics, while the analysis here focuses on behavior near initialization.

An active area of research currently is to better understand how overparameterization and neural net architecture promote generalization [NLB18, AZLL18, BHMM18, NMB18]. An interesting topic for future work is whether there is a connection between gradient confusion, sharp and flat minimizers, and generalization for SGD. See [FNN19] for some recent work in this direction.

### Acknowledgements

The authors thank Brendan O’Donoghue, James Martens, Sudha Rao, and Sam Smith for helpful discussions and for reviewing earlier versions of this manuscript.

### Appendix A Additional experimental results

To further test the main claims in the paper, we perform additional experiments on an image classification problem on the MNIST dataset using fully connected neural networks. We iterate over neural networks of varying depth and width, and consider both the identity activation function (i.e., linear neural networks) and the tanh activation function. We also consider two different weight initializations that are popularly used and appropriate for these activation functions:

• The Glorot normal initializer [GB10] with weights initialized by sampling from the distribution , where fan-in denotes the number of input units in the weight matrix, and fan-out denotes the number of output units in the weight matrix.

• The LeCun normal initializer [LBOM12] with weights initialized by sampling from the distribution .

We consider the simplified case where all hidden layers have the same width . Thus, the first weight matrix , where for the -sized images of MNIST; all intermediate weight matrices ; and the final layer for the 10 image classes in MNIST. We also add biases to each layer, which we initialize to 0. We use softmax cross entropy as the loss function.

We choose this relatively simple model as it gives us the ability to iterate over a large number of combinations of network architectures of varying width and depth, and different activation functions and weight initializations. Linear neural networks are an efficient way to directly understand the effect of changing depth and width without increasing model complexity over linear regression. Thus, we consider both linear and non-linear neural nets in our experiments.

We use SGD with constant learning rates for training with a mini-batch size of 128 and train each model for 40000 iterations (more than 100 epochs). The constant learning rate was tuned over a logarithmically-spaced grid (, , ). We ran each experiment 10 times, and picked the learning rate that achieved the lowest training loss value on average at the end of training. Our grid search was such that the optimal learning rate never occurred at one of the extreme values tested.

To measure gradient confusion at the end training, we sample 1000 pairs of mini-batches each of size 128 (the same size as the training batch size). We calculate gradients on each of these pairs of mini-batches, and then calculate the cosine similarity between them. To measure the worse-case gradient confusion, we compute the lowest gradient cosine similarity among all pairs. We also calculate the average pairwise gradient cosine similarities of the 1000 pairs. We explore the effect of changing depth and changing width on the different activation functions and weight initializations. We plot the final training loss achieved for each model, as well as the minimum and average gradient cosine similarities calculated over the 1000 pairs of gradients at the end of training. For each point, we plot both the mean and the standard deviation over the 10 independent runs.

#### a.1 The effect of depth

We first explore the effect of depth of the neural network for these image classification models. To do this, we consider a fixed width of , and vary the depth of the neural network, on the log scale, as: , , , , . Figure 6 shows results on linear neural networks for the two weight initializations considered (Glorot normal and LeCun normal). Figure 7 shows results on neural networks with tanh activation functions, for the same two weight initializations.

Similar to the experimental results in Section 7, and matching our theoretical results in Section 5, we see the consistent trend of gradient confusion increasing with increasing depth. This makes the networks harder to train with increasing depth, and this is evidenced by an increase in the final training loss value. By depth , the increased gradient confusion effectively makes the network untrainable when using tanh non-linearities. In Section 7, we further showed that increased depth results in lower concentration of the gradient cosine similarities (higher variance), and gradients get increasingly non-orthogonal with increased depth. We see the same effect in these experiments in the plots showing the average gradient cosine similarity.

#### a.2 The effect of width

We now explore the effect of width by varying the width of the neural network while keeping the depth fixed at . We choose a very deep model, which is essentially untrainable for small widths (with standard initialization techniques) and helps better illustrate the effects of increasing width. We vary the width of the network, again on the log scale, as: , , , , . Crucially, note that the smallest network considered here (, ) still has more than 50000 parameters (i.e., more than the number of training samples), and the network with width has almost three times the number of parameters as the high-performing , network considered in the previous section. Figures 8 and 9 show results on linear neural nets and neural nets with tanh activations for both the Glorot normal and LeCun normal initializations.

As in the experimental results of Section 7, we see the consistent trend of gradient confusion decreasing with increasing width. Thus, wider networks become easier to train and improve the final training loss value. We further see that when the width is too small (), the gradient confusion becomes drastically high and the network becomes completely untrainable. As in Section 7, we also notice, from the average gradient cosine similarity plots, that increasing width helps in better concentration of the gradient cosine similarities (lower variance), effectively making the gradients more orthogonal, and the network easier to train.

### Appendix B Conditions for faster convergence

In Theorem 3.1, we show convergence to the neighborhood of a minimizer for problems satisfying the PL inequality and Lipschitz-smoothness. The learning rate for which the rate of convergence in the transient phase is optimal is given by , with a corresponding decay rate of .

Faster convergence can be guaranteed if we strengthen the definition of confusion by examining the correlation between and for all and Compared to Theorem 3.1, convergence is guaranteed with a larger learning rate that is independent of the training set size and faster geometric decay.

###### Theorem B.1.

If the objective function satisfies A1 and A2, and satisfies the strengthened gradient confusion bound ⟨∇f_i (→w) , f_j (→w’) ⟩≥- η,     i, j,→w,→w’, then SGD converges with

 F(→wk)−F\opt≤ρk(F(→w0)−F\opt)+αη1−ρ,

where the learning rate and .

###### Proof.

We start by noting that, for

 fi(→w−α∇fj(→w)) =fi(→w)+∫αt=0∂∂tfi(→w−t∇fj(→w))dt =fi(→w)−∫αt=0∇fj(→w)⊤fi(→w−t∇fj(→w))dt ≤fi(→w)+∫αt=0ηdt ≤fi(→w)+αη.

We then have

 NF(→wk+1) =~fk(→wk−α∇~fk(→wk))+∑fi≠~fkfi(→wk−α∇~fk(→w