# On regularization of gradient descent, layer imbalance and flat minima

We analyze the training dynamics for deep linear networks using a new metric - layer imbalance - which defines the flatness of a solution. We demonstrate that different regularization methods, such as weight decay or noise data augmentation, behave in a similar way. Training has two distinct phases: 1) optimization and 2) regularization. First, during the optimization phase, the loss function monotonically decreases, and the trajectory goes toward a minima manifold. Then, during the regularization phase, the layer imbalance decreases, and the trajectory goes along the minima manifold toward a flat area. Finally, we extend the analysis for stochastic gradient descent and show that SGD works similarly to noise regularization.

## Authors

• 24 publications
• ### The Sobolev Regularization Effect of Stochastic Gradient Descent

The multiplicative structure of parameters and input data in the first l...
05/27/2021 ∙ by Chao Ma, et al. ∙ 18

• ### S-SGD: Symmetrical Stochastic Gradient Descent with Weight Noise Injection for Reaching Flat Minima

The stochastic gradient descent (SGD) method is most widely used for dee...
09/05/2020 ∙ by Wonyong Sung, et al. ∙ 0

• ### Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis

The representation of functions by artificial neural networks depends on...
06/04/2021 ∙ by Stephan Wojtowytsch, et al. ∙ 0

• ### Theory of Deep Learning IIb: Optimization Properties of SGD

In Theory IIb we characterize with a mix of theory and experiments the o...
01/07/2018 ∙ by Chiyuan Zhang, et al. ∙ 0

• ### The Impact of Local Geometry and Batch Size on the Convergence and Divergence of Stochastic Gradient Descent

Stochastic small-batch (SB) methods, such as mini-batch Stochastic Gradi...
09/14/2017 ∙ by Vivak Patel, et al. ∙ 0

• ### Partial local entropy and anisotropy in deep weight spaces

We refine a recently-proposed class of local entropic loss functions by ...
07/17/2020 ∙ by Daniele Musso, et al. ∙ 0

• ### Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

The early phase of training has been shown to be important in two ways f...
12/28/2020 ∙ by Stanisław Jastrzębski, et al. ∙ 16

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we analyze regularization methods used for training of deep neural networks. To understand how regularization like weight decay and noise data augmentation work, we study gradient descent (GD) dynamics for deep linear networks (DLNs). We study deep networks with

scalar layers to exclude factors related to over-parameterization and to focus on factors specific to deep models. Our analysis is based on the concept of flat minima Hochreiter and Schmidhuber (1994b). We call a region in weight space flat, if each solution from that region has a similar small loss. We show that minima flatness is related to a new metric, layer imbalance, which measures the difference between the norm of network layers. Next, we analyze layer imbalance dynamics of gradient descent (GD) for DLNs using a trajectory-based approach Saxe et al. (2013).

With these tools, we prove the following results:

1. Standard regularization methods such as weight decay and noise data augmentation, decrease layer imbalance during training and drive trajectory toward flat minima.

2. Training for GD with regularization has two distinct phases: (1) ‘optimization’ and (2) ‘regularization’. During the optimization phase, the loss monotonically decreases, and the trajectory goes toward minima manifold. During the regularization phase, layer imbalance decreases and the trajectory goes along minima manifold toward flat area.

3. Stochastic Gradient Descent (SGD) works similarly to implicit noise regularization.

## 2 Linear neural networks

We begin with a linear regression

with mean squared error on scalar samples :

 E(w,b)=1N∑(w⋅xi+b−yi)2→min (1)

Let’s center and normalize the training dataset in the following way:

 ∑xi=0;1N∑x2i=1; ∑yi=0;1N∑xiyi=1. (2)

The solution for this normalized linear regression is .

Next, let’s replace with a linear network with scalar layers :

 y=w1⋯wd⋅x+b (3)

Denote . The loss function for the new problem is:

 E(w,b)=1N∑(W⋅xi+b−yi)2→min

Now the loss is a non-linear (and non-convex) function with respect to the weights . For the normalized dataset (2), network training is equivalent to the following problem:

 L(w)=(w1⋯wd−1)2→min (4)

Such linear networks with depth-2 have been studied in Baldi and Hornik (1989), who showed that all minima for the problem (4) are global and that all other critical points are saddles.

### 2.1 Flat minima

Following Hochreiter et al Hochreiter and Schmidhuber (1994b), we are interested in flat minima – “a region in weight space with the property that each weight from that region has similar small error". In contrast, sharp minima are regions where the function can increase rapidly. Let’s compute the loss gradient :

 ∂L∂wi =2(w1⋯wd−1)(w1⋯wi−1wi+1⋯wd)=2(W−1)(W/wi) (5)

Here we denote for brevity. The minima of loss are located on hyperbola (Fig. 1). Our interest in flat minima is related to training robustness. Training in the flat area is more stable than in the sharp area: the gradient vanishes if is very large, and the gradient explodes if is very small.

It was suggested by Hochreiter et al Hochreiter and Schmidhuber (1994a) that flat minima have smaller generalization errors than sharp minima. Keskar et al. (2016)

observed that large-batch training tends to converge towards a sharp minima with a significant number of large positive eigenvalues of Hessian. They suggested that sharp minima generalize worse than flat minima, which have smaller eigenvalues. In contrast,

Dinh et al. (2017) argued that flatness of minima can’t be directly applied to explain generalization; since both flat and sharp minima represent the same function, they perform equally on a validation set.

The question of how minima flatness is related to good generalization is out of scope of this paper.

### 2.2 Layer imbalance

In this section we define a new metric related to the flatness of the minimizer – layer imbalance.

Dinh Dinh et al. (2017) showed that minima flatness is defined by the largest eigenvalue of Hessian :

 H(w)=2⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣W2w21(2W−1)Ww1w2…(2W−1)Ww1wd(2W−1)Ww2w1W2w22…(2W−1)Ww2wd…………(2W−1)Wwdw1(2W−1)Wwdw2…W2w2d⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

The eigenvalues of the Hessian are . Minima close to the axes are sharp. Minima close to the origin are flat. Note that flat minima are balanced: for all layers.

In the spirit of Arora et al. (2019); Neyshabur et al. (2015), let’s define layer imbalance for a deep linear network:

 D(w):=maxi,j| ||wi||2−||wj||2 | (6)

Minima with low layer imbalance are flat, and minima with high layer imbalance are sharp.

## 3 Implicit regularization for gradient descent

In this section, we explore the training dynamics for continuous and discrete gradient descent.

### 3.1 Gradient descent: convergence analysis

We start with an analysis of training dynamics for continuous GD. By taking a time limit for gradient descent: , we obtain the following DEs Saxe et al. (2013):

 dwidt =−λ∂L∂wi=−2λ(W−1)(W/wi) (7)

For continuous GD, the loss function monotonically decreases:

 dLdt =∑(∂L∂wi⋅dwidt)=−4λ(W−1)2W2(∑1w2i)=−4λW2(∑1w2i)⋅L(t)≤0

The trajectory for continuous GD is hyperbola: const (see Fig. 1(a)) (Saxe et al., 2013) . The layer imbalance remains constant during training. So if training starts close to the origin, then a final point will also have a small layer imbalance and a minimum will be flat.

Let’s turn from continuous to regular gradient descent:111We omit in the right part for brevity, so means .

 wi(t+1) =wi−2λ∂L∂wi=wi−2λ(W−1)(W/wi) (8)

We would like to find conditions, which would guarantee that the loss monotonically decreases. For any fixed learning rate, one can find a point , such that the loss will increase after the GD step.222For example, consider the network with 2 layers. The loss after GD step is:

For any fixed , one can find with and large enough to make , and therefore the loss will increase: . But we can define an adaptive learning rate which guarantees that the loss decreases.

###### Theorem 3.1.

Consider discrete GD (Eq. 8). Assume that . If we define an adaptive learning rate , then the loss monotonically converges to 0 with a linear rate.

###### Proof.

Let’s estimate the loss change for a gradient descent step:

 W(t+1)−1=∏(wi−2λ(W−1)W/wi)−1 =∏(wi(1−2λ(W−1)W/w2i))−1=W⋅∏(1−2λ(W−1)W/w2i)−1 =W⋅(1−2λ(W−1)W(∑i1/w2i)+4λ2(W−1)2W2(∑i≠j1/(w2iw2j)) −8λ3(W−1)3W3(∑i≠j≠k1/(w2iw2jw2k))+...)−1 −8λ3(W−1)2W4(∑i≠j≠k1/(w2iw2jw2k))+...)=(W−1)⋅(1−WW−1⋅S)

Here is a series with :

 S=2λ(W−1)W(∑i1/w2i)−4λ2(W−1)2W2(∑i≠j1/(w2iw2j)) +8λ3(W−1)3W3(∑i≠j≠k1/(w2iw2jw2k))+…

Consider the factor . To prove that , we consider two cases.

CASE 1: . In this case, the series can be written as:

 S=−(2λ(1−W)W(∑i1/w2i)+4λ2(1−W)2W2(∑i≠j1/(w2iw2j))+ +8λ3(1−W)3W3(∑i≠j≠k1/(w2iw2jw2k))+...)≥2λ(W−1)W(∑i1/w2i)11−q

where is:

 q ≤2λ|(W−1)W|(∑i≠...≠m1/(w2i...w2m))(∑1/w2i)∑i≠...≠m1/(w2i...w2m)=2λ|(W−1)W|(∑1/w2i)≤38

So on the one hand: .

On the other hand: .

CASE 2: . In the series , all terms are now positive. Since , we have that .

On the one hand: .

On the other hand: .

To conclude, in CASE 1 we prove that and in CASE 2 that .

Since , the loss monotonically converges to 0 with rate . ∎

### 3.2 Gradient descent: implicit regularization

###### Theorem 3.2.

Consider discrete GD (Eq. 8). Assume that . If we define an adaptive learning rate , then the layer imbalance monotonically decreases.

###### Proof.

Let’s compute the layer imbalance for the layers and after one GD step:

 Dij(t+1)=wi(t+1)2−wj(t+1)2=(wi−2λ(W−1)W/wi)2−(wj−2λ(W−1)W/wj)2 =(w2i−w2j)⋅(1−4λ2(W−1)2W2/(wiwj)2)=Dij⋅(1−4λ2(W−1)2W2/(wiwj)2)

On the one hand, the factor .

On the other hand:

 k =1−4λ2(W−1)2W2/(wiwj)2≥1−λ2(W−1)2W2(1/w2i+1/w2j)2 ≥1−λ2(∑1/w2l)2(W−1)2W2≥1−9256=247256

So and . This guarantees that the layer imbalance decreases. ∎

Note. We proved only that the layer imbalance decreases, but not that converges to . The layer imbalance may stay large, if the loss too fast or if , so the factor . To make the layer imbalance , we should keep the loss in certain range, e.g. . For this, we could increase the learning rate if the loss becomes too small, and decrease learning rate if loss becomes large.

## 4 Explicit regularization

In this section, we prove that regularization methods, such as weight decay, noise data augmentation, and continuous dropout, decrease the layer imbalance.

### 4.1 Training with weight decay

As before, we consider the gradient descent for linear network with layers. Let’s add the weight decay (WD) term to the loss: .

The continuous GD with weight decay is described by the following DEs:

 dwidt=−λ∂¯L∂wi=−2λ((W−1)(W/wi)+μ⋅wi) (9)

Accordingly, the loss dynamics for continuous GD with weight decay is:

 dLdt=∑∂L∂wi⋅dwidt=−4λ((W−1)2W2(∑1/w2i)+μ⋅d⋅(W−1)W) =−4λ(∑1/w2i)W2(W−1)(W−(1−μdW(∑1/w2i)))

The loss decreases when , outside the weight decay band: . The width of this band is controlled by the weight decay .

We can divide GD training with weight decay into two phases: (1) optimization and (2) regularization. During the first phase, the loss decreases until the trajectory gets into the WD-band. During the second phase, the loss can oscillate, but the trajectory stays inside the WD-band (Fig. 1(b)) and goes toward a flat minima area. The layer imbalance monotonically decreases:

 d(w2i−w2j)dt =−4λ⋅(((W−1)W+μw2i)−((W−1)W+μw2j))=−4λ⋅μ⋅(w2i−w2j)

### 4.2 Training with noise augmentation

Bishop (1995) showed that for shallow networks, training with noise is equivalent to Tikhonov regularization. We extend this result to DLNs.

Let’s augment the training data with noise: , where the noise has mean and is bounded: . The DLN with noise augmentation can be written in the following form:

 ~y=w1⋯wd⋅(1+η)x (10)

This model also describes continuous dropout (Srivastava et al., 2014) when layer outputs are multiplied with the noise: . This model can be also used for continuous drop-connect (Kingma et al., 2015; Wan et al., 2013) when the noise is applied to weights: .

The GD with noise augmentation is described by the following stochastic DEs:

 dwidt=−λ∂~L∂wi=−2λ⋅(1+η)(W(1+η)−1)(W/wi)

Let’s consider loss dynamics:

 dLdt=∑(∂L∂wi⋅dwidt)=−4λ(1+η)W2(∑1/w2i)(W−1)(W(1+η)−1) =−4λ(1+η)2W2(∑1/w2i)⋅((W−1)(W−11+η))

The loss decreases while the factor , outside of the noise band . The training trajectory is the hyperbola const. When the trajectory gets inside the noise band, it oscillates around the minima manifold, but the layer imbalance remains constant for continuous GD.

Consider now discrete GD with noise augmentation:

 wi(t+1)=wi−2λ(1+η)(W(1+η)−1)(W/wi) (11)

For discrete GD, noise augmentation works similarly to weight decay. Training has two phases: (1) optimization and (2) regularization (Fig. 1(c)

). During the optimization phase, the loss decreases until the trajectory hits the noise band. Next, the trajectory oscillates inside the noise band, and the layer imbalance decreases. The noise variance

defines the band width, similarly to the weight decay .

###### Theorem 4.1.

Consider discrete GD with noise augmentation (Eq. 11). Assume that the noise has 0-mean and is bounded: . If we define the adaptive learning rate , then the layer imbalance monotonically decreases inside the noise band .

###### Proof.

Let’s estimate the layer imbalance:

 w2i(t+1)−w2j(t+1) =(wi−2λ(1+η)(W(1+η)−1)W/wi)2−(wj−2λ(1+η)(W(1+η)−1)W/wj)2 =(w2i−w2j)+4λ2(1+η)2(W(1+η)−1)2(W2/w2i−W2/w2j)

On the one hand, the factor .

On the other hand:

 k=1−4λ2(1+η)4(W−11+η)2W2/(wiwj)2 ≥1−λ2(1+η)4(W−11+η)2W2(1/w2i+1/w2j)2 ≥1−λ2(∑i1/w2i)2⋅(1+δ)4(δ+δ1−δ)2(1+δ)2≥1−λ2(∑i1/w2i)2(3/2)10

Taking makes , which proves that the layer imbalance decreases. ∎

Note. We can prove that the layer imbalance if we also assume that all layers are uniformly bounded . This implies that there is such that for all the adaptive learning rate , and we can prove that the expectation :

 E(k)=1−E[4λ2(1+η)4(W−11+η)2W2/(wiwj)2] ≤1−4λ2W2/(wiwj)2⋅(1+σ2)2σ21+σ2≤1−4λ214C4(1+σ2)σ2≤1−λ2σ2C4

This proves that the layer imbalance with rate .

## 5 SGD noise as implicit regularization

In this section, we show that SGD works as implicit noise regularization, and that the layer imbalance converges to . As before, we train a DLN with loss on a normalized dataset with samples :

 ∑xi=0;1N∑x2i=1;∑yi=0;1N∑xiyi=1.

A stochastic gradient for a batch with samples is:

 ∂LB∂wi=1|B|∑¯B2(Wx2n−xnyn)W/wi=2(W(1B∑¯Bx2n)−(1B∑¯Bxnyn))W/wi

If batch size , then terms and .

So we can write the stochastic gradient in the following form:

 ∂LB∂wi=2(W(1+η1)−(1+η2))W/wi=2(W−1+(Wη1−η2))W/wi

The factor works as noise data augmentation, and the term works as label noise. Both and have -mean. When loss is small, we can combine both components into one SGD noise term: . SGD noise has -mean. We assume that SGD noise variance depends on batch size in the following way: . The trajectory for continuous SGD is described by the stochastic DEs:

 dwidt=−λ⋅∂LB∂wi=−2λ(W−1+η)W/wi

 dLdt=−4λW2(∑1/w2i)⋅(W−1)(W−1+η)

For continuous SGD, the loss decreases anywhere except in the SGD noise band: . The band width depends on : the smaller the batch, the wider the band. The SGD training consists of two parts. First, the loss decreases until the trajectory hits the SGD-noise band. Then the trajectory oscillates inside the noise band. The layer imbalance remains constant for continuous SGD.

Similarly to the noise augmentation, the layer imbalance decreases for discrete SGD:

 wi(t+1)=wi−2λ(W−1+η)W/wi (12)
###### Theorem 5.1.

Consider discrete SGD (Eq. 12). Assume that , and that SGD noise satisfies . If we define the adaptive learning rate , then the layer imbalance monotonically decreases.

###### Proof.

Let’s estimate the layer imbalance:

 w2i(t+1)−w2j(t+1)=(wi−2λ(W−1+η)W/wi)2−(wj−2λ(W−1+η)W/wj)2 =(w2i−w2j)⋅(1−4λ2(W−1+η)2W2/(wiwj)2)

On the one hand, the factor . On the other hand:

 k =1−4λ2(W−1+η)2W2/(wiwj)2≥1−2λ2(W−1+η)2W2(1/w2i+1/w2j)2 ≥1−4λ2W2(∑1/w2i)2⋅((W−1)2+η2)≥1−4λ2(∑1/w2i)2⋅δ2(1+δ)2

Setting makes , which completes the proof. ∎

The layer imbalance at a rate proportional to the variance of SGD noise. It was observed by Keskar et al. (2016) that SGD training with a large batch leads to sharp solutions, which generalize worse than solutions obtained with a smaller batch. This fact directly follows from Theorem 5.1. The layer imbalance decreases at a rate . When a batch size increases, , the variance of SGD-noise decreases as . One can compensate for smaller SGD noise with additional generalization: data augmentation, weight decay, dropout, etc.

## 6 Discussion

In this work, we explore dynamics for gradient descent training of deep linear networks. Using the layer imbalance metric, we analyze how regularization methods such as -regularization, noise data augmentation, dropout, etc, affect training dynamics. We show that for all these methods the training has two distinct phases: optimization and regularization. During the optimization phase, the training trajectory goes from an initial point toward minima manifold, and loss monotonically decreases. During the regularization phase, the trajectory goes along minima manifold toward flat minima, and the layer imbalance monotonically decreases. We derive an analytical proof that noise augmentation and continuous dropout work similarly to -regularization. Finally, we show that SGD behaves in the same way as gradient descent with noise regularization.

This work provides an analysis of regularization for scalar linear networks. We leave the question of how regularization works for over-parameterized nonlinear networks for future research. The work also gives a few interesting insights into training dynamics, which can lead to new algorithms for large batch training, new learning rate policies, etc.

#### Acknowledgments

We would like to thank Vitaly Lavrukhin, Nadav Cohen and Daniel Soudry for the valuable feedback.

## References

• S. Arora, N. Golowich, N. Cohen, and W. Hu (2019) A convergence analysis of gradient descent for deep linear neural networks. In ICLR, Cited by: §2.2.
• P. Baldi and K. Hornik (1989)

Neural networks and principal component analysis: learning from examples without local minima

.
In Neural Networks 2.1, pp. 53–58. Cited by: §2.
• C. M. Bishop (1995) Training with noise is equivalent to Tikhonov regularization. Neural Computation 7, pp. 108–116.. Cited by: §4.2.
• L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio (2017) Sharp minima can generalize for deep nets. In ICML, Cited by: §2.1, §2.2.
• S. Hochreiter and J. Schmidhuber (1994a) Flat minima search for discovering simple nets, technical report fki-200-94. Technical report Fakultat fur Informatik, H2, Technische Universitat Munchen. Cited by: §2.1.
• S. Hochreiter and J. Schmidhuber (1994b) Simplifying neural nets by discovering flat minima. In NIPS, Cited by: §1, §2.1.
• N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2016)

On large-batch training for deep learning: generalization gap and sharp minima

.
In ICLR, Cited by: §2.1, §5.
• D. Kingma, T. Salimans, and M. Welling (2015) Variational dropout and the local reparameterization trick. In NIPS, Cited by: §4.2.
• B. Neyshabur, R. Salakhutdinov, and N. Srebro (2015) Path-sgd: path-normalized optimization in deep neural networks. In NIPS, pp. 2422–2430. Cited by: §2.2.
• A. M. Saxe, J. L. McClelland, and S. Ganguli (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural network. In ICLR, Cited by: §1, §3.1.
• N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

Journal of Machine Learning Research

.
Cited by: §4.2.
• L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In ICML, Cited by: §4.2.