 # Gradient descent aligns the layers of deep linear networks

This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): (i) the risk converges to 0; (ii) the normalized i-th weight matrix asymptotically equals its rank-1 approximation u_iv_i^; (iii) these rank-1 matrices are aligned across layers, meaning |v_i+1^u_i|→1. In the case of the logistic loss (binary cross entropy), more can be said: the linear function induced by the network --- the product of its weight matrices --- converges to the same direction as the maximum margin solution. This last property was identified in prior work, but only under assumptions on gradient descent which here are implied by the alignment phenomenon.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Efforts to explain the effectiveness of gradient descent in deep learning have uncovered an exciting possibility: it not only finds solutions with low error, but also biases the search for low complexity solutions which generalize well

(Zhang et al., 2017; Bartlett et al., 2017; Soudry et al., 2017; Gunasekar et al., 2018).

This paper analyzes the implicit regularization of gradient descent and gradient flow on deep linear networks and linearly separable data. For strictly decreasing losses, the optimum is off at infinity, and we establish various alignment phenomena:

• For each weight matrix , the corresponding normalized weight matrix asymptotically equals its rank- approximation , where the Frobenius norm satisfies . In other words, , and asymptotically only the rank- approximation of each weight matrix contributes to the final predictor, a form of implicit regularization.

• Adjacent rank- weight matrix approximations are aligned: .

• For the logistic loss, the first right singular vector

of is aligned with the data, meaning converges to the unique maximum margin predictor defined by the data. Moreover, the linear predictor induced by the network, , is also aligned with the data, meaning .

Simultaneously, this work proves that the risk is globally optimized: it asymptotes to 0. Alignment and risk convergence are proved simultaneously; the phenomena are coupled within the proofs.

The paper is organized as follows. This introduction continues with related work, notation, and assumptions in Sections 1.2 and 1.1. The analysis of gradient flow is in Section 2, and gradient descent is analyzed in Section 3. The paper closes with future directions in Section 4; a particular highlight is a preliminary experiment on CIFAR-10 which establishes empirically that a form of the alignment phenomenon occurs on the standard nonlinear network AlexNet.

### 1.1 Related work

On the implicit regularization of gradient descent, Soudry et al. (2017) show that for linear predictors and linearly separable data, the gradient descent iterates converge to the same direction as the maximum margin solution. Ji and Telgarsky (2018) further characterize such an implicit bias for general nonseparable data. Gunasekar et al. (2018) consider gradient descent on fully connected linear networks and linear convolutional networks. In particular, for the exponential loss, assuming the risk is minimized to and the gradients converge in direction, they show that the whole network converges in direction to the maximum margin solution. These two assumptions are on the gradient descent process itself, and specifically the second one might be hard to interpret and justify. Compared with Gunasekar et al. (2018), this paper proves that the risk converges to

and the weight matrices align; moreover the proof here proves the properties simultaneously, rather than assuming one and deriving the other. Lastly, for ReLU networks,

Du et al. (2018) show that gradient flow does not change the difference between squared Frobenius norms of any two layers.

For a smooth (nonconvex) function, Lee et al. (2016) show that any strict saddle can be avoided almost surely with small step sizes. If there are only countably many saddle points and they are all strict, and if gradient descent iterates converge, then this implies (almost surely) they converge to a local minimum. In the present work, since there is no finite local minimum, gradient descent will go to infinity and never converge, and thus these results of Lee et al. (2016) do not show that the risk converges to .

There has been a rich literature on linear networks. Saxe et al. (2013) analyze the learning dynamics of deep linear networks, showing that they exhibit some learning patterns similar to nonlinear networks, such as a long plateau followed by a rapid risk drop. Arora et al. (2018) show that depth can help accelerate optimization. On the landscape properties of deep linear networks, Lu and Kawaguchi (2017); Laurent and von Brecht (2017) show that under various structural assumptions, all local optima are global. Zhou and Liang (2018) give a necessary and sufficient characterization of critical points for deep linear networks.

### 1.2 Notation, setting, and assumptions

Consider a data set , where , , and . The data set is assumed to be linearly separable, i.e., there exists a unit vector

which correctly classifies every data point: for any

, . Furthermore, let denote the maximum margin, and denote the maximum margin solution (the solution to the hard-margin SVM).

A linear network of depth is parameterized by weight matrices , where , , and . Let denote all parameters of the network. The (empirical) risk induced by the network is given by

 R(W)=R\delWL,…,W1=1nn∑i=1ℓ\delyiWL⋯W1xi=1nn∑i=1ℓ\del⟨wprod,zi⟩,

where , and .

The loss is assumed to be continuously differentiable, unbounded, and strictly decreasing to . Examples include the exponential loss and the logistic loss . is continuous, and .

This paper considers gradient flow and gradient descent, where gradient flow can be interpreted as gradient descent with infinitesimal step sizes. It starts from some at , and proceeds as

 \difW(t)\dift=−∇R\delW(t).

By contrast, gradient descent is a discrete-time process given by

 W(t+1)=W(t)−ηt∇R\delW(t),

where is the step size at time .

We assume that the initialization of the network is not a critical point and induces a risk no larger than the risk of the trivial linear predictor . The initialization satisfies and . It is natural to require that the initialization is not a critical point, since otherwise gradient flow/descent will never make a progress. The requirement can be easily satisfied, for example, by making and . On the other hand, if , gradient flow/descent may never minimize the risk to . Proofs of those claims are given in Appendix A.

## 2 Results for gradient flow

In this section, we consider gradient flow. Although impractical when compared with gradient descent, gradient flow can simplify the analysis and highlight proof ideas. For convenience, we usually use , , and , but they all change with (the continuous time) . Only proof sketches are given here; detailed proofs are deferred to Appendix B.

### 2.1 Risk convergence

One key property of gradient flow is that it never increases the risk:

 \difR(W)\dift=⟨∇R(W),\difW\dift⟩=−∥∇R(W)∥2=−L∑k=1\enVert∂R∂Wk2F≤0. (1)

We now state the main result: under Sections 1.2 and 1.2, gradient flow minimizes the risk, and all go to infinity, and the alignment phenomenon occurs. Under Sections 1.2 and 1.2, gradient flow iterates satisfy the following properties:

• .

• For any , .

• For any , letting denote the first left and right singular vectors of ,

 limt→∞\enVertWk∥Wk∥F−ukv⊤kF=0.

Moreover, for any , . As a result,

 limt→∞\envert⟨wprod∏Lk=1∥Wk∥F,v1⟩=1,

and thus .

Section 2.1 is proved using two lemmas, which may be of independent interest. To show the ideas, let us first introduce a little more notation. Recall that denotes the empirical risk induced by the deep linear network . Abusing the notation a little, for any linear predictor , we also use to denote the risk induced by . With this notation, , while

 ∇R(wprod)=1nn∑i=1ℓ′\del⟨wprod,zi⟩zi=1nn∑i=1ℓ′\delWL⋯W1zizi

is in and different from , which has entries, as given below:

 ∂R∂Wk=W⊤k+1⋯W⊤L∇R(wprod)⊤W⊤1⋯W⊤k−1.

Furthermore, for any , let

The first lemma shows that for any , the time spent by gradient flow in is finite. Under Sections 1.2 and 1.2, for any , there exists a constant , such that for any and any , . As a result, gradient flow spends a finite amount of time in for any , and is unbounded. Here is a proof sketch. If all are bounded, then will be lower bounded by a positive constant, therefore if can be arbitrarily small, then and can also be arbitrarily small, and thus can be arbitrarily close to . This cannot happen after , otherwise it will contradict Section 1.2 and eq. 1.

To proceed, we need the following properties of linear networks from prior work (Arora et al., 2018; Du et al., 2018). For any time and any ,

 W⊤k+1(t)Wk+1(t)−W⊤k+1(0)Wk+1(0)=Wk(t)W⊤k(t)−Wk(0)W⊤k(0). (2)

To see this, just notice that

 W⊤k+1∂R∂Wk+1=W⊤k+1⋯W⊤L∇R(wprod)⊤W⊤1⋯W⊤k=∂R∂WkW⊤k.

Taking the trace on both sides of eq. 2, we have

 \enVertWk+1(t)2F−\enVertWk+1(0)2F=\enVertWk(t)2F−\enVertWk(0)2F. (3)

In other words, the difference between the squares of Frobenius norms of any two layers remains a constant. Together with Section 2.1, it implies that all are unbounded.

However, even if are large, it does not necessarily follow that is also large. Section 2.1 shows that this is indeed true: for gradient flow, as become larger, adjacent layers also get more aligned to each other, which ensures that their product also has a large norm.

For , let , , and

denote the first singular value (the

-norm), the first left singular vector, and the first right singular vector of , respectively. Furthermore, define

 D :=\delmax1≤k≤L∥Wk(0)∥2F−∥WL(0)∥2F+L−1∑k=1\enVertWk(0)W⊤k(0)−W⊤k+1(0)Wk+1(0)2,

which depends only on the initialization. If for all it holds that , then . The gradient flow iterates satisfy the following properties:

• For any , .

• For any , .

• Suppose , then .

The proof is based on eq. 2 and eq. 3. If , then eq. 2 gives that and have the same singular values, and ’s right singular vectors and ’s left singular vectors are the same. If it is true for any two adjacent layers, since is a row vector, all layers have rank . With general initialization, we have similar results when is large enough so that the initialization is negligible. Careful calculations give the exact results in Section 2.1.

An interesting point is that the implicit regularization result in Section 2.1 helps establish risk convergence in Section 2.1. Specifically, by Section 2.1, if all layers have large norms, will be large. If the risk is not minimized to , will be lower bounded by a positive constant, and thus will be large. Invoking eq. 1, Section 2.1 and eq. 3 gives a contradiction. Since the risk has no finite optimum, .

### 2.2 Convergence to the maximum margin solution

Here we focus on the exponential loss and the logistic loss . In addition to risk convergence, these two losses also enable gradient descent to find the maximum margin solution.

To get such a strong convergence, we need one more assumption on the data set. Recall that denotes the maximum margin, and denotes the unique maximum margin predictor which attains this margin . Those data points for which are called support vectors. The support vectors span the whole space . Section 2.2 appears in prior work (Soudry et al., 2017), and can be satisfied in many cases: for example, it is almost surely true if the number of support vectors is larger than or equal to and the data set is sampled from some density w.r.t. the Lebesgue measure. It can also be relaxed to the situation that the support vectors and the whole data set span the same space; in this case will never leave this space, and we can always restrict our attention to this space.

With Section 2.2, we can state the main theorem. Under Sections 2.2 and 1.2, for almost all data and for losses and , then , where is the first right singular vector of . As a result, .

Section 2.2 relies on two structural lemmas. The first one is based on a similar almost-all argument due to Soudry et al. (2017, Lemma 8). Let denote the set of indices of support vectors. Under Section 2.2

, if the data set is sampled from some density w.r.t. the Lebesgue measure, then with probability

,

 α:=min|ξ|=1,ξ⊥¯umaxi∈S⟨ξ,zi⟩>0.

Let denote the orthogonal complement of , and let denote the projection onto . Under Section 2.2, for almost all data, losses and , and any , if and is larger than for or for , then .

With Section 2.2 and Section 2.2 in hand, we can prove Section 2.2. Let denote the projection of rows of onto . Notice that

 Π⊥wprod=\delWL…W2(Π⊥W1)⊤and\dif∥Π⊥W1∥2F\dift=−⟨Π⊥wprod,∇R(wprod)⟩.

If is large compared with , since layers become aligned, will also be large, and then Section 2.2 implies that will not increase. At the same time, , and thus for large enough , must be very small compared with . Many details need to be handled to make this intuition exact; the proof is given in Appendix B.

## 3 Results for gradient descent

One key property of gradient flow which is used in the previous proofs is that it never increases the risk, which is not necessarily true for gradient descent. However, for smooth losses (i.e, with Lipschitz continuous derivatives), we can design some decaying step sizes, with which gradient descent never increases the risk, and basically the same results hold as in the gradient flow case. Deferred proofs are given in Appendix C.

We make the following additional assumption on the loss, which is satisfied by the logistic loss . is -Lipschitz (i.e, is -smooth), and (i.e., is -Lipschitz). Under Section 3, the risk is also a smooth function of , if all layers are bounded. Suppose is -smooth. For any , the risk is a -smooth function on the set , where .

Smoothness ensures that for any , (see (Bubeck et al., 2015, Lemma 3.4)). In particular, if we choose some and set a constant step size , then as long as and are both in ,

 R\delW(t+1)−R\delW(t) ≤⟨∇R\delW(t),−ηt∇R\delW(t)⟩+β(R)η2t2\enVert∇R\delW(t)2 =−12β(R)\enVert∇R\delW(t)2=−ηt2\enVert∇R\delW(t)2. (4)

In other words, the risk does not increase at this step. However, similar to gradient flow, the gradient descent iterate will eventually escape , which may increase the risk. Under Sections 3, 1.2 and 1.2, suppose gradient descent is run with a constant step size . Then there exists a time when , in other words, .

Fortunately, this issue can be handled by adaptively increasing and correspondingly decreasing the step sizes, formalized in the following assumption. The step size , where satisfies , and if , . Section 3 can be satisfied by a line search, which ensures that the gradient descent update is not too aggressive and the boundary is increased properly.

With the additional Sections 3 and 3, exactly the same theorems can be proved for gradient descent. We restate them briefly here. Under Sections 3, 3, 1.2 and 1.2, gradient descent satisfies

• .

• For any , .

• , where is the first right singular vector of .

Under Sections 3, 2.2 and 1.2, for the logistic loss and almost all data, , and .

Proofs of Section 3 and 3 are given in Appendix C, and are basically the same as the gradient flow proofs. The key difference is that an error of will be introduced in many parts of the proof. However, it is bounded in light of eq. 4:

 ∞∑t=0η2t\enVert∇R\delW(t)2≤∞∑t=0ηt\enVert∇R\delW(t)2≤2R\delW(0).

Since all weight matrices go to infinity, such a bounded error does not matter asymptotically, and thus proofs still go through.

## 4 Summary and future directions

This paper rigorously proves that, for deep linear networks on linearly separable data, gradient flow and gradient descent minimize the risk to , align adjacent weight matrices, and align the first right singular vector of the first layer to the maximum margin solution determined by the data. There are many potential future directions; a few are as follows.

#### Convergence rate.

This paper only proves asymptotic convergence with no convergence rate. A convergence rate would allow the algorithm to be compared to other methods which also globally optimize this objective, would also suggest ways to improve step sizes and initialization, and ideally even exhibit a sensitivity to the network architecture and suggest how it could be improved.

#### Nonseparable data and nonlinear networks.

Real-world data is generally not linearly separable, but nonlinear deep networks can reliably decrease the risk to 0, even with random labels (Zhang et al., 2017). This seems to suggest that a nonlinear notion of separability is at play; is there some way to adapt the present analysis?

The present analysis is crucially tied to the alignment of weight matrices: alignment and risk are analyzed simultaneously. Motivated by this, consider a preliminary experiment, presented in Figure 2

, where stochastic gradient descent was used to minimize the risk of a standard AlexNet on CIFAR-10

(Krizhevsky et al., 2012; Krizhevsky and Hinton, 2009).

Even though there are ReLUs, max-pooling layers, and convolutional layers, the alignment phenomenon is occurring in a reduced form on the dense layers (the last three layers of the network). Specifically, despite these weight matrices having shape

, , and the key alignment ratios are much larger than their respective lower bounds . Two initializations were tried: default PyTorch initialization, and a Gaussian initialization forcing all initial Frobenius norms to be just , which is suggested by the norm preservation property in the analysis and removes noise in the weights.

### Acknowledgements

The authors are grateful for support from the NSF under grant IIS-1750051. This grant allowed them to focus on research, and when combined with a gracious NVIDIA GPU grant, led to the creation of their beloved GPU machine DutchCrunch.

## Appendix A Regarding Section 1.2

Suppose while . First of all, and thus . Moreover,

 ⟨∇R\delwprod(0),¯u⟩=1nn∑i=1ℓ′(0)⟨zi,¯u⟩≤ℓ′(0)γ<0,

which implies and .

On the other hand, if , gradient flow/descent may never minimize the risk to . For example, suppose the network has two layers, and both the input and output have dimension ; the network just computes the dot product of two vectors and . Consider minimizing . If , then . It is easy to verify that for any , , and .

## Appendix B Omitted proofs from Section 2

###### Proof of Section 2.1.

Fix an arbitrary . If the claim is not true, then for any , there exists some such that for all while , which means

 \enVert∂R∂W12F=\enVertW⊤2⋯W⊤L∇R(wprod)⊤2F=\enVertWL⋯W22\enVert∇R(wprod)2≤ϵ2.

Since , we have

 ⟨∇R(wprod),¯u⟩=1nn∑i=1ℓ′\del⟨wprod,zi⟩⟨zi,¯u⟩≤1nn∑i=1ℓ′\del⟨wprod,zi⟩γ≤−Mγ,

where . Since is continuous and the domain is bounded, the maximum is attained and negative, and thus . Therefore , and thus . Since , we further have . In other words, after , may be arbitrarily small, which implies can be arbitrarily close to .

On the other hand, by Section 1.2, at . This implies that , and for any , , which is a contradiction.

Since the risk is always positive, we have

 R\delW(0) ≥∫∞t=0L∑k=1\enVert∂R∂Wk2F\dift ≥∫∞t=0\enVert∂R∂W12F\dift ≥∫∞t=0\enVert∂R∂W12F1\sbrmax1≤k≤L∥Wk∥F≤R\dift ≥∫∞t=1\enVert∂R∂W12F1\sbrmax1≤k≤L∥Wk∥F≤R\dift ≥ϵ(R)2∫∞t=11\sbrmax1≤k≤L∥Wk∥F≤R\dift,

which implies gradient flow only spends a finite amount of time in . This directly implies that is unbounded. ∎

###### Proof of Section 2.1.

The first claim is true for since is a row vector. For any , recall the following relation (Arora et al., 2018; Du et al., 2018):

 W⊤k+1(t)Wk+1(t)−W⊤k+1(0)Wk+1(0)=Wk(t)W⊤k(t)−Wk(0)W⊤k(0). (5)

Let . By eq. 5 and the definition of singular vectors and singular values, we have

 σ2k ≥v⊤k+1WkW⊤kvk+1 =v⊤k+1W⊤k+1Wk+1vk+1+v⊤k+1Ak,k+1vk+1 =σ2k+1+v⊤k+1Ak,k+1vk+1 ≥σ2k+1−∥Ak,k+1∥2. (6)

Moreover, by taking the trace on both sides of eq. 5, we have

 ∥Wk∥2F=tr\delWkW⊤k =tr\delW⊤k+1Wk+1+tr\delWk(0)W⊤k(0)−tr\delW⊤k+1(0)Wk+1(0) =∥Wk+1∥2F+∥Wk(0)∥2F−∥Wk+1(0)∥2F. (7)

Summing eq. 6 and eq. 7 from to , we get

 ∥Wk∥2F−∥Wk∥22≤∥Wk(0)∥2F−∥WL(0)∥2F+L−1∑k′=k∥Ak′,k′+1∥2≤D. (8)

Next we prove that singular vectors get aligned. Consider . On one hand, similarly to eq. 6,

 u⊤kW⊤k+1Wk+1uk =u⊤kWkW⊤kuk−u⊤kWk(0)W⊤k(0)uk+u⊤kW⊤k+1(0)Wk+1(0)uk ≥u⊤kWkW⊤kuk−u⊤kWk(0)W⊤k(0)uk ≥σ2k−∥Wk(0)∥22. (9)

On the other hand, it follows from the definition of singular vectors and eq. 8 that

 u⊤kW⊤k+1Wk+1uk =⟨uk,vk+1⟩2σ2k+1+u⊤k\delW⊤k+1Wk+1−vk+1σ2k+1v⊤k+1uk ≤⟨uk,vk+1⟩2σ2k+1+∥Wk+1∥2F−∥Wk+1∥22 ≤⟨uk,vk+1⟩2σ2k+1+D. (10)

Combining appendix B and eq. 10, we get

 σ2k ≤⟨uk,vk+1⟩2σ2k+1+D+∥Wk(0)∥22. (11)

Similar to appendix B, we can get

 σ2k≥v⊤k+1WkW⊤kvk+1≥σ2k+1−∥Wk+1(0)∥22.

Therefore

 σ2kσ2k+1≥1−∥Wk+1(0)∥22σ2k+1. (12)

Combining eq. 11 and eq. 12, we finally get

 ⟨uk,vk+1⟩2≥1−D+∥Wk(0)∥22+∥Wk+1(0)∥22σ2k+1.

Regarding the last claim, first recall that since the difference between the squares of Frobenius norms of any two layers is a constant, implies for any . We further have the following.

• Since , for any , and .

• Since , .

As a result,

 \envert⟨wprod∏Lk=1∥Wk∥F,v1⟩ =\envert⟨L∏k=1Wk∥Wk∥F,v1⟩ →\envert⟨L∏k=1uiv⊤i,v1⟩ →1.

###### Proof of Section 2.1.

Suppose for some , for any . Then there exists some such that , and thus . On the other hand, since , , and thus . Let