# Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections

The behavior of the gradient descent (GD) algorithm is analyzed for a deep neural network model with skip-connections. It is proved that in the over-parametrized regime, for a suitable initialization, with high probability GD can find a global minimum exponentially fast. Generalization error estimates along the GD path are also established. As a consequence, it is shown that when the target function is in the reproducing kernel Hilbert space (RKHS) with a kernel defined by the initialization, there exist generalizable early-stopping solutions along the GD path. In addition, it is also shown that the GD path is uniformly close to the functions given by the related random feature model. Consequently, in this "implicit regularization" setting, the deep neural network model deteriorates to a random feature model. Our results hold for neural networks of any width larger than the input dimension.

## Authors

• 54 publications
• 57 publications
• 38 publications
• ### A Comparative Analysis of the Optimization and Generalization Property of Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics

A fairly comprehensive analysis is presented for the gradient descent dy...
04/08/2019 ∙ by Weinan E, et al. ∙ 25

• ### Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

The evolution of a deep neural network trained by the gradient descent c...
09/18/2019 ∙ by Jiaoyang Huang, et al. ∙ 0

• ### Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks

The skip-connections used in residual networks have become a standard ar...
10/07/2019 ∙ by Spencer Frei, et al. ∙ 14

• ### On the Generalization Properties of Minimum-norm Solutions for Over-parameterized Neural Network Models

We study the generalization properties of minimum-norm solutions for thr...
12/15/2019 ∙ by Weinan E, et al. ∙ 10

• ### The Quenching-Activation Behavior of the Gradient Descent Dynamics for Two-layer Neural Network Models

A numerical and phenomenological study of the gradient descent (GD) algo...
06/25/2020 ∙ by Chao Ma, et al. ∙ 6

• ### Kernel and Deep Regimes in Overparametrized Models

A recent line of work studies overparametrized neural networks in the "k...
06/13/2019 ∙ by Blake Woodworth, et al. ∙ 0

• ### A Priori Estimates of the Population Risk for Residual Networks

Optimal a priori estimates are derived for the population risk of a regu...
03/06/2019 ∙ by Weinan E, et al. ∙ 6

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper is concerned with the following questions on the gradient descent (GD) algorithm for deep neural network models:

1. Under what condition, can the algorithm find a global minimum of the empirical risk?

2. Under what condition, can the algorithm find models that generalize, without using any explicit regularization?

These questions are addressed for a specific deep neural network model with skip connections. For the first question, it is shown that with proper initialization, the gradient descent algorithm converges to a global minimum exponentially fast, as long as the network is deep enough. For the second question, it is shown that if in addition the target function belongs to a certain reproducing kernel Hilbert space (RKHS) with kernel defined by the initialization, then the gradient descent algorithm does find models that can generalize. This result is obtained as a consequence of the estimates on the the generalization error along the GD path. However, it is also shown that the GD path is uniformly close to functions generated by the GD path for the related random feature model. Therefore in this particular setting, as far as “implicit regularization” is concerned, this deep neural network model is no better than the random feature model.

In recent years there has been a great deal of interest on the two questions raised above[10, 13, 12, 4, 2, 1, 32, 7, 17, 19, 29, 26, 9, 28, 5, 3, 11, 31, 21]. An important recent advance is the realization that over-parametrization can simplify the analysis of GD dynamics in two ways: The first is that in the over-parametrized regime, the parameters do not have to change much in order to make an change to the function that they represent [10, 19]. This gives rise to the possibility that only a local analysis in the neighborhood of the initialization is necessary in order to analyze the GD algorithm. The second is that over-parametrization can improve the non-degeneracy of the associated Gram matrix [30], thereby ensuring exponential convergence of the GD algorithm [13].

Using these ideas, [2, 32, 12, 13]

proved that (stochastic) gradient descent converges to a global minimum of the empirical risk with an exponential rate.

[17] showed that in the infinite-width limit, the GD dynamics for deep fully connected neural networks with Xavier initialization can be characterized by a fixed neural tangent kernel. [10, 19] considered the online learning setting and proved that stochastic gradient descent can achieve a population error of using samples. [7] proved that GD can find the generalizable solutions when the target function comes from certain RKHS. These results all share one thing in common: They all require that the network width satisfies , where denote the network depth and training set size, respectively. In fact, [10, 7] required that . In other words, these results are concerned with very wide networks. In contrast, in this paper, we will focus on deep networks with fixed width (assumed to be larger than where is the input dimension).

### 1.1 The motivation

Our work is motivated strongly by the results of the companion paper [16] in which similar questions were addressed for the two-layer neural network model. It was proved in [16]

that in the so-called “implicit regularization” setting, the GD dynamics for the two-layer neural network model is closely approximated by the GD dynamics for a random feature model with the features defined by the initialization. For over-parametrized models, this statement is valid uniformly for all time. In the general case, this statement is valid at least for finite time intervals during which early stopping leads to generalizable models for target functions in the relevant reproducing kernel Hilbert space (RKHS). The numerical results reported in

[16] nicely corroborated these theoretical findings.

To understand what happens for deep neural network models, we first turn to the ResNet model:

 h(l+1)=h(l)+U(l)σ(V(l)h(l)),l=0,1,⋯,L−1 (1.1)
 h(0)=(xT,1)T∈Rd+1,fL(x,θ)=wTh(L)

where , and denote the all the parameters to be trained.

A main observation exploited in [16]

is the time scale separation between the GD dynamics for the coefficients in- and outside the activation function, i.e. the

’s and the ’s In a typical practical setting, one would initialize the ’s to be and the ’s to be . This results in a slower dynamics for the ’s, compared with the dynamics for the ’s, due to the factor presence of an extra factor of in the dynamical equation for . In the case of two-layer networks, this separation of time scales resulted in the fact that the parameters inside the activation function were effectively frozen during the time period of interest. Therefore the GD path stays close to the GD path for the random feature model with the features given by the initialization.

To see whether similar things happen for the ResNet model, we consider the following “compositional random feature model” in which (1.1) is replaced by

 h(l+1)=h(l)+U(l)σ(V(l)(0)h(l)),l=0,1,⋯,L−1 (1.2)

Note that in (1.2) the ’s are fixed at their initial values, the only parameters to be updated by the GD dynamics are the ’s.

##### Numerical experiments

Here we provide numerical evidences for the above intuition by considering a very simple target function: where . We initialize (1.1) and (1.2) by . Since we are interested in the effect of depth, we choose . Please refer to Appendix A for more details.

Figure 1 displays the comparison of the GD dynamics for ResNet and the related “compositional random feature model”. We see a clear indication that (1) GD algorithm converges to a global minimum of the empirical risk for deep residual network, and (2) for deep neural networks, the GD dynamics for the two models stays close to each other.

Figure 2 shows the testing error for the optimal (convergent) solution shown in Figure 1 as the depth of the ResNet changes. We see that the testing error seems to be settling down on a finite value as the network depth is increased. As a comparison, we also show the testing error for the optimizers of the regularized model proposed in [14] (see (A.2)). One can see that for this particular target function, the testing error for the minimizers of the regularized model is consistently very small as one varies the depth of the network.

These results are similar to the ones shown in [16] for two-layer neural networks. They suggest that for ResNets, GD algorithm is able to find global minimum for the empirical risk but in terms of the generalization property, the resulting model may not be better than the compositional random feature model.

On the theoretical side, we have not yet succeeded in dealing directly with the ResNet model. Therefore in this paper we will deal instead with a modified model which shares a lot of common features with the ResNet model but the simplifies the task of analyzing error propagation between the layers. We believe that the insight we have gained in this analysis is helpful for understanding general deep network models.

### 1.2 Our contribution

In this paper, we analyze the gradient descent algorithm for a particular class of deep neural networks with skip-connections. We consider the least square loss and assume that the nonlinear activation function is Lipschitz continuous (e.g.

Tanh, ReLU

).

• We prove that if the depth satisfies , then gradient descent converges to a global minimum with zero training error at an exponential rate. This result is proved by only assuming that the network width is larger than . As noted above, the previous optimization results [12, 2, 32] require that the width satisfies .

• We provide a general estimate for the generalization error along the GD path, assuming that the target function is in a RKHS with the kernel defined by the initialization. As a consequence, we show that population risk is bounded from above by if certain early stopping rules are used. In contrast, the generalization result in [7] requires that .

• We prove that the GD path is uniformly close to the functions given by the related random feature model (see Theorem 6.6). Consequently the generalization property of the resulting model is no better than that of the random feature model. This allows us to conclude that in this “implicit regularization” setting, the deep neural network model deteriorates to a random feature model. In contrast, it has been established in [15, 14] that for suitable explicitly regularized models, optimal generalization error estimates (e.g. rates comparable to the Monte Carlo rate) can be proved for a much larger class of target functions.

These results are very much analogous to the ones proved in [16] for two-layer neural networks.

One main technical ingredient in this work is to use a combination of the identity mapping and skip-connections to stabilize the forward and backward propagation in the neural network. This enable us to consider deep neural networks with fixed width. The second main ingredient is the exploitation of a possible time scale separation between the GD dynamics for the parameters in- and outside the activation function: The parameters inside the activation function are effectively frozen during the GD dynamics compared with the parameters outside the activation function.

## 2 Preliminaries

Throughout this paper, we let , and use and to denote the and Frobenius norms, respectively. For a matrix , we use to denote its -th row, -th column and -th entry, respectively. We let and use

to indicate the uniform distribution over

. We use as a shorthand notation for , where is some absolute constant. is similarly defined.

### 2.1 Problem setup

We consider the regression problem with training data set given by , where are i.i.d. samples drawn from a fixed (but unknown) distribution . For simplicity, we assume and . We use to denote the model with parameter . We are interested in minimizing the empirical risk, defined by

 ^Rn(Θ)=12nn∑i=1(f(xi;Θ−yi)2. (2.1)

We let and , then .

For the generalization problem, we need to specify how the ’s are obtained. Let be our target function. Then we have . We will assume that there are no measurement noises. This makes the argument more transparent but does not change things qualitatively: Essentially the same argument applies to the case with measurement noise.

Our goal is to estimate the population risk, defined by

 R(Θ)=12Eρ[(f(x;Θ)−f∗(x))2]

#### Deep neural networks with skip-connections

We will consider a special deep neural network model with multiple skip-connections, defined by

 h(1) =(x,0)T∈Rd+1 (2.2) h(l+1) =⎛⎝h(1)1:dh(l)d+1⎞⎠+U(l)σ(V(l)h(l)),l=1,⋯,L−1 f(x;Θ) =wTh(L).

Here . Note that and are the depth and width of the network respectively. is a scalar nonlinear activation function, which is assumed to be 1-Lipschitz continuous and

. For any vector

we define . For simplicity, we fix to be . Thus the parameters that need to be estimated are: . We also define , the output of the -th nonlinear hidden layer.

This network model has the following feature: The first entries of are directly connected to the input layer by a long-distance skip-connection, only the last entry is connected to the previous layer. As will be seen later, the long-distance skip-connections help to stabilize the deep network. We further let:

 U(l)=(C(l)(a(l))T),V(l)=(B(l)r(l)),h(l)=(z(l)y(l)),

where , and . With these notations, we can re-write the model as

 z(1) =x,y(1)=0 (2.3) z(l+1) =z(1)+C(l)σ(B(l)z(l)+r(l)y(l)) y(l+1) =y(l)+(a(l))Tσ(B(l)z(l)+r(l)y(l)),l=1,⋯,L−1. f(x;Θ) =y(L).

We will analyze the behavior of the gradient descent algorithm, defined by

 Θt+1=Θt−η∇^Rn(Θt),

where is the learning rate. For simplicity, in most cases, we will focus on its continuous version:

 dΘtdt=−∇^Rn(Θt). (2.4)
##### Initialization

We will focus on a special class of initialization:

 C(l)0=0,a(l)0=0,√mrow(B(l)0)∼π0,r(l)0=0, (2.5)

where the third item means that each row of is independently drawn from the uniform distribution over . Thus for this initialization, .

Note that all the results in this paper also hold for slightly larger initializations, e.g. and . But for simplicity, we will focus on the initialization (2.5).

### 2.2 Assumption on the input data

For the given activation function , we can define a symmetric positive definite (SPD) function

 (2.6)

Denote by the RKHS induced by . For the given training set, the (empirical) kernel matrix is defined as

 Ki,j=1nk0(xi,xj),i,j=1,⋯,n

We make the following assumption on the training data set.

###### Assumption 2.1.

For the given training data , we assume that is positive definite, i.e.

 λndef=λmin(K)>0. (2.7)
###### Remark 1.

Note that , and in general depends on the data set. If we assume that are independently drawn from , it was shown in [6] that with high probability where is the

-th eigenvalue of the Hilbert-Schmidt integral operator

defined by

 Tk0f(x)=∫Sd−1k0(x,x′)f(x′)dπ0(x′).

Using this result, [30] provided lower bounds for based on some geometric discrepancy.

## 3 The main results

Let be the solution of the GD dynamics (2.4) at time with the initialization defined in (2.5). We first show that with high probability, the landscape of near the initialization has some coercive property which guarantees the exponential convergence towards a global minimum.

###### Lemma 3.1.

Assume that there are constants such that and for

 c2J(Θ)≤∥∇J(Θ)∥2≤c3J(Θ). (3.1)

Then for any , we have

 J(Θt)≤e−c2tJ(Θ0).
###### Proof.

Let . Then for , the condition (3.1) is satisfied. Thus we have

 dJ/dt=−∥∇J∥2≤−c2J.

Consequently, we have,

 J(Θt)≤e−c2tJ(Θ0)

It remains to show that actually . If , then we have

 ∥Θt0−Θ0∥ ≤∫t00∥∇J(Θt)∥dt≤√c3J(Θ0)∫∞0e−c2t/2dt≤2√c3J(Θ0)c2(i)

where is due to the assumption that . This contradicts the definition of . ∎

Our main result for optimization is as follows.

###### Theorem 3.2 (Optimization).

For any , assume that . With probability at least over the initialization , we have that for any ,

 ^Rn(Θt)≤e−Lλnt2^Rn(Θ0). (3.2)

In contrast to other recent results for multi-layer neural networks [13, 2, 32], we do not require the network width to increase with the size of the data set or the depth of the network.

As is the case for two-layer neural networks [16], the fact that the GD dynamics stays in a neighborhood of the initialization suggests that it resembles the situation of a random feature model. Consequently, the generalization error can be controlled if we assume that the target function is in the appropriate RKHS.

###### Assumption 3.3.

Assume that , i.e.

 f∗(x) =Eπ0[a∗(ω)σ(ωTx)] ∥f∗∥2Hk0 =Eπ0[|a∗(ω)|2]<+∞.

In addition, we also assume that .

In the following, we will denote . Obviously, .

###### Theorem 3.4 (Generalization).

Assume that the target function satisfies Assumption 3.3. For any , assume that . Then with probability at least over the random initialization, the following holds for any ,

 R(Θt)≲1L3/2λ2n+tλ2n+γ2(f∗)Lt+(1+√Lγ(f∗)t√n)2c3(δ)γ2(f∗)√n.

where .

In addition, by choosing the stopping time appropriately, we obtain the following result:

###### Corollary 3.5 (Early-stopping).

Assume that . Let , then we have

 R(ΘT)≲c3(δ)γ2(f∗)√n.

## 4 Landscape around the initialization

###### Definition 1.

For any , we define a neighborhood around the initialization by

 Ic(Θ0)def={Θ:maxl∈[L]{∥a(l)−a(l)0∥,∥rl−r(l)0∥,∥B(l)−B(l)0∥F,∥C(l)−C(l)0∥F}≤cL}. (4.1)

Let . We will assume that

. In the following, we first prove that both the forward and backward propagation is stable regardless of its depth. We then show that the norm of the gradient can be bounded from above and below by the loss function, similar to the condition required in Lemma

3.1. This implies that there are no issues with vanishing or exploding gradients.

### 4.1 Forward stability

At , it easy to check that

 y(l)(x;Θ0)=0,z(l)(x;Θ0)=x,g(l)(x;Θ0)=σ(B(l)0x).

For simplicity, when it is clear from the context, we will omit the dependence on and in the notations.

###### Proposition 4.1.

If , we have for any and that

 |y(l)(x;Θ)−y(l)(x;Θ0)| ≤4c (4.2) ∥z(l)(x;Θ)−z(l)(x;Θ0)∥ ≤4cL ∥g(l)(x;Θ)−g(l)(x;Θ0)∥ ≤6c2L.
###### Remark 2.

We see that all the variables are close to their initial value except , which is used to accumulate the prediction from each layer.

###### Proof.

Let . Then by (2.3), we have

 o(l+1) ≤εc((1+εc)(1+o(l))+εc|y(l)|) |y(l+1)| ≤|y(l)|+εc((1+εc)(1+o(l))+εc|y(l)|),

with . Adding the two inequalities gives us:

 o(l+1)+|y(l+1)| ≤2εc(1+εc)o(l)+(1+2ε2c)|y(l)|+2εc(1+εc)

Since , the above inequality can be simplified as

 o(l+1)+|y(l+1)| ≤(1+2ε2c)(o(0)+|y(0)|)+2.25εc ≤2.25εcl∑l′=0(1+2εc)l′≤2.25LεceL2ε2c≤4c

Thus we obtain that for any , . Plugging it back to the recursive formula for , we get

 o(l+1)≤1.25εco(l)+2.25εc

This gives us

 o(l)≤4c/L∀l∈[L].

Now the deviation of can be estimated by

 ∥g(l)(x;Θ)−g(l)(x;Θ0)∥ =∥σ(B(l)z(l)(x)+r(l)y(l)(x))−σ(B(l)0x)∥ ≤∥B(l)z(l)(x)+r(l)y(l)(x)−B(l)0x∥

By inserting the previous estimates, we obtain

 ∥g(l)(x;Θ)−g(l)(x;Θ0)∥≤6c2L

### 4.2 Backward stability

For convenience, we define the gradients with respect to the neurons by

 α(l)(x;Θ)=∇y(l)f(x;Θ)β(l)(x;Θ)=∇z(l)f(x;Θ)γ(l)(x;Θ)=∇g(l)f(x;Θ).

For simplicity, we will omit the explicit reference of and in these notations when it is clear from the context. Note that

, and it is easy to derive the following back-propagation formula using the chain rule,

 γ(l) =a(l)α(l+1)+(C(l))Tβ(l+1) (4.3) β(l) =(B(l))Tγ(l) α(l) =α(l+1)+(r(l))Tγ(l).

At the top layer, we have that for any and :

 α(L)=1,β(L)=0.

 α(l)(x;Θ0)=1,β(l)(x;Θ0)=0,γ(l)(x;Θ0)=0.
###### Proposition 4.2.

If , we have for any and

 |α(l)(x;Θ)−1|≤5cL,∥β(l)(x;Θ)∥≤4cL,∥γ(l)(x;Θ)∥≤3cL (4.4)
###### Proof.

According to the (4.3), we have

 (β(l)α(l))

which gives us

 ∥β(l)∥ ≤εc(1+εc)∥β(l+1)∥+εc(1+εc)α(l+1) (4.5) |α(l)| ≤ε2c∥β(l+1)∥+(1+ε2c)α(l+1). (4.6)

 εc∥β(l)∥+α(l)≤ε2c(2+εc)∥β(l+1)∥+(1+2.25ε2c)α(l+1)≤(1+2.25ε2c)(∥β(l+1)∥+α(l+1)).

Therefore, we have

 α(l)≤εβ(l)+α(l)≤(1+2.25ε2c)L≤1+5cεc.

Inserting the above estimates back to (4.5) gives us

 ∥β(l)∥≤1.25εc∥β(l+1)∥+2.5εc,

from which we obtain that

 ∥β(l)∥≤4εc

. Using the (4.3) again, we get

 ∥γ(l)∥=∥a(l)α(l+1)+(C(l))Tβ(l+1)∥≤3εc.

For the lower bound, using (4.3), we get

 α(l) =α(l+1)+(r(l))Tγ(l)≥α(l+1)−3ε2c ≥αL−3Lε2c≥1−3c2L.

We are now ready to bound the gradients. First note that we have

 ∇a(l)f(x) =α(l)(x)g(l)(x) ∇B(l)f(x) =γ(l)(x)(z(l)(x))T ∇C(l)f(x) =β(l)(x)(g(l)(x))T ∇r(l)f(x) =γ(l)(x)y(l)(x),

where we have omitted the dependence on . Using the stability results, we can bound the gradients by the empirical loss.

###### Lemma 4.3 (Upper bound).

If , then for any we have

 max{∥∇a(l)^Rn∥2,∥∇r(l)^Rn∥2} ≤(1+50c2L)^Rn (4.7) max{∥∇B(l)^Rn∥2,∥∇C(l)^Rn∥2} ≤20c2L2^Rn
###### Proof.

Using Lemma 4.1 and Lemma 4.2, we have

 ∥∇a(l)^Rn∥2 =∥1nn∑i=0e(xi,yi)α(l)(xi)g(l)(xi)∥2 ≤^Rn(Θ)1nn∑i=1∥α(l)(xi)g(l)(xi)∥2 ≤^Rn(Θ)nn∑i=0(1+5cL)2(1+6c2L)2 ≤(1+50c2L)^Rn(Θ)

Analogously, we have

 (A). ∥∇B(l)^Rn∥2F=∥1nn∑i=0e(xi,yi)γ(l)(xi)(z(l)(xi))T∥2F ≤^Rn(Θ)1nn∑i=1(3cL)2(1+4cL)2≲15c2L2^Rn(Θ); (B). ∥∇C(l)^Rn∥2F=∥1nn∑i=0eiβ(l)(xi)g(l)k(xi)∥F ≤^Rn(Θ)1nn∑i=0(4cL)2(1+6c2L)2≤20c2L2^Rn(Θ); (C). ∥∇r(l)^Rn∥2=∥1nn∑i=0eiy(l)(xi)γ(l)(xi)∥2 ≤^Rn(Θ)1nn∑i=0(4c)2(3cL)2≤12^Rn(Θ).

We now turn to the lower bound. The technique used is similar to case for two-layer neural networks [13]. Define a Gram matrix with

 Hi,j(Θ)=1nLL∑l=1⟨∇a(l)f(xi),∇a(l)f(xj)⟩. (4.8)

At the initialization, we have

 Hi,j(Θ0)=1nLL∑l=1⟨σ(B(l)0xi),σ(B(l)0xj)⟩.

This matrix can be viewed as an empirical approximation of the kernel matrix defined in Section 2.2, since each row of is independently drawn from the uniform distribution over the sphere of radius . Using standard concentration inequalities, we can prove that with high probability, the smallest eigenvalue of the Gram matrix is bounded from below by the smallest eigenvalue of the kernel matrix. This is stated in the following lemma, whose proof is deferred to Appendix B.

###### Lemma 4.4.

For any , assume that . Then with probability at least over the random initialization:

 λmin(H(Θ0))≥3λn4. (4.9)

Moreover, we can show that for any , the Gram matrix is still strictly positive definite as long as is large enough.

###### Lemma 4.5.

For any , assume that . With probability over the random initialization, we have for any ,

 λmin(H(Θ))≥λn2. (4.10)
###### Proof.
 Hi,j(Θ)−Hi,j(