This paper is concerned with the following questions on the gradient descent (GD) algorithm for deep neural network models:
Under what condition, can the algorithm find a global minimum of the empirical risk?
Under what condition, can the algorithm find models that generalize, without using any explicit regularization?
These questions are addressed for a specific deep neural network model with skip connections. For the first question, it is shown that with proper initialization, the gradient descent algorithm converges to a global minimum exponentially fast, as long as the network is deep enough. For the second question, it is shown that if in addition the target function belongs to a certain reproducing kernel Hilbert space (RKHS) with kernel defined by the initialization, then the gradient descent algorithm does find models that can generalize. This result is obtained as a consequence of the estimates on the the generalization error along the GD path. However, it is also shown that the GD path is uniformly close to functions generated by the GD path for the related random feature model. Therefore in this particular setting, as far as “implicit regularization” is concerned, this deep neural network model is no better than the random feature model.
In recent years there has been a great deal of interest on the two questions raised above[10, 13, 12, 4, 2, 1, 32, 7, 17, 19, 29, 26, 9, 28, 5, 3, 11, 31, 21]. An important recent advance is the realization that over-parametrization can simplify the analysis of GD dynamics in two ways: The first is that in the over-parametrized regime, the parameters do not have to change much in order to make an change to the function that they represent [10, 19]. This gives rise to the possibility that only a local analysis in the neighborhood of the initialization is necessary in order to analyze the GD algorithm. The second is that over-parametrization can improve the non-degeneracy of the associated Gram matrix , thereby ensuring exponential convergence of the GD algorithm .
proved that (stochastic) gradient descent converges to a global minimum of the empirical risk with an exponential rate. showed that in the infinite-width limit, the GD dynamics for deep fully connected neural networks with Xavier initialization can be characterized by a fixed neural tangent kernel. [10, 19] considered the online learning setting and proved that stochastic gradient descent can achieve a population error of using samples.  proved that GD can find the generalizable solutions when the target function comes from certain RKHS. These results all share one thing in common: They all require that the network width satisfies , where denote the network depth and training set size, respectively. In fact, [10, 7] required that . In other words, these results are concerned with very wide networks. In contrast, in this paper, we will focus on deep networks with fixed width (assumed to be larger than where is the input dimension).
1.1 The motivation
that in the so-called “implicit regularization” setting, the GD dynamics for the two-layer neural network model is closely approximated by the GD dynamics for a random feature model with the features defined by the initialization. For over-parametrized models, this statement is valid uniformly for all time. In the general case, this statement is valid at least for finite time intervals during which early stopping leads to generalizable models for target functions in the relevant reproducing kernel Hilbert space (RKHS). The numerical results reported in nicely corroborated these theoretical findings.
To understand what happens for deep neural network models, we first turn to the ResNet model:
where , and denote the all the parameters to be trained.
A main observation exploited in 
is the time scale separation between the GD dynamics for the coefficients in- and outside the activation function, i.e. the’s and the ’s In a typical practical setting, one would initialize the ’s to be and the ’s to be . This results in a slower dynamics for the ’s, compared with the dynamics for the ’s, due to the factor presence of an extra factor of in the dynamical equation for . In the case of two-layer networks, this separation of time scales resulted in the fact that the parameters inside the activation function were effectively frozen during the time period of interest. Therefore the GD path stays close to the GD path for the random feature model with the features given by the initialization.
To see whether similar things happen for the ResNet model, we consider the following “compositional random feature model” in which (1.1) is replaced by
Note that in (1.2) the ’s are fixed at their initial values, the only parameters to be updated by the GD dynamics are the ’s.
Here we provide numerical evidences for the above intuition by considering a very simple target function: where . We initialize (1.1) and (1.2) by . Since we are interested in the effect of depth, we choose . Please refer to Appendix A for more details.
Figure 1 displays the comparison of the GD dynamics for ResNet and the related “compositional random feature model”. We see a clear indication that (1) GD algorithm converges to a global minimum of the empirical risk for deep residual network, and (2) for deep neural networks, the GD dynamics for the two models stays close to each other.
Figure 2 shows the testing error for the optimal (convergent) solution shown in Figure 1 as the depth of the ResNet changes. We see that the testing error seems to be settling down on a finite value as the network depth is increased. As a comparison, we also show the testing error for the optimizers of the regularized model proposed in  (see (A.2)). One can see that for this particular target function, the testing error for the minimizers of the regularized model is consistently very small as one varies the depth of the network.
These results are similar to the ones shown in  for two-layer neural networks. They suggest that for ResNets, GD algorithm is able to find global minimum for the empirical risk but in terms of the generalization property, the resulting model may not be better than the compositional random feature model.
On the theoretical side, we have not yet succeeded in dealing directly with the ResNet model. Therefore in this paper we will deal instead with a modified model which shares a lot of common features with the ResNet model but the simplifies the task of analyzing error propagation between the layers. We believe that the insight we have gained in this analysis is helpful for understanding general deep network models.
1.2 Our contribution
In this paper, we analyze the gradient descent algorithm for a particular class of
deep neural networks with skip-connections.
We consider the least square loss and assume that the nonlinear activation function is Lipschitz continuous (e.g. Tanh, ReLU
We prove that if the depth satisfies , then gradient descent converges to a global minimum with zero training error at an exponential rate. This result is proved by only assuming that the network width is larger than . As noted above, the previous optimization results [12, 2, 32] require that the width satisfies .
We provide a general estimate for the generalization error along the GD path, assuming that the target function is in a RKHS with the kernel defined by the initialization. As a consequence, we show that population risk is bounded from above by if certain early stopping rules are used. In contrast, the generalization result in  requires that .
We prove that the GD path is uniformly close to the functions given by the related random feature model (see Theorem 6.6). Consequently the generalization property of the resulting model is no better than that of the random feature model. This allows us to conclude that in this “implicit regularization” setting, the deep neural network model deteriorates to a random feature model. In contrast, it has been established in [15, 14] that for suitable explicitly regularized models, optimal generalization error estimates (e.g. rates comparable to the Monte Carlo rate) can be proved for a much larger class of target functions.
These results are very much analogous to the ones proved in  for two-layer neural networks.
One main technical ingredient in this work is to use a combination of the identity mapping and skip-connections to stabilize the forward and backward propagation in the neural network. This enable us to consider deep neural networks with fixed width. The second main ingredient is the exploitation of a possible time scale separation between the GD dynamics for the parameters in- and outside the activation function: The parameters inside the activation function are effectively frozen during the GD dynamics compared with the parameters outside the activation function.
Throughout this paper, we let , and use and to denote the and Frobenius norms, respectively. For a matrix , we use to denote its -th row, -th column and -th entry, respectively. We let and use
to indicate the uniform distribution over. We use as a shorthand notation for , where is some absolute constant. is similarly defined.
2.1 Problem setup
We consider the regression problem with training data set given by , where are i.i.d. samples drawn from a fixed (but unknown) distribution . For simplicity, we assume and . We use to denote the model with parameter . We are interested in minimizing the empirical risk, defined by
We let and , then .
For the generalization problem, we need to specify how the ’s are obtained. Let be our target function. Then we have . We will assume that there are no measurement noises. This makes the argument more transparent but does not change things qualitatively: Essentially the same argument applies to the case with measurement noise.
Our goal is to estimate the population risk, defined by
Deep neural networks with skip-connections
We will consider a special deep neural network model with multiple skip-connections, defined by
Here . Note that and are the depth and width of the network respectively. is a scalar nonlinear activation function, which is assumed to be 1-Lipschitz continuous and
. For any vectorwe define . For simplicity, we fix to be . Thus the parameters that need to be estimated are: . We also define , the output of the -th nonlinear hidden layer.
This network model has the following feature: The first entries of are directly connected to the input layer by a long-distance skip-connection, only the last entry is connected to the previous layer. As will be seen later, the long-distance skip-connections help to stabilize the deep network. We further let:
where , and . With these notations, we can re-write the model as
We will analyze the behavior of the gradient descent algorithm, defined by
where is the learning rate. For simplicity, in most cases, we will focus on its continuous version:
We will focus on a special class of initialization:
where the third item means that each row of is independently drawn from the uniform distribution over . Thus for this initialization, .
Note that all the results in this paper also hold for slightly larger initializations, e.g. and . But for simplicity, we will focus on the initialization (2.5).
2.2 Assumption on the input data
For the given activation function , we can define a symmetric positive definite (SPD) function
Denote by the RKHS induced by . For the given training set, the (empirical) kernel matrix is defined as
We make the following assumption on the training data set.
For the given training data , we assume that is positive definite, i.e.
Note that , and in general depends on the data set. If we assume that are independently drawn from , it was shown in  that with high probability where is the
-th eigenvalue of the Hilbert-Schmidt integral operatordefined by
Using this result,  provided lower bounds for based on some geometric discrepancy.
3 The main results
Let be the solution of the GD dynamics (2.4) at time with the initialization defined in (2.5). We first show that with high probability, the landscape of near the initialization has some coercive property which guarantees the exponential convergence towards a global minimum.
Assume that there are constants such that and for
Then for any , we have
Let . Then for , the condition (3.1) is satisfied. Thus we have
Consequently, we have,
It remains to show that actually . If , then we have
where is due to the assumption that . This contradicts the definition of . ∎
Our main result for optimization is as follows.
Theorem 3.2 (Optimization).
For any , assume that . With probability at least over the initialization , we have that for any ,
As is the case for two-layer neural networks , the fact that the GD dynamics stays in a neighborhood of the initialization suggests that it resembles the situation of a random feature model. Consequently, the generalization error can be controlled if we assume that the target function is in the appropriate RKHS.
Assume that , i.e.
In addition, we also assume that .
In the following, we will denote . Obviously, .
Theorem 3.4 (Generalization).
Assume that the target function satisfies Assumption 3.3. For any , assume that . Then with probability at least over the random initialization, the following holds for any ,
In addition, by choosing the stopping time appropriately, we obtain the following result:
Corollary 3.5 (Early-stopping).
Assume that . Let , then we have
4 Landscape around the initialization
For any , we define a neighborhood around the initialization by
Let . We will assume that
. In the following, we first prove that both the forward and backward propagation is stable regardless of its depth. We then show that the norm of the gradient can be bounded from above and below by the loss function, similar to the condition required in Lemma3.1. This implies that there are no issues with vanishing or exploding gradients.
4.1 Forward stability
At , it easy to check that
For simplicity, when it is clear from the context, we will omit the dependence on and in the notations.
If , we have for any and that
We see that all the variables are close to their initial value except , which is used to accumulate the prediction from each layer.
Let . Then by (2.3), we have
with . Adding the two inequalities gives us:
Since , the above inequality can be simplified as
Thus we obtain that for any , . Plugging it back to the recursive formula for , we get
This gives us
Now the deviation of can be estimated by
By inserting the previous estimates, we obtain
4.2 Backward stability
For convenience, we define the gradients with respect to the neurons by
For simplicity, we will omit the explicit reference of and in these notations when it is clear from the context. Note that
, and it is easy to derive the following back-propagation formula using the chain rule,
At the top layer, we have that for any and :
In addition, we have at
If , we have for any and
4.3 Bounding the gradients
We are now ready to bound the gradients. First note that we have
where we have omitted the dependence on . Using the stability results, we can bound the gradients by the empirical loss.
Lemma 4.3 (Upper bound).
If , then for any we have
We now turn to the lower bound. The technique used is similar to case for two-layer neural networks . Define a Gram matrix with
At the initialization, we have
This matrix can be viewed as an empirical approximation of the kernel matrix defined in Section 2.2, since each row of is independently drawn from the uniform distribution over the sphere of radius . Using standard concentration inequalities, we can prove that with high probability, the smallest eigenvalue of the Gram matrix is bounded from below by the smallest eigenvalue of the kernel matrix. This is stated in the following lemma, whose proof is deferred to Appendix B.
For any , assume that . Then with probability at least over the random initialization:
Moreover, we can show that for any , the Gram matrix is still strictly positive definite as long as is large enough.
For any , assume that . With probability over the random initialization, we have for any ,