On the convergence of gradient descent for two layer neural networks

09/30/2019
by   Lei Li, et al.
Shanghai Jiao Tong University
0

It has been shown that gradient descent can yield the zero training loss in the over-parametrized regime (the width of the neural networks is much larger than the number of data points). In this work, combining the ideas of some existing works, we investigate the gradient descent method for training two-layer neural networks for approximating some target continuous functions. By making use the generic chaining technique from probability theory, we show that gradient descent can yield an exponential convergence rate, while the width of the neural networks needed is independent of the size of the training data. The result also implies some strong approximation ability of the two-layer neural networks without curse of dimensionality.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/30/2019

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

We study the training and generalization of deep neural networks (DNNs) ...
07/14/2020

Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

The ability of neural networks to provide `best in class' approximation ...
12/04/2020

When does gradient descent with logistic loss find interpolating two-layer networks?

We study the training of finite-width two-layer smoothed ReLU networks f...
10/22/2020

Beyond Lazy Training for Over-parameterized Tensor Decomposition

Over-parametrization is an important technique in training neural networ...
08/25/2020

Stochastic Markov Gradient Descent and Training Low-Bit Neural Networks

The massive size of modern neural networks has motivated substantial rec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The universal approximation theorem tells us that a two-layer neural networks can approximate a broad class of functions, provided that the width is sufficiently large, i.e. in the over-parametrized regime [1, 2]. Recently, it has been shown that gradient descent can find the optimal parameters for the wide two layer neural networks using optimal transport theory [3, 4, 5]

, and using direct estimation on particular neural network structure

[6, 7, 8]

. All these results are in the over-parametrized regime. We are interested to see whether the loss function can converge to zero in the under-parametrized regime, i.e., the number of data points are larger than the width of the two layer neural networks.

Following [4, 6], we consider the loss function given by the quadratic function

(1.1)

where is a compact set and , is the dimension, and is some probability measure on . We aim to show that under suitable setup, with high probability, the convergence rate is exponentially fast, and we desire the width of the neural networks is independent of the training data, by using suitable approximation . Note that this does not contradict with the result of [9], which shows that the under-parametrized networks cannot approximate the function well. The reason is that the loss function here only measures the approximation at the given data points, instead of approximation for the whole function.

Here, we mention the two main works that motivate this work. In [4], the following form of approximations were considered inspired by weak approximation of measures

(1.2)

where represents any unit function. For example, it could be the Gaussian, the one-layer neural networks or deep neural networks. They were able study the dynamics as the interacting particle systems, so that the decaying of the loss function can be understood as gradient flows in Wasserstein-2 space using optimal transport theory. However, the decaying can only be shown in the limiting regime where and the decaying rate is unknown. The mechanism is that the parameters will converge to some minimization regions.

In the work [6] by S. Du et al., the two layer of neural networks were studied, i.e.,

Here,

is the so-called activation function. In their work, the following form of apprxomation is used:

(1.3)

Though this seems to be a little change, the dynamics has been changed significantly compared with (1.2). They are actually able to show that the weights ’s will not converge under the gradient descent method: they are close to the initialization. They claim that the loss function converges to zero exponentially in the overparamterized regime. They obtain this result by considering the dynamics of the prediction . In [6], the convergence largely is due to their Assumption 3.1 and Theorem 3.1. The analysis for their

matrix needs to exclude the cases that there are two vectors being nearly parallel, which is not desired in the

regime, where is the number of training data. We note that if we rely on the dynamics of , the matrix can be positive definite simply due to the positivity of , and this is the strategy we adopt here.

Below, we aim to combine and improve the ideas from these two works to show a convergence rate of the gradient descent independent of the sample size . We use the approximation from [6], i.e. (1.3), to study the dynamics of the training loss directly. We makes use the key observation from [6] that the weights are close to the initial values, but instead rely on the dynamics on ’s to obtain the positive definiteness of the Gram matrix. Another important technique is the generic chaining [10] from probability theory that kind of guarantees the the Gram matrix is close to some positive definite matrix uniformly data . In section 2, we set up the mathematical formulation and state the main results. The proof is then performed in section 3. Lastly, we make some discussion in section 4.

2 Mathematical setup and the main result

We focus on the following special setup, and consider two layer neural networks as in (1.3). The generalization to other cases will be done soon in subsequent works.

About the data set and measure for , we assume the following:

Assumption 1.

The domain is a compact set such that . The measure is a probability measure supported on a countable subset of .

Below, we will denote the support of by :

(2.1)

One common example of is the empirical measure

(2.2)

where ’s are the training data sampled i.i.d from some underlying distribution. Of course, in our discussion in this work, ’s are fixed.

Moreover, we assume the initial conditions as follows.

Assumption 2.

For the weight , we assume

and ’s are i.i.d such that .

The condition is important because it guarantees that the approximation (1.3) is with high probability at to close up the estimation later (Lemma 4). The distribution for

’s are assumed to be Gaussians only for technical convenience. The argument can be extended to other distributions with finite second moments.

Moreover, we assume

Assumption 3.

The activation function satisfies , and . Moreover, , and thus

The sigmoid functions should satisfy the condition. This condition may be too strong, however, we impose this only for simplicity. The generalization to other functions should be doable.

Under the gradient descent with loss function (1.1), the parameters ’s satisfy

(2.3)

where means the time derivative. Similarly, the parameters ’s satisfy

(2.4)

Instead of studying the prediction as in [6], we study the dynamics of the loss function directly as motivated by [4],

(2.5)

The matrix

(2.6)

is called the Gram matrix in the terminology from [6]. We aim to use the positive definiteness of this matrix to show the exponential convergence. The second part is positive semi-definite (in fact in [6], it was argued that this part is also positive definite when the data points are not parallel), so we can control easily as

(2.7)

One key observation from [6] is that the weights do not change much during the dynamics. The choice (1.3) ensures the form

, so that one has kind of law of large number convergence here. Moreover, we note that

are i.i.d D Gaussian variables regardless the dimension of itself. Using the generic chaining approach [10], this allows us obtain the convergence rate of the loss function independent of and when is large enough ( without exponential dependence in ).

Theorem 1.

Under Assumptions 1-3, for any small, when , with probability , it holds that

(2.8)

where and are independent of , and .

We remark that as does not mean approximates well. It only approximates the function values on the discrete points . However, since the convergence is independent of , this result somehow says that any continuous function can be approximated by the two-layer neural networks with high probability.

In fact, it seems that the following approximation ability of two layer neural networks holds.

Corollary 1.

Suppose satisfying Assumptions 1 is the union of closures of some domains (open connected sets) so that the Lebesgue measure can be defined. Then, the two layer neural networks satisfying Assumptions 3 with width are able to approximate any functions , , under the norm.

Here, means the norm with Lebesgue measure,. As commented in section 4, it seems that such good approximations are available ”everywhere” in the parameter space.

3 The proof of the main result

Below, we use to indicate a constant independent of , whose concrete meaning can change from line to line.

Lemma 1.

The derivatives of the parameters satisfy

(3.1)

and

(3.2)

where is independent of and .

The proof is straightforward and we omit. Let us just remark that if we consider, for example, . Using (2.3), by Hölder’s inequality, it is straightforward to obtain

Using Lemma 1, we are able to establish:

(3.3)

where we have introduced

(3.4)

and

(3.5)

We now estimate these two terms separately.

Lemma 2.

The term satisfies

where

(3.6)
Proof.

First of all, one finds

Using (3.1), one has

Hence, due to for , one has

Consequently, one has

(3.7)

where we have used the fact that

The claim then follows. ∎

We are now in a position to take in . In particular, we need to estimate

For fixed , it is clear that this converges to

and the rate is independent of . However, to obtain the uniform convergence rate for is challenging. The usual technique is the union bound, which gives a quantity involving , and this further requires the over-parametrized regime. One may also use the convergence of empirical measures in the so called Wasserstein spaces [11], which yields a quantity independent of the training data, but suffers from curse of dimensionality. In our problem here, we are essentially considering the projections of high dimensional Gaussians. Hence, we expect the convergence rate will not suffer from curse of dimensionality. We find that the generic chaining approach in [10, 12] useful.

We introduce the following result essentially derived from [12, Proposition 2.2] and the proof of [12, Theorem 1.1]. The proof is then deferred to Appendix A using generic chaining approach.

Proposition 1.

Consider

Suppose that for any given , ’s are independent, and for any , and are independent for . Moreover, ’s have the same distribution, satisfying the following.

  1. They are uniformly bounded, have mean zero, and finite variance.

  2. there is a constant such that

then there is a parameter only depending on with magnitude , such that for any small and that is countable, when , with probability ,

(3.8)

where is independent of .

We now move to the estimates of . We have

Lemma 3.

For any small , when , with probability , it holds that

(3.9)
Proof.

We only need to verify the conditions in Proposition 1. Here, the domain we work on is while is the countable subset we consider. The distance is defined by .

Clearly, we need to identity

Then,

Taking expectation, one finds . The claim follows.

Note that in this case, the dimension is , but clearly, . ∎

Note that the boundedness of is also used here for the Bernstein inequality in Appendix A. Consequently, when (3.9) holds, one has

(3.10)

Now, we move onto the initial values of . If we identify

and use Proposition 1, we can find that when , with probability , it holds that

(3.11)

Here, depends on . However, according to [8, Lemma 1], one has the following.

Lemma 4.

For any small , with probability , it holds that

(3.12)

The proof is done via the Rademacher complexity, and we refer the readers to [8].

Now, we estimate the Gram matrix.

Lemma 5.

There exists independent of such that

(3.13)
Proof.

In fact, , which is a 2D Gaussian with covariant matrix

Consider the interval where . Then, contains for some .

Now, we should estimate . The expectation can be rewritten as

where and are independent standard normal variables, and

Note that is bounded, hence, we only need and where is a number chosen such that and are in for all . Clearly can be made universal independent of . The probability hence has a lower bound independent of . Then, setting will suffice. ∎

We are now able to prove the main result.

Proof of Theorem 1.

By the previous lemmas, for any small, when , we may pick an event with probability , such that outside this, we have , , and also .

Then, on

Define

We claim that if for some , then

We first of all note that is nonincreasing. We now consider a fixed . Then, all the quantities below are essentially discussed for such , and we omit this dependence for convenience.

In fact, since , it is then clear that if , . Then, for , then

Then,

Hence, for all

Hence, if is chosen such that

also holds, then we must have . In other words, for some ,

will imply .

Otherwise, suppose . Then,

still holds. Then, by continuity, there exists such that on this still holds. This then contradicts with the definition of .

Now that , holds for all . Lastly, we note , the claim then follows. ∎

We now move to the corollary regarding approximation.

Proof of Corollary 1.

We fix . Clearly, one can use a function that is Lipschitz to approximate with small error under norm. Hence, we assume that is Lipschitz.

We first of all have the following estimates:

(3.14)

where

is the empirical measure for data .

By the standard result of convergence empirical measures [11], there exists such with such that,

where is the Wasserstein distance.

Then, we set . Then, we take the event such that the claims in Theorem 1 hold for , and such that . Hence, we are able to find the parameters for large enough such that

The important consequence of Theorem 1 is that is independent of (and thus ).

Now, we fix such and estimate the Lipschitz constant of . The gradient of this function is

Hence, we need to estimate and .

By Lemma 3.1,

Similarly,

Hence,

Hence, and . In other words, the Lipschitz constant of is controlled by a polynomial of and the Lipschitz constant of (but independent of ), since is independent of .

The whole error is thus given by . As is arbitrary, the claim follows. ∎

4 Conclusion and discussion

Current works show the convergence of loss function in the overparametrized regime for neural networks. By considering the loss function directly, we obtain the decay of the loss function in a rate independent of the number of training data, where the approximation is given specifically by (1.3). The key observation was made by S. Du et al that the weights are close to the initialization. In this sense, the loss function has global minimizations “everywhere” in the parameter space. As we have seen, the proof works for large number of data , and as Corollary 1

indicates, the parameters that can achieve the good approximation seems to be available ”everywhere” in the parameter space, which we strongly believe leads to the good performance of stochastic gradient descent. If we instead use the approximation of the form (

1.2) inspired by mean field limit, then one must wait the weights to converge to some specific regions to gain global minimization. This probably means the approximation (1.3) has larger capacity and may be used to explain why SGD behaves well.

The loss function used here is quadratic functions. However, we argue that quadratic functions are general enough since near the minima, all functions look like quadratic functions.

There are many instant subsequent works, for example, one may explore whether the ideas and techniques here can be applied to improve the the generalization error estimates in literature, and to get better explanation of why stochastic gradient descent works well. These should be challenging but exciting problems.

Acknowledgement

The work of L. Li was partially sponsored by NSFC 11901389 and 11971314, and Shanghai Sailing Program 19YF1421300.

Appendix A Proof of Proposition 1

The proof divides into several steps.


Step 1. Basic concentration inequalities.

Define

(A.1)

and define

(A.2)

Hence, using , one has

Note that

where ’s are still independent with mean zero. By Bernstein inequality [13], one has for any

(A.3)

where .

Consequently,

Hence, one defines the metric (note that the maximum of two metrics is still a metric)

(A.4)

then one has

(A.5)

while for ,

(A.6)

Step 2 Generic chaining

Now consider the domain equipped with the metric in (A.4). Define a number

(A.7)

where is a set containing points. The is taken over the collections of such sets . From to , the number of points in is squared. Hence, the distance decays double exponentially if the sequence is chosen suitably. Hence, is well-defined. For a given , for every , there must be , such that

and we define

(A.8)

Let be the usual Euclidean distance, and we find that on the suitable sequence of subsets, . Hence, the number is comparable to

Below, we focus only on . The number of for the unit ball is like as shown in [10, sec. 2.2], where is the dimension.

By the definition of , we are able to choose the collection of such that

which holds for all .


Step 3 Controlling .

Let be such that . Consider any . By step 1, for any fixed , one has