1 Introduction
The universal approximation theorem tells us that a two-layer neural networks can approximate a broad class of functions, provided that the width is sufficiently large, i.e. in the over-parametrized regime [1, 2]. Recently, it has been shown that gradient descent can find the optimal parameters for the wide two layer neural networks using optimal transport theory [3, 4, 5]
, and using direct estimation on particular neural network structure
[6, 7, 8]. All these results are in the over-parametrized regime. We are interested to see whether the loss function can converge to zero in the under-parametrized regime, i.e., the number of data points are larger than the width of the two layer neural networks.
Following [4, 6], we consider the loss function given by the quadratic function
(1.1) |
where is a compact set and , is the dimension, and is some probability measure on . We aim to show that under suitable setup, with high probability, the convergence rate is exponentially fast, and we desire the width of the neural networks is independent of the training data, by using suitable approximation . Note that this does not contradict with the result of [9], which shows that the under-parametrized networks cannot approximate the function well. The reason is that the loss function here only measures the approximation at the given data points, instead of approximation for the whole function.
Here, we mention the two main works that motivate this work. In [4], the following form of approximations were considered inspired by weak approximation of measures
(1.2) |
where represents any unit function. For example, it could be the Gaussian, the one-layer neural networks or deep neural networks. They were able study the dynamics as the interacting particle systems, so that the decaying of the loss function can be understood as gradient flows in Wasserstein-2 space using optimal transport theory. However, the decaying can only be shown in the limiting regime where and the decaying rate is unknown. The mechanism is that the parameters will converge to some minimization regions.
In the work [6] by S. Du et al., the two layer of neural networks were studied, i.e.,
Here,
is the so-called activation function. In their work, the following form of apprxomation is used:
(1.3) |
Though this seems to be a little change, the dynamics has been changed significantly compared with (1.2). They are actually able to show that the weights ’s will not converge under the gradient descent method: they are close to the initialization. They claim that the loss function converges to zero exponentially in the overparamterized regime. They obtain this result by considering the dynamics of the prediction . In [6], the convergence largely is due to their Assumption 3.1 and Theorem 3.1. The analysis for their
matrix needs to exclude the cases that there are two vectors being nearly parallel, which is not desired in the
regime, where is the number of training data. We note that if we rely on the dynamics of , the matrix can be positive definite simply due to the positivity of , and this is the strategy we adopt here.Below, we aim to combine and improve the ideas from these two works to show a convergence rate of the gradient descent independent of the sample size . We use the approximation from [6], i.e. (1.3), to study the dynamics of the training loss directly. We makes use the key observation from [6] that the weights are close to the initial values, but instead rely on the dynamics on ’s to obtain the positive definiteness of the Gram matrix. Another important technique is the generic chaining [10] from probability theory that kind of guarantees the the Gram matrix is close to some positive definite matrix uniformly data . In section 2, we set up the mathematical formulation and state the main results. The proof is then performed in section 3. Lastly, we make some discussion in section 4.
2 Mathematical setup and the main result
We focus on the following special setup, and consider two layer neural networks as in (1.3). The generalization to other cases will be done soon in subsequent works.
About the data set and measure for , we assume the following:
Assumption 1.
The domain is a compact set such that . The measure is a probability measure supported on a countable subset of .
Below, we will denote the support of by :
(2.1) |
One common example of is the empirical measure
(2.2) |
where ’s are the training data sampled i.i.d from some underlying distribution. Of course, in our discussion in this work, ’s are fixed.
Moreover, we assume the initial conditions as follows.
Assumption 2.
For the weight , we assume
and ’s are i.i.d such that .
The condition is important because it guarantees that the approximation (1.3) is with high probability at to close up the estimation later (Lemma 4). The distribution for
’s are assumed to be Gaussians only for technical convenience. The argument can be extended to other distributions with finite second moments.
Moreover, we assume
Assumption 3.
The activation function satisfies , and . Moreover, , and thus
The sigmoid functions should satisfy the condition. This condition may be too strong, however, we impose this only for simplicity. The generalization to other functions should be doable.
Under the gradient descent with loss function (1.1), the parameters ’s satisfy
(2.3) |
where means the time derivative. Similarly, the parameters ’s satisfy
(2.4) |
Instead of studying the prediction as in [6], we study the dynamics of the loss function directly as motivated by [4],
(2.5) |
The matrix
(2.6) |
is called the Gram matrix in the terminology from [6]. We aim to use the positive definiteness of this matrix to show the exponential convergence. The second part is positive semi-definite (in fact in [6], it was argued that this part is also positive definite when the data points are not parallel), so we can control easily as
(2.7) |
One key observation from [6] is that the weights do not change much during the dynamics. The choice (1.3) ensures the form
, so that one has kind of law of large number convergence here. Moreover, we note that
are i.i.d D Gaussian variables regardless the dimension of itself. Using the generic chaining approach [10], this allows us obtain the convergence rate of the loss function independent of and when is large enough ( without exponential dependence in ).Theorem 1.
We remark that as does not mean approximates well. It only approximates the function values on the discrete points . However, since the convergence is independent of , this result somehow says that any continuous function can be approximated by the two-layer neural networks with high probability.
In fact, it seems that the following approximation ability of two layer neural networks holds.
Corollary 1.
Here, means the norm with Lebesgue measure,. As commented in section 4, it seems that such good approximations are available ”everywhere” in the parameter space.
3 The proof of the main result
Below, we use to indicate a constant independent of , whose concrete meaning can change from line to line.
Lemma 1.
The derivatives of the parameters satisfy
(3.1) |
and
(3.2) |
where is independent of and .
The proof is straightforward and we omit. Let us just remark that if we consider, for example, . Using (2.3), by Hölder’s inequality, it is straightforward to obtain
We now estimate these two terms separately.
Lemma 2.
The term satisfies
where
(3.6) |
Proof.
First of all, one finds
Using (3.1), one has
Hence, due to for , one has
Consequently, one has
(3.7) |
where we have used the fact that
The claim then follows. ∎
We are now in a position to take in . In particular, we need to estimate
For fixed , it is clear that this converges to
and the rate is independent of . However, to obtain the uniform convergence rate for is challenging. The usual technique is the union bound, which gives a quantity involving , and this further requires the over-parametrized regime. One may also use the convergence of empirical measures in the so called Wasserstein spaces [11], which yields a quantity independent of the training data, but suffers from curse of dimensionality. In our problem here, we are essentially considering the projections of high dimensional Gaussians. Hence, we expect the convergence rate will not suffer from curse of dimensionality. We find that the generic chaining approach in [10, 12] useful.
We introduce the following result essentially derived from [12, Proposition 2.2] and the proof of [12, Theorem 1.1]. The proof is then deferred to Appendix A using generic chaining approach.
Proposition 1.
Consider
Suppose that for any given , ’s are independent, and for any , and are independent for . Moreover, ’s have the same distribution, satisfying the following.
-
They are uniformly bounded, have mean zero, and finite variance.
-
there is a constant such that
then there is a parameter only depending on with magnitude , such that for any small and that is countable, when , with probability ,
(3.8) |
where is independent of .
We now move to the estimates of . We have
Lemma 3.
For any small , when , with probability , it holds that
(3.9) |
Proof.
We only need to verify the conditions in Proposition 1. Here, the domain we work on is while is the countable subset we consider. The distance is defined by .
Clearly, we need to identity
Then,
Taking expectation, one finds . The claim follows.
Note that in this case, the dimension is , but clearly, . ∎
Note that the boundedness of is also used here for the Bernstein inequality in Appendix A. Consequently, when (3.9) holds, one has
(3.10) |
Now, we move onto the initial values of . If we identify
and use Proposition 1, we can find that when , with probability , it holds that
(3.11) |
Here, depends on . However, according to [8, Lemma 1], one has the following.
Lemma 4.
For any small , with probability , it holds that
(3.12) |
The proof is done via the Rademacher complexity, and we refer the readers to [8].
Now, we estimate the Gram matrix.
Lemma 5.
There exists independent of such that
(3.13) |
Proof.
In fact, , which is a 2D Gaussian with covariant matrix
Consider the interval where . Then, contains for some .
Now, we should estimate . The expectation can be rewritten as
where and are independent standard normal variables, and
Note that is bounded, hence, we only need and where is a number chosen such that and are in for all . Clearly can be made universal independent of . The probability hence has a lower bound independent of . Then, setting will suffice. ∎
We are now able to prove the main result.
Proof of Theorem 1.
By the previous lemmas, for any small, when , we may pick an event with probability , such that outside this, we have , , and also .
Then, on
Define
We claim that if for some , then
We first of all note that is nonincreasing. We now consider a fixed . Then, all the quantities below are essentially discussed for such , and we omit this dependence for convenience.
In fact, since , it is then clear that if , . Then, for , then
Then,
Hence, for all
Hence, if is chosen such that
also holds, then we must have . In other words, for some ,
will imply .
Otherwise, suppose . Then,
still holds. Then, by continuity, there exists such that on this still holds. This then contradicts with the definition of .
Now that , holds for all . Lastly, we note , the claim then follows. ∎
We now move to the corollary regarding approximation.
Proof of Corollary 1.
We fix . Clearly, one can use a function that is Lipschitz to approximate with small error under norm. Hence, we assume that is Lipschitz.
We first of all have the following estimates:
(3.14) |
where
is the empirical measure for data .
By the standard result of convergence empirical measures [11], there exists such with such that,
where is the Wasserstein distance.
Then, we set . Then, we take the event such that the claims in Theorem 1 hold for , and such that . Hence, we are able to find the parameters for large enough such that
The important consequence of Theorem 1 is that is independent of (and thus ).
Now, we fix such and estimate the Lipschitz constant of . The gradient of this function is
Hence, we need to estimate and .
By Lemma 3.1,
Similarly,
Hence,
Hence, and . In other words, the Lipschitz constant of is controlled by a polynomial of and the Lipschitz constant of (but independent of ), since is independent of .
The whole error is thus given by . As is arbitrary, the claim follows. ∎
4 Conclusion and discussion
Current works show the convergence of loss function in the overparametrized regime for neural networks. By considering the loss function directly, we obtain the decay of the loss function in a rate independent of the number of training data, where the approximation is given specifically by (1.3). The key observation was made by S. Du et al that the weights are close to the initialization. In this sense, the loss function has global minimizations “everywhere” in the parameter space. As we have seen, the proof works for large number of data , and as Corollary 1
indicates, the parameters that can achieve the good approximation seems to be available ”everywhere” in the parameter space, which we strongly believe leads to the good performance of stochastic gradient descent. If we instead use the approximation of the form (
1.2) inspired by mean field limit, then one must wait the weights to converge to some specific regions to gain global minimization. This probably means the approximation (1.3) has larger capacity and may be used to explain why SGD behaves well.The loss function used here is quadratic functions. However, we argue that quadratic functions are general enough since near the minima, all functions look like quadratic functions.
There are many instant subsequent works, for example, one may explore whether the ideas and techniques here can be applied to improve the the generalization error estimates in literature, and to get better explanation of why stochastic gradient descent works well. These should be challenging but exciting problems.
Acknowledgement
The work of L. Li was partially sponsored by NSFC 11901389 and 11971314, and Shanghai Sailing Program 19YF1421300.
Appendix A Proof of Proposition 1
The proof divides into several steps.
Step 1. Basic concentration inequalities.
Define
(A.1) |
and define
(A.2) |
Hence, using , one has
Note that
where ’s are still independent with mean zero. By Bernstein inequality [13], one has for any
(A.3) |
where .
Consequently,
Hence, one defines the metric (note that the maximum of two metrics is still a metric)
(A.4) |
then one has
(A.5) |
while for ,
(A.6) |
Step 2 Generic chaining
Now consider the domain equipped with the metric in (A.4). Define a number
(A.7) |
where is a set containing points. The is taken over the collections of such sets . From to , the number of points in is squared. Hence, the distance decays double exponentially if the sequence is chosen suitably. Hence, is well-defined. For a given , for every , there must be , such that
and we define
(A.8) |
Let be the usual Euclidean distance, and we find that on the suitable sequence of subsets, . Hence, the number is comparable to
Below, we focus only on . The number of for the unit ball is like as shown in [10, sec. 2.2], where is the dimension.
By the definition of , we are able to choose the collection of such that
which holds for all .
Step 3 Controlling .
Let be such that . Consider any . By step 1, for any fixed , one has
Comments
There are no comments yet.