1 Introduction
Despite the extensive empirical success of deep networks, their optimization and generalization properties are still not well understood. Recently, the neural tangent kernel (NTK) has provided the following insight into the problem. In the infinite-width limit, the NTK converges to a limiting kernel which stays constant during training; on the other hand, when the width is large enough, the function learned by gradient descent follows the NTK (Jacot et al., 2018)
. This motivates the study of overparameterized networks trained by gradient descent, using properties of the NTK. In fact, parameters related to NTK, such as the minimum eigenvalue of the limiting kernel, appear to affect optimization and generalization
(Arora et al., 2019).However, in addition to such NTK-dependent parameters, prior work also requires the width to depend polynomially on , or , where denotes the size of the training set, denotes the failure probability, and denotes the target error. These large widths far exceed what is used empirically, constituting a significant gap between theory and practice.
Our contributions.
In this paper, we narrow this gap by showing that a two-layer ReLU network with hidden units trained by gradient descent achieves classification error on test data, meaning both optimization and generalization occur. Unlike prior work, the width is fully polylogarithmic in , , and ; the width will additionally depend on the separation margin of the limiting kernel, a quantity which is guaranteed positive (assuming no inputs are duplicated with noisy labels), and can distinguish between true labels and random labels. The paper organization together with some details are described below.
- Section 2
-
studies gradient descent on the training set. Using the geometry inherent in classification tasks, we prove that with any width at least polylogarithmic and any constant step size no larger than , gradient descent achieves training error in iterations (cf. Section 2). As is common in the NTK literature (Chizat and Bach, 2019), we also show the parameters hardly change, which will be essential to our generalization analysis.
- Section 3
-
gives a test error bound. Concretely, using the preceding gradient descent analysis, and standard Rademacher tools but exploiting how little the weights moved, we show that with samples and iterations, gradient descent finds a solution with test error (cf. Section 3 and Section 3). (As discussed in Section 3, samples also suffice via a smoothness-based generalization bound, at the expense of large constant factors.)
- Section 4
-
considers stochastic gradient descent (SGD) with access to a standard stochastic online oracle. We prove that with width at least polylogarithmic and sample complexity
, SGD achieves an arbitrarily small test error (cf. Section 4). - Section 5
-
discusses the separability condition, which is in general a positive number, but reflects the difficulty of the classification problem. Regarding random labels, we show that starting from a distribution with a good positive margin, but replacing the labels with random noise, the margin can degrade all the way down to , which (correctly) removes the possibility of generalization. In this way, our analysis can distinguish between true labels and random labels.
- Section 6
-
concludes with some open problems.
1.1 Related work
There has been a large literature studying gradient descent on overparameterized networks via the NTK. The most closely related work is (Nitanda and Suzuki, 2019), which shows that a two-layer network trained by gradient descent with the logistic loss can achieve a small test error, under the same assumption that the neural tangent model with respect to the first layer can separate the data distribution. However, they analyze smooth activations, while we handle the ReLU. They require hidden units, data samples, and steps, while our result only needs polylogarithmic hidden units, data samples, and steps.
Additionally on shallow networks, Du et al. (2018b) prove that on an overparameterized two-layer network, gradient descent can globally minimize the empirical risk with the squared loss. Their result requires hidden units. Oymak and Soltanolkotabi (2019) further reduces the required overparameterization, but it still has a dependency. Using the same amount of overparameterization as (Du et al., 2018b), Arora et al. (2019)
further show that the two-layer network learned by gradient descent can achieve a small test error, assuming that on the data distribution the smallest eigenvalue of the limiting kernel is at least some positive constant. They also give a fine-grained characterization of the predictions made by gradient descent iterates; such a characterization makes use of a special property of the squared loss and cannot be applied to the logistic regression setting.
Li and Liang (2018) show that stochastic gradient descent (SGD) with the cross entropy loss can learn a two-layer network with small test error, using hidden units, where is at least the covering number of the unit sphere using balls whose radii are no larger than the smallest distance between two data points with different labels. In a high-dimensional space, could be very large. Allen-Zhu et al. (2018a) consider SGD on a two-layer network, and a variant of SGD on a three-layer network. The three-layer analysis further exhibits some properties not captured by the NTK. They assume a ground truth network with infinite-order smooth activations, and they require the width to depend polynomially on and some constants related to the smoothness of the activations of the ground truth network.On deep networks, a variety of works have established low training error (Allen-Zhu et al., 2018b; Du et al., 2018a; Zou et al., 2018; Zou and Gu, 2019). Cao and Gu (2019a) assume that the neural tangent model with respect to the second layer of a two-layer network can separate the data distribution, and prove that gradient descent on a deep network can achieve test error with samples and hidden units. Cao and Gu (2019b) consider SGD with an online oracle and give a general result. Under the same assumption as in (Cao and Gu, 2019a), their result requires hidden units and sample complexity . By contrast, with the same online oracle, our result only needs polylogarithmic hidden units and sample complexity .
1.2 Notation
The dataset is denoted by where and . For simplicity, we assume that for any , which is standard in the NTK literature.
The two-layer network has weight matrices and . We use the following parameterization, which is also used in (Du et al., 2018b; Arora et al., 2019):
with initialization
Note that in this paper, denotes the -th row of at step . We fix and only train , as in (Li and Liang, 2018; Du et al., 2018b; Arora et al., 2019; Nitanda and Suzuki, 2019). We consider the ReLU activation , though our analysis can be extended easily to Lipschitz continuous, positively homogeneous activations such as leaky ReLU.
We use the logistic (binary cross entropy) loss and gradient descent. For any and any , let . The empirical risk and its gradient are given by
For any , the gradient descent step is given by . Also define
Note that . This property generally holds due to homogeneity: for any and any ,
and thus .
2 Empirical risk minimization
In this section, we consider a fixed training set and empirical risk minimization. We first state our assumption on the separability of the neural tangent model, and then give our main result and a proof sketch.
Here is some additional notation. Let denote the Gaussian measure on , given by the Gaussian density with respect to the Lebesgue measure on . We consider the following Hilbert space
For each , define by
One can verify that , and thus is indeed in .
There exist and such that for any , and for any ,
A more natural assumption is that the infinite-width limit of the NTK can separate the training set with a positive margin. This assumption actually implies Section 2 (cf. Section 5). Some other discussion on the separability assumption is also given in Section 5.
With Section 2, we state our main empirical risk result. Under Section 2, given any risk target and any , let
Then for any and any constant step size , with probability over the random initialization,
Moreover for any ,
While the number of hidden units required by prior work all have a polynomial dependency on , or , Section 2 only requires . In the rest of Section 2, we give a proof sketch of Section 2.
2.1 Properties at initialization
In this subsection, we give some nice properties of random initialization. The proofs are given in Appendix A.
Given an initialization , for any , define
(1) |
where is given by Section 2. Collect into a matrix . It holds that , and . Section 2.1 ensures that with high probability has a positive margin at initialization. Under Section 2, given any and any , if , then with probability , it holds simultaneously for all that
For any , any , and any , define
Section 2.1 controls . It will help us show that has a good margin during the training process. Under the condition of Section 2.1, for any , with probability , it holds simultaneously for all that
Finally, Section 2.1 controls the output of the network at initialization. Given any , if , then with probability , it holds simultaneously for all that
2.2 Convergence analysis of gradient descent
We analyze gradient descent in this subsection. First, define
We have the following observations.
-
For any and any , , and thus . Therefore by the triangle inequality, .
-
If is -Lipschitz continuous (as with the logistic loss), then .
-
If (as with the logistic loss), then .
With the above observations, we give the following general result which does not require staying close to initialization. It plays an important role in the proof of our main result, and could be useful when analyzing neural networks beyond the NTK setting. For any
and any , if , thenConsequently, if we use a constant step size for , then
Proof.
We have
(2) |
The first order term of eq. 2 can be handled using the convexity of and homogeneity of ReLU:
(3) |
Using Sections 2.2, 2.1, 2.1 and 2.1, we can prove Section 2. Below is a proof sketch; the full proof is given in Appendix A.
-
We first show that defined in eq. 1 gives a positive margin at step as long as the activation patterns do not change too much from the initialization.
-
We then show that such a phase lasts for a long time with a mild overparameterization by giving a strong control of via Section 2.2. Prior work only shows an or upper bound on , which then requires the number of hidden units to be . By contrast, we are able to control by , which allows us to have a overparameterization.
-
Next we use Section 2.2 once again to get the empirical risk guarantee.
-
We also give an upper bound on , or . This will give us a Rademacher complexity bound in Section 3.
3 Generalization
To get a generalization bound, we naturally extend Section 2 to the following assumption, which is also made in (Nitanda and Suzuki, 2019) for smooth activations. There exist and such that for any , and
for any sampled from the data distribution (i.e., almost surely over ).
Here is our test error bound with Section 3. Under Section 3, given any and any , let and be given as in Section 2:
Then for any and any constant step size , with probability over the random initialization and data sampling,
where denotes the step with the minimum empirical risk before .
Below is a direct corollary of Section 3. Under Section 3, given any , using a constant step size and let
it holds with probability that , where denotes the step with the minimum empirical risk in the first steps.
To prove Section 3, we consider the sigmoid mapping , the empirical average , and the corresponding population average . First of all, since , it is enough to control . Next, as is controlled by Section 2, it is enough to control the generalization error . Moreover, since is supported on and -Lipschitz, it is enough to bound the Rademacher complexity of the function space explored by gradient descent. Invoking the bound on from Section 2 finishes the proof. The proof details are given in Appendix B.
To get Section 3, we use a Lipschitz-based Rademacher complexity bound. One can also use a smoothness-based Rademacher complexity bound (Srebro et al., 2010, Theorem 1) and get a sample complexity . However, the bound will become complicated and some large constant will be introduced. It is an interesting open question to give a clean analysis based on smoothness.
4 Stochastic gradient descent
There are some different formulations of SGD. In this section, we consider SGD with an online oracle. We randomly sample and , and fix during training. At step , a data example is sampled from the data distribution. We still let , and perform the following update
Note that here starts from .
Still with Section 3, we show the following result. Under Section 3, given any , using a constant step size and , it holds with probability that
Below is a proof sketch; the details are given in Appendix C. For any , define
Due to homogeneity, and .
The first step is an extension of Section 2.2 to the SGD setting. The proofs are similar. With a constant step size , for any and any ,
With Section 4, we can also extend Section 2 to the SGD setting and get a bound on , using a similar proof. To further get a bound on the cumulative population risk , the key observation is that is a martingale. Using a martingale Bernstein bound, we prove the following lemma; applying it finishes the proof of Section 4. Given any , with probability ,
5 On separability
Given a training set , the linear kernel is defined as
. The maximum margin achievable by a linear classifier is given by
(5) |
where denotes the element-wise product of and . If the data is not linearly separable, .
In this paper we train the first layer of a two-layer network, and the kernel we consider is the NTK of the first layer:
Similar to the definition of , the margin given by is defined as
Regarding the relation between and Section 2, we have the following result. If , then there exists s.t. , and for any , and for any . The proof is given in Appendix D, and uses Fenchel duality theory. given by Section 5 satisfies Section 2 with , but there might exist some with a much better , since the upper bound in Section 5 might be very loose.
We can further ask how large could be. Oymak and Soltanolkotabi (2019, Corollary I.2)
show that if for any two feature vectors
and , we have and for some , thenFor arbitrary labels , since , we have the worst case bound . However, real world labels could give a much better . For example, a tighter lower bound on is , where denotes the number of support vectors, which might be much smaller than .
On the other hand, given any training set which may have a large margin, if we replace with random labels , with high probability the margin becomes . To see this, let denote the uniform probability vector . Note that
Since for any , by Hoeffding’s inequality it holds with high probability that , and thus the margin is .
6 Open problems
In this paper, we analyze gradient descent on a two-layer network in the NTK regime, where the weights stay close to the initialization. It is an interesting open question if gradient descent learns something beyond the NTK, after the iterates move far enough from the initial weights. It is also interesting to extend our analysis to other architectures, such as multi-layer networks, convolutional networks, and residual networks. Finally, in this paper we only discuss binary classification; it is interesting to see if it is possible to get similar results for other tasks, such as regression.
References
- Allen-Zhu et al. (2018a) Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018a.
- Allen-Zhu et al. (2018b) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018b.
- Arora et al. (2019) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
- Bartlett and Mendelson (2002) Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463–482, Nov 2002.
-
Beygelzimer et al. (2011)
Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire.
Contextual bandit algorithms with supervised learning guarantees.
InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pages 19–26, 2011. - Borwein and Zhu (2005) Jonathan M. Borwein and Qiji J. Zhu. Techniques of Variational Analysis, volume 20 of. CMS Books in Mathematics, 2005.
- Cao and Gu (2019a) Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. arXiv preprint arXiv:1902.01384, 2019a.
- Cao and Gu (2019b) Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210, 2019b.
- Chizat and Bach (2019) Lenaic Chizat and Francis Bach. A Note on Lazy Training in Supervised Differentiable Programming. arXiv:1812.07956v2 [math.OC], 2019.
- Du et al. (2018a) Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018a.
- Du et al. (2018b) Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018b.
- Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
- Li and Liang (2018) Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018.
-
Liang (2016)
Percy Liang.
Stanford CS229T/STAT231: Statistical Learning Theory, Apr 2016.
URL https://web.stanford.edu/class/cs229t/notes.pdf. - Nitanda and Suzuki (2019) Atsushi Nitanda and Taiji Suzuki. Refined generalization analysis of gradient descent for over-parameterized two-layer neural networks with smooth activations on classification problems. arXiv preprint arXiv:1905.09870, 2019.
- Oymak and Soltanolkotabi (2019) Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674, 2019.
-
Shalev-Shwartz and Ben-David (2014)
Shai Shalev-Shwartz and Shai Ben-David.
Understanding Machine Learning: From Theory to Algorithms
. Cambridge University Press, 2014. - Srebro et al. (2010) Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances in neural information processing systems, pages 2199–2207, 2010.
- Wainwright (2015) Martin J. Wainwright. UC Berkeley Statistics 210B, Lecture Notes: Basic tail and concentration bounds, Jan 2015. URL https://www.stat.berkeley.edu/~mjwain/stat210b/Chap2_TailBounds_Jan22_2015.pdf.
- Zou and Gu (2019) Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. arXiv preprint arXiv:1906.04688, 2019.
- Zou et al. (2018) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.
Appendix A Omitted proofs from Section 2
Proof of Section 2.1.
By Section 2, given any ,
On the other hand,
is the empirical mean of i.i.d. r.v.’s supported on with mean . Therefore by Hoeffding’s inequality, with probability ,
Applying a union bound finishes the proof. ∎
Proof of Section 2.1.
Given any fixed and ,
because is a standard Gaussian r.v. and the density of standard Gaussian has maximum . Since is the empirical mean of Bernoulli r.v.’s, by Hoeffding’s inequality, with probability ,
Applying a union bound finishes the proof. ∎
To prove Section 2.1, we need the following technical result.
Consider the random vector , where for some that is -Lipschitz, and are i.i.d. standard Gaussian r.v.’s. Then the r.v. is -sub-Gaussian, and thus with probability ,
Proof.
Given , define
where is obtained by applying coordinate-wisely to . For any , by the triangle inequality, we have
and by further using the -Lipschitz continuity of , we have
As a result, is a -Lipschitz continuous function w.r.t. the norm, indeed is -sub-Gaussian and the bound follows by Gaussian concentration (Wainwright, 2015, Theorem 2.4). ∎
Proof of Section 2.1.
Given , let . By Appendix A,
is sub-Gaussian with variance proxy
, and with probability at least over ,On the other hand, by Jensen’s inequality,
As a result, with probability , it holds that . By a union bound, with probability over , for all , we have .
For any such that the above event holds, and for any , the r.v. is sub-Gaussian with variance proxy . By Hoeffding’s inequality, with probability over ,
By the union bound, with probability over , for all , we have .
The probability that the above events all happen is at least , over and . ∎
Proof of Section 2.
The condition on ensures that Sections 2.1, 2.1 and 2.1 hold with and .
For any and any step , let denote the proportion of activation patterns for that are different from step to step . Formally,