Modern deep neural networks are trained in a highly over-parameterized regime, with many more trainable parameters than training examples. It is well-known that these networks trained with simple first-order methods can fit any labels, even completely random ones (Zhang et al., 2017). Although training on properly labeled data usually leads to good generalization performance, the ability to over-fit the entire training dataset is undesirable for generalization when noisy labels are present. Since mislabeled data are ubiquitous in very large datasets (Krishna et al., 2016), principled methods that are robust to noisy labels are expected to improve generalization significantly.
In order to prevent over-fitting to mislabeled data, some form of regularization is necessary. A simple such example is early stopping, which has been observed to be effective for this purpose (Rolnick et al., 2017; Guan et al., 2018; Li et al., 2019). For instance, on MNIST, even when 90% of the labels are corrupted, training a neural net with early stopping can still give 90% test accuracy (Li et al., 2019). How to explain such generalization phenomenon is an intriguing theoretical question.
In an effort to understanding the optimization and generalization mysteries in deep learning, theoretical progress has been made recently by considering very wide deep neural nets(Du et al., 2019, 2018; Li and Liang, 2018; Allen-Zhu et al., 2018a, b; Zou et al., 2018; Arora et al., 2019b; Cao and Gu, 2019)
. When the width in every hidden layer is sufficiently large, it was shown that (stochastic) gradient descent with random initialization can almost always drive the training loss to; under further assumptions the trained net can be shown to have good generalization guarantee. In this line of work, wideness plays an important role: parameters in a wide neural net will stay close to their initialization during gradient descent training, and as a consequence, the neural net can be effectively approximated by its first-order Taylor expansion with respect to its parameters at initialization. This leads to tractable linear dynamics, and the final solution can be characterized by kernel regression using a particular kernel, which was named neural tangent kernel (NTK) by Jacot et al. (2018). Such type of kernels were further explicitly studied by Lee et al. (2019); Yang (2019); Arora et al. (2019a).
The connection between over-parameterized neural nets and neural tangent kernels is theoretically appealing because a pure kernel method captures the power of a fully trained neural net. These kernels also exhibit reasonable empirical performance on CIFAR-10 (Arora et al., 2019a), thus suggesting that ultra-wide (or infinitely wide) neural nets are at least not irrelevant.
The aforementioned papers do not provide an explanation of the generalization behavior in presence of mislabeled training data, because the training loss will always go to for any labels, i.e., over-fitting will happen. However, one may use the theoretical insights from the kernel viewpoint to design and analyze regularization methods for neural net training in the over-parameterization regime.
This paper makes progress towards a theoretical explanation of the generalization phenomenon in over-parameterized neural nets when noisy labels are present. In particular, inspired by the correspondence between wide neural nets and neural tangent kernels (NTKs), we propose two simple regularization methods for neural net training:
Regularization by distance to initialization. Denote by the network parameters and by its random initialization. This method adds a regularization to the training objective.
Adding an auxiliary variable for each training example. Let be the -th training example and represent the neural net. This method adds a trainable variable and tries to fit the -th label using . At test time, only the neural net is used and the auxiliary variables are discarded.
We prove that for wide neural nets, both methods, when trained with gradient descent to convergence, correspond to kernel ridge regression using the NTK, which is usually regarded as an alternative to early stopping in kernel literature.
Then we prove a generalization bound of the learned predictor on the true data distribution when the training data labels are corrupted. This generalization bound depends on the (unobserved) true labels, and is comparable to the bound one can get when there is no label noise, therefore indicating that the proposed regularization methods are robust to noisy labels.
The effectiveness of these two regularization methods are verified empirically – on MNIST and CIFAR, they achieve similar test accuracy to early stopping, which is much lower than the noise level in the training dataset.
Additional related work.
Li et al. (2019) proved that gradient descent with early stopping is robust to label noise for an over-parameterized two-layer neural net. Under a clustering assumption on data, they showed that gradient descent fits the correct labels before starting to over-fit wrong labels. Their result is different from ours from several aspects: they only considered two-layer nets while we allow arbitrarily deep nets; they required a clustering assumption on data while our generalization bound is general and data-dependent; furthermore, they did not address the question of generalization, but only provided guarantees on the training data. Interestingly, Li et al. (2019) quantified over-fitting through the distance of parameters to their initialization . This exactly corresponds to our first regularization method, which we show can help generalization despite the presence of noisy labels. Distance to initialization was also believed to be related to generalization in deep learning (Neyshabur et al., 2019; Nagarajan and Kolter, 2019). These studies provide additional motivation for our regularization method.
Our methods are inspired by kernel ridge regression, which is one of the most common kernel methods and has been widely studied. It was shown to perform comparably to early-stopped gradient descent (Bauer et al., 2007; Gerfo et al., 2008; Raskutti et al., 2014; Wei et al., 2017). Accordingly, we indeed observe in our experiments that our regularization methods perform similarly to gradient descent with early stopping in neural net training.
A line of empirical work proposed various methods to deal with mislabeled examples, e.g. (Sukhbaatar et al., 2014; Liu and Tao, 2015; Veit et al., 2017; Northcutt et al., 2017; Jiang et al., 2017; Ren et al., 2018). Our work provides a theoretical guarantee on generalization by using principled and considerably simpler regularization methods.
In Section 2 we review how training wide neural nets are connected to the kernel method. In Section 3 we introduce the setting considered in this paper. In Section 4 we describe our regularization methods and show how they lead to kernel ridge regression, In Section 5 we prove a generalization guarantee of our methods. We provide experimental results in Section 6 and conclude in Section 7.
We use bold-faced letters for vectors and matrices. For a matrix, is its -th entry. We use to denote the Euclidean norm of a vector or the spectral norm of a matrix, and to denote the Frobenius norm of a matrix. represents the standard inner product. Denote by
the minimum eigenvalue of a positive semidefinite matrix.
is the identity matrix of appropriate dimension. Let. Let be the indicator of event .
2 Recap of Neural Tangent Kernel
It has been proved in a series of recent papers (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019a) that a sufficiently wide neural net trained with randomly initialized gradient descent will stay close to its first-order Taylor expansion with respect to its parameters at initialization. This leads to an essentially linear dynamics, and the final learned neural net is close to the kernel regression solution with respect to a particular kernel named neural tangent kernel (NTK). This section briefly and informally recaps this theory.
Let be an -layer fully connected neural network with scalar output, where is the input and is all the network parameters:
Here is the weight matrix in the -th layer () and is initialized using i.i.d. entries,
is a coordinate-wise activation function, andis an activation-dependent normalization constant. The hidden widths are allowed to go to infinity. Suppose that the net is trained by minimizing the squared loss over a training dataset :
Let the random initial parameters be , and the parameters be updated according to gradient descent on . It is shown that if the network is sufficiently wide, the parameters will stay close to the initialization during training so that the following first-order approximation is accurate:
The above approximation is exact in the infinite width limit, but can also be shown when the width is finite but sufficiently large.
In the following, we assume the linear approximation (1) and proceed to derive the solution of gradient descent. Denote for each and let for any . Then using (1) we obtain the following approximation for the training objective :
Since we have reduced the problem to training a linear model with the squared loss, its gradient descent iterates can be written analytically: starting from and updating using , we can easily derive the solution at every iteration:
where , (assuming is invertible), , and . Assuming , the final solution at is , and the corresponding predictor is
Suppose that the neural net and its initialization are defined so that the initial output is small, i.e., and .111We can ensure small or even zero output at initialization by either multiplying a small factor (as done in (Arora et al., 2019a, b)), or using the following “difference trick”: define the network to be the difference between two networks with the same architecture, i.e., ; then initialize and to be the same (and still random); this ensures at initialization, while keeping the same value of for both and . See details in Appendix A.. Then we have
Neural tangent kernel (NTK).
The solution in (2) is in fact the kernel regression solution with respect to the kernel induced by (random) features , which is defined as for all . This kernel was named the neural tangent kernel (NTK) by Jacot et al. (2018)
. Although this kernel is random, it is shown that when the network is sufficiently wide, this random kernel converges to a deterministic limit in probability(Arora et al., 2019a). Note that using the NTK we can rewrite the predictor at the end of training, (2), as
where represents the training inputs, , and with .
3 Setting: Learning from Noisily Labeled Data
In this section we formally describe the model of learning from noisily labeled data considered throughout this paper. Suppose that there is an underlying data distribution over of interest, but we only have access to samples from a noisy version of . Formally, consider the following data generation process:
conditioned on , let be drawn from a noise distribution over that may depend on and , and
be the joint distribution offrom the above process. Intuitively, represents the true label and is the noisy label.
Let be i.i.d. samples from , and suppose that we only have access to , i.e., we only observe inputs and their noisy labels, but do not observe true labels. The goal is to learn a function (in the form of a neural net) that can predict the true label well on the distribution , i.e., to find a function that makes the population loss
as small as possible for certain loss function.
To set up notation for later sections, we denote , , , and .
The goal of learning from noisy labels would be impossible without assuming some correlation between true and noisy labels. The assumption we make throughout this paper is that for any , the noise distribution has mean and is subgaussian with parameter . As we explain below this already captures the interesting case of having corrupted labels in a classification task.
Example: binary classification with partially corrupted labels.
Let be a distribution over , but in the noisy observations, each label is flipped with probability (). Namely, for , the observed noisy label is equal to with probability , and is equal to with probability . The joint distribution of in this case does not satisfy our assumption, because the mean of is non-zero (conditioned on ). Nevertheless, this issue can be fixed by considering instead.222For binary classification, only the sign of the label matters, so we can assume that the true labels are from instead of , without changing the classification problem. Then we can easily check that conditioned on , has mean and is subgaussian with parameter . Therefore, this is a special case of having zero-mean subgaussian noise in the label.
4 Regularization Methods
Given a noisily labeled training dataset , let be a neural net to be trained. A direct, unregularized training method would involve minimizing an objective function like . To prevent over-fitting, we suggest the following simple regularization methods that slightly modify this objective.
Method 1: Regularization using Distance to Initialization (RDI).
We let the initial parameters be randomly generated, and minimize the following regularized objective:
Method 2: adding an AUXiliary variable for each training example (AUX).
We add an auxiliary trainable parameter for each , and minimize the following modified objective:
where is initialized to be .
4.1 Equivalence to Kernel Ridge Regression in Wide Neural Nets
Now we assume the setting in Section 2, where the neural net architecture is sufficiently wide so that the first-order approximation (1) is accurate during gradient descent: Recall that we have which induces the NTK . Also recall that we can assume near-zero initial output: (see Footnote 1). Therefore we have the approximation:
The following theorem shows that in either case, gradient descent leads to the same dynamics and converges to the kernel ridge regression solution using the NTK.
Fix a learning rate . Consider gradient descent on with initialization :
and gradient descent on with initialization and :
Then we must have for all . Furthermore, if the learning rate satisfies , then converges linearly to a limit solution such that:
The gradient of can be written as
where . Therefore we have . Then, according to the gradient descent update rule (8), we know that and can always be related by It follows that
On the other hand, from (7) we have
Comparing the above two equations, we find that and have the same update rule. Since , this proves for all .
The second part of the theorem can be proved by noticing that is a strongly convex quadratic function. Therefore gradient descent will converge linearly to its unique optimum which can be easily calculated. We defer the details to Appendix B. ∎
If no regularization were used, the labels would be fitted perfectly and the learned function would be (c.f. (2)). Therefore the effect of regularization is to add to the kernel matrix, and (9) is known as the solution to kernel ridge regression in kernel literature. In Section 5, we give a generalization bound of this solution on the true data distribution, which is comparable to the bound one can obtain even when true labels are available.
In this section we analyze the population loss of the function defined in (9) on the true data distribution . Our main result is the following theorem:
Assume that the true labels are bounded, i.e., , and that the kernel matrix satisfies . Consider any loss function that is -Lipschitz in the first argument such that . Then, with probability at least we have
As , we have . In order for the second term in (10) to go to , we need to choose to grow with , e.g., for some small constant . Then, the only remaining term in (10) to worry about is . Notice that it depends on the (unobserved) true labels , instead of the noisy labels . By a very similar proof, one can show that training on the true labels (without regularization) leads to a population loss bound . In comparison, we can see that even when there is label noise, we only lose a factor of in the population loss on the true distribution, which can be chosen as any slow-growing function of . If grows much slower than , by choosing an appropriate , our result indicates that the underlying distribution is learnable in presence of label noise. See Remark 5.2 for an example.
Arora et al. (2019b) proved that two-layer ReLU neural nets trained with gradient descent can learn a class of smooth functions on the unit sphere.
Their proof is by showing
proved that two-layer ReLU neural nets trained with gradient descent can learn a class of smooth functions on the unit sphere. Their proof is by showingif for certain function , where is the NTK corresponding to two-layer ReLU nets. Combined with their result, Theorem 5.1 implies that the same class of functions can be learned by the same network even if the labels are noisy.
Proof sketch of Theorem 5.1.
The proof is given in Section C.1. It has three main steps: first, bound the training error of on the true labels ; second, bound the RKHS (reproducing kernel Hilbert space) norm of , which in fact equals to the distance of to its initialization, ; finally, the population loss bound is proved via Rademacher complexity. ∎
As shown in Section 3, binary classification with partially corrupted labels can be viewed as a special case of the noisy label model considered in this paper. Therefore, Theorem 5.1 has the following corollary on classification. Its proof is given in Section C.2.
Under the same assumption as in Theorem 5.1 and additionally assuming that for , we have , and for , with probability at least we have
In this section, we compare the performance of our regularization methods from Section 4 – regularization by distance to initialization (RDI) and regularization by adding auxiliary variable (AUX) – against vanilla gradient descent or stochastic gradient descent (GD/SGD) with or without early stopping. We perform binary classification with loss in two settings: two-layer fully-connected (FC) net on MNIST (“5” vs “8”) and deep convolutional neural net (CNN) on CIFAR (“airplanes” vs “automobiles”). See detailed description in Appendix E. The labels are corrupted with different levels of noise. We summarize our experimental findings as follows:
Both RDI and AUX can prevent over-parameterized neural networks from over-fitting the noise in labels. They almost have the same performance when trained with GD.
The weights of over-parametrized networks trained by GD/SGD remain close to the initialization, even with noisy labels. Moreover, as expected, RDI and AUX enforce the weights to stay closer to the initialization, and thus generalize even with noise in labels.
6.1 Performance of Regularization Methods
We first evaluate the performance of GD with AUX and RDI for binary classification with 1000 training samples from MNIST (see Figure 1). We observe that both methods GD+AUX and GD+RDI achieve much higher test accuracy than vanilla GD which over-fits the noisy training dataset, and they achieve similar test accuracy to GD with early stopping. According to Theorem 4.1, GD+AUX and GD+RDI should have similar performances, which is also empirically verified in Figure 1.
We then evaluate the performance of SGD on CIFAR with 20% labels flipped (see Figure 1(a)). In this setting, SGD+AUX outperforms SGD with early stopping by 1%. We also observe that there is a gap between the performance of SGD+AUX and that of SGD+RDI. An explanation is that SGD+AUX and SGD+RDI may have different trajectories since Theorem 4.1 only works for GD and sufficiently over-parameterized networks. Noting that the weights in SGD+RDI are farther from initialization than weights in SGD+AUX (Figure 1(b)), we conjecture that the gap comes from the fact that SGD+RDI suffers more perturbation along its trajectory due to the finite width. Indeed, Lemma D.1 provides some theoretical evidence for the benefits of AUX, which states that AUX enjoys a linear convergence rate under gradient flow even if the network is not over-parameterized. Additional figures for other noise levels can be bound in Appendix F.
6.2 Distance of Weights to Their Initialization
A key property in the NTK regime is that training loss is decreased to with minimal movement of weights from their initialization, which in our case is bounded by roughly in Lemma C.2. We indeed find that the distance will not increase as the width of the network changes as long as the network remains over-parametrized – it solely depends on the training data and labels and is much smaller than the initial weights. However, the weights still tend to move more with noise than without noise333See Figures 9 and 8 in Appendix F for the distance in noiseless case., which leads to over-fitting. As shown in Figure 1(b), explicit regularization (AUX and RDI) can reduce the moving distance of weights, thus giving better generalization (see Figure 1(a)).
Table 1 summarizes the relationship between the distance to initialization and other hyper-parameters that we observe from various experiments.
|# samples||noise level||width||regularization strength||learning rate|
Towards understanding generalization of deep neural networks in presence of noisy labels, this paper presents two simple regularization methods and shows that they are theoretically and empirically effective. The theoretical insights behind these methods come from the correspondence between neural networks and kernels. We believe that a better understanding of such correspondence could help the design of other principled methods in practice.
This work is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC. The authors thank Sanjeev Arora for helpful discussions and suggestions.
- Allen-Zhu et al. (2018a) Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018a.
- Allen-Zhu et al. (2018b) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018b.
- Arora et al. (2019a) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019a.
- Arora et al. (2019b) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019b.
Bartlett and Mendelson (2002)
Peter L Bartlett and Shahar Mendelson.
Rademacher and gaussian complexities: Risk bounds and structural
Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- Bauer et al. (2007) Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007.
- Cao and Gu (2019) Yuan Cao and Quanquan Gu. A generalization theory of gradient descent for learning over-parameterized deep relu networks. arXiv preprint arXiv:1902.01384, 2019.
- Chizat and Bach (2018) Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
- Du et al. (2018) Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.
- Du et al. (2019) Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
Gerfo et al. (2008)
L Lo Gerfo, Lorenzo Rosasco, Francesca Odone, E De Vito, and Alessandro Verri.
Spectral algorithms for supervised learning.Neural Computation, 20(7):1873–1897, 2008.
Guan et al. (2018)
Melody Y Guan, Varun Gulshan, Andrew M Dai, and Geoffrey E Hinton.
Who said what: Modeling individual labelers improves classification.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Hsu et al. (2012) Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
- Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
- Jiang et al. (2017) Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055, 2017.
- Krishna et al. (2016) Ranjay A Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A Shamma, Li Fei-Fei, and Michael S Bernstein. Embracing error to enable rapid crowdsourcing. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 3167–3179. ACM, 2016.
- Lee et al. (2019) Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
- Li et al. (2019) Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. arXiv preprint arXiv:1903.11680, 2019.
- Li and Liang (2018) Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv preprint arXiv:1808.01204, 2018.
- Liu and Tao (2015) Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015.
- Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT Press, 2012.
- Nagarajan and Kolter (2019) Vaishnavh Nagarajan and J Zico Kolter. Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672, 2019.
- Neyshabur et al. (2019) Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations, 2019.
- Northcutt et al. (2017) Curtis G Northcutt, Tailin Wu, and Isaac L Chuang. Learning with confident examples: Rank pruning for robust classification with noisy labels. arXiv preprint arXiv:1705.01936, 2017.
- Raskutti et al. (2014) Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research, 15(1):335–366, 2014.
- Ren et al. (2018) Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050, 2018.
- Rolnick et al. (2017) David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.
- Sukhbaatar et al. (2014) Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
- Veit et al. (2017) Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie. Learning from noisy large-scale datasets with minimal supervision. In
- Wei et al. (2017) Yuting Wei, Fanny Yang, and Martin J Wainwright. Early stopping for kernel boosting algorithms: A general analysis with localized complexities. In Advances in Neural Information Processing Systems, pages 6065–6075, 2017.
- Yang (2019) Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
- Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations (ICLR), 2017, 2017.
- Zou et al. (2018) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888, 2018.
Appendix A Difference Trick
We give a quick analysis of the “difference trick” described in Footnote 1, i.e., let where , and initialize and to be the same (and still random). The following lemma implies that the NTKs for and are the same.
If , then
(i) holds by definition. For (ii), we can calculate that
The above lemma allows us to ensure zero output at initialization while preserving NTK. As a comparison, Chizat and Bach 
proposed the following "doubling trick": neurons in the last layer are duplicated, with the new neurons having the same input weights and opposite output weights. This satisfies zero output at initialization, but destroys the NTK. To see why, note that with the “doubling trick", the network will output 0 at initialization no matter what the input to its second to last layer is. Thus the gradients with respect to all parameters that are not in the last two layers are 0.
In our experiments, we observe that the performance of the neural net improves with the “difference trick.” See Figure 3. This intuitively makes sense, since the initial network output is independent of the label (only depends on the input) and thus can be viewed as noise. When the width of the neural net is infinity, the initial network output is actually a zero-mean Gaussian process, whose covariance matrix is equal to the NTK contributed by the gradients of parameters in its last layer. Therefore, learning an infinitely wide neural network with nonzero initial output is equivalent to doing kernel regression with an additive correlated Gaussian noise on training and testing labels.
Appendix B Missing Proof in Section 4
Proof of Theorem 4.1 (second part).
Now we prove the second part of the theorem. Notice that is a strongly convex quadratic function with Hessian , where . From the classical convex optimization theory, as long as , gradient descent converges linearly to the unique optimum of , which can be easily obtained:
Then we have
finishing the proof. ∎
Appendix C Missing Proofs in Section 5
c.1 Proof of Theorem 5.1
We first prove two lemmas.
With probability at least , we have
In this proof we are conditioned on and , and only consider the randomness in given and .
First of all, we can write
so we can write the training loss on true labels as
Next, since , conditioned on and , has independent and subgaussian entries (with parameter ), by [Hsu et al., 2012], for any symmetric matrix , with probability at least ,
Let and let be the eigenvalues of . We have
Let be the reproducing kernel Hilbert space (RKHS) corresponding to the kernel . Recall that the RKHS norm of a function is
With probability at least , we have
In this proof we are still conditioned on and , and only consider the randomness in given and . Note that with . Since and , we can bound