Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

01/24/2019 ∙ by Sanjeev Arora, et al. ∙ Princeton University Carnegie Mellon University 16

Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed than recent papers, an explanation for why training a neural net with random labels leads to slower training, as originally observed in [Zhang et al. ICLR'17]. (ii) Generalization bound independent of network size, using a data-dependent complexity measure. Our measure distinguishes clearly between random labels and true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent papers require sample complexity to increase (slowly) with the size, while our sample complexity is completely independent of the network size. (iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets trained via gradient descent. The key idea is to track dynamics of training and generalization via properties of a related kernel.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The well-known work of Zhang et al. (2017) highlighted intriguing experimental phenomena about deep net training – specifically, optimization and generalization – and asked whether theory could explain them. They showed that sufficiently powerful nets (with vastly more parameters than number of training samples) can attain zero training error, regardless of whether the data is properly labeled or randomly labeled. Obviously, training with randomly labeled data cannot generalize, whereas training with properly labeled data generalizes. See Figure 2 replicating some of these results.

Recent papers have begun to provide explanations, showing that gradient descent can allow an overparametrized multi-layer net to attain arbitrarily low training error on fairly generic datasets (Du et al., 2018a, c; Li & Liang, 2018; Allen-Zhu et al., 2018b; Zou et al., 2018), provided the amount of overparametrization is a high polynomial of the relevant parameters (i.e. vastly more than the overparametrization in Zhang et al. (2017)). Under further assumptions it can also be shown that the trained net generalizes (Allen-Zhu et al., 2018a). But some issues were not addressed in these papers, and the goal of the current paper is to address them.

First, the experiments in Zhang et al. (2017) show that though the nets attain zero training error on even random data, the convergence rate is much slower. See Figure 1.

Question 1.

Why do true labels give faster convergence rate than random labels for gradient descent?

The above papers do not answer this question, since their proof of convergence does not distinguish between good and random labels.

The next issue is about generalization: clearly, some property of properly labeled data controls generalization, but what? Classical measures used in generalization theory such as VC-dimension and Rademacher complexity are much too pessimistic. A line of research proposed norm-based (e.g. Bartlett et al. (2017a)) and compression-based bounds (Arora et al., 2018). But the sample complexity upper bounds obtained are still far too weak. Furthermore they rely on some property of the trained net that is revealed/computed at the end of training. There is no property of data alone that determine upfront whether the trained net will generalize. A recent paper (Allen-Zhu et al., 2018a) assumed that there exists an underlying (unknown) neural network that achieves low error on the data distribution, and the amount of data available is quite a bit more than the minimum number of samples needed to learn this underlying neural net. Under this condition, the overparametrized net (which has way more parameters) can learn in a way that generalizes. However, it is hard to verify from data whether this assumption is satisfied, even after the larger net has finished training.111In Section 2, we discuss the related works in more details. Thus the assumption is in some sense unverifiable.

Question 2.

Is there an easily verifiable complexity measure that can differentiate true labels and random labels?

Without explicit regularization, to attack this problem, one must resort to algorithm-dependent generalization analysis. One such line of work established that first-order methods can automatically find minimum-norm/maximum-margin solutions that fit the data in the settings of logistic regression, deep linear networks, and symmetric matrix factorization 

(Soudry et al., 2018; Gunasekar et al., 2018a, b; Ji & Telgarsky, 2018; Li et al., 2018b). However, how to extend these results to non-linear neural networks remains unclear (Wei et al., 2018). Another line of algorithm-dependent analysis of generalization (Hardt et al., 2015; Mou et al., 2017; Chen et al., 2018)

used stability of specific optimization algorithms that satisfy certain generic properties like convexity, smoothness, etc. However, as the number of epochs becomes large, these generalization bounds are vacuous.

Our results.

We give a new analysis that provides answers to Questions 1 and 2

for overparameterized two-layer neural networks with ReLU activation trained by gradient descent (GD), when the number of neurons in the hidden layer is sufficiently large. In this setting,

Du et al. (2018c) have proved that GD with random initialization can achieve zero training error for any non-degenerate data. We give a more refined analysis of the trajectory of GD which enables us to provide answers to Questions 1 and 2. In particular:

  • In Section 4

    , using the trajectory of the network predictions on the training data during optimization, we accurately estimate the magnitude of training loss in each iteration. Our key finding is that the number of iterations needed to achieve a target accuracy depends on the projections of data labels on the eigenvectors of a certain Gram matrix to be defined in Equation (

    3). On MNIST and CIFAR datasets, we find that such projections are significantly different for true labels and random labels, and as a result we are able to answer Question 1.

  • In Section 5, we give a generalization bound for the solution found by GD, based on accurate estimates of how much the network parameters can move during optimization (in suitable norms). Our generalization bound depends on a data-dependent complexity measure (c.f. Equation (10)), and notably, is completely independent of the number of hidden units in the network. Again, we test this complexity measure on MNIST and CIFAR, and find that the complexity measures for true and random labels are significantly different, which thus answers Question 2.

    Notice that because zero training error is achieved by the solution found by GD, a generalization bound is an upper bound on the error on the data distribution (test error). We also remark that our generalization bound is valid for any data labels – it does not require the existence of a small ground-truth network as in (Allen-Zhu et al., 2018a). Moreover, our bound can be efficiently computed for any data labels.

  • In Section 6, we further study what kind of functions can be provably learned by two-layer ReLU networks trained by GD. Combining the optimization and generalization results, we uncover a broad class of learnable functions, including linear functions, two-layer neural networks with polynomial activation or cosine activation, etc. Our requirement on the smoothness of learnable functions is weaker than that in (Allen-Zhu et al., 2018a).

Finally, we note that the intriguing generalization phenomena in deep learning were observed in kernel methods as well 

Belkin et al. (2018). The analysis in the current paper is also related to a kernel from the ReLU activation (c.f. Equation (3)).

2 Related Work

In this section we survey previous works on optimization and generalization aspects of neural networks.

Optimization.

Many papers tried to characterize geometric landscapes of objective functions (Safran & Shamir, 2017; Zhou & Liang, 2017; Freeman & Bruna, 2016; Hardt & Ma, 2016; Nguyen & Hein, 2017; Kawaguchi, 2016; Venturi et al., 2018; Soudry & Carmon, 2016; Du & Lee, 2018; Soltanolkotabi et al., 2018; Haeffele & Vidal, 2015). The hope is to leverage recent advance in first-order algorithms (Ge et al., 2015; Lee et al., 2016; Jin et al., 2017) which showed that if the landscape satisfies (1) all local minima are global and (2) all saddle points are strict (i.e., there exists a negative curvature), then first-order methods can escape all saddle points and find a global minimum. Unfortunately, these desired properties do not hold even for simple non-linear shallow neural networks (Yun et al., 2018) or 3-layer linear neural networks (Kawaguchi, 2016).

Another approach is to directly analyze trajectory of the optimization method and to show convergence to global minimum. A series of papers made strong assumptions on input distribution as well as realizability of labels, and showed global convergence of (stochastic) gradient descent for some shallow neural networks 

(Tian, 2017; Soltanolkotabi, 2017; Brutzkus & Globerson, 2017; Du et al., 2017a, b; Li & Yuan, 2017). Some local convergence results have also been proved (Zhong et al., 2017; Zhang et al., 2018). However, these assumptions are not satisfied in practice.

For two-layer neural networks, a line of papers used mean field analysis to establish that for infinitely wide neural networks, the empirical distribution of the neural network parameters can be described as a Wasserstein gradient flow (Mei et al., 2018; Chizat & Bach, 2018a; Sirignano & Spiliopoulos, 2018; Rotskoff & Vanden-Eijnden, 2018; Wei et al., 2018). However, it is unclear whether this framework can explain the behavior of first-order methods on finite-size neural networks.

Recent breakthroughs were made in understanding optimization of overparameterized neural networks through the trajectory-based approach. They proved global polynomial time convergence of (stochastic) gradient descent on non-linear neural networks for minimizing empirical risk. Their proof techniques can be roughly classified into two categories.

Li & Liang (2018); Allen-Zhu et al. (2018b); Zou et al. (2018) analyzed the trajectory of parameters and showed that on the trajectory, the objective function satisfies certain gradient dominance property. On the other hand, Du et al. (2018a, c) analyzed the trajectory of network predictions on training samples and showed that it enjoys a strongly-convex-like property.

Generalization.

It is well known that the VC-dimension of neural networks is at least linear in the number of parameters (Bartlett et al., 2017b), and therefore classical VC theory cannot explain the generalization ability of modern neural networks with more parameters than training samples. Researchers have proposed norm-based generalization bounds (Bartlett & Mendelson, 2002; Bartlett et al., 2017a; Neyshabur et al., 2015, 2017, 2019; Konstantinos et al., 2017; Golowich et al., 2017; Li et al., 2018a) and compression-based bounds (Arora et al., 2018). Dziugaite & Roy (2017); Zhou et al. (2019)

used the PAC-Bayes approach to compute non-vacuous generalization bounds for MNIST and ImageNet, respectively. All these bounds are

posterior in nature – they depend on certain properties of the trained neural networks. Therefore, one has to finish training a neural network to know whether it can generalize. Comparing with these results, our generalization bound only depends on training data and can be calculated without actually training the neural network.

Another line of work assumed the existence of a true model, and showed that the (regularized) empirical risk minimizer has good generalization with sample complexity that depends on the true model (Du et al., 2018b; Ma et al., 2018; Imaizumi & Fukumizu, 2018). These papers ignored the difficulty of optimization, while we are able to prove generalization of the solution found by gradient descent. Furthermore, our generic generalization bound does not assume the existence of any true model.

Our paper is closely related to (Allen-Zhu et al., 2018a) which showed that two-layer overparametrized neural networks trained by randomly initialized stochastic gradient descent can learn a class of infinite-order smooth functions. In contrast, our generalization bound depends on a data-dependent complexity measure that can be computed for any dataset, without assuming any ground-truth model. As a consequence of our generic bound, we also show that two-layer neural networks can learn a class of infinite-order smooth functions, with a less strict requirement for smoothness. Furthermore, our bound is completely independent of the number of hidden units, while there is a poly-logarithmic dependence in Allen-Zhu et al. (2018a). Allen-Zhu et al. (2018a) also studied the generalization performance of three-layer neural networks.

Lastly, our work is related to kernel methods, especially recent discoveries of the connection between deep learning and kernels (Jacot et al., 2018; Chizat & Bach, 2018b; Daniely, 2017). Our analysis utilized several properties of a related kernel from the ReLU activation (c.f. Equation (3)).

3 Preliminaries and Overview of Results

Notation.

We use bold-faced letters for vectors and matrices. For a matrix

, let be its -th entry. We use to denote the Euclidean norm of a vector or the spectral norm of a matrix, and use to denote the Frobenius norm of a matrix. Denote by

the minimum eigenvalue of a symmetric matrix

. Let be the vectorization of a matrix in column-first order. Let

be the identity matrix and

. Denote by

the Gaussian distribution with mean

and covariance . Denote by the ReLU function . Denote by the indicator function for an event .

3.1 Setting: Two-Layer Neural Network Trained by Randomly Initialized Gradient Descent

We consider a two-layer ReLU activated neural network with neurons in the hidden layer:

where is the input, are weight vectors in the first layer, are weights in the second layer. For convenience we denote and .

We are given input-label samples drawn i.i.d. from an underlying data distribution over . We denote and . For simplicity, we assume that for sampled from , we have and .

We train the neural network by randomly initialized gradient descent (GD) on the quadratic loss over data . In particular, we first initialize the parameters randomly:

(1)

where controls the magnitude of initialization, and all randomnesses are independent. We then fix the second layer and optimize the first layer through GD on the following objective function:

(2)

The GD update rule can be written as:222Since ReLU is not differentiable at , we just define “gradient” using this formula, and this is indeed what is used in practice.

where is the learning rate.

3.2 The Gram Matrix from ReLU Kernel

Given , we define the following Gram matrix as follows:

(3)

This matrix can be viewed as a Gram matrix from a kernel associated with the ReLU function, and has been studied in (Xie et al., 2017; Tsuchida et al., 2017; Du et al., 2018c).

In our setting of training a two-layer ReLU network, Du et al. (2018c) showed that if is positive definite, GD converges to training loss if is sufficiently large:

Theorem 3.1 ((Du et al., 2018c)333Du et al. (2018c) only considered the case , but it is straightforward to generalize their result to general at the price of an extra factor in .).

Assume . For , if and

, then with probability at least

over the random initialization (1), we have:

  • ;

  • .

Our results on optimization and generalization also crucially depend on this matrix .

3.3 Overview of Our Results

Now we give an informal description of our main results. It assumes that the initialization magnitude is sufficiently small and the network width is sufficiently large (to be quantified later).

The following theorem gives a precise characterization of how the objective decreases to . It says that this process is essentially determined by a power method for matrix applied on the label vector .

Theorem 3.2 (Informal version of Theorem 4.1).

With high probability we have:

As a consequence, we are able to distinguish the convergence rates for different labels , which can be determined by the projections of on the eigenvectors of . This allows us to obtain an answer to Question 1. See Section 4 for details.

Our main result for generalization is the following:

Theorem 3.3 (Informal version of Theorem 5.1).

For any

-Lipschitz loss function, the generalization error of the two-layer ReLU network found by GD is at most

(4)

Notice that our generalization bound (4) can be computed from data , and is completely independent of the network width . We observe that this bound can clearly distinguish true labels and random labels, thus providing an answer to Question 2. See Section 5 for details.

Finally, using Theorem 3.3, we prove that we can use our two-layer ReLU network trained by GD to learn a broad class of functions, including linear functions, two-layer neural networks with polynomial activation or cosine activation, etc. See Section 6 for details.

3.4 Additional Notation

We introduce some additional notation that will be used.

Define , i.e., the network’s prediction on the -th input. We also use to denote all predictions. Then we have and the gradient of can be written as:

(5)

where .

We define two matrices and which will play a key role in our analysis of the GD trajectory:

and . Note that

With this notation we have a more compact form of the gradient (5):

Then the GD update rule is:

(6)

for . Throughout the paper, we use as the iteration number, and also use to index all variables that depend on . For example, we have , , etc.

4 Analysis of Convergence Rate

(a) Convergence Rate, MNIST.
(b) Eigenvalues & Projections, MNIST.
(c) Convergence Rate, CIFAR.
(d) Eigenvalues & Projections, CIFAR.
Figure 1: In Figures 1(a) and 1(c), we compare convergence rates of gradient descent between using true labels, random labels and the worst case labels (normalized eigenvector of corresponding to . In Figures 1(b) and 1(d), we plot the eigenvalues of as well as projections of true, random, and worst case labels on different eigenvectors of . The experiments use gradient descent on data from two classes of MNIST or CIFAR. The plots clearly demonstrate that true labels have much better alignment with top eigenvectors, thus enjoying faster convergence.

Although Theorem 3.1 already predicts linear convergence of GD to loss, it only provides an upper bound on the loss and does not distinguish different types of labels. In particular, it cannot answer Question 1. In this section we give a fine-grained analysis of the convergence rate.

Recall the loss function . Thus, it is equivalent to study how fast the sequence converges to . Key to our analysis is the observation that when the size of initialization is small and the network width is large, the sequence stays close to another sequence which has a linear update rule:

(7)

where is the Gram matrix defined in (3).

Write the eigen-decomposition , where are orthonormal eigenvectors of and are corresponding eigenvalues. Our main theorem in this section is the following:

Theorem 4.1.

Suppose , , and . Then with probability at least over the random initialization, for all we have:

(8)

The proof of Theorem 4.1 is given in Appendix C.

In fact, the dominating term is exactly equal to , which we prove in Section 4.1.

In light of (8), it suffices to understand how fast converges to as grows. Define , and notice that each sequence is a geometric sequence which starts at and decreases at ratio . In other words, we can think of decomposing the label vector into its projections onto all eigenvectors of : , and the -th portion shrinks exponentially at ratio . The larger is, the faster decreases to , so in order to have faster convergence we would like the projections of onto top eigenvectors to be larger. Therefore we obtain the following intuitive rule to compare the convergence rates on two sets of labels in a qualitative manner (for fixed ):

  • For a set of labels , if they align with the top eigenvectors, i.e., is large for large , then gradient descent converges quickly.

  • For a set of labels , if the projections on eigenvectors are uniform, or labels align with eigenvectors with respect to small eigenvalues, then gradient descent converges with a slow rate.

Answer to Question 1.

We now use this reasoning to answer Question 1. In Figure 1(b), we compute the eigenvalues of (blue curve) for the MNIST dataset. The plot shows the eigenvalues of admit a fast decay. We further compute the projections of true labels (red) and random labels (cyan). We observe that there is a significant difference between the projections of true labels and random labels: true labels align well with top eigenvectors whereas projections of random labels are close to being uniform. Furthermore, according to our theory, if a set of labels align with the eigenvector associated with the least eigenvalue, the convergence rate of gradient descent will be extremely slow. We construct such labels and in Figure 1(a) we indeed observe slow convergence. We repeat the same experiments on CIFAR and have similar observations (Figures 1(c) and 1(d)). These empirical findings support our theory on the convergence rate of gradient descent. See Appendix A for implementation details.

4.1 Proof Sketch of Theorem 4.1

Now we prove . The entire proof of Theorem 4.1 is given in Appendix C, which relies on the fact that the dynamics of is essentially a perturbed version of (7).

From (7) we have , which implies . Note that has eigen-decomposition and that can be decomposed as . Then we have , which implies .

5 Analysis of Generalization

In this section, we study the generalization ability of the two-layer neural network trained by GD.

First, in order for optimization to succeed, i.e., zero training loss is achieved, we need a non-degeneracy assumption on the data distribution, defined below:

Definition 5.1.

A distribution over is -non-degenerate, if for i.i.d. samples from , with probability at least we have .

Remark 5.1.

Note that as long as no two and are parallel to each other, we have . (See (Du et al., 2018c)). For most real-world distributions, any two training inputs are not parallel.

Our main theorem is the following:

Theorem 5.1.

Fix an error parameter and failure probability . Suppose our data are i.i.d. samples from a -non-degenerate distribution , and . Consider any loss function that is -Lipschitz in the first argument such that . Then with probability at least over the random initialization and the training samples, the two-layer neural network trained by GD for iterations has population loss bounded as:

(9)

The proof of Theorem 5.1 is given in Appendix D and we sketch the proof in Section 5.1.

Note that in Theorem 5.1 there are three sources of possible failures: (i) failure of satisfying , (ii) failure of random initialization, and (iii) failure in the data sampling procedure (c.f. Theorem B.1). We ensure that all these failure probabilities are at most so that the final failure probability is at most .

As a corollary of Theorem 5.1, for binary classification problems (i.e., labels are ), Corollary 5.2 shows that (9) also bounds the population classification error of the learned classifier. See Appendix D for the proof.

Corollary 5.2.

Under the same assumptions as in Theorem 5.1 and additionally assuming that for , with probability at least , the population classification error is bounded as:

Now we discuss our generalization bound. The dominating term in (9) is:

(10)

This can be viewed as a complexity measure of data that one can use to predict the test accuracy of the learned neural network. Our result has the following advantages: (i) our complexity measure (10) can be directly computed given data , without the need of training a neural network or assuming a ground-truth model; (ii) our bound is completely independent of the network width ; (iii) our theorem does not require early stopping of optimization as in Allen-Zhu et al. (2018a).

Evaluating our completixy measure (10).

To illustrate that the complexity measure in (10) effectively determines test error, in Figure 2 we compare this complexity measure versus the test error with true labels and random labels (and mixture of true and random labels). Random and true labels have significantly different complexity measures, and as the portion of random labels increases, our complexity measure also increases. See Appendix A for implementation details.

(a) MNIST Data.
(b) CIFAR Data.
Figure 2: Generalization error ( loss and classification error) v.s. our complexity measure when different portions of random labels are used. We apply GD on data from two classes of MNIST or CIFAR until convergence. Our complexity measure almost matches the trend of generalization error as the portion of random labels increases. Note that loss is always an upper bound on the classification error.

5.1 Proof Sketch of Theorem 5.1

The main ingredients in the proof of Theorem 5.1 are Lemmas 5.3 and 5.4. We defer the proofs of these lemmas as well as the full proof of Theorem 5.1 to Appendix D.

Our proof is based on a careful characterization of the trajectory of during GD. In particular, we bound its distance to initialization as follows:

Lemma 5.3.

Suppose and . Then with probability at least over the random initialization, we have for all :

  • , and

  • .

The bound on the movement of each was proved in Du et al. (2018c). Our main contribution is the bound on which corresponds to the total movement of all neurons. The main idea is to couple the trajectory of with another simpler trajectory defined as:

(11)

We prove in Section 5.2.444Note that we have from standard concentration. See Lemma C.3. The actually proof of Lemma 5.3 is essentially a perturbed version of this.

Lemma 5.3 implies that the learned function from GD is in a restricted class of neural nets whose weights are close to initialization . The following lemma bounds the Rademacher complexity of this function class:

Lemma 5.4.

For given , with probability at least over the random initialization (, the following function class

has empirical Rademacher complexity bounded as:

Finally, combining Lemmas 5.3 and 5.4, we are able to conclude that the neural network found by GD belongs to a function class with Rademacher complexity at most (plus negligible errors). This gives us the generalization bound in Theorem 5.1 using the theory of Rademacher complexity (Appendix B).

5.2 Analysis of the Auxiliary Sequence

Now we give a proof of as an illustration for the proof of Lemma 5.3. Define . Then from (11) we have and , yielding . Plugging this back to (11) we get . Then taking a sum over we have

The desired result thus follows:

6 Provable Learning using Two-Layer ReLU Neural Networks

Theorem 5.1 determines that controls the generalization error. In this section, we study what functions can be provably learned in this setting. We assume the data satisfy for some underlying function . A simple observation is that if we can prove

for some quantity that is independent of the number of samples , then Theorem 5.1 implies we can provably learn the function on the underlying data distribution using samples. The following theorem shows that this is indeed the case for a broad class of functions.

Theorem 6.1.

Suppose we have

where or , and . Then we have

The proof of Theorem 6.1 is given in Appendix E.

Notice that for two label vectors and , we have

This implies that the sum of learnable functions is also learnable. Therefore, the following is a direct corollary of Theorem 6.1:

Corollary 6.2.

Suppose we have

(12)

where for each , , and . Then we have

(13)

Corollary 6.2 shows that overparameterized two-layer ReLU network can learn any function of the form (12) for which (13) is bounded. One can view (12) as two-layer neural networks with polynomial activation , where are weights in the first layer and are the second layer. Below we give some specific examples.

Example 6.1 (Linear functions).

For , we have .

Example 6.2 (Quadratic functions).

For where is symmetric, we can write down the eigen-decomposition . Then we have , so .555 is the trace-norm of . This is also the class of two-layer neural networks with quadratic activation.

Example 6.3 (Cosine activation).

Suppose for some . Using Taylor series we know . Thus we have .

Finally, we note that our “smoothness” requirement (13) is weaker than that in (Allen-Zhu et al., 2018a), as illustrated in the following example.

Example 6.4 (A not-so-smooth function).

Suppose , where and . We have since . Thus , so our result implies that this function is learnable by -layer ReLU nets.

However, for the above function, Allen-Zhu et al. (2018a)’s generalization theorem would require

to be bounded, where is a large constant and is the target generalization error. This is clearly not satisfied.

7 Conclusion

This paper shows how to give a fine-grained analysis of the optimization trajectory and the generalization ability of overparameterized two-layer neural networks trained by gradient descent. We believe that our approach can also be useful in analyzing overparameterized deep neural networks and other machine learning models.

Acknowledgements

This work is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC. The authors would like to thank Yi Zhang for helpful discussions.

References

Appendix

Appendix A Experimental Setup

The architecture of our neural networks is as described in Section 3.1. During the training process, we fix the second layer and only optimize the first layer, following the setting in Section 3.1. We fix the number of neurons to be in all experiments. We train the neural network using (full-batch) gradient descent (GD), with a fixed learning rate . Our theory requires a small scaling factor during the initialization (cf. (1)). We fix in all experiments. We train the neural networks until the training loss converges.

We use two image datasets, the CIFAR dataset Krizhevsky & Hinton (2009) and the MNIST dataset LeCun et al. (1998), in our experiments. We only use the first two classes of images in the the CIFAR dataset and the MNIST dataset, with training images and validation images in total for each dataset. In both datasets, for each image , we set the corresponding label to be if the image belongs to the first class, and otherwise. For each image in the dataset, we normalize the image so that , following the setup in Section 3.1.

In the experiments reported in Figure 2, we choose a specific portion of (both training and test) data uniformly at random, and change their labels to .

Our neural networks are trained using the PyTorch package Paszke et al. (2017), using (possibly multiple) NVIDIA Tesla V100 GPUs.

Appendix B Background on Generalization and Rademacher Complexity

Consider a loss function . For a function , the population loss over data distribution as well as the empirical loss over samples from are defined as:

Generalization error refers to the gap for the learned function given sample .

Recall the standard definition of Rademacher complexity:

Definition B.1.

Given samples , the empirical Rademacher complexity of a function class (mapping from to ) is defined as:

where

contains i.i.d. random variables drawn from the Rademacher distribution

.

Rademacher complexity directly gives an upper bound on generalization error (see e.g. (Mohri et al., 2012)):

Theorem B.1.

Suppose the loss function is bounded in and is -Lipschitz in the first argument. Then with probability at least over sample of size :

Therefore, as long as we can bound the Rademacher complexity of a certain function class that contains our learned predictor, we can obtain a generalization bound.

Appendix C Proofs for Section 4

In this section we prove Theorem 4.1.

We first show some technical lemmas. Most of them are already proved in (Du et al., 2018c) and we give proofs for them for completeness.

First, we have the following lemma which gives an upper bound on how much each weight vector can move during optimization.

Lemma C.1.

Under the same setting as Theorem 3.1, i.e., , and , with probability at least over the random initialization we have

Proof.

From Theorem 3.1 we know for all , which implies