In deep learning, over-parametrization refers to the widely-adopted technique of using more parameters than necessary(Krizhevsky et al., 2012; Livni et al., 2014). Both computationally and statistically, over-parametrization is crucial for learning neural nets. Controlled experiments demonstrate that over-parametrization eases optimization by smoothing the non-convex loss surface (Livni et al., 2014; Sagun et al., 2017). Statistically, increasing model size without any regularization still improves
generalization even after the model interpolates the data perfectly(Neyshabur et al., 2017b). This is surprising given the conventional wisdom on the trade-off between model capacity and generalization.
In the absence of an explicit regularizer, algorithmic regularization is likely the key contributor to good generalization. Recent works have shown that gradient descent finds the minimum norm solution fitting the data for problems including logistic regression, linearized neural networks, and matrix factorization(Soudry et al., 2018; Gunasekar et al., 2018b; Li et al., 2018; Gunasekar et al., 2018a; Ji & Telgarsky, 2018). Many of these proofs require a delicate analysis of the algorithm’s dynamics, and some are not fully rigorous due to assumptions on the iterates. To the best of our knowledge, it is an open question to prove analogous results for even two-layer relu networks. (For example, the technique of Li et al. (2018) on two-layer neural nets with quadratic activations still falls within the realm of linear algebraic tools, which apparently do not suffice for other activations.)
We propose a different route towards understanding generalization: making the regularization explicit. The motivations are: 1) with an explicit regularizer, we can analyze generalization without fully understanding optimization; 2) it is unknown whether gradient descent provides additional implicit regularization beyond what regularization already offers; 3) on the other hand, with a sufficiently weak regularizer, we can prove stronger results that apply to multi-layer neural nets with relu activations. Additionally, explicit regularization is perhaps more relevant because regularization is typically used in practice.
Concretely, we add a norm-based regularizer to the cross entropy loss of a multi-layer feedforward neural network with relu activations. We show that the global minimizer of the regularized objective achieves the maximum normalized margin among all the models with the same architecture, if the regularizer is sufficiently weak (Theorem 2.1). Informally, for models with norm
that perfectly classify the data, the margin is the smallest difference across all datapoints between the classifier score for the true label and the next best score. We are interested in normalized margin because its inverse bounds the generalization error (see recent work(Bartlett et al., 2017; Neyshabur et al., 2017a, 2018) and our Theorem 3.1). Our work explains why optimizing the training loss can lead to parameters with a large margin and thus, better generalization error.
At a first glance, it might seem counterintuitive that decreasing the regularizer is the right approach. At a high level, we show that the regularizer only serves as a tiebreaker to steer the model towards choosing the largest normalized margin. Our proofs are simple, oblivious to the optimization procedure, and apply to any norm-based regularizer. We also show that an exact global minimum is unnecessary: if we approximate the minimum loss within a constant, we obtain the max-margin within a constant (Theorem 2.2).
We further study the margin of two-layer networks: let be the max normalized margin of a neural net with hidden units (formally defined in Section 3.1). Let be the largest possible margin of an infinite two-layer network. We will show three properties of the margins:
In Theorem 3.2, we show that the optimal normalized margin of two-layer networks is non-decreasing as the width of the architecture grows, so the generalization error bound only improves with a wider network. Thus, even if the dataset is already separable, it could still be useful to increase the width to achieve larger margin and better generalization. More formally, let be the number of training examples. We additionally approach the maximum possible margin after over-parameterizing with neurons: .
We compare the neural net margin to the standard margin for the kernel SVM on the same features. We design a simple data distribution (Figure 1) where neural net margin is large but the kernel margin is small. This translates to an factor gap between the generalization error bounds for the two approaches and demonstrates the power of neural nets compared to kernel methods. We experimentally confirm that a gap does indeed exist.
In the context of bullet 2, our work is closely related to that of Rosset et al. (2007) and Neyshabur et al. (2014), who show that optimizing the loss over the parameters of a two-layer relu network is equivalent to optimizing the loss of a “convex neural net" parametrized by a distribution over hidden units. We go one step further and connect the weakly regularized training loss to the SVM.
We will also adopt this view of infinite-size neural networks to study how over-parametrization helps optimization. Prior works (Mei et al., 2018; Chizat & Bach, 2018; Sirignano & Spiliopoulos, 2018) show that gradient descent on two-layer networks becomes Wasserstein gradient flow over parameter distributions in the limit of infinite neurons. For this setting, we prove that perturbed Wasserstein gradient flow finds a global optimizer in polynomial time.
Finally, we empirically validate several of the claims made in this paper. First, we train a two-layer network on a one-dimensional classification task that is simple to visualize. In one dimension, it is possible to brute-force approximate the maximum neural network margin and we show that training with an progressively smaller regularizer results in convergence to this margin. Second, we compare the generalization performance of neural networks and kernel methods and confirm that neural networks do achieve better generalization, as our theory predicts.
1.1 Additional Related Work
Zhang et al. (2016) and Neyshabur et al. (2017b) show that neural network generalization defies conventional explanations and requires new ones. One proposed explanation is the inductive bias of the training algorithm. Recent papers (Hardt et al., 2015; Brutzkus et al., 2017; Chaudhari et al., 2016) study inductive bias through training time and sharpness of local minima. Neyshabur et al. (2015a) propose a new steepest descent algorithm in a geometry invariant to weight rescaling and show that this improves generalization. Morcos et al. (2018) relate generalization in deep nets to the number of “directions” in the neurons. Other papers (Gunasekar et al., 2017; Soudry et al., 2018; Nacson et al., 2018; Gunasekar et al., 2018b; Li et al., 2018; Gunasekar et al., 2018a) study implicit regularization towards a specific solution. Ma et al. (2017) show that implicit regularization can help gradient descent avoid overshooting optima. Rosset et al. (2004a, b) study logistic regression with a weak regularization and show convergence to the max margin solution. We adopt their techniques and extend their results.
Recent works have also derived tighter Rademacher complexity bounds for deep neural networks (Neyshabur et al., 2015b; Bartlett et al., 2017; Neyshabur et al., 2017a; Golowich et al., 2017) and new compression based generalization properties (Arora et al., 2018b). Dziugaite & Roy (2017) manage to compute non-vacuous generalization bounds from PAC-Bayes bounds. Neyshabur et al. (2018) investigate the Rademacher complexity of two-layer networks. Liang & Rakhlin (2018) and Belkin et al. (2018) study the generalization of kernel methods.
On the optimization side, Soudry & Carmon (2016) explain why over-parametrization can remove bad local minima. Safran & Shamir (2016) show that over-parametrization can improve the quality of the random initialization. Haeffele & Vidal (2015), Nguyen & Hein (2017), and Venturi et al. (2018) show that for sufficiently overparametrized networks, all local minima are global, but do not show how to find these minima via gradient descent. Du & Lee (2018) show that for two-layer networks with quadratic activations, all second-order stationary points are global minimizers. Arora et al. (2018a) interpret over-parametrization as a means of implicit acceleration during optimization. Mei et al. (2018), Chizat & Bach (2018), and Sirignano & Spiliopoulos (2018) take a distributional view of over-parametrized networks. Chizat & Bach (2018) show that Wasserstein gradient flow converges to global optimizers under structural assumptions. We extend this to a polynomial-time result.
Let denote the set of real numbers. We will use to indicate a general norm, with denoting the
norms on finite dimensional vectors, respectively, anddenoting the Frobenius norm on a matrix. In general, we use on top of a symbol to denote a unit vector: when applicable, , where the norm will be clear from context. Let be the unit sphere in dimensions. Let be the space of functions on for which the -th power of the absolute value is Lebesgue integrable. For , we overload notation and write . Additionally, for and or , we can define . Furthermore, we will use .
Throughout this paper, we reserve the symbol to denote the collection of datapoints (as a matrix), and to denote labels. We use to denote the dimension of our data. We often use to denote the parameters of a prediction function , and to denote the prediction of on datapoint .
We will use the notation to mean less than or greater than up to a universal constant, respectively. Unless stated otherwise, we use as a placeholders for some universal constant in upper and lower bounds, respectively. We will use poly to denote some universal constant-degree polynomial in the arguments.
2 Weak Regularizer Guarantees Max Margin Solutions
In this section, we will show that when we add a weak regularizer to cross-entropy loss with a positive-homogeneous prediction function, the normalized margin of the optimum converges to some max-margin solution. As a concrete example, feedforward relu networks are positive-homogeneous.
Let be the number of labels, so the -th example has label . We work with a family of prediction functions that are -positive-homogeneous in their parameters for some : . We additionally require that is continuous in . For some general norm , we study the -regularized cross-entropy loss , defined as
Define the -max normalized margin as
and let be a parameter achieving this maximum. We show that with sufficiently small regularization level , the normalized margin approaches the maximum margin . Our theorem and proof are inspired by the result of Rosset et al. (2004a, b), who analyze the special case when is a linear predictor. In contrast, our result can be applied to non-linear as long as is homogeneous.
Assume the training data is separable by a network with an optimal normalized margin . Then, the normalized margin of the global optimum of the weakly-regularized objective (equation 2.1) converges to as the strength of the regularizer goes to zero. Mathematically, let be defined in equation 2.2. Then
An intuitive explanation for our result is as follows: because of the homogeneity, the loss roughly satisfies the following (for small , and ignoring problem parameters such as ):
Thus, the loss focuses on choosing parameters with larger margin, and the regularization term biases the loss to select parameters with a smaller norm. The full proof of the theorem is deferred to Section A.1.
We can also provide an analogue of Theorem 2.1 for the binary classification setting. For this setting, our prediction is now a single real output and we train using logistic loss. We provide formal definitions and results in Section A.2. Our theory for two-layer neural networks (see Section 3) is based in this setting.
2.1 Optimization Accuracy
Since is typically hard to optimize exactly for neural nets, it would be ideal to relax the condition that minimizes . Thus, we ask, how accurately do we need to optimize to obtain a margin that approximates up to a constant? The following theorem shows that if suffices to find achieving a constant factor multiplicative approximation of , where is some sufficiently small polynomial in . Though our theorem is stated for the general multi-class setting, our result applies for binary classification as well. We provide the proof in Section A.3.
In the setting of Theorem 2.1, suppose that we choose for sufficiently large (that only depends on ). Let denote a 2-approximate 222The exact approximation constant is not important, so we choose 2 for simplicity. minimizer of , so . Denote the normalized margin of by . Then
3 Margins of Over-parameterized Two-layer Homogeneous Neural Nets
In Section 2 we showed that a weakly-regularized logistic loss leads to the maximum normalized margin. In this section, we analyze the properties of the max-margin of neural nets more closely. We will contrast neural networks with kernel methods, for which margins have already been extensively studied. Towards a first-cut understanding, we focus on two-layer networks for binary classification.
First, in Section 3.1 we provide a bound stating that the generalization error is roughly linear in the inverse of the margin, establishing that a larger margin implies better generalization. In Section 3.2, we show that the maximum normalized margin is non-decreasing with the hidden layer size and stays constant as soon as there are more hidden units than data points. This suggests that increasing the size of the network improves the generalization of the solution.
Second, in Section 3.3, we draw an analogy to classical kernel methods by proving that the maximum -normalized margin of an over-parameterized neural net is equal to half the maximum possible -normalized margin of linear functionals on a lifted feature space. In other words, we establish an equivalence between neural networks and the 1-norm SVM (Zhu et al., 2004) on the lifted features. These features are constructed by applying the activation function on all possible hidden layer weights.
Third, continuing this analogy, we will compare the generalization power of a two-layer neural network to that of a kernel method on the lifted space. This kernel method corresponds to fixing random weights for the hidden layer and solving a 2-norm max-margin problem on the top layer weights. We demonstrate instances where two layer neural networks give better generalization error guarantees than the kernel method.
3.1 Setup and Margin-based Generalization Error
In the rest of the paper, we work with two-layer neural networks with a single output for binary classification. We use to denote the number of hidden units, for the weight vectors on the first layer, and for the weights on the second layer. We let , and we use to denote the collection of all the parameters. We assume in this section that the activation is 1-homogeneous and 1-Lipschitz. The network thus computes a single score
We consider regularization from here on. The regularized logistic loss of the architecture with hidden units is therefore
where denotes the Euclidean norm of all the parameters in . We note that and the regularizer are both 2-homogeneous in , so the results of Section 2 apply to .333Although Theorem 2.1 is written in the language of multi-class prediction where the classifier outputs scores, the results translate to single-output binary classification. See Section A.2.
Following our conventions from Section 2, we denote the optimizer of by , the normalized margin of by , the max-margin solution by , and the max-margin by . We emphasize the size of the network in our notation. Since our classifier now predicts a single real value, we need to redefine
When the data is not separable by a -unit neural net, is zero by definition.
Recall that denotes the matrix with all the data points as columns, and denotes the labels. We sample and i.i.d. from the data generating distribution , which is supported on . We can define the population 0-1 loss and the training 0-1 loss of the network as
We will let be the average norm squared of the data and be an upper bound on the norm of a single datapoint. The following theorem shows that the generalization error only depends on the parameters through the inverse of the margin on the training data. We provide a proof in Section C.1.
Suppose is 1-Lipschitz and 1-homogeneous. Then for any that separates the data with margin , with probability at least
, with probability at leastover the draw of ,
where . Note that is typically small, and thus the above bound mainly scales with . As a corollary, with probability ,444The quantity does not necessarily exist, but here we take it to mean , with at most the RHS of equation 3.3 plus for all .
Above we implicitly assume , since otherwise the right hand side of the bound is vacuous.
One consequence of the above theorem and Theorem 2.2 is that if is polynomially small in and , we only need to optimize up to a constant multiplicative factor to obtain parameters with generalization bounds roughly as good as those for .
3.2 The max margin is non-decreasing in the hidden layer size
Now we show that the maximum normalized margin is nondecreasing with the hidden layer size and stays constant once we have more hidden units than examples.
In the setting of Section 3.1, recall that denotes the max normalized margin of a two-layer neural network with hidden layer size . Then,
We note that will be positive when is a sufficiently powerful activation such as relu or sigmoid and the data points are not repetitive, so the neural network can fit any function of the data. We prove Theorem 3.2 in Section B. Theorem 3.2 can explain why additional over-parametrization has been observed to improve generalization in two-layer networks Neyshabur et al. (2017b). Our margin does not decrease with a larger network size, and therefore Theorem 3.1 gives a better generalization bound. We precisely characterize the value of in the following section.
3.3 The max margin of neural nets is equivalent to SVM in lifted space
We link infinite-size neural networks to the SVM over a lifted space, defined via a lifting function mapping data to an infinite feature vector:
We look at the margin of linear functionals corresponding to . The 1-norm SVM over the lifted feature solves for the maximum margin:
where we rely on the inner product and 1-norm defined in Section 1.2. A priori, it is unclear how to optimize this since the kernel trick does not work for norm. Here we will show that optimizing two-layer neural networks with weak regularization is equivalent to solving equation 3.6.
Rosset et al. (2007) and Neyshabur et al. (2014) show a similar equivalence, but between a lifted logistic regression problem and equation 3.1. In contrast, the above theorem, proved in Section B, shows the equivalence555The factor of is due the the relation that every unit-norm parameter corresponds to an in the lifted space with . between equation 3.1 and the 1-norm SVM when the regularizer is small.
3.4 Comparison to kernel methods
We compare the SVM margin, attainable by a finite neural network, to the margin attainable via kernel methods. Following the setup of Section 3.3, we define the kernel problem over :
where . (We scale by to make the lemma statement below cleaner.) First, can be used to obtain a standard upper bound on the generalization error of the kernel SVM. Following the notation of Section 3.1, we will let denote the 0-1 population classification error for the optimizer of equation 3.8.
The bound above follows from standard techniques (Bartlett & Mendelson, 2002), and we provide a full proof in Section C.1. We construct a data distribution for which this lemma does not give a good bound for kernel methods, but Theorem 3.1 does imply good generalization for two-layer networks.
There exists a data distribution such that the SVM with relu features has a good margin:
and with probability over the choice of i.i.d. samples from , obtains generalization error
where is typically a lower order term. Meanwhile, with high probability the SVM has a small margin:
and therefore the generalization upper bound from Lemma 3.4 is at least
Proof sketch for Theorem 3.5.
We base on the distribution of examples described below. Here is the i-th standard basis vector and we use to represent the -coordinate of (since the subscript is reserved to index training examples).
Figure 1 shows samples from when there are 3 dimensions. From the visualization, it is clear that there is no linear separator for . As Lemma D.1 shows, a relu network with four neurons can fit this relatively complicated decision boundary. On the other hand, for kernel methods, we prove that the symmetries in induce cancellation in feature space. The following lemmas, proved in Section D.1, formalize this cancellation and show that it results in a small margin for kernel methods.
Lemma 3.6 (Margin upper bound tool).
In the setting of Theorem 3.5, we have
Combining these lemmas gives us the desired bound on .
Gap in regression setting:
We are able to prove an even larger gap between neural networks and kernel methods in the regression setting where we wish to interpolate continuous labels. Analogously to the classification setting, optimizing a regularized squared error loss on neural networks is equivalent to solving a minimum 1-norm regression problem (see Theorem D.3). Furthermore, kernel methods correspond to a minimum 2-norm problem. We construct distributions where the 1-norm solution will have a generalization error bound of , whereas the 2-norm solution will have a generalization error bound that is and thus vacuous. In Section D.2, we define the 1-norm and 2-norm regression problems. In Theorem D.6 we formalize our construction.
4 Perturbed Wasserstein gradient flow finds global optimizers in polynomial time
In the prior section, we studied the limiting behavior of the generalization of a two-layer network as its width goes to infinity. In this section, we will now study the limiting behavior of the optimization algorithm, gradient descent. Prior work (Mei et al., 2018; Chizat & Bach, 2018) has shown that as the hidden layer size grows to infinity, gradient descent for a finite neural network approaches the Wasserstein gradient flow over distributions of hidden units (defined in equation 4.1). Chizat & Bach (2018) also prove that Wasserstein gradient flow converges to a global optimizer in this setting but do not specify a convergence rate.
We show that a perturbed version of Wasserstein gradient flow converges in polynomial time. The informal take-away of this section is that a perturbed version of gradient descent converges in polynomial time on infinite-size neural networks (for the right notion of infinite-size.)
Formally, we optimize the following functional over distributions on :
where , , and . In this work, we consider 2-homogeneous and . We will additionally require that is nonnegative and is positive on the unit sphere. Finally, we need standard regularity assumptions on , and :
Assumption 4.1 (Regularity conditions on , , ).
and are differentiable as well as upper bounded and Lipschitz on the unit sphere. is Lipschitz and its Hessian has bounded operator norm.
We provide more details on the specific parameters (for boundedness, Lipschitzness, etc.) in Section E.1. We note that relu networks satisfy every condition but differentiability of .666The relu activation is non-differentiable at 0 and hence the gradient flow is not well-defined. Chizat & Bach (2018) acknowledge this same difficulty with relu. We can fit a neural network under our framework as follows:
Example 4.2 (Logistic loss for neural networks).
We interpret as a distribution over the parameters of the network. Let and for . In this case, is a distributional neural network that computes an output for each of the training examples (like a standard neural network, it also computes a weighted sum over hidden units). We can compute the distributional version of the regularized logistic loss in equation 3.1 by setting and .
We will define with and . Informally, is the gradient of with respect to , and is the induced velocity field. For the standard Wasserstein gradient flow dynamics, evolves according to
where denotes the divergence of a vector field. For neural networks, these dynamics formally define continuous-time gradient descent when the hidden layer has infinite size (see Theorem 2.6 of Chizat & Bach (2018), for instance).
We propose the following modification of the Wasserstein gradient flow dynamics:
is the uniform distribution on. In our perturbed dynamics, we add uniform noise over . For infinite-size neural networks, one can informally interpret this as re-initializing a very small fraction of the neurons at every step of gradient descent. We prove convergence to a global optimizer in time polynomial in , and the regularity parameters.
Theorem 4.3 (Theorem e.4 with regularity parameters omitted).
Suppose that and are 2-homogeneous and the regularity conditions of Assumption 4.1 are satisfied. Also assume that from starting distribution , a solution to the dynamics in equation 4.2 exists. Define . Let be a desired error threshold and choose and , where the regularity parameters for , , and are hidden in the . Then, perturbed Wasserstein gradient flow converges to an -approximate global minimum in time:
and additionally assuming and have Lipschitz gradients. In this setting, a polynomial time convergence result also holds. We state the result in Section E.3.
We first verify the normalized margin convergence on a two-layer networks with one-dimensional input. A single hidden unit computes the following: . We add -regularization to , and and compare the resulting normalized margin to that of an approximate solution of the SVM problem with features for . Writing this feature vector is intractable, so we solve an approximate version by choosing 1000 evenly spaced values of . Our theory predicts that with decreasing regularization, the margin of the neural network converges to the SVM objective. In Figure 2, we plot this margin convergence and visualize the final networks and ground truth labels. The network margin approaches the ideal one as , and the visualization shows that the network and SVM functions are extremely similar.
Next, we experiment on synthetic data in a higher-dimensional setting. For classification and regression, we compare the generalization error and predicted generalization upper bounds777We compute the leading term that is linear in the norm or inverse margin. (from Theorem 3.1 and Lemmas 3.4, D.4, and D.5) of a trained neural network against a kernel SVM with relu features as we vary . For classification we plot 0-1 error, whereas for regression we plot squared error. Our ground truth comes from a random neural network with 6 hidden units. For classification, we used rejection sampling to obtain datapoints with unnormalized margin of at least 0.1 on the ground truth network. We use a fixed dimension of . For all experiments, we train the network for 20000 steps with and average over 100 trials for each plot point.
The plots in Figure 3 show that two-layer networks clearly outperform kernel methods in test error as grows. However, there seems to be looseness in the upper bounds for kernel methods: the kernel generalization bound appears to stay constant with
(as predicted by our theory for regression), but the kernel test error decreases. There is also some variance in the neural network generalization bound for classification. This occured likely because we did not tune learning rate and training time, so the optimization failed to find the best margin.
In Section F, we include additional experiments training modified WideResNet architectures on CIFAR10 and CIFAR100. Although ResNet is not homogeneous, we still report interesting increases in generalization performance from annealing the weight decay during training, versus staying at a fixed decay rate.
We have made the case that maximizing margin is one of the inductive biases of relu networks with cross-entropy loss. We show that we can obtain a maximum normalized margin by training with a weak regularizer. We also prove that larger -normalized margin indicates better generalization for two-layer nets. Our work leaves open the question of how the -normalized margin relates to generalization in much deeper neural networks. This is a fascinating theoretical and empirical question for future work. On the optimization side, we make progress towards understanding over-parametrized gradient descent by analyzing infinite-size neural networks. A natural direction for future work is to apply our theory to optimize the margin of finite-sized neural networks.
JDL acknowledges support of the ARO under MURI Award W911NF-11-1-0303. This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council (EPSRC) under the Multidisciplinary University Research Initiative. We also thank Nati Srebro and Suriya Gunasekar for helpful discussions in the early stage of this work.
- Arora et al. (2018a) Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018a.
- Arora et al. (2018b) Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018b.
- Ball (1997) Keith Ball. An elementary introduction to modern convex geometry. Flavors of geometry, 31:1–58, 1997.
Bartlett & Mendelson (2002)
Peter L Bartlett and Shahar Mendelson.
Rademacher and gaussian complexities: Risk bounds and structural
Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240–6249, 2017.
- Belkin et al. (2018) Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018.
- Brutzkus et al. (2017) Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized networks that provably generalize on linearly separable data. arXiv preprint arXiv:1710.10174, 2017.
- Chaudhari et al. (2016) Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
- Chizat & Bach (2018) Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv preprint arXiv:1805.09545, 2018.
- Du & Lee (2018) Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018.
- Du et al. (2017) Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
- Dziugaite & Roy (2017) Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
- Golowich et al. (2017) Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. arXiv preprint arXiv:1712.06541, 2017.
- Gunasekar et al. (2017) Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pp. 6151–6159, 2017.
- Gunasekar et al. (2018a) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018a.
- Gunasekar et al. (2018b) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. arXiv preprint arXiv:1806.00468, 2018b.
- Haeffele & Vidal (2015) Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
- Hardt et al. (2015) Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.
- Ji & Telgarsky (2018) Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
- Kakade et al. (2009) Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793–800, 2009.
- Koltchinskii et al. (2002) Vladimir Koltchinskii, Dmitry Panchenko, et al. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30(1):1–50, 2002.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
- Li et al. (2018) Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pp. 2–47, 2018.
- Liang & Rakhlin (2018) T. Liang and A. Rakhlin. Just Interpolate: Kernel “Ridgeless” Regression Can Generalize. ArXiv e-prints, August 2018.
- Livni et al. (2014) Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pp. 855–863, 2014.
- Ma et al. (2017) Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution. arXiv preprint arXiv:1711.10467, 2017.
- Mei et al. (2018) Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layers neural networks. arXiv preprint arXiv:1804.06561, 2018.
- Morcos et al. (2018) Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959, 2018.
- Nacson et al. (2018) Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. arXiv preprint arXiv:1803.01905, 2018.
- Neyshabur et al. (2014) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
- Neyshabur et al. (2015a) Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2422–2430, 2015a.
- Neyshabur et al. (2015b) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pp. 1376–1401, 2015b.
- Neyshabur et al. (2017a) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017a.
- Neyshabur et al. (2017b) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pp. 5947–5956, 2017b.
- Neyshabur et al. (2018) Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
- Nguyen & Hein (2017) Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017.
- Rosset et al. (2004a) Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5(Aug):941–973, 2004a.
Rosset et al. (2004b)
Saharon Rosset, Ji Zhu, and Trevor J Hastie.
Margin maximizing loss functions.In Advances in neural information processing systems, pp. 1237–1244, 2004b.
Rosset et al. (2007)
Saharon Rosset, Grzegorz Swirszcz, Nathan Srebro, and Ji Zhu.
l1 regularization in infinite dimensional feature spaces.
International Conference on Computational Learning Theory, pp. 544–558. Springer, 2007.
- Safran & Shamir (2016) Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.
- Sagun et al. (2017) Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
- Sirignano & Spiliopoulos (2018) Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks. arXiv preprint arXiv:1805.01053, 2018.
- Soudry & Carmon (2016) Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
- Soudry et al. (2018) Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1q7n9gAb.
- Tibshirani (2013) Ryan J Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7:1456–1490, 2013.
- van Laarhoven (2017) Twan van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017.
- Venturi et al. (2018) Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural networks with finite intrinsic dimension have no spurious valleys. arXiv preprint arXiv:1802.06384, 2018.
- Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Zhu et al. (2004)
Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie.
1-norm support vector machines.In Advances in neural information processing systems, pp. 49–56, 2004.
Appendix A Missing Proofs in Section 2
We first show that does indeed have a global minimizer.
We will argue in the setting of Theorem 2.1 where is the multi-class cross entropy loss, because the logistic loss case is analogous. We first note that is continuous in because is continuous in and the term inside the logarithm is always positive. Next, define . Then we note that for , we must have . It follows that . However, there must be a value which attains , because is a compact set and is continuous. Thus, is attained by some . ∎
a.1 Missing Proofs for Multi-class Setting
Towards proving Theorem 2.1, we first show as we decrease , the norm of the solution grows.
In the setting of Theorem 2.1, as , we have .
To prove Theorem 2.1, we rely on the exponential scaling of the cross entropy: can be lower bounded roughly by , but also has an upper bound that scales with . By Lemma A.2, we can take large so the gap vanishes. This proof technique is inspired by that of Rosset et al. (2004a).
Proof of Theorem 2.1.
For any and with ,
|(by the homogeneity of )|
We can also apply in order to lower bound equation A.1 and obtain
Applying equation A.2 with and , noting that , we have:
Next we lower bound by applying equation A.3,
Recall that by Lemma A.2, as , we have . Therefore, . Thus, we can apply Taylor expansion to the equation above with respect to and . If , then we obtain
We claim this implies that . If not, we have , which implies that the equation above is violated with sufficiently large ( would suffice). By Lemma A.2, as and therefore we get a contradiction.
Finally, we have by definition of . Hence, exists and equals . ∎
Now we fill in the proof of Lemma A.2.
Proof of Lemma a.2.
For the sake of contradiction, we assume that such that for any , there exists with . We will determine the choice of later and pick such that
. Then the logits (the predictionbefore softmax) are bounded in absolute value by some constant (that depends on ), and therefore the loss function for every example is bounded from below by some constant (depending on but not .)
Let , we have that
|(by the optimality of )|
Taking a sufficiently small , we obtain a contradiction and complete the proof. ∎
a.2 Full Binary Classification Setting
For completeness, we state and prove our max-margin results for the setting where we fit binary labels (as opposed to indices in ) and redefining to assign a single real-valued score (as opposed to a score for each label). This lets us work with the simpler -regularized logistic loss:
As before, let , and define the normalized margin by . Define the maximum possible normalized margin
Assume in the binary classification setting with logistic loss. Then as , .
The proof follows via simple reduction to the multi-class case.
Proof of Theorem a.3.
We prove this theorem via reduction to the multi-class case with . Construct with and . Define new labels if and if . Now note that , so the multi-class margin for under is the same as binary margin for under . Furthermore, defining
we get that , and in particular, and have the same set of minimizers. Therefore we can apply Theorem 2.1 for the multi-class setting and conclude in the binary classification setting. ∎