1 Introduction
Inductive bias introduced through the learning process plays a crucial role in training deep neural networks and in the generalization properties of the learned models
(Neyshabur et al., 2015b, a; Zhang et al., 2017; Keskar et al., 2017; Neyshabur et al., 2017; Wilson et al., 2017; Hoffer et al., 2017). Deep neural networks used in practice are typically highly overparameterized, i.e., have far more trainable parameters than training examples. Thus, using these models, it is usually possible to fit the data perfectly and obtain zero training error (Zhang et al., 2017). However, simply minimizing the training loss does not guarantee good generalization to unseen data – many global minima of the training loss indeed have very high test error (Wu et al., 2017). The inductive bias introduced in our learning process affects which specific global minimizer is chosen as the predictor. Therefore, it is essential to understand the nature of this inductive bias to understand why overparameterized models, and particularly deep neural networks, exhibit good generalization abilities.A common way to introduce an additional inductive bias in overparameterized models is via small amounts of regularization, or loose constraints . For example, Rosset et al. (2004b, a); Wei et al. (2018) show that, in overparameterized classification models, a vanishing amount of regularization, or a diverging norm constraint can lead to maxmargin solutions, which in turn enjoy strong generalization guarantees.
A second and more subtle source of inductive bias is via the optimization algorithm used to minimize the underdetermined training objective (Gunasekar et al., 2017; Soudry et al., 2018b)
. Common algorithms used in neural network training, such as stochastic gradient descent, iteratively refine the model parameters by making incremental local updates. For different algorithms, the local updates are specified by different geometries in the space of parameters. For example, gradient descent uses an Euclidean
geometry, while coordinate descent updates are specified in the geometry. The minimizers to which such local search based optimization algorithms converge to are indeed very special and are related to the geometry of the optimization algorithm (Gunasekar et al., 2018b) as well as the choice of model parameterization (Gunasekar et al., 2018a).In this work we similarly investigate the connection between margin maximization and the limits of

The “optimization path” of unconstrained, unregularized gradient descent.

The “constrained path”, where we optimize with a diverging (increasingly loose) constraint on the norm of the parameters.

The closely related “regularization path”, of solutions with decreasing penalties on the norm.
To better understand the questions we tackle in this paper, and our contribution toward understanding the inductive bias introduced in training, let us briefly survey prior work.
Equivalence of the regularization or constrained paths and margin maximization:
Rosset et al. (2004b, a); Wei et al. (2018) investigated the connection between the regularization and constrained paths and the maxmargin solution. Rosset et al. (2004a, b)
considered linear (hence homogeneous) models with monotone loss and explicit norm regularization or constraint, and proved convergence to the maxmargin solution for certain loss functions (e.g., logistic loss) as the regularization vanishes or the norm constraint diverges.
Wei et al. (2018) extended the regularization path result to nonlinear but positivehomogeneous prediction functions,Definition 1 (positive homogeneous function).
A function is positive homogeneous if and .
e.g. as obtained by a ReLU network with uniform depth.
These results are thus limited to only positive homogeneous predictors, and do not include deep networks with bias parameters, ensemble models with different depths, ResNets, or other models with skip connections. Here, we extend this connection beyond positive homogeneous predictors.
Furthermore, even for homogeneous or linear predictors, there might be multiple margin maximizing solutions. For linear models, Rosset et al. (2004b)
alluded to a refined set of maximum margin classifiers that in addition to maximizing the distance to the closest data point (maxmargin), also maximize the distance to the second closest data point, and so on. We formulate such special maximum margin solutions as “lexicographic maxmargin” classifiers which we introduce in Section
4.2. We show that for general continuous homogeneous models, the constrained path with diverging norm constraint converges to these more refined “lexicographic maxmargin” classifiers.Equivalence of the optimization path and margin maximization:
Another line of works studied the connection between unconstrained, unregularized optimization with a specific algorithm (i.e., the limit of the “optimization path”), and the maxmargin solution. For linear prediction with the logistic loss (or other exponential tail losses), we now know gradient descent (Soudry et al., 2018b; Ji & Telgarsky, 2018) as well as SGD (Nacson et al., 2019b) converges in direction to the maxmargin solution, while steepest descent with respect to an arbitrary norm converges to the maxmargin w.r.t. the corresponding norm (Gunasekar et al., 2018b). All the above results are for linear prediction. Gunasekar et al. (2018a); Nacson et al. (2019a); Ji & Telgarsky (2019) obtained results establishing convergence to margin maximizing solutions also for certain uniformdepth linear networks (including fully connected networks and convolutional networks), which still implement linear model. Separately, Xu et al. (2019) analyzed a single linear unit with ReLU activation—a limited nonlinear but still positive homogeneous model. Lastly, Soudry et al. (2018a) analyzed a nonlinear ReLU network where only a single weight layer is optimized.
Here, we extend this relationship to general, nonlinear and positive homogeneous predictors for which the loss can be minimized only at infinity. We establish a connection between the limit of unregularized unconstrained optimization and the maxmargin solution.
Problems with finite minimizers:
We note that the connection between regularization path and optimization path was previously considered in a different settings, where a finite (global) minimum exists. In such settings the questions asked are different than the ones we consider here, and are not about the limit of the paths. E.g., Ali et al. (2018)
showed for gradient flow a multiplicative relation between the risk for the gradient flow optimization path and the ridgeregression regularization path. Also,
Suggala et al. (2018) showed that for gradient flow and strongly convex and smooth loss function – gradient descent iterates on the unregularized loss function are pointwise close to solutions of a corresponding regularized problem.Contributions
We examine overparameterized realizable problems (i.e., where it is possible to perfectly classify the training data), when training using monotone decreasing classification loss functions. For simplicity, we focus on the exponential loss. However, using similar techniques as in Soudry et al. (2018a) our results should extend to other exponentialtailed loss functions such as the logistic loss and its multiclass generalization. This is indeed the common setting for deep neural networks used in practice.
We show that in any model,

As long as the margin attainable by a (unregularized, unconstrained) model is unbounded, then the margin of the constrained path converges to the maxmargin. See Corollary 1.

If additional conditions hold, the constrained path also converges to the “margin path” in parameter space (the path of minimal norm solutions attaining increasingly large margins). See section 3.1.
We then demonstrate that

If the model is a sum of homogeneous functions of different orders (i.e., it is not homogeneous itself), then we can still characterize the asymptotic solution of both the constrained path and the margin path. See Theorem 3.2.

This solution implies that in an ensemble of homogeneous neural networks, the ensemble will aim to discard the most shallow network. This is in contrast to what we would expect from considerations of optimization difficulty (since deeper networks are typically harder to train (He et al., 2016)).

This also allows us to represent hardmargin SVM problems with unregularized bias using such models. This is in contrast to previous approaches which fail to do so, as pointed out recently (Nar et al., 2019).
Finally, for homogeneous models,

We find general conditions under which the optimization path converges to stationary points of the margin path or the constrained path. See section 4.1.

We show that the constrained path converges to a specific type maxmargin solution, which we term the “lexicographic maxmargin”. ^{1}^{1}1The authors thank Rob Shapire for the suggestion of the nomenclature during initial discussions. See Theorem 4.
2 Preliminaries and Basic Results
In this paper, we will study the following exponential tailed loss function
(1) 
where is a continuous function, and is the number of samples. Also, for any norm in we define as the unit norm ball in .
We will use in our results the following basic lemma
Lemma 1.
Let and be two functions from to , such that
(2) 
exists and is strictly monotonically decreasing in , , for some . Then, , the optimization problem in eq. 2 has the same set of solutions as
(3) 
whose minimum is obtained at .
Proof.
See Appendix A. ∎
2.1 The Optimization Path
The optimization path in the Euclidean norm , is given by the direction of iterates of gradient descent algorithm with initialization and learning rates ,
(4) 
2.2 The Constrained Path
The constrained path for the loss in eq. 1 is given by minimizer of the loss at a given norm value , i.e.,
(5) 
The constrained path was previously considered for linear models (Rosset et al., 2004a). However, most previous works (e.g. Rosset et al. (2004b); Wei et al. (2018)) focused on the regularization path, which is the minimizer of the regularized loss. These two paths are closely linked, as we discuss in more detail in Appendix F.
Denote the constrained minimum of the loss as follows:
exists for any finite as the minimum of a continuous function on a compact set.
By Lemma 1, the Assumption
Assumption 1.
There exists such that is strictly monotonically decreasing to zero for any .
enables an alternative form of the constrained path
In addition, in the next lemma we show that under this assumption the constrained path minimizers are obtained on the boundary of .
Lemma 2.
Under assumption 1, for all and for all , we have .
Proof.
Let . We assume, in contradiction, that so that . This implies that which contradicts our assumption that is strictly monotonically decreasing. ∎
2.3 The Margin Path
For prediction functions on data points indexed as , we define the margin path as:
Margin path:  (6) 
For , we denote the margin at scaling as
and the maxmargin at scale of as
Note that for all , this maximum exists as the maximum of a continuous function on a compact set.
Again, we make a simplifying assumption
Assumption 2.
There exist such that is strictly monotonically increasing to for any .
Many common prediction functions satisfy this assumption, including the sum of positivehomogeneous prediction functions.
3 NonHomogeneous Models
We first study the constrained path in nonhomogeneous models, and relate it to the margin path. To do so, we need to first define the ball surrounding a set
and the notion of set convergence
Definition 2 (Set convergence).
A sequence of sets converges to another sequence of sets if such that .
3.1 Margin of Constrained Path Converges to Maximum Margin
For all , the constrained path margin deviation from the maxmargin is bounded, as we prove next.
Lemma 3.
For all , and every in
(8) 
Proof.
Note that
(9) 
Since, , we have, and ,
Lemma 3 immediately implies that
Corollary 1.
If , then for all , and every in
The last corollary states that the margin of the constrained path converges to the maximum margin. However, this does not necessarily imply convergence in parameter space, i.e., this result does not guaranty that converges to . We analyze some positive and negative examples to demonstrate this claim.
Example 1: homogeneous models
It is straightforward to see that, for positive homogeneous prediction functions (Definition 1) the margin path in eq. 6 is the same set for any , and is given by
Additionally, as we show next, for such models Lemma 3 implies convergence in parameter space, i.e., converges to . To see this, notice that for positive homogeneous functions , :
For we must have
By continuity, the last equation implies that converges to . For full details see Appendix D.1.
Connection to previous results: For linear models, Rosset et al. (2004a) connected the constrained path and maximum margin solution. In addition, for any norm, Rosset et al. (2004b) showed that the regularization path converges to the limit of the margin path. In a recent work, Wei et al. (2018) extended this result to homogeneous models with crossentropy loss. Here, for homogeneous models and any norm, we show a connection between the constrained path and the margin path.
Extension: Later, in Theorem 4 we prove a more refined result: the constrained path converges to a specific subset of the margin path set (the lexicographic maxmargin set).
In contrast, in general models, 8 does not necessarily imply convergence in the parameter space. We demonstrate this result in the next example.
Example 2: log predictor: We denote for some dataset , with features and label . We examine the prediction function for . We focus on the loss function tail behaviour and thus only care about the loss function behaviour in region. We assume that a separator which satisfy this constraint exists since we are focusing on realizable problems.
Since is strictly increasing and , we have
We denote and . Note that . Now consider such that for some and : . For this case, we still have,
but clearly, . Thus, Lemma 3 does not guarantee that as , or that converges to .
Analogies with regularization and optimization paths: This example demonstrates that for the prediction function for , the constrained path does not necessarily converge to the margin path. This is equivalent to setup A: linear prediction models with loss function . Rosset et al. (2004b) and Nacson et al. (2019a) state related results for setup A. Both works derived conditions on the loss function that ensure convergence to the margin path from the regularization/ optimization path respectively. Rosset et al. (2004b) showed that in setup A the regularization path does not necessarily converge to the margin path. (Nacson et al., 2019a) showed a similar result for the optimization path, i.e., that in setup A the optimization path does not necessarily converge to the margin path. Both results align with our results for the constrained path.
In contrast, according to the conditions of Rosset et al. (2004b); Nacson et al. (2019a), we know that if the prediction function is for some and , then the regularization path and optimization path do converge to the margin path. In the next example, we show that this is also true for the constrained path.
Example 3: log predictor: We examine the prediction function for and some . Since the log function is strictly increasing and , we have
For all :
For we must have , which implies, by continuity, that converges to . For details, see Appendix D.2.
3.2 Sum of Positively Homogeneous Functions
Remark: The results in this subsection are specific for the Euclidean or norm.
Let be functions that are a finite sum of positively homogeneous functions, i.e., for some finite :
(10) 
where and are positive homogeneous functions, where .
First, we characterize the asymptotic form of the margin path in this setting.
Lemma 4.
Let be a sum of positively homogeneous functions as in eq. 10. Then, the set of solutions of
(11) 
can be written as
(12) 
where the term is vanishing as , and
where
(13) 
Proof.
We write the original optimization problem
Dividing by , using the positive homogeneity of , and changing the variables as , we obtain an equivalent optimization problem
(14) 
We denote the set of solutions of eq. 14 as . Taking the limit of of this optimization problem we find that any solution must minimize the first term in the sum , and only then the other terms. Therefore the asymptotic solution is of the form of eqs. 12 and 13. We prove this reasoning formally in Appendix B, i.e., we show that
∎
The following Lemma will be used to connect the constrained path to the characterization of the margin path.
Lemma 5.
Proof.
See Appendix C. ∎
Theorem 1.
Theorem 1 implications: An important implication of Theorem 1 is that an ensemble on neural networks will aim to discard the shallowest network in the ensemble. Consider the following setting: for each , the function
represents a prediction function of some feedforward neural network with no bias, all with the same positivehomogeneous activation function
of some degree (e.g., ReLU activation is positivehomogeneous of degree ). Note that in this setup, each of the prediction functions is also a positivehomogeneous function. In particular, network with depth is positive homogeneous with degree where is the activation function degree. Since all the networks have the same activation function, deeper networks will have larger degree. We assume WLOG that . This implies that . In this setting, represents an ensemble of these networks. From Theorem 1, the solution of the constrained path will satisfywhere and is calculated using eq. 13. Examining equation 13, we observe that the network aims to minimize the norm. In particular, if the network ensemble can satisfy the constraints with , then the first equation obtained solutions will satisfy . Thus the ensemble will discard the shallowest network if it is ”unnecessary” to satisfy the constraint.
Furthermore, from eq. 14 we conjecture that after discarding the shallowest “unnecessary” network, the ensemble will tend to minimize , i.e., to discard the second shallowest ”unnecessary” network. This will continue until there are no more ”unnecessary” shallow networks. In other words, we conjecture that the an ensemble of neural networks will aim to discard the shallowest “unnecessary” networks.
Additionally, using Theorem 1 we can now represent hardmargin SVM problems with unregularized bias
. Previous results only focused on linear prediction functions without bias. Trying to extend these results to SVM with bias by extending all the input vectors
with an additional component would fail since the obtained solution in the original space is the solution ofwhich is not the maxmargin (SVM) solution, as pointed out by (Nar et al., 2019). However, we can now achieve this goal using Theorem 1. For some dataset , we use the following prediction function where . From eqs. 12, 13 the asymptotic solution will satisfy .
4 Homogeneous Models
In the previous section we connected the constrained path to the margin path. We would like to refine this characterization and also understand the connection to the optimization path. In this section we are able to do so for prediction functions which are positive homogeneous functions (definition 1).
4.1 Optimization Path Converges to Stationary Points of the Margin Path and Constrained Path
Remark: The results in this subsection are specific for the Euclidean or norm, as opposed to many of the results in this paper which are stated for any norm.
In this section, we link the optimization path to the margin path and the constrained path. These results require the following smoothness assumption:
Assumption 3 (Smoothness).
We assume is a function.
Relating optimization path and margin path.
The limit of the margin path for homogeneous models is given by eq. 18. In this section we first relate the optimization path to this limit of margin path.
Note that for general homogeneous prediction functions , eq. 18 is a nonconvex optimization problem, and thus it is unlikely for an optimization algorithm such as gradient descent to find the global optimum. We can relax the set to that are firstorder stationary, i.e., critical points of 18. For , denote the set of support vectors of as
(19) 
Definition 3 (Firstorder Stationary Point).
The firstorder optimality conditions of 18 are:

,

There exists such that and for .
We denote by the set of firstorder stationary points.
Let be the iterates of gradient descent. Define and be the vector with entries . The following two assumptions assume that the limiting direction exist and the limiting direction of the losses exist. Such assumptions are natural in the context of maxmargin problems, since we want to argue that converges to a maxmargin direction, and also the losses converges to an indicator vector of the support vectors. The first step to argue this convergence is to ensure the limits exist.
Assumption 4 (Asymptotic Formulas).
Assume that , that is we converge to a global minimizer. Further assume that and exist. Equivalently,
(20)  
(21) 
with , , , , and .
Assumption 5 (Linear Independence Constraint Qualification).
Let be a unit vector. LICQ holds at if the vectors are linearly independent.
Remark 1.
Constraint qualifications allow the firstorder optimality conditions of Definition 3 to be a necessary condition for optimality. Without constraint qualifications, the global optimum need not satisfy the optimality conditions.
LICQ is the simplest among many constraint qualification conditions identified in the optimization literature (Nocedal & Wright, 2006).
For example, in linear SVM, LICQ is ensured if the set of support vectors is linearly independent. Consider and be the support vectors. Then , and so linear independence of the support vectors implies LICQ. For data sampled from an absolutely continuous distribution, the SVM solution will always have linearly independent support vectors (Soudry et al., 2018b, Lemma 12), but LICQ may fail when the data is degenerate.
Theorem 2.
Optimization path and constrained path.
Next, we study how the optimization path as converges to stationary points of the constrained path with .
The firstorder optimally conditions of the constrained path , require that the constraints hold, and the gradient of the Lagrangian of the constrained path
(22) 
is equal to zero. In other words,
On many paths the gradient of the Lagrangian goes to zero as . However, we have a faster vanishing rate for the specific optimization paths that follow Definition 4 below. Therefore, these paths better approximate true stationary points:
Definition 4 (Firstorder optimal for ).
A sequence is firstorder optimal for with if
To relate the limit points of gradient decent to the constrained path, we will focus on stationary points of the constrained path that minimize the loss.
Theorem 3.
4.2 Lexicographic MaxMargin
Recall that for positive homogeneous prediction functions, the margin path in eq. 11 is the same set for any and is given by
For nonconvex functions or nonEuclidean norms , the above set need not be unique. In this case, we define the following refined set of maximum margin solution set
Definition 5 (Lexicographic maximum margin set).
The lexicographic margin set denoted by is given by the following iterative definition of for :
In the above definition, denotes the set of maximum margin solutions, denotes the subset of with second smallest margin, and so on.
For an alternate representation of , we introduce the following notation: for , let denote the index corresponding to the smallest margin of as defined below by breaking ties in the arbitrarily:
(23) 
Using this notation, we can rewrite as
We also define the limit set of constrained path as follows:
Definition 6 (Limit set of constrained path).
The limit set of constrained path is defined as follows:
Theorem 4.
For positive homogeneous prediction functions the limit set of constrained path is contained in the lexicographic maximum margin set, i.e., .
5 Summary
In this paper we characterized the connections between the constrained, margin and optimization paths. First, in Section 3, we examined general nonhomogeneous models. We showed that the margin of the constrained path solution converges to the maximum margin. We further analyzed this result and demonstrated how it implies convergence in parameters, i.e., converges to , for some models. Then, we examined functions that are a finite sum of positively homogeneous functions. These prediction function can represent an ensemble of neural networks with positive homogeneous activation functions. For this model, we characterized the asymptotic constrained path and margin path solution. This implies a surprising result: ensembles of neural networks will aim to discard the most shallow network. In the future work we aim to analyze sum of homogeneous functions with shared variables, such as ResNets.
Second, in Section 4 we focus on homogeneous models. For such models we link the optimization path to the margin and constrained paths. Particularly, we show that the optimization path converges to stationary points of the constrained path and margin path. In future work, we aim to extend this to nonhomogeneous models. In addition, we give a more refined characterization of the constrained path limit. It will be interesting to find whether this characterization be further refined to answer whether the weighting of the data point can have any effect on the selection of the asymptotic solution — as (Byrd & Lipton, 2018) observed empirically that it did not.
Acknowledgements
The authors are grateful to C. Zeno, and N. Merlis for helpful comments on the manuscript. This research was supported by the Israel Science foundation (grant No. 31/1031), and by the Taub foundation. SG and NS were partially supported by NSF awards IIS1302662 and IIS1764032.
References
 Ali et al. (2018) Ali, A., Kolter, J. Z., and Tibshirani, R. J. A continuoustime view of early stopping for least squares regression. arXiv preprint arXiv:1810.10082, 2018.
 Byrd & Lipton (2018) Byrd, J. and Lipton, Z. C. Weighted risk minimization & deep learning. arXiv preprint arXiv:1812.03372, 2018.
 Gunasekar et al. (2017) Gunasekar, S., Woodworth, B. E., Bhojanapalli, S., Neyshabur, B., and Srebro, N. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pp. 6152–6160, 2017.
 Gunasekar et al. (2018a) Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Implicit bias of gradient descent on linear convolutional networks. NIPS, 2018a.
 Gunasekar et al. (2018b) Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. Characterizing implicit bias in terms of optimization geometry. ICML, 2018b.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Hoffer et al. (2017) Hoffer, E., Hubara, I., and Soudry, D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In NIPS, 2017.
 Ji & Telgarsky (2018) Ji, Z. and Telgarsky, M. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
 Ji & Telgarsky (2019) Ji, Z. and Telgarsky, M. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2019.
 Keskar et al. (2017) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On LargeBatch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR, pp. 1–16, 2017.
 Nacson et al. (2019a) Nacson, M. S., Lee, J., Gunasekar, S., Savarese, P. H., Srebro, N., and Soudry, D. Convergence of gradient descent on separable data. AISTATS, 2019a.
 Nacson et al. (2019b) Nacson, M. S., Srebro, N., and Soudry, D. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. AISTATS, 2019b.
 Nar et al. (2019) Nar, K., Ocal, O., Sastry, S. S., and Ramchandran, K. Crossentropy loss leads to poor margins, 2019.
 Neyshabur et al. (2015a) Neyshabur, B., Salakhutdinov, R. R., and Srebro, N. PathSGD: Pathnormalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2422–2430, 2015a.
 Neyshabur et al. (2015b) Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations, 2015b.
 Neyshabur et al. (2017) Neyshabur, B., Tomioka, R., Salakhutdinov, R., and Srebro, N. Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071, 2017.
 Nocedal & Wright (2006) Nocedal, J. and Wright, S. Numerical optimization. Springer Science, 35(6768), 2006.

Rosset et al. (2004a)
Rosset, S., Zhu, J., and Hastie, T.
Boosting as a regularized path to a maximum margin classifier.
Journal of Machine Learning Research
, 2004a.  Rosset et al. (2004b) Rosset, S., Zhu, J., and Hastie, T. J. Margin maximizing loss functions. In Advances in neural information processing systems, pp. 1237–1244, 2004b.
 Soudry et al. (2018a) Soudry, D., Hoffer, E., Shpigel Nacson, M., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. JMLR, 2018a.
 Soudry et al. (2018b) Soudry, D., Hoffer, E., and Srebro, N. The implicit bias of gradient descent on separable data. ICLR, 2018b.
 Suggala et al. (2018) Suggala, A., Prasad, A., and Ravikumar, P. K. Connecting optimization and regularization paths. In Advances in Neural Information Processing Systems, pp. 10631–10641, 2018.
 Wei et al. (2018) Wei, C., Lee, J. D., Liu, Q., and Ma, T. On the margin theory of feedforward neural networks. arXiv preprint arXiv:1810.05369v1, pp. 1–34, 2018.
 Wilson et al. (2017) Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.
 Wu et al. (2017) Wu, L., Zhu, Z., and E, W. Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes. arXiv, 2017.
 Xu et al. (2019) Xu, T., Zhou, Y., Ji, K., and Liang, Y. When will gradient methods converge to maxmargin classifier under reLU models?, 2019.
 Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
Appendix A Proof of Lemma 1
Proof.
Let be a solution of the optimization problem in eq. 2. Then, , since otherwise we could have decreased without changing or — and this is impossible, since is strictly monotonically decreasing. Therefore, we cannot decrease below without increasing above . This implies that is a solution of the optimization problem in eq. 3 with . Next, all that is left to show that eq. 3 has no additional solutions. Suppose by contradiction there were such solutions . Since they are also minimizers of eq. 3, like , they have the same minimum value . Since they are not solutions of eq. 2, we have . However, this means they are not feasible for eq. 3, and therefore cannot be solutions. ∎
Appendix B Proof of Claim 1
Proof.
Recall we denoted the set of solutions of eq. 14 as , and recall from eq. 13. To simplify notations we omit the dependency on from the notation, i.e., we replace with . Suppose the claim was not correct. Then, there would have existed such that , such that Note that is feasible in both optimization problems (eq. 13 and 14), since both problems have the same constraints. Moreover, since it must be suboptimal in comparison to the solution of eq. 13. Therefore, such that for any , . Then we can write (from eq. 14)
(24) 
From Assumption 2 we know that such that a solution of the margin path exists. Therefore, , eq. 11 is feasible. We assume, WLOG, that . This implies that there exist a feasible finite solution to eq. 24 which does not depend on . Therefore, , , and the values of are respectively bounded below the values of , which are independent of . This implies that if we select large enough, we will have . This would contradict the assumption that and therefore minimizes eq. 24. This implies that , such that , we have , which entails the Theorem.
∎