# Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models

With an eye toward understanding complexity control in deep learning, we study how infinitesimal regularization or gradient descent optimization lead to margin maximizing solutions in both homogeneous and non-homogeneous models, extending previous work that focused on infinitesimal regularization only in homogeneous models. To this end we study the limit of loss minimization with a diverging norm constraint (the "constrained path"), relate it to the limit of a "margin path" and characterize the resulting solution. For non-homogeneous ensemble models, which output is a sum of homogeneous sub-models, we show that this solution discards the shallowest sub-models if they are unnecessary. For homogeneous models, we show convergence to a "lexicographic max-margin solution", and provide conditions under which max-margin solutions are also attained as the limit of unconstrained gradient descent.

## Authors

• 5 publications
• 18 publications
• 62 publications
• 75 publications
• 37 publications
• ### Bias of Homotopic Gradient Descent for the Hinge Loss

Gradient descent is a simple and widely used optimization method for mac...
07/26/2019 ∙ by Denali Molitor, et al. ∙ 7

• ### Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

Recent works on implicit regularization have shown that gradient descent...
06/13/2019 ∙ by Kaifeng Lyu, et al. ∙ 0

• ### Directional convergence and alignment in deep learning

In this paper, we show that although the minimizers of cross-entropy and...
06/11/2020 ∙ by Ziwei Ji, et al. ∙ 0

• ### The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization

Motivated by applications in Optimization, Game Theory, and the training...
07/11/2018 ∙ by Constantinos Daskalakis, et al. ∙ 0

• ### Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

We study the implicit regularization imposed by gradient descent for lea...
06/04/2018 ∙ by Simon S. Du, et al. ∙ 0

• ### Minnorm training: an algorithm for training overcomplete deep neural networks

In this work, we propose a new training method for finding minimum weigh...
06/03/2018 ∙ by Yamini Bansal, et al. ∙ 0

• ### Why Adversarial Interaction Creates Non-Homogeneous Patterns: A Pseudo-Reaction-Diffusion Model for Turing Instability

Long after Turing's seminal Reaction-Diffusion (RD) model, the elegance ...
10/01/2020 ∙ by Litu Rout, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1 Introduction

Inductive bias introduced through the learning process plays a crucial role in training deep neural networks and in the generalization properties of the learned models

(Neyshabur et al., 2015b, a; Zhang et al., 2017; Keskar et al., 2017; Neyshabur et al., 2017; Wilson et al., 2017; Hoffer et al., 2017). Deep neural networks used in practice are typically highly overparameterized, i.e., have far more trainable parameters than training examples. Thus, using these models, it is usually possible to fit the data perfectly and obtain zero training error (Zhang et al., 2017). However, simply minimizing the training loss does not guarantee good generalization to unseen data – many global minima of the training loss indeed have very high test error (Wu et al., 2017). The inductive bias introduced in our learning process affects which specific global minimizer is chosen as the predictor. Therefore, it is essential to understand the nature of this inductive bias to understand why overparameterized models, and particularly deep neural networks, exhibit good generalization abilities.

A common way to introduce an additional inductive bias in overparameterized models is via small amounts of regularization, or loose constraints . For example, Rosset et al. (2004b, a); Wei et al. (2018) show that, in overparameterized classification models, a vanishing amount of regularization, or a diverging norm constraint can lead to max-margin solutions, which in turn enjoy strong generalization guarantees.

A second and more subtle source of inductive bias is via the optimization algorithm used to minimize the underdetermined training objective (Gunasekar et al., 2017; Soudry et al., 2018b)

. Common algorithms used in neural network training, such as stochastic gradient descent, iteratively refine the model parameters by making incremental local updates. For different algorithms, the local updates are specified by different geometries in the space of parameters. For example, gradient descent uses an Euclidean

geometry, while coordinate descent updates are specified in the geometry. The minimizers to which such local search based optimization algorithms converge to are indeed very special and are related to the geometry of the optimization algorithm (Gunasekar et al., 2018b) as well as the choice of model parameterization (Gunasekar et al., 2018a).

In this work we similarly investigate the connection between margin maximization and the limits of

• The “optimization path” of unconstrained, unregularized gradient descent.

• The “constrained path”, where we optimize with a diverging (increasingly loose) constraint on the norm of the parameters.

• The closely related “regularization path”, of solutions with decreasing penalties on the norm.

To better understand the questions we tackle in this paper, and our contribution toward understanding the inductive bias introduced in training, let us briefly survey prior work.

##### Equivalence of the regularization or constrained paths and margin maximization:

Rosset et al. (2004b, a); Wei et al. (2018) investigated the connection between the regularization and constrained paths and the max-margin solution. Rosset et al. (2004a, b)

considered linear (hence homogeneous) models with monotone loss and explicit norm regularization or constraint, and proved convergence to the max-margin solution for certain loss functions (e.g., logistic loss) as the regularization vanishes or the norm constraint diverges.

Wei et al. (2018) extended the regularization path result to non-linear but positive-homogeneous prediction functions,

###### Definition 1 (α-positive homogeneous function).

A function is -positive homogeneous if and .

e.g. as obtained by a ReLU network with uniform depth.

These results are thus limited to only positive homogeneous predictors, and do not include deep networks with bias parameters, ensemble models with different depths, ResNets, or other models with skip connections. Here, we extend this connection beyond positive homogeneous predictors.

Furthermore, even for homogeneous or linear predictors, there might be multiple margin maximizing solutions. For linear models, Rosset et al. (2004b)

alluded to a refined set of maximum margin classifiers that in addition to maximizing the distance to the closest data point (max-margin), also maximize the distance to the second closest data point, and so on. We formulate such special maximum margin solutions as “lexicographic max-margin” classifiers which we introduce in Section

4.2. We show that for general continuous homogeneous models, the constrained path with diverging norm constraint converges to these more refined “lexicographic max-margin” classifiers.

##### Equivalence of the optimization path and margin maximization:

Another line of works studied the connection between unconstrained, unregularized optimization with a specific algorithm (i.e., the limit of the “optimization path”), and the max-margin solution. For linear prediction with the logistic loss (or other exponential tail losses), we now know gradient descent (Soudry et al., 2018b; Ji & Telgarsky, 2018) as well as SGD (Nacson et al., 2019b) converges in direction to the max-margin solution, while steepest descent with respect to an arbitrary norm converges to the max-margin w.r.t. the corresponding norm (Gunasekar et al., 2018b). All the above results are for linear prediction. Gunasekar et al. (2018a); Nacson et al. (2019a); Ji & Telgarsky (2019) obtained results establishing convergence to margin maximizing solutions also for certain uniform-depth linear networks (including fully connected networks and convolutional networks), which still implement linear model. Separately, Xu et al. (2019) analyzed a single linear unit with ReLU activation—a limited non-linear but still positive homogeneous model. Lastly, Soudry et al. (2018a) analyzed a non-linear ReLU network where only a single weight layer is optimized.

Here, we extend this relationship to general, non-linear and positive homogeneous predictors for which the loss can be minimized only at infinity. We establish a connection between the limit of unregularized unconstrained optimization and the max-margin solution.

##### Problems with finite minimizers:

We note that the connection between regularization path and optimization path was previously considered in a different settings, where a finite (global) minimum exists. In such settings the questions asked are different than the ones we consider here, and are not about the limit of the paths. E.g., Ali et al. (2018)

showed for gradient flow a multiplicative relation between the risk for the gradient flow optimization path and the ridge-regression regularization path. Also,

Suggala et al. (2018) showed that for gradient flow and strongly convex and smooth loss function – gradient descent iterates on the unregularized loss function are pointwise close to solutions of a corresponding regularized problem.

#### Contributions

We examine overparameterized realizable problems (i.e., where it is possible to perfectly classify the training data), when training using monotone decreasing classification loss functions. For simplicity, we focus on the exponential loss. However, using similar techniques as in Soudry et al. (2018a) our results should extend to other exponential-tailed loss functions such as the logistic loss and its multi-class generalization. This is indeed the common setting for deep neural networks used in practice.

We show that in any model,

• As long as the margin attainable by a (unregularized, unconstrained) model is unbounded, then the margin of the constrained path converges to the max-margin. See Corollary 1.

• If additional conditions hold, the constrained path also converges to the “margin path” in parameter space (the path of minimal norm solutions attaining increasingly large margins). See section 3.1.

We then demonstrate that

• If the model is a sum of homogeneous functions of different orders (i.e., it is not homogeneous itself), then we can still characterize the asymptotic solution of both the constrained path and the margin path. See Theorem 3.2.

• This solution implies that in an ensemble of homogeneous neural networks, the ensemble will aim to discard the most shallow network. This is in contrast to what we would expect from considerations of optimization difficulty (since deeper networks are typically harder to train (He et al., 2016)).

• This also allows us to represent hard-margin SVM problems with unregularized bias using such models. This is in contrast to previous approaches which fail to do so, as pointed out recently (Nar et al., 2019).

Finally, for homogeneous models,

• We find general conditions under which the optimization path converges to stationary points of the margin path or the constrained path. See section 4.1.

• We show that the constrained path converges to a specific type max-margin solution, which we term the “lexicographic max-margin”. 111The authors thank Rob Shapire for the suggestion of the nomenclature during initial discussions. See Theorem 4.

### 2 Preliminaries and Basic Results

In this paper, we will study the following exponential tailed loss function

 L(θ)≜N∑n=1exp(−fn(θ)), (1)

where is a continuous function, and is the number of samples. Also, for any norm in we define as the unit norm ball in .

We will use in our results the following basic lemma

###### Lemma 1.

Let and be two functions from to , such that

 ϕ(ρ)=minw∈Rdf(w) s.t.g(w)≤ρ (2)

exists and is strictly monotonically decreasing in , , for some . Then, , the optimization problem in eq. 2 has the same set of solutions as

 (3)

whose minimum is obtained at .

###### Proof.

See Appendix A. ∎

#### 2.1 The Optimization Path

The optimization path in the Euclidean norm , is given by the direction of iterates of gradient descent algorithm with initialization and learning rates ,

 Optimization path:¯θ(t)=θ(t)∥θ(t)∥,where θ(t)=θ(t−1)−ηt∇L(θ(t−1)). (4)

#### 2.2 The Constrained Path

The constrained path for the loss in eq. 1 is given by minimizer of the loss at a given norm value , i.e.,

 Constrained path:Θc(ρ)≜argminθ∈Sd−1L(ρθ). (5)

The constrained path was previously considered for linear models (Rosset et al., 2004a). However, most previous works (e.g. Rosset et al. (2004b); Wei et al. (2018)) focused on the regularization path, which is the minimizer of the regularized loss. These two paths are closely linked, as we discuss in more detail in Appendix F.

Denote the constrained minimum of the loss as follows:

 L∗(ρ)≜minθ∈Sd−1L(ρθ).

exists for any finite as the minimum of a continuous function on a compact set.

By Lemma 1, the Assumption

###### Assumption 1.

There exists such that is strictly monotonically decreasing to zero for any .

enables an alternative form of the constrained path

 Θc(ρ)=argminθ∈Rd∥θ∥2s.t.L(ρθ)≤L∗(ρ).

In addition, in the next lemma we show that under this assumption the constrained path minimizers are obtained on the boundary of .

###### Lemma 2.

Under assumption 1, for all and for all , we have .

###### Proof.

Let . We assume, in contradiction, that so that . This implies that which contradicts our assumption that is strictly monotonically decreasing. ∎

#### 2.3 The Margin Path

For prediction functions on data points indexed as , we define the margin path as:

 Margin path: Θm(ρ)≜argmaxθ∈Sd−1minnfn(ρθ). (6)

For , we denote the margin at scaling as

 γ(ρ,θ)=minnfn(ρθ),

and the max-margin at scale of as

 γ∗(ρ)=maxθ∈Sd−1minnfn(ρθ) .

Note that for all , this maximum exists as the maximum of a continuous function on a compact set.

Again, we make a simplifying assumption

###### Assumption 2.

There exist such that is strictly monotonically increasing to for any .

Many common prediction functions satisfy this assumption, including the sum of positive-homogeneous prediction functions.

Using Lemma 1 with Assumption 2, we have:

 Θm(ρ) =argmaxθ∈Sd−1minnfn(ρθ) (7) =argminθ∈Rd∥θ∥2s.t.minnfn(ρθ)≥γ∗(ρ).

### 3 Non-Homogeneous Models

We first study the constrained path in non-homogeneous models, and relate it to the margin path. To do so, we need to first define the -ball surrounding a set

 Bϵ(A)≜{θ∈Rd | ∃θ′∈A:∥∥θ−θ′∥∥<ϵ},

and the notion of set convergence

###### Definition 2 (Set convergence).

A sequence of sets converges to another sequence of sets if such that .

#### 3.1 Margin of Constrained Path Converges to Maximum Margin

For all , the constrained path margin deviation from the max-margin is bounded, as we prove next.

###### Lemma 3.

For all , and every in

 γ∗(ρ)−γ(ρ,θc(ρ))≤logN. (8)
###### Proof.

Note that

 e−γ(ρ,θ)≤N∑n=1exp(−fn(ρθ))≤Ne−γ(ρ,θ). (9)

Since, , we have, and ,

 1 ≤L(ρθm(ρ))L(ρθc(ρ))=∑Nn=1exp(−fn(ρθm(ρ)))∑Nn=1exp(−fn(ρθc(ρ))) ≤Nexp(−(γ∗(ρ)−γ(ρ,θc(ρ)))).
 ⇒γ∗(ρ)−γ(ρ,θc(ρ))≤logN.\qed

Lemma 3 immediately implies that

###### Corollary 1.

If , then for all , and every in

 limρ→∞γ∗(ρ)γ(ρ,θc(ρ))=1.

The last corollary states that the margin of the constrained path converges to the maximum margin. However, this does not necessarily imply convergence in parameter space, i.e., this result does not guaranty that converges to . We analyze some positive and negative examples to demonstrate this claim.

Example 1: homogeneous models

It is straightforward to see that, for -positive homogeneous prediction functions (Definition 1) the margin path in eq. 6 is the same set for any , and is given by

 Θ∗m=argmaxθ∈Sd−1minnfn(θ).

Additionally, as we show next, for such models Lemma 3 implies convergence in parameter space, i.e., converges to . To see this, notice that for -positive homogeneous functions , :

 γ∗(ρ)−γ(ρ,θc(ρ))= maxθ∈Sd−1minnfn(ρθ)−minnfn(ρθc(ρ)) ρα(maxθ∈Sd−1minnfn(θ)−minnfn(θc(ρ))) ≤logN.

For we must have

 (maxθ∈Sd−1minnfn(θ)−minnfn(θc(ρ)))→0.

By continuity, the last equation implies that converges to . For full details see Appendix D.1.

Connection to previous results: For linear models, Rosset et al. (2004a) connected the constrained path and maximum margin solution. In addition, for any norm, Rosset et al. (2004b) showed that the regularization path converges to the limit of the margin path. In a recent work, Wei et al. (2018) extended this result to homogeneous models with cross-entropy loss. Here, for homogeneous models and any norm, we show a connection between the constrained path and the margin path.

Extension: Later, in Theorem 4 we prove a more refined result: the constrained path converges to a specific subset of the margin path set (the lexicographic max-margin set).

In contrast, in general models, 8 does not necessarily imply convergence in the parameter space. We demonstrate this result in the next example.

Example 2: log predictor: We denote for some dataset , with features and label . We examine the prediction function for . We focus on the loss function tail behaviour and thus only care about the loss function behaviour in region. We assume that a separator which satisfy this constraint exists since we are focusing on realizable problems.

Since is strictly increasing and , we have

We denote and . Note that . Now consider such that for some and : . For this case, we still have,

 γ∗(ρ)−γ(ρ,θc(ρ)) =log(ρ˜γ∗)−log(ρ˜γ(θc(ρ))) =log(˜γ∗˜γ(θc(ρ)))≤logN.

but clearly, . Thus, Lemma 3 does not guarantee that as , or that converges to .

Analogies with regularization and optimization paths: This example demonstrates that for the prediction function for , the constrained path does not necessarily converge to the margin path. This is equivalent to setup A: linear prediction models with loss function . Rosset et al. (2004b) and Nacson et al. (2019a) state related results for setup A. Both works derived conditions on the loss function that ensure convergence to the margin path from the regularization/ optimization path respectively. Rosset et al. (2004b) showed that in setup A the regularization path does not necessarily converge to the margin path. (Nacson et al., 2019a) showed a similar result for the optimization path, i.e., that in setup A the optimization path does not necessarily converge to the margin path. Both results align with our results for the constrained path.

In contrast, according to the conditions of Rosset et al. (2004b); Nacson et al. (2019a), we know that if the prediction function is for some and , then the regularization path and optimization path do converge to the margin path. In the next example, we show that this is also true for the constrained path.

Example 3: -log predictor: We examine the prediction function for and some . Since the log function is strictly increasing and , we have

 γ(ρ,θ)=minnfn(ρθ)=log1+ϵ(ρminnθ⊤zn).

For all :

 γ∗(ρ)−γ(ρ,θc(ρ)) =(1+ϵ)logϵ(ρ)(log(˜γ∗)−log(˜γ(θc(ρ)))) +o(logϵ(ρ))≤N.

For we must have , which implies, by continuity, that converges to . For details, see Appendix D.2.

#### 3.2 Sum of Positively Homogeneous Functions

Remark: The results in this subsection are specific for the Euclidean or norm.

Let be functions that are a finite sum of positively homogeneous functions, i.e., for some finite :

 ∀n:fn(ρθ)=K∑k=1f(k)n(ρθk), (10)

where and are -positive homogeneous functions, where .

First, we characterize the asymptotic form of the margin path in this setting.

###### Lemma 4.

Let be a sum of positively homogeneous functions as in eq. 10. Then, the set of solutions of

 argminθ∈Rd∥θ∥2s.t.∀n:fn(ρθ)≥γ∗(ρ). (11)

can be written as

 θ∗k=1ρ(wk+o(1))(γ∗(ρ))1αk (12)

where the term is vanishing as , and

 w∗=[w∗1,…,w∗K]∈W,

where

 W=argminw∈Rd∥w1∥2s.t.∀n:fn(w)≥1. (13)
###### Proof.

We write the original optimization problem

 argminθ∈RdK∑k=1∥θk∥2s.t.∀n:K∑k=1f(k)n(ρθk)≥γ∗(ρ).

Dividing by , using the positive homogeneity of , and changing the variables as , we obtain an equivalent optimization problem

 argminw∈RdK∑k=1γ∗(ρ)2αk∥wk∥2s.t.∀n:fn(w)≥1. (14)

We denote the set of solutions of eq. 14 as . Taking the limit of of this optimization problem we find that any solution must minimize the first term in the sum , and only then the other terms. Therefore the asymptotic solution is of the form of eqs. 12 and 13. We prove this reasoning formally in Appendix B, i.e., we show that

###### Claim 1.

The solution of eq. 14 is the same solution described in Lemma 4, i.e., eqs. 12 and 13.

The following Lemma will be used to connect the constrained path to the characterization of the margin path.

###### Lemma 5.

Let be a sum of positively homogeneous functions as in eq. 10. Any path such that

 γ∗(ρ)−γ(ρ,θ(ρ))

is of the form described in eqs. 12 and 13.

###### Proof.

See Appendix C. ∎

Combining Lemma 3, 4 and Lemma 5 we obtain the following Theorem

###### Theorem 1.

Under Assumption 1 and 2, any solution in converges to

 θ∗k=1ρ(w∗k+o(1))(γ∗(ρ))1αk (16)

where the term is vanishing as , and

 w∗=[w∗1,…,w∗K]∈W,

where

 W=argminw∈Rd∥w1∥2s.t.∀n:fn(w)≥1. (17)

Theorem 1 implications: An important implication of Theorem 1 is that an ensemble on neural networks will aim to discard the shallowest network in the ensemble. Consider the following setting: for each , the function

represents a prediction function of some feedforward neural network with no bias, all with the same positive-homogeneous activation function

of some degree (e.g., ReLU activation is positive-homogeneous of degree ). Note that in this setup, each of the prediction functions is also a positive-homogeneous function. In particular, network with depth is positive homogeneous with degree where is the activation function degree. Since all the networks have the same activation function, deeper networks will have larger degree. We assume WLOG that . This implies that . In this setting, represents an ensemble of these networks. From Theorem 1, the solution of the constrained path will satisfy

 fn(ρθ∗) =K∑k=1f(k)n(ρθ∗k) =K∑k=1f(k)n((w∗k+o(1))(γ∗(ρ))1αk) =γ∗(ρ)K∑k=1f(k)n(w∗k+o(1)),

where and is calculated using eq. 13. Examining equation 13, we observe that the network aims to minimize the norm. In particular, if the network ensemble can satisfy the constraints with , then the first equation obtained solutions will satisfy . Thus the ensemble will discard the shallowest network if it is ”unnecessary” to satisfy the constraint.

Furthermore, from eq. 14 we conjecture that after discarding the shallowest “unnecessary” network, the ensemble will tend to minimize , i.e., to discard the second shallowest ”unnecessary” network. This will continue until there are no more ”unnecessary” shallow networks. In other words, we conjecture that the an ensemble of neural networks will aim to discard the shallowest “unnecessary” networks.

Additionally, using Theorem 1 we can now represent hard-margin SVM problems with unregularized bias

. Previous results only focused on linear prediction functions without bias. Trying to extend these results to SVM with bias by extending all the input vectors

with an additional component would fail since the obtained solution in the original space is the solution of

 argminw∈Rd,b∈R∥w∥2+b2s.t.yn(w⊤xn+b)≥1,

which is not the max-margin (SVM) solution, as pointed out by (Nar et al., 2019). However, we can now achieve this goal using Theorem 1. For some dataset , we use the following prediction function where . From eqs. 12, 13 the asymptotic solution will satisfy .

### 4 Homogeneous Models

In the previous section we connected the constrained path to the margin path. We would like to refine this characterization and also understand the connection to the optimization path. In this section we are able to do so for prediction functions which are -positive homogeneous functions (definition 1).

In the homogeneous case, eq. 7 is equivalent, , to

 Θ∗m=argminθ∈Rd∥θ∥2s.t.minnfn(θ)≥γ∗(1) (18)

since is homogeneous.

#### 4.1 Optimization Path Converges to Stationary Points of the Margin Path and Constrained Path

Remark: The results in this subsection are specific for the Euclidean or norm, as opposed to many of the results in this paper which are stated for any norm.

In this section, we link the optimization path to the margin path and the constrained path. These results require the following smoothness assumption:

###### Assumption 3 (Smoothness).

We assume is a function.

##### Relating optimization path and margin path.

The limit of the margin path for homogeneous models is given by eq. 18. In this section we first relate the optimization path to this limit of margin path.

Note that for general homogeneous prediction functions , eq. 18 is a non-convex optimization problem, and thus it is unlikely for an optimization algorithm such as gradient descent to find the global optimum. We can relax the set to that are first-order stationary, i.e., critical points of 18. For , denote the set of support vectors of as

 Sm(θ)={n:fn(θ)=γ∗(1)}. (19)
###### Definition 3 (First-order Stationary Point).

The first-order optimality conditions of 18 are:

1. ,

2. There exists such that and for .

We denote by the set of first-order stationary points.

Let be the iterates of gradient descent. Define and be the vector with entries . The following two assumptions assume that the limiting direction exist and the limiting direction of the losses exist. Such assumptions are natural in the context of max-margin problems, since we want to argue that converges to a max-margin direction, and also the losses converges to an indicator vector of the support vectors. The first step to argue this convergence is to ensure the limits exist.

###### Assumption 4 (Asymptotic Formulas).

Assume that , that is we converge to a global minimizer. Further assume that and exist. Equivalently,

 ℓn(t) =h(t)an+h(t)ϵn(t) (20) θ(t) =g(t)¯w()+g(t)δ(t), (21)

with , , , , and .

###### Assumption 5 (Linear Independence Constraint Qualification).

Let be a unit vector. LICQ holds at if the vectors are linearly independent.

###### Remark 1.

Constraint qualifications allow the first-order optimality conditions of Definition 3 to be a necessary condition for optimality. Without constraint qualifications, the global optimum need not satisfy the optimality conditions.

LICQ is the simplest among many constraint qualification conditions identified in the optimization literature (Nocedal & Wright, 2006).

For example, in linear SVM, LICQ is ensured if the set of support vectors is linearly independent. Consider and be the support vectors. Then , and so linear independence of the support vectors implies LICQ. For data sampled from an absolutely continuous distribution, the SVM solution will always have linearly independent support vectors (Soudry et al., 2018b, Lemma 12), but LICQ may fail when the data is degenerate.

###### Theorem 2.

Define . Under Assumptions 3, 4, and constraint qualification at (Assumption 5), is a first-order stationary point of 18.

The proof of Theorem 2 can be found in Appendix E.1.

##### Optimization path and constrained path.

Next, we study how the optimization path as converges to stationary points of the constrained path with .

The first-order optimally conditions of the constrained path , require that the constraints hold, and the gradient of the Lagrangian of the constrained path

 ρ∇θL(ρθ)+λ(ρ)θ (22)

is equal to zero. In other words,

###### Remark 2.

Under Assumption 1, is first-order optimal for the problem if it satisfies:

,   .

On many paths the gradient of the Lagrangian goes to zero as . However, we have a faster vanishing rate for the specific optimization paths that follow Definition 4 below. Therefore, these paths better approximate true stationary points:

###### Definition 4 (First-order optimal for ρ→∞).

A sequence is first-order optimal for with if

,    .

To relate the limit points of gradient decent to the constrained path, we will focus on stationary points of the constrained path that minimize the loss.

###### Theorem 3.

Let be the limit direction of gradient descent. Under Assumptions 1, 3, 4, and constraint qualification at (Assumption 5), the sequence is a first-order optimal point for (Definition 4).

The proof of Theorem 3 can be found in Appendix E.2.

#### 4.2 Lexicographic Max-Margin

Recall that for positive homogeneous prediction functions, the margin path in eq. 11 is the same set for any and is given by

 Θ∗m=argmaxθ∈Sd−1minnfn(θ).

For non-convex functions or non-Euclidean norms , the above set need not be unique. In this case, we define the following refined set of maximum margin solution set

###### Definition 5 (Lexicographic maximum margin set).

The lexicographic margin set denoted by is given by the following iterative definition of for :

 Θ∗m,0 =Sd−1, Θ∗m,k =argmaxθ∈Θ∗m,k−1(min{nℓ}kℓ=1maxℓ∈[k]fnℓ(θ))⊆Θ∗m,k−1.

In the above definition, denotes the set of maximum margin solutions, denotes the subset of with second smallest margin, and so on.

For an alternate representation of , we introduce the following notation: for , let denote the index corresponding to the smallest margin of as defined below by breaking ties in the arbitrarily:

 n∗1(θ)=argminnfn(θ)n∗k(θ)=argminn∉{n∗ℓ(θ)}k−1l=1fn(θ)for k≥2. (23)

Using this notation, we can rewrite as

 Θ∗m,k+1=argmaxθ∈Θ∗m,kfn∗k+1(θ)(θ).

We also define the limit set of constrained path as follows:

###### Definition 6 (Limit set of constrained path).

The limit set of constrained path is defined as follows:

###### Theorem 4.

For -positive homogeneous prediction functions the limit set of constrained path is contained in the lexicographic maximum margin set, i.e., .

The proof of the above Theorem follows from adapting the arguments of (Rosset et al., 2004a) (Theorem in Appendix ) for general homogeneous models. We show the complete proof in Appendix E.3.

### 5 Summary

In this paper we characterized the connections between the constrained, margin and optimization paths. First, in Section 3, we examined general non-homogeneous models. We showed that the margin of the constrained path solution converges to the maximum margin. We further analyzed this result and demonstrated how it implies convergence in parameters, i.e., converges to , for some models. Then, we examined functions that are a finite sum of positively homogeneous functions. These prediction function can represent an ensemble of neural networks with positive homogeneous activation functions. For this model, we characterized the asymptotic constrained path and margin path solution. This implies a surprising result: ensembles of neural networks will aim to discard the most shallow network. In the future work we aim to analyze sum of homogeneous functions with shared variables, such as ResNets.

Second, in Section 4 we focus on homogeneous models. For such models we link the optimization path to the margin and constrained paths. Particularly, we show that the optimization path converges to stationary points of the constrained path and margin path. In future work, we aim to extend this to non-homogeneous models. In addition, we give a more refined characterization of the constrained path limit. It will be interesting to find whether this characterization be further refined to answer whether the weighting of the data point can have any effect on the selection of the asymptotic solution — as (Byrd & Lipton, 2018) observed empirically that it did not.

### Acknowledgements

The authors are grateful to C. Zeno, and N. Merlis for helpful comments on the manuscript. This research was supported by the Israel Science foundation (grant No. 31/1031), and by the Taub foundation. SG and NS were partially supported by NSF awards IIS-1302662 and IIS-1764032.

### References

• Ali et al. (2018) Ali, A., Kolter, J. Z., and Tibshirani, R. J. A continuous-time view of early stopping for least squares regression. arXiv preprint arXiv:1810.10082, 2018.
• Byrd & Lipton (2018) Byrd, J. and Lipton, Z. C. Weighted risk minimization & deep learning. arXiv preprint arXiv:1812.03372, 2018.
• Gunasekar et al. (2017) Gunasekar, S., Woodworth, B. E., Bhojanapalli, S., Neyshabur, B., and Srebro, N. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pp. 6152–6160, 2017.
• Gunasekar et al. (2018a) Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Implicit bias of gradient descent on linear convolutional networks. NIPS, 2018a.
• Gunasekar et al. (2018b) Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. Characterizing implicit bias in terms of optimization geometry. ICML, 2018b.
• He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition

, pp. 770–778, 2016.
• Hoffer et al. (2017) Hoffer, E., Hubara, I., and Soudry, D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In NIPS, 2017.
• Ji & Telgarsky (2018) Ji, Z. and Telgarsky, M. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
• Ji & Telgarsky (2019) Ji, Z. and Telgarsky, M. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2019.
• Keskar et al. (2017) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR, pp. 1–16, 2017.
• Nacson et al. (2019a) Nacson, M. S., Lee, J., Gunasekar, S., Savarese, P. H., Srebro, N., and Soudry, D. Convergence of gradient descent on separable data. AISTATS, 2019a.
• Nacson et al. (2019b) Nacson, M. S., Srebro, N., and Soudry, D. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. AISTATS, 2019b.
• Nar et al. (2019) Nar, K., Ocal, O., Sastry, S. S., and Ramchandran, K. Cross-entropy loss leads to poor margins, 2019.
• Neyshabur et al. (2015a) Neyshabur, B., Salakhutdinov, R. R., and Srebro, N. Path-SGD: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2422–2430, 2015a.
• Neyshabur et al. (2015b) Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations, 2015b.
• Neyshabur et al. (2017) Neyshabur, B., Tomioka, R., Salakhutdinov, R., and Srebro, N. Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071, 2017.
• Nocedal & Wright (2006) Nocedal, J. and Wright, S. Numerical optimization. Springer Science, 35(67-68), 2006.
• Rosset et al. (2004a) Rosset, S., Zhu, J., and Hastie, T. Boosting as a regularized path to a maximum margin classifier.

Journal of Machine Learning Research

, 2004a.
• Rosset et al. (2004b) Rosset, S., Zhu, J., and Hastie, T. J. Margin maximizing loss functions. In Advances in neural information processing systems, pp. 1237–1244, 2004b.
• Soudry et al. (2018a) Soudry, D., Hoffer, E., Shpigel Nacson, M., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. JMLR, 2018a.
• Soudry et al. (2018b) Soudry, D., Hoffer, E., and Srebro, N. The implicit bias of gradient descent on separable data. ICLR, 2018b.
• Suggala et al. (2018) Suggala, A., Prasad, A., and Ravikumar, P. K. Connecting optimization and regularization paths. In Advances in Neural Information Processing Systems, pp. 10631–10641, 2018.
• Wei et al. (2018) Wei, C., Lee, J. D., Liu, Q., and Ma, T. On the margin theory of feedforward neural networks. arXiv preprint arXiv:1810.05369v1, pp. 1–34, 2018.
• Wilson et al. (2017) Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.
• Wu et al. (2017) Wu, L., Zhu, Z., and E, W. Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes. arXiv, 2017.
• Xu et al. (2019) Xu, T., Zhou, Y., Ji, K., and Liang, Y. When will gradient methods converge to max-margin classifier under reLU models?, 2019.
• Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.

### Appendix A Proof of Lemma 1

###### Proof.

Let be a solution of the optimization problem in eq. 2. Then, , since otherwise we could have decreased without changing or — and this is impossible, since is strictly monotonically decreasing. Therefore, we cannot decrease below without increasing above . This implies that is a solution of the optimization problem in eq. 3 with . Next, all that is left to show that eq. 3 has no additional solutions. Suppose by contradiction there were such solutions . Since they are also minimizers of eq. 3, like , they have the same minimum value . Since they are not solutions of eq. 2, we have . However, this means they are not feasible for eq. 3, and therefore cannot be solutions. ∎

### Appendix B Proof of Claim 1

###### Proof.

Recall we denoted the set of solutions of eq. 14 as , and recall from eq. 13. To simplify notations we omit the dependency on from the notation, i.e., we replace with . Suppose the claim was not correct. Then, there would have existed such that , such that Note that is feasible in both optimization problems (eq. 13 and 14), since both problems have the same constraints. Moreover, since it must be sub-optimal in comparison to the solution of eq. 13. Therefore, such that for any , . Then we can write (from eq. 14)

 (24)

From Assumption 2 we know that such that a solution of the margin path exists. Therefore, , eq. 11 is feasible. We assume, WLOG, that . This implies that there exist a feasible finite solution to eq. 24 which does not depend on . Therefore, , , and the values of are respectively bounded below the values of , which are independent of . This implies that if we select large enough, we will have . This would contradict the assumption that and therefore minimizes eq. 24. This implies that , such that , we have , which entails the Theorem.

### Appendix C Proof of Lemma 5

###### Proof.

We assume by contradiction that yet does not have the form of eqs. 12 and 13. Without loss of generality we can write

 ρθ(ρ)=vk(ρ)[γ(ρ,θ(ρ))]1αk. (25)

If , for some . Then we could have written, from eqs. 25 and 15

 ρθ(ρ)=(w∗k+o(1))[γ(ρ,θ(ρ))]1αk=(w∗k+o(1))[γ∗(ρ)]1αk

which contradicts out assumption that does not have the form of eq. 13 and 14.

Therefore , such that , : . The norm of the solution in eq. 25

 ρ′2=K∑k=1∥∥vk(ρ′)∥∥2[γ(ρ,θ(ρ))]2αk,

is equal to the norm of the solution with margin

 ρ′2=K∑k=1∥∥w∗k+o(1)∥∥2[γ∗(ρ′)]2αk.

Therefore, from eq. 15 we have

 K∑k=1∥∥w∗1+o(1)∥∥2[γ(