# On Margin Maximization in Linear and ReLU Networks

The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem. In this paper, we study this question in detail, for several neural network architectures involving linear and ReLU activations. Perhaps surprisingly, we show that in many cases, the KKT point is not even a local optimum of the max margin problem. On the flip side, we identify multiple settings where a local or global optimum can be guaranteed. Finally, we answer a question posed in Lyu and Li [2019] by showing that for non-homogeneous networks, the normalized margin may strictly decrease over time.

• 15 publications
• 71 publications
• 86 publications
06/13/2019

### Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

Recent works on implicit regularization have shown that gradient descent...
02/09/2022

### Gradient Methods Provably Converge to Non-Robust Networks

Despite a great deal of research, it is still unclear why neural network...
02/11/2020

### Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

Neural networks trained to minimize the logistic (a.k.a. cross-entropy) ...
06/25/2020

### Implicitly Maximizing Margins with the Hinge Loss

A new loss function is proposed for neural networks on classification ta...
10/26/2021

### Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

The generalization mystery of overparametrized deep nets has motivated e...
06/11/2020

### Directional convergence and alignment in deep learning

In this paper, we show that although the minimizers of cross-entropy and...
05/17/2019

### Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models

With an eye toward understanding complexity control in deep learning, we...

## 1 Introduction

A central question in the theory of deep learning is how neural networks generalize even when trained without any explicit regularization, and when there are far more learnable parameters than training examples. In such optimization problems there are many solutions that label the training data correctly, and gradient descent seems to prefer solutions that generalize well

(Zhang et al., 2016). Hence, it is believed that gradient descent induces an implicit bias (Neyshabur et al., 2014, 2017), and characterizing this bias has been a subject of extensive research in recent years.

A main focus in the theoretical study of implicit bias is on homogeneous neural networks. These are networks where scaling the parameters by any factor scales the predictions by for some constant . For example, fully-connected and convolutional ReLU networks without bias terms are homogeneous. Lyu and Li (2019) proved that in linear and ReLU homogeneous networks trained with the exponential or the logistic loss, if gradient flow converges to zero loss111They also assumed directional convergence, but (Ji and Telgarsky, 2020) later showed that this assumption is not required., then the direction to which the parameters of the network converge can be characterized as a first order stationary point (KKT point) of the maximum margin problem in the parameter space. Namely, the problem of minimizing the

norm of the parameters under the constraints that each training example is classified correctly with margin at least

. They also showed that this KKT point satisfies necessary conditions for optimality. However, the conditions are not known to be sufficient even for local optimality. It is analogous to showing that some unconstrained optimization problem converges to a point with gradient zero, without proving that it is either a global or a local minimum.

In this work we consider several architectures of homogeneous neural networks with linear and ReLU activations, and study whether the aforementioned KKT point is guaranteed to be a global optimum of the maximum margin problem, a local optimum, or neither. Perhaps surprisingly, our results imply that in many cases, such as depth- fully-connected ReLU networks and depth- diagonal linear networks, the KKT point may not even be a local optimum of the maximum-margin problem. On the flip side, we identify multiple settings where a local or global optimum can be guaranteed.

We now describe our results in a bit more detail. We denote by the class of neural networks without bias terms, where the weights in each layer might have an arbitrary sparsity pattern, and weights might be shared222See Section 2 for the formal definition.. The class contains, for example, convolutional networks. Moreover, we denote by the subclass of that contains only networks without shared weights, such as fully-connected networks and diagonal networks (cf. Gunasekar et al. (2018b); Yun et al. (2020)). We describe our main results below, and also summarize them in Tables 1 and 2.

Fully-connected networks:

• In linear fully-connected networks of any depth the KKT point is a global optimum333We note that margin maximization for such networks in the predictor space is already known (Ji and Telgarsky, 2020). However, margin maximization in the predictor space does not necessarily imply margin maximization in the parameter space..

• In fully-connected depth-

ReLU networks the KKT point may not even be a local optimum. Moreover, this negative result holds with constant probability over the initialization, i.e., there is a training dataset such that gradient flow with random initialization converges with positive probability to the direction of a KKT point which is not a local optimum.

Depth- networks in :

• The positive result on fully-connected linear networks does not extend to networks with sparse weights: In linear diagonal networks the KKT point may not be a local optimum.

• In our proof for the above negative result, the KKT point contains a neuron whose weights vector is zero. However, in practice gradient descent often converges to networks that do not contain such zero neurons. We show that for linear networks in

, if the KKT point has only non-zero weights vectors, then it is a global optimum. We also show that even for the simple case of depth- diagonal linear networks, the optimality of the KKT points can be unexpectedly subtle, in the context of margin maximization in the predictor space (see Remark 4.1).

• For ReLU networks in , in order to obtain a positive result we need a stronger assumption. We show that if the KKT point is such that for every input in the dataset the input to every hidden neuron in the network is non-zero, then it is guaranteed to be a local optimum (but not necessarily a global optimum).

• For linear or ReLU convolutional networks, even if the above assumptions hold, the KKT point may not be a local optimum.

Deep networks in :

• We show that the positive results on depth- linear and ReLU networks in (under the assumptions described above) do not apply to deeper networks.

• We study a weaker notion of margin maximization: maximizing the margin for each layer separately. For linear networks of depth in (including networks with shared weights), we show that the KKT point is a global optimum of the per-layer maximum margin problem. For ReLU networks the KKT point may not even be a local optimum of this problem, but under the assumption on non-zero inputs to all neurons it is a local optimum.

In the paper, our focus is on understanding what can be guaranteed for the KKT convergence points specified in Lyu and Li (2019). Accordingly, in most of our negative results, the construction assumes some specific initialization of gradient flow, and does not quantify how “likely” they are to be reached under some random initialization. An exception is our negative result for depth- fully-connected ReLU networks (Theorem 3.2), which holds with constant probability under reasonable random initializations. Understanding whether this can be extended to the other settings we consider is an interesting problem for future research.

Finally, we consider non-homogeneous networks, for example, networks with skip connections or bias terms. Lyu and Li (2019) showed that a smoothed version of the normalized margin is monotonically increasing when training homogeneous networks. They observed empirically that the normalized margin is monotonically increasing also when training non-homogeneous networks, but did not provide a proof for this phenomenon and left it as an open problem. We give an example for a simple non-homogeneous network where the normalized margin (as well as the smoothed margin) is strictly decreasing (see Theorem 6.1).

Our paper is structured as follows: In Section 2 we provide necessary notations and definitions, and discuss the most relevant prior results. Additional related works are discussed in Appendix A. In Sections 3, 4 and 5 we state our results on fully-connected networks, depth- networks in and deep networks in respectively, and provide some proof ideas. In Section 6 we state our result on non-homogeneous networks. All formal proofs are deferred to Appendix C.

## 2 Preliminaries

##### Notations.

We use bold-faced letters to denote vectors, e.g., . For we denote by the Euclidean norm. We denote by the indicator function, for example equals if and otherwise. For an integer we denote .

##### Neural networks.

A fully-connected neural network of depth is parameterized by a collection of weight matrices, such that for every layer we have . Thus, denotes the number of neurons in the -th layer (i.e., the width of the layer). We assume that and denote by the input dimension. The neurons in layers are called hidden neurons. A fully-connected network computes a function defined recursively as follows. For an input we set , and define for every the input to the -th layer as , and the output of the -th layer as , where

is an activation function that acts coordinate-wise on vectors. Then, we define

. Thus, there is no activation function in the output neuron. When considering depth- fully-connected networks we often use a parameterization where are the weights vectors of the hidden neurons (i.e., correspond to the rows of the first layer’s weight matrix) and are the weights of the second layer.

We also consider neural networks where some weights can be missing or shared. We define a class of networks that may contain sparse and shared weights as follows. A network in is parameterized by where is the depth of , and are the parameters of the -th layer. We denote by the weight matrix of the -th layer. The matrix is described by the vector , and a function as follows: if , and if . Thus, the function represents the sparsity and weight-sharing pattern of the -th layer, and the dimension of is the number of free parameters in the layer. We denote by the input dimension of the network and assume that the output dimension is . The function

computed by the neural network is defined recursively by the weight matrices as in the case of fully-connected networks. For example, convolutional neural networks are in

. Note that the networks in do not have bias terms and do not allow weight sharing between different layers. Moreover, we define a subclass of , that contains networks without shared weights. Formally, a network is in if for every layer and every there is at most one such that . Thus, networks in might have sparse weights, but do not allow shared weights. For example, diagonal networks (defined below) and fully-connected networks are in .

A diagonal neural network is a network in such that the weight matrix of each layer is diagonal, except for the last layer. Thus, the network is parameterized by where for all , and it computes a function defined recursively as follows. For an input set . For , the output of the -th layer is . Then, we have .

In all the above definitions the parameters of the neural networks are given by a collection of matrices or vectors. We often view as the vector obtained by concatenating the matrices or vectors in the collection. Thus, denotes the norm of the vector .

The ReLU activation function is defined by , and the linear activation is . In this work we focus on ReLU networks (i.e., networks where all neurons have the ReLU activation) and on linear networks (where all neurons have the linear activation). We say that a network is homogeneous if there exists such that for every and we have . Note that in our definition of the class we do not allow bias terms, and hence all linear and ReLU networks in are homogeneous, where is the depth of the network. With the exception of Section 6 which studies non-homogeneous networks, all networks considered in this work are homogeneous.

##### Optimization problem and gradient flow.

Let be a binary classification training dataset. Let be a neural network parameterized by

. For a loss function

the empirical loss of on the dataset is

 L(θ):=n∑i=1ℓ(yiΦ(θ;xi)) . (1)

We focus on the exponential loss and the logistic loss .

We consider gradient flow on the objective given in Eq. 1. This setting captures the behavior of gradient descent with an infinitesimally small step size. Let be the trajectory of gradient flow. Starting from an initial point , the dynamics of is given by the differential equation . Note that the ReLU function is not differentiable at . Practical implementations of gradient methods define the derivative to be some constant in . We note that the exact value of has no effect on our results.

##### Convergence to a KKT point of the maximum-margin problem.

We say that a trajectory converges in direction to if . Throughout this work we use the following theorem:

###### Theorem 2.1 (Paraphrased from Lyu and Li (2019); Ji and Telgarsky (2020)).

Let be a homogeneous linear or ReLU neural network. Consider minimizing either the exponential or the logistic loss over a binary classification dataset using gradient flow. Assume that there exists time such that , namely, classifies every correctly. Then, gradient flow converges in direction to a first order stationary point (KKT point) of the following maximum margin problem in parameter space:

 minθ12∥θ∥2s.t. ∀i∈[n]yiΦ(θ;xi)≥1 . (2)

Moreover, and as .

In the case of ReLU networks, Problem 2 is non-smooth. Hence, the KKT conditions are defined using Clarke’s subdifferential, which is a generalization of the differential for non-differentiable functions. See Appendix B for a formal definition. We note that Lyu and Li (2019) proved the above theorem under the assumption that converges in direction, and Ji and Telgarsky (2020) showed that such a directional convergence occurs and hence this assumption is not required.

Lyu and Li (2019) also showed that the KKT conditions of Problem 2 are necessary for optimality. In convex optimization problems, necessary KKT conditions are also sufficient for global optimality. However, the constraints in Problem 2 are highly non-convex. Moreover, the standard method for proving that necessary KKT conditions are sufficient for local optimality, is by showing that the KKT point satisfies certain second order sufficient conditions (SOSC) (cf. Ruszczynski (2011)). However, even when is a linear neural network it is not known when such conditions hold. Thus, the KKT conditions of Problem 2 are not known to be sufficient even for local optimality.

A linear network with weight matrices computes a linear predictor where . Some prior works studied the implicit bias of linear networks in the predictor space. Namely, characterizing the vector from the aforementioned linear predictor. Gunasekar et al. (2018b) studied the implications of margin maximization in the parameter space on the implicit bias in the predictor space. They showed that minimizing (under the constraints in Problem 2) implies: (1) Minimizing for fully-connected linear networks; (2) Minimizing for diagonal linear networks of depth ; (3) Minimizing for linear convolutional networks of depth with full-dimensional convolutional filters, where are the Fourier coefficients of . However, these implications may not hold if gradient flow converges to a KKT point which is not a global optimum of Problem 2.

For some classes of linear networks, positive results were obtained directly in the predictor space, without assuming convergence to a global optimum of Problem 2 in the parameter space. Most notably, for fully-connected linear networks (of any depth), Ji and Telgarsky (2020) showed that under the assumptions of Theorem 2.1, gradient flow maximizes the margin in the predictor space. Note that margin maximization in the predictor space does not necessarily imply margin maximization in the parameter space. Moreover, some results on the implicit bias in the predictor space of linear convolutional networks with full-dimensional convolutional filters are given in Gunasekar et al. (2018b). However, the architecture and set of assumptions are different than what we focus on.

## 3 Fully-connected networks

First, we show that fully-connected linear networks of any depth converge in direction to a global optimum of Problem 2.

###### Theorem 3.1.

Let and let be a depth- fully-connected linear network parameterized by . Consider minimizing either the exponential or the logistic loss over a dataset using gradient flow. Assume that there exists time such that . Then, gradient flow converges in direction to a global optimum of Problem 2.

###### Proof idea (for the complete proof see Appendix c.2).

Building on results from Ji and Telgarsky (2020) and Du et al. (2018), we show that gradient flow converges in direction to a KKT point such that for every we have , where and are unit vectors (with ). Also, we have . Then, we show that every that satisfies these properties, and satisfies the constraints of Problem 2, is a global optimum. Intuitively, the most “efficient” way (in terms of minimizing the parameters) to achieve margin with a linear fully-connected network, is by using a network such that the direction of its corresponding linear predictor maximizes the margin, the layers are balanced (i.e., have equal norms), and the weight matrices of the layers are aligned. ∎

We now prove that the positive result in Theorem 3.1 does not apply to ReLU networks. We show that in depth- fully-connected ReLU networks gradient flow might converge in direction to a KKT point of Problem 2 which is not even a local optimum. Moreover, it occurs under conditions holding with constant probability over reasonable random initializations.

###### Theorem 3.2.

Let be a depth- fully-connected ReLU network with input dimension and two hidden neurons. Namely, for and we have . Consider minimizing either the exponential or the logistic loss using gradient flow. Consider the dataset where , , and . Assume that the initialization is such that for every we have and . Thus, the first hidden neuron is active for both inputs, and the second hidden neuron is not active. Also, assume that . Then, gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2 which is not a local optimum.

###### Proof idea (for the complete proof see Appendix c.3).

By analyzing the dynamics of gradient flow on the given dataset, we show that it converges to zero loss, and converges in direction to a KKT point such that , , , and . Note that and since remain constant during the training and . See Figure 1 for an illustration. Then, we show that for every there exists some such that , satisfies for every , and . Such is obtained from by slightly changing , , and . Thus, by using the second hidden neuron, which is not active in , we can obtain a solution with smaller norm. ∎

We note that the assumption on the initialization in the above theorem holds with constant probability for standard initialization schemes (e.g., Xavier initialization).

###### Remark 3.1 (Unbounded sub-optimality).

By choosing appropriate inputs in the setting of Theorem 3.2, it is not hard to show that the sub-optimality of the KKT point w.r.t. the global optimum can be arbitrarily large. Namely, for every large we can choose a dataset where the angle between and is sufficiently close to , such that , where is a KKT point to which gradient flow converges, and is a global optimum of Problem 2. Indeed, as illustrated in Figure 1, if one neuron is active on both inputs and the other neuron is not active on any input, then the active neuron needs to be very large in order to achieve margin , while if each neuron is active on a single input then we can achieve margin with much smaller parameters. We note that such unbounded sub-optimality can be obtained also in other negative results in this work (in Theorems 4.1, 4.3, 4.4 and 5.4).

###### Remark 3.2 (Robustness to small perturbations).

Theorem 3.2 holds even if we slightly perturb the inputs . Thus, it is not sensitive to small changes in the dataset. We note that such robustness to small perturbations can be shown also for the negative results in Theorems 4.1, 4.3, 5.1 and 5.4.

## 4 Depth-2 networks in N

In this section we study depth- linear and ReLU networks in . We first show that already for linear networks in (more specifically, for diagonal networks) gradient flow may not converge even to a local optimum.

###### Theorem 4.1.

Let be a depth- linear or ReLU diagonal neural network parameterized by . Consider minimizing either the exponential or the logistic loss using gradient flow. There exists a dataset of size and an initialization , such that gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2 which is not a local optimum.

###### Proof idea (for the complete proof see Appendix c.4).

Let and . Let such that . Recalling that the diagonal network computes the function (where is the entry-wise product), we see that the second coordinate remains inactive during training. It is not hard to show that gradient flow converges in direction to the KKT point with . However, it is not a local optimum, since for every small the parameters with satisfy the constraints of Problem 2, and we have . ∎

By Theorem 3.2 fully-connected ReLU networks may not converge to a local optimum, and by Theorem 4.1 linear (and ReLU) networks with sparse weights may not converge to a local optimum. In the proofs of both of these negative results, gradient flow converges in direction to a KKT point such that one of the weights vectors of the hidden neurons is zero. However, in practice gradient descent often converges to a network that does not contain such disconnected neurons. Hence, a natural question is whether the negative results hold also in networks that do not contain neurons whose weights vector is zero. In the following theorem we show that in linear networks such an assumption allows us to obtain a positive result. Namely, in depth- linear networks in , if gradient flow converges in direction to a KKT point of Problem 2 that satisfies this condition, then it is guaranteed to be a global optimum. However, we also show that in ReLU networks assuming that all neurons have non-zero weights is not sufficient.

###### Theorem 4.2.

We have:

1. Let be a depth- linear neural network in parameterized by . Consider minimizing either the exponential or the logistic loss over a dataset using gradient flow. Assume that there exists time such that , and let be the KKT point of Problem 2 such that converges to in direction (such exists by Theorem 2.1). Assume that in the network parameterized by all hidden neurons have non-zero incoming weights vectors. Then, is a global optimum of Problem 2.

2. Let be a fully-connected depth- ReLU network with input dimension and hidden neurons parameterized by . Consider minimizing either the exponential or the logistic loss using gradient flow. There exists a dataset and an initialization , such that gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2, which is not a local optimum, and in the network parameterized by all hidden neurons have non-zero incoming weights.

###### Proof idea (for the complete proof see Appendix c.5).

We give here the proof idea for part (1). Let be the width of the network. For every we denote by the incoming weights vector to the -th hidden neuron, and by the outgoing weight. Let . We consider an optimization problem over the variables where the objective is to minimize and the constrains correspond to the constraints of Problem 2. Let be the KKT point of Problem 2 to which gradient flow converges in direction. For every we denote . We show that satisfy the KKT conditions of the aforementioned problem. Since the objective there is convex and the constrains are affine, then it is a global optimum. Finally, we show that it implies global optimality of . ∎

###### Remark 4.1 (Implications on margin maximization in the predictor space for diagonal linear networks).

Theorems 4.1 and 4.2 imply analogous results on diagonal linear networks also in the predictor space. As we discussed in Section 2, Gunasekar et al. (2018b) showed that in depth- diagonal linear networks, minimizing under the constraints in Problem 2 implies minimizing , where is the corresponding linear predictor. Theorem 4.1 can be easily extended to the predictor space, namely, gradient flow on depth- linear diagonal networks might converge to a KKT point of Problem 2, such that the corresponding linear predictor is not a local optimum of the following problem:

 argminβ∥β∥1s.t. ∀i∈[n]yi⟨β,xi⟩≥1 . (3)

Moreover, by combining part (1) of Theorem 4.2 with the result from Gunasekar et al. (2018b), we deduce that if gradient flow on a depth- diagonal linear network converges in direction to a KKT point of Problem 2 with non-zero weights vectors, then the corresponding linear predictor is a global optimum of Problem 3.

By part (2) of Theorem 4.2, assuming that gradient flow converges to a network without zero neurons is not sufficient for obtaining a positive result in the case of ReLU networks. Hence, we now consider a stronger assumption, namely, that the KKT point is such that for every in the dataset the inputs to all hidden neurons in the computation are non-zero. In the following theorem we show that in depth- ReLU networks, if the KKT point satisfies this condition then it is guaranteed to be a local optimum of Problem 2. However, even under this condition it is not necessarily a global optimum. The proof is given in Appendix C.6 and uses ideas from the previous proofs, with some required modifications.

###### Theorem 4.3.

Let be a depth- ReLU network in parameterized by . Consider minimizing either the exponential or the logistic loss over a dataset using gradient flow. Assume that there exists time such that , and let be the KKT point of Problem 2 such that converges to in direction (such exists by Theorem 2.1). Assume that for every the inputs to all hidden neurons in the computation are non-zero. Then, is a local optimum of Problem 2. However, it may not be a global optimum, even if the network is fully connected.

Note that in all the above theorems we do not allow shared weights. We now consider the case of depth- linear or ReLU networks in , where the first layer is convolutional with disjoint patches (and hence has shared weights), and show that gradient flow does not always converge in direction to a local optimum, even when the inputs to all hidden neurons are non-zero (and hence there are no zero weights vectors).

###### Theorem 4.4.

Let be a depth- linear or ReLU network in , parameterized by for , such that for we have where and . Thus, is a convolutional network with two disjoint patches. Consider minimizing either the exponential or the logistic loss using gradient flow. Then, there exists a dataset of size , and an initialization , such that gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2 which is not a local optimum. Moreover, we have for .

###### Proof idea (for the complete proof see Appendix c.7).

Let and . Let such that and . Since and are symmetric w.r.t. , and does not break this symmetry, then keeps its direction throughout the training. Thus, we show that gradient flow converges in direction to a KKT point where and . Then, we show that it is not a local optimum, since for every small the parameters with and satisfy the constraints of Problem 2, and we have . ∎

## 5 Deep networks in N

In this section we study the more general case of depth- neural networks in , where . First, we show that for networks of depth at least in , gradient flow may not converge to a local optimum of Problem 2, for both linear and ReLU networks, and even where there are no zero weights vectors and the inputs to all hidden neurons are non-zero. More precisely, we prove this claim for diagonal networks.

###### Theorem 5.1.

Let . Let be a depth- linear or ReLU diagonal neural network parameterized by . Consider minimizing either the exponential or the logistic loss using gradient flow. There exists a dataset of size and an initialization , such that gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2 which is not a local optimum. Moreover, all inputs to neurons in the computation are non-zero.

###### Proof idea (for the complete proof see Appendix c.8).

Let and . Consider the initialization where for every . We show that gradient flow converges in direction to a KKT point such that for all . Then, we consider the parameters such that for every we have , and show that if is sufficiently small, then satisfies the constraints in Problem 2 and we have . ∎

Note that in the case of linear networks, the above result is in contrast to networks with sparse weights of depth that converge to a global optimum by Theorem 4.2, and to fully-connected networks of any depth that converge to a global optimum by Theorem 3.1. In the case of ReLU networks, the above result is in contrast to the case of depth- networks studied in Theorem 4.3, where it is guaranteed to converge to a local optimum.

In light of our negative results, we now consider a weaker notion of margin maximization, namely, maximizing the margin for each layer separately. Let be a neural network of depth in , parameterized by . The maximum margin problem for a layer w.r.t. is the following:

 minu(l0)12∥∥u(l0)∥∥2s.t. ∀i∈[n]yiΦ(θ′;xi)≥1 , (4)

where . We show a positive result for linear networks:

###### Theorem 5.2.

Let . Let be any depth- linear neural network in , parameterized by . Consider minimizing either the exponential or the logistic loss over a dataset using gradient flow. Assume that there exists time such that . Then, gradient flow converges in direction to a KKT point of Problem 2, such that for every layer the parameters vector is a global optimum of Problem 4 w.r.t. .

The theorem follows easily by noticing that if is a linear network, then the constraints in Problem 4 are affine, and its KKT conditions are implied by the KKT conditions of Problem 2. See Appendix C.9 for the formal proof. Note that by Theorems 4.1, 4.4 and 5.1, linear networks in might converge in direction to a KKT point , which is not a local optimum of Problem 2. However, Theorem 5.2 implies that each layer in is a global optimum of Problem 4. Hence, any improvement to requires changing at least two layers simultaneously.

While in linear networks gradient flow maximize the margin for each layer separately, in the following theorem (which we prove in Appendix C.10) we show that this claim does not hold for ReLU networks: Already for fully-connected networks of depth- gradient flow may not converge in direction to a local optimum of Problem 4.

###### Theorem 5.3.

Let be a fully-connected depth- ReLU network with input dimension and hidden neurons parameterized by . Consider minimizing either the exponential or the logistic loss using gradient flow. There exists a dataset and an initialization such that gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2, such that the weights of the first layer are not a local optimum of Problem 4 w.r.t. .

Finally, we show that in ReLU networks in of any depth, if the KKT point to which gradient flow converges in direction is such that the inputs to hidden neurons are non-zero, then it must be a local optimum of Problem 4 (but not necessarily a global optimum). The proof follows the ideas from the proof of Theorem 5.2, with some required modifications, and is given in Appendix C.11.

###### Theorem 5.4.

Let . Let be any depth- ReLU network in parameterized by . Consider minimizing either the exponential or the logistic loss over a dataset using gradient flow, and assume that there exists time such that . Let be the KKT point of Problem 2 such that converges to in direction (such exists by Theorem 2.1). Let and assume that for every the inputs to all neurons in layers in the computation are non-zero. Then, the parameters vector is a local optimum of Problem 4 w.r.t. . However, it may not be a global optimum.

## 6 Non-homogeneous networks

We define the normalized margin as follows:

 ¯γ(θ):=mini∈[n]yiΦ(θ∥θ∥;xi) .

If is homogeneous then maximizing the normalized margin is equivalent to solving Problem 2, i.e., minimizing under the constraints (cf. Lyu and Li (2019)). In this section we study the normalized margin in non-homogeneous networks.

Lyu and Li (2019) showed under the assumptions from Theorem 2.1, that a smoothed version of the normalized margin is monotonically increasing when training homogeneous networks. More precisely, there is a function which is an -additive approximation of , such that is monotonically non-decreasing. This result holds only for homogeneous networks. Hence, it does not apply to networks with bias terms or skip connections. Lyu and Li (2019) observed empirically that the normalized margin is monotonically increasing also when training non-homogeneous networks. However, they did not provide a rigorous proof for this phenomenon and left it as an open problem. The experiments in Lyu and Li (2019)

are on training convolutional neural networks (CNN) with bias on MNIST.

In the following theorem we show an example for a simple non-homogeneous linear network where the normalized margin is monotonically decreasing during the training. This example implies that in order to obtain a rigorous proof for the empirical phenomenon that was observed in non-homogeneous networks some additional assumptions must be made.

###### Theorem 6.1.

Let be a depth- linear network with input dimension , width and a skip connection. Namely, is parameterized by where , and we have . Consider the size- dataset , and assume that . Then, gradient flow w.r.t. either the exponential loss or the logistic loss converges to zero loss, converges in direction (i.e., exists), and the normalized margin is monotonically decreasing during the training, i.e., for all . Moreover, we have and .

We note that the proof readily extends in a few directions: It applies also for a depth- network without a skip connection but with a bias term in the output neuron. In addition, it also holds for ReLU networks. Finally, the theorem applies also for the smoothed version of the normalized margin considered in Lyu and Li (2019). The proof of the theorem is given in Appendix C.12. Intuitively, note that if and then , and if and then . Also, since the partial derivative of the loss w.r.t. depends on (respectively) and on , and the partial derivative w.r.t. depends only on , then grow faster than during the training. Hence, as increases, decreases and increase.

### Acknowledgements

This research is supported in part by European Research Council (ERC) grant 754705.

## Appendix A More related work

Soudry et al. [2018] showed that gradient descent on linearly-separable binary classification problems with exponentially-tailed losses (e.g., the exponential loss and the logistic loss), converges to the maximum -margin direction. This analysis was extended to other loss functions, tighter convergence rates, non-separable data, and variants of gradient-based optimization algorithms [Nacson et al., 2019b, Ji and Telgarsky, 2018b, Ji et al., 2020, Gunasekar et al., 2018a, Shamir, 2020, Ji and Telgarsky, 2021].

As detailed in Section 2, Lyu and Li [2019] and Ji and Telgarsky [2020] showed that gradient flow on homogeneous neural networks with exponential-type losses converge in direction to a KKT point of the maximum margin problem in the parameter space. Similar results under stronger assumptions were previously obtained in Nacson et al. [2019a], Gunasekar et al. [2018b]. The implications of margin maximization in the parameter space on the implicit bias in the predictor space for linear neural networks were studied in Gunasekar et al. [2018b] (as detailed in Section 2) and also in Jagadeesan et al. [2021]. Margin maximization in the predictor space for fully-connected linear networks was shown by Ji and Telgarsky [2020] (as detailed in Section 2), and similar results under stronger assumptions were previously established in Gunasekar et al. [2018b] and in Ji and Telgarsky [2018a]. The implicit bias in the predictor space of diagonal and convolutional linear networks was studied in Gunasekar et al. [2018b], Moroshko et al. [2020], Yun et al. [2020]. The implicit bias in infinitely-wide two-layer homogeneous neural networks was studied in Chizat and Bach [2020].

Finally, the implicit bias of neural networks in regression tasks w.r.t. the square loss was also extensively studied in recent years (e.g., Gunasekar et al. [2018c], Razin and Cohen [2020], Arora et al. [2019], Belabbas [2020], Eftekhari and Zygalakis [2020], Li et al. [2018], Ma et al. [2018], Woodworth et al. [2020], Gidel et al. [2019], Li et al. [2020], Yun et al. [2020], Vardi and Shamir [2021], Azulay et al. [2021]). This setting, however, is less relevant for our work.

## Appendix B Preliminaries on the KKT condition

Below we review the definition of the KKT condition for non-smooth optimization problems (cf. Lyu and Li [2019], Dutta et al. [2013]).

Let be a locally Lipschitz function. The Clarke subdifferential [Clarke et al., 2008] at is the convex set

 ∂∘f(x):=conv{limi→∞∇f(xi)∣∣∣limi→∞xi=x,f%isdifferentiableatxi} .

If is continuously differentiable at then .

Consider the following optimization problem

 minf(x)s.t. ∀n∈[N]gn(x)≤0 , (5)

where are locally Lipschitz functions. We say that is a feasible point of Problem 5 if satisfies for all . We say that a feasible point is a KKT point if there exists such that

1. ;

2. For all we have .

## Appendix C Proofs

### c.1 Auxiliary lemmas

Throughout our proofs we use the following two lemmas from Du et al. [2018]:

###### Lemma C.1 (Du et al. [2018]).

Let , and let be a depth- fully-connected linear or ReLU network parameterized by . Suppose that for every we have . Consider minimizing any differentiable loss function (e.g., the exponential or the logistic loss) over a dataset using gradient flow. Then, for every at all time we have

 ddt(∥∥Wj∥∥2F−∥∥Wj+1∥∥2F)=0 .

Moreover, for every and we have

 ddt(∥∥Wj[i,:]∥∥2−∥∥Wj+1[:,i]∥∥2)=0 ,

where is the vector of incoming weights to the -th neuron in the -th hidden layer (i.e., the -th row of ), and is the vector of outgoing weights from this neuron (i.e., the -th column of ).

###### Lemma C.2 (Du et al. [2018]).

Let , and let be a depth- linear or ReLU network in , parameterized by . Consider minimizing any differentiable loss function (e.g., the exponential or the logistic loss) over a dataset using gradient flow. Then, for every at all time we have

 ddt(∥∥u(j)∥∥2−∥∥u(j+1)∥∥2)=0 .

Note that Lemma C.2 considers a larger family of neural networks since it allows sparse and shared weights, but Lemma C.1 gives a stronger guarantee, since it implies balancedness between the incoming and outgoing weights of each hidden neuron separately. In our proofs we will also need to use a balancedness property for each hidden neuron separately in depth- networks with sparse weights. Since this property is not implied by the above lemmas from Du et al. [2018], we now prove it.

Before stating the lemma, let us introduce some required notations. Let be a depth- network in . We can always assume w.l.o.g. that the second layer is fully connected, namely, all hidden neurons are connected to the output neuron. Indeed, otherwise we can ignore the neurons that are not connected to the output neuron. For the network we use the parameterization , where is the number of hidden neurons. For every the vector is the weights vector of the -th hidden neuron, and we have where is the input dimention. For an input we denote by a sub-vector of , such that includes the coordinates of that are connected to the -th hidden neuron. Thus, given , the input to the -th hidden neuron is . The vector is the weights vector of the second layer. Overall, we have .

###### Lemma C.3.

Let be a depth- linear or ReLU network in , parameterized by . Consider minimizing any differentiable loss function (e.g., the exponential or the logistic loss) over a dataset using gradient flow. Then, for every at all time we have

 ddt(∥∥wj∥∥2−v2j)=0 .
###### Proof.

We have

 L(θ)=∑i∈[n]ℓ(yiΦ(θ;xi))=∑i∈[n]ℓ⎛⎝yi∑l∈[k]vlσ(w⊤lxji)⎞⎠ .

Hence

 ddt(∥∥wj∥∥2) =2⟨wj,dwjdt⟩=−2⟨wj,∇wjL(θ)⟩ =−2∑i∈[n]ℓ′⎛⎝yi∑l∈[k]vlσ(w⊤lxli)⎞⎠⋅yivjσ′(w⊤jxji)w⊤jxji =−2∑i∈[n]ℓ′⎛⎝yi∑l∈[k]vlσ(w⊤lxli)⎞⎠⋅yivjσ(w⊤jxji) .

Moreover,

 ddt(v2j) =2vjdvjdt=−2vj∇vjL(θ) =−2vj∑i∈[n]ℓ′⎛⎝yi∑l∈[k]vlσ(w⊤lxli)⎞⎠⋅yiσ(w