1 Introduction
A central question in the theory of deep learning is how neural networks generalize even when trained without any explicit regularization, and when there are far more learnable parameters than training examples. In such optimization problems there are many solutions that label the training data correctly, and gradient descent seems to prefer solutions that generalize well
(Zhang et al., 2016). Hence, it is believed that gradient descent induces an implicit bias (Neyshabur et al., 2014, 2017), and characterizing this bias has been a subject of extensive research in recent years.A main focus in the theoretical study of implicit bias is on homogeneous neural networks. These are networks where scaling the parameters by any factor scales the predictions by for some constant . For example, fullyconnected and convolutional ReLU networks without bias terms are homogeneous. Lyu and Li (2019) proved that in linear and ReLU homogeneous networks trained with the exponential or the logistic loss, if gradient flow converges to zero loss^{1}^{1}1They also assumed directional convergence, but (Ji and Telgarsky, 2020) later showed that this assumption is not required., then the direction to which the parameters of the network converge can be characterized as a first order stationary point (KKT point) of the maximum margin problem in the parameter space. Namely, the problem of minimizing the
norm of the parameters under the constraints that each training example is classified correctly with margin at least
. They also showed that this KKT point satisfies necessary conditions for optimality. However, the conditions are not known to be sufficient even for local optimality. It is analogous to showing that some unconstrained optimization problem converges to a point with gradient zero, without proving that it is either a global or a local minimum.In this work we consider several architectures of homogeneous neural networks with linear and ReLU activations, and study whether the aforementioned KKT point is guaranteed to be a global optimum of the maximum margin problem, a local optimum, or neither. Perhaps surprisingly, our results imply that in many cases, such as depth fullyconnected ReLU networks and depth diagonal linear networks, the KKT point may not even be a local optimum of the maximummargin problem. On the flip side, we identify multiple settings where a local or global optimum can be guaranteed.
We now describe our results in a bit more detail. We denote by the class of neural networks without bias terms, where the weights in each layer might have an arbitrary sparsity pattern, and weights might be shared^{2}^{2}2See Section 2 for the formal definition.. The class contains, for example, convolutional networks. Moreover, we denote by the subclass of that contains only networks without shared weights, such as fullyconnected networks and diagonal networks (cf. Gunasekar et al. (2018b); Yun et al. (2020)). We describe our main results below, and also summarize them in Tables 1 and 2.
Fullyconnected networks:

In linear fullyconnected networks of any depth the KKT point is a global optimum^{3}^{3}3We note that margin maximization for such networks in the predictor space is already known (Ji and Telgarsky, 2020). However, margin maximization in the predictor space does not necessarily imply margin maximization in the parameter space..

In fullyconnected depth
ReLU networks the KKT point may not even be a local optimum. Moreover, this negative result holds with constant probability over the initialization, i.e., there is a training dataset such that gradient flow with random initialization converges with positive probability to the direction of a KKT point which is not a local optimum.
Depth networks in :

The positive result on fullyconnected linear networks does not extend to networks with sparse weights: In linear diagonal networks the KKT point may not be a local optimum.

In our proof for the above negative result, the KKT point contains a neuron whose weights vector is zero. However, in practice gradient descent often converges to networks that do not contain such zero neurons. We show that for linear networks in
, if the KKT point has only nonzero weights vectors, then it is a global optimum. We also show that even for the simple case of depth diagonal linear networks, the optimality of the KKT points can be unexpectedly subtle, in the context of margin maximization in the predictor space (see Remark 4.1). 
For ReLU networks in , in order to obtain a positive result we need a stronger assumption. We show that if the KKT point is such that for every input in the dataset the input to every hidden neuron in the network is nonzero, then it is guaranteed to be a local optimum (but not necessarily a global optimum).

For linear or ReLU convolutional networks, even if the above assumptions hold, the KKT point may not be a local optimum.
Deep networks in :

We show that the positive results on depth linear and ReLU networks in (under the assumptions described above) do not apply to deeper networks.

We study a weaker notion of margin maximization: maximizing the margin for each layer separately. For linear networks of depth in (including networks with shared weights), we show that the KKT point is a global optimum of the perlayer maximum margin problem. For ReLU networks the KKT point may not even be a local optimum of this problem, but under the assumption on nonzero inputs to all neurons it is a local optimum.
In the paper, our focus is on understanding what can be guaranteed for the KKT convergence points specified in Lyu and Li (2019). Accordingly, in most of our negative results, the construction assumes some specific initialization of gradient flow, and does not quantify how “likely” they are to be reached under some random initialization. An exception is our negative result for depth fullyconnected ReLU networks (Theorem 3.2), which holds with constant probability under reasonable random initializations. Understanding whether this can be extended to the other settings we consider is an interesting problem for future research.
Finally, we consider nonhomogeneous networks, for example, networks with skip connections or bias terms. Lyu and Li (2019) showed that a smoothed version of the normalized margin is monotonically increasing when training homogeneous networks. They observed empirically that the normalized margin is monotonically increasing also when training nonhomogeneous networks, but did not provide a proof for this phenomenon and left it as an open problem. We give an example for a simple nonhomogeneous network where the normalized margin (as well as the smoothed margin) is strictly decreasing (see Theorem 6.1).
Our paper is structured as follows: In Section 2 we provide necessary notations and definitions, and discuss the most relevant prior results. Additional related works are discussed in Appendix A. In Sections 3, 4 and 5 we state our results on fullyconnected networks, depth networks in and deep networks in respectively, and provide some proof ideas. In Section 6 we state our result on nonhomogeneous networks. All formal proofs are deferred to Appendix C.
Linear  ReLU  

Fullyconnected  Global (Thm. 3.1)  Not local (Thm. 3.2) 
Not local (Thm. 4.1)  Not local (Thm. 3.2)  
assuming nonzero weights vectors  Global (Thm. 4.2)  Not local (Thm. 4.2) 
assuming nonzero inputs to all neurons  Global (Thm. 4.2)  Local, Not global (Thm. 4.3) 
assuming nonzero inputs to all neurons  Not local (Thm. 4.4)  Not local (Thm. 4.4) 
Linear  ReLU  

Fullyconnected  Global (Thm. 3.1)  Not local (Thm. 3.2) 
assuming nonzero inputs to all neurons  Not local (Thm. 5.1)  Not local (Thm. 5.1) 
 max margin for each layer separately  Global (Thm. 5.2)  Not local (Thm. 5.3) 
 max margin for each layer separately, assuming nonzero inputs to all neurons  Global (Thm. 5.2)  Local, Not global (Thm. 5.4) 
2 Preliminaries
Notations.
We use boldfaced letters to denote vectors, e.g., . For we denote by the Euclidean norm. We denote by the indicator function, for example equals if and otherwise. For an integer we denote .
Neural networks.
A fullyconnected neural network of depth is parameterized by a collection of weight matrices, such that for every layer we have . Thus, denotes the number of neurons in the th layer (i.e., the width of the layer). We assume that and denote by the input dimension. The neurons in layers are called hidden neurons. A fullyconnected network computes a function defined recursively as follows. For an input we set , and define for every the input to the th layer as , and the output of the th layer as , where
is an activation function that acts coordinatewise on vectors. Then, we define
. Thus, there is no activation function in the output neuron. When considering depth fullyconnected networks we often use a parameterization where are the weights vectors of the hidden neurons (i.e., correspond to the rows of the first layer’s weight matrix) and are the weights of the second layer.We also consider neural networks where some weights can be missing or shared. We define a class of networks that may contain sparse and shared weights as follows. A network in is parameterized by where is the depth of , and are the parameters of the th layer. We denote by the weight matrix of the th layer. The matrix is described by the vector , and a function as follows: if , and if . Thus, the function represents the sparsity and weightsharing pattern of the th layer, and the dimension of is the number of free parameters in the layer. We denote by the input dimension of the network and assume that the output dimension is . The function
computed by the neural network is defined recursively by the weight matrices as in the case of fullyconnected networks. For example, convolutional neural networks are in
. Note that the networks in do not have bias terms and do not allow weight sharing between different layers. Moreover, we define a subclass of , that contains networks without shared weights. Formally, a network is in if for every layer and every there is at most one such that . Thus, networks in might have sparse weights, but do not allow shared weights. For example, diagonal networks (defined below) and fullyconnected networks are in .A diagonal neural network is a network in such that the weight matrix of each layer is diagonal, except for the last layer. Thus, the network is parameterized by where for all , and it computes a function defined recursively as follows. For an input set . For , the output of the th layer is . Then, we have .
In all the above definitions the parameters of the neural networks are given by a collection of matrices or vectors. We often view as the vector obtained by concatenating the matrices or vectors in the collection. Thus, denotes the norm of the vector .
The ReLU activation function is defined by , and the linear activation is . In this work we focus on ReLU networks (i.e., networks where all neurons have the ReLU activation) and on linear networks (where all neurons have the linear activation). We say that a network is homogeneous if there exists such that for every and we have . Note that in our definition of the class we do not allow bias terms, and hence all linear and ReLU networks in are homogeneous, where is the depth of the network. With the exception of Section 6 which studies nonhomogeneous networks, all networks considered in this work are homogeneous.
Optimization problem and gradient flow.
Let be a binary classification training dataset. Let be a neural network parameterized by
. For a loss function
the empirical loss of on the dataset is(1) 
We focus on the exponential loss and the logistic loss .
We consider gradient flow on the objective given in Eq. 1. This setting captures the behavior of gradient descent with an infinitesimally small step size. Let be the trajectory of gradient flow. Starting from an initial point , the dynamics of is given by the differential equation . Note that the ReLU function is not differentiable at . Practical implementations of gradient methods define the derivative to be some constant in . We note that the exact value of has no effect on our results.
Convergence to a KKT point of the maximummargin problem.
We say that a trajectory converges in direction to if . Throughout this work we use the following theorem:
Theorem 2.1 (Paraphrased from Lyu and Li (2019); Ji and Telgarsky (2020)).
Let be a homogeneous linear or ReLU neural network. Consider minimizing either the exponential or the logistic loss over a binary classification dataset using gradient flow. Assume that there exists time such that , namely, classifies every correctly. Then, gradient flow converges in direction to a first order stationary point (KKT point) of the following maximum margin problem in parameter space:
(2) 
Moreover, and as .
In the case of ReLU networks, Problem 2 is nonsmooth. Hence, the KKT conditions are defined using Clarke’s subdifferential, which is a generalization of the differential for nondifferentiable functions. See Appendix B for a formal definition. We note that Lyu and Li (2019) proved the above theorem under the assumption that converges in direction, and Ji and Telgarsky (2020) showed that such a directional convergence occurs and hence this assumption is not required.
Lyu and Li (2019) also showed that the KKT conditions of Problem 2 are necessary for optimality. In convex optimization problems, necessary KKT conditions are also sufficient for global optimality. However, the constraints in Problem 2 are highly nonconvex. Moreover, the standard method for proving that necessary KKT conditions are sufficient for local optimality, is by showing that the KKT point satisfies certain second order sufficient conditions (SOSC) (cf. Ruszczynski (2011)). However, even when is a linear neural network it is not known when such conditions hold. Thus, the KKT conditions of Problem 2 are not known to be sufficient even for local optimality.
A linear network with weight matrices computes a linear predictor where . Some prior works studied the implicit bias of linear networks in the predictor space. Namely, characterizing the vector from the aforementioned linear predictor. Gunasekar et al. (2018b) studied the implications of margin maximization in the parameter space on the implicit bias in the predictor space. They showed that minimizing (under the constraints in Problem 2) implies: (1) Minimizing for fullyconnected linear networks; (2) Minimizing for diagonal linear networks of depth ; (3) Minimizing for linear convolutional networks of depth with fulldimensional convolutional filters, where are the Fourier coefficients of . However, these implications may not hold if gradient flow converges to a KKT point which is not a global optimum of Problem 2.
For some classes of linear networks, positive results were obtained directly in the predictor space, without assuming convergence to a global optimum of Problem 2 in the parameter space. Most notably, for fullyconnected linear networks (of any depth), Ji and Telgarsky (2020) showed that under the assumptions of Theorem 2.1, gradient flow maximizes the margin in the predictor space. Note that margin maximization in the predictor space does not necessarily imply margin maximization in the parameter space. Moreover, some results on the implicit bias in the predictor space of linear convolutional networks with fulldimensional convolutional filters are given in Gunasekar et al. (2018b). However, the architecture and set of assumptions are different than what we focus on.
3 Fullyconnected networks
First, we show that fullyconnected linear networks of any depth converge in direction to a global optimum of Problem 2.
Theorem 3.1.
Let and let be a depth fullyconnected linear network parameterized by . Consider minimizing either the exponential or the logistic loss over a dataset using gradient flow. Assume that there exists time such that . Then, gradient flow converges in direction to a global optimum of Problem 2.
Proof idea (for the complete proof see Appendix c.2).
Building on results from Ji and Telgarsky (2020) and Du et al. (2018), we show that gradient flow converges in direction to a KKT point such that for every we have , where and are unit vectors (with ). Also, we have . Then, we show that every that satisfies these properties, and satisfies the constraints of Problem 2, is a global optimum. Intuitively, the most “efficient” way (in terms of minimizing the parameters) to achieve margin with a linear fullyconnected network, is by using a network such that the direction of its corresponding linear predictor maximizes the margin, the layers are balanced (i.e., have equal norms), and the weight matrices of the layers are aligned. ∎
We now prove that the positive result in Theorem 3.1 does not apply to ReLU networks. We show that in depth fullyconnected ReLU networks gradient flow might converge in direction to a KKT point of Problem 2 which is not even a local optimum. Moreover, it occurs under conditions holding with constant probability over reasonable random initializations.
Theorem 3.2.
Let be a depth fullyconnected ReLU network with input dimension and two hidden neurons. Namely, for and we have . Consider minimizing either the exponential or the logistic loss using gradient flow. Consider the dataset where , , and . Assume that the initialization is such that for every we have and . Thus, the first hidden neuron is active for both inputs, and the second hidden neuron is not active. Also, assume that . Then, gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2 which is not a local optimum.
Proof idea (for the complete proof see Appendix c.3).
By analyzing the dynamics of gradient flow on the given dataset, we show that it converges to zero loss, and converges in direction to a KKT point such that , , , and . Note that and since remain constant during the training and . See Figure 1 for an illustration. Then, we show that for every there exists some such that , satisfies for every , and . Such is obtained from by slightly changing , , and . Thus, by using the second hidden neuron, which is not active in , we can obtain a solution with smaller norm. ∎
We note that the assumption on the initialization in the above theorem holds with constant probability for standard initialization schemes (e.g., Xavier initialization).
Remark 3.1 (Unbounded suboptimality).
By choosing appropriate inputs in the setting of Theorem 3.2, it is not hard to show that the suboptimality of the KKT point w.r.t. the global optimum can be arbitrarily large. Namely, for every large we can choose a dataset where the angle between and is sufficiently close to , such that , where is a KKT point to which gradient flow converges, and is a global optimum of Problem 2. Indeed, as illustrated in Figure 1, if one neuron is active on both inputs and the other neuron is not active on any input, then the active neuron needs to be very large in order to achieve margin , while if each neuron is active on a single input then we can achieve margin with much smaller parameters. We note that such unbounded suboptimality can be obtained also in other negative results in this work (in Theorems 4.1, 4.3, 4.4 and 5.4).
4 Depth networks in
In this section we study depth linear and ReLU networks in . We first show that already for linear networks in (more specifically, for diagonal networks) gradient flow may not converge even to a local optimum.
Theorem 4.1.
Let be a depth linear or ReLU diagonal neural network parameterized by . Consider minimizing either the exponential or the logistic loss using gradient flow. There exists a dataset of size and an initialization , such that gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2 which is not a local optimum.
Proof idea (for the complete proof see Appendix c.4).
Let and . Let such that . Recalling that the diagonal network computes the function (where is the entrywise product), we see that the second coordinate remains inactive during training. It is not hard to show that gradient flow converges in direction to the KKT point with . However, it is not a local optimum, since for every small the parameters with satisfy the constraints of Problem 2, and we have . ∎
By Theorem 3.2 fullyconnected ReLU networks may not converge to a local optimum, and by Theorem 4.1 linear (and ReLU) networks with sparse weights may not converge to a local optimum. In the proofs of both of these negative results, gradient flow converges in direction to a KKT point such that one of the weights vectors of the hidden neurons is zero. However, in practice gradient descent often converges to a network that does not contain such disconnected neurons. Hence, a natural question is whether the negative results hold also in networks that do not contain neurons whose weights vector is zero. In the following theorem we show that in linear networks such an assumption allows us to obtain a positive result. Namely, in depth linear networks in , if gradient flow converges in direction to a KKT point of Problem 2 that satisfies this condition, then it is guaranteed to be a global optimum. However, we also show that in ReLU networks assuming that all neurons have nonzero weights is not sufficient.
Theorem 4.2.
We have:

Let be a depth linear neural network in parameterized by . Consider minimizing either the exponential or the logistic loss over a dataset using gradient flow. Assume that there exists time such that , and let be the KKT point of Problem 2 such that converges to in direction (such exists by Theorem 2.1). Assume that in the network parameterized by all hidden neurons have nonzero incoming weights vectors. Then, is a global optimum of Problem 2.

Let be a fullyconnected depth ReLU network with input dimension and hidden neurons parameterized by . Consider minimizing either the exponential or the logistic loss using gradient flow. There exists a dataset and an initialization , such that gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2, which is not a local optimum, and in the network parameterized by all hidden neurons have nonzero incoming weights.
Proof idea (for the complete proof see Appendix c.5).
We give here the proof idea for part (1). Let be the width of the network. For every we denote by the incoming weights vector to the th hidden neuron, and by the outgoing weight. Let . We consider an optimization problem over the variables where the objective is to minimize and the constrains correspond to the constraints of Problem 2. Let be the KKT point of Problem 2 to which gradient flow converges in direction. For every we denote . We show that satisfy the KKT conditions of the aforementioned problem. Since the objective there is convex and the constrains are affine, then it is a global optimum. Finally, we show that it implies global optimality of . ∎
Remark 4.1 (Implications on margin maximization in the predictor space for diagonal linear networks).
Theorems 4.1 and 4.2 imply analogous results on diagonal linear networks also in the predictor space. As we discussed in Section 2, Gunasekar et al. (2018b) showed that in depth diagonal linear networks, minimizing under the constraints in Problem 2 implies minimizing , where is the corresponding linear predictor. Theorem 4.1 can be easily extended to the predictor space, namely, gradient flow on depth linear diagonal networks might converge to a KKT point of Problem 2, such that the corresponding linear predictor is not a local optimum of the following problem:
(3) 
Moreover, by combining part (1) of Theorem 4.2 with the result from Gunasekar et al. (2018b), we deduce that if gradient flow on a depth diagonal linear network converges in direction to a KKT point of Problem 2 with nonzero weights vectors, then the corresponding linear predictor is a global optimum of Problem 3.
By part (2) of Theorem 4.2, assuming that gradient flow converges to a network without zero neurons is not sufficient for obtaining a positive result in the case of ReLU networks. Hence, we now consider a stronger assumption, namely, that the KKT point is such that for every in the dataset the inputs to all hidden neurons in the computation are nonzero. In the following theorem we show that in depth ReLU networks, if the KKT point satisfies this condition then it is guaranteed to be a local optimum of Problem 2. However, even under this condition it is not necessarily a global optimum. The proof is given in Appendix C.6 and uses ideas from the previous proofs, with some required modifications.
Theorem 4.3.
Let be a depth ReLU network in parameterized by . Consider minimizing either the exponential or the logistic loss over a dataset using gradient flow. Assume that there exists time such that , and let be the KKT point of Problem 2 such that converges to in direction (such exists by Theorem 2.1). Assume that for every the inputs to all hidden neurons in the computation are nonzero. Then, is a local optimum of Problem 2. However, it may not be a global optimum, even if the network is fully connected.
Note that in all the above theorems we do not allow shared weights. We now consider the case of depth linear or ReLU networks in , where the first layer is convolutional with disjoint patches (and hence has shared weights), and show that gradient flow does not always converge in direction to a local optimum, even when the inputs to all hidden neurons are nonzero (and hence there are no zero weights vectors).
Theorem 4.4.
Let be a depth linear or ReLU network in , parameterized by for , such that for we have where and . Thus, is a convolutional network with two disjoint patches. Consider minimizing either the exponential or the logistic loss using gradient flow. Then, there exists a dataset of size , and an initialization , such that gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2 which is not a local optimum. Moreover, we have for .
Proof idea (for the complete proof see Appendix c.7).
Let and . Let such that and . Since and are symmetric w.r.t. , and does not break this symmetry, then keeps its direction throughout the training. Thus, we show that gradient flow converges in direction to a KKT point where and . Then, we show that it is not a local optimum, since for every small the parameters with and satisfy the constraints of Problem 2, and we have . ∎
5 Deep networks in
In this section we study the more general case of depth neural networks in , where . First, we show that for networks of depth at least in , gradient flow may not converge to a local optimum of Problem 2, for both linear and ReLU networks, and even where there are no zero weights vectors and the inputs to all hidden neurons are nonzero. More precisely, we prove this claim for diagonal networks.
Theorem 5.1.
Let . Let be a depth linear or ReLU diagonal neural network parameterized by . Consider minimizing either the exponential or the logistic loss using gradient flow. There exists a dataset of size and an initialization , such that gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2 which is not a local optimum. Moreover, all inputs to neurons in the computation are nonzero.
Proof idea (for the complete proof see Appendix c.8).
Let and . Consider the initialization where for every . We show that gradient flow converges in direction to a KKT point such that for all . Then, we consider the parameters such that for every we have , and show that if is sufficiently small, then satisfies the constraints in Problem 2 and we have . ∎
Note that in the case of linear networks, the above result is in contrast to networks with sparse weights of depth that converge to a global optimum by Theorem 4.2, and to fullyconnected networks of any depth that converge to a global optimum by Theorem 3.1. In the case of ReLU networks, the above result is in contrast to the case of depth networks studied in Theorem 4.3, where it is guaranteed to converge to a local optimum.
In light of our negative results, we now consider a weaker notion of margin maximization, namely, maximizing the margin for each layer separately. Let be a neural network of depth in , parameterized by . The maximum margin problem for a layer w.r.t. is the following:
(4) 
where . We show a positive result for linear networks:
Theorem 5.2.
Let . Let be any depth linear neural network in , parameterized by . Consider minimizing either the exponential or the logistic loss over a dataset using gradient flow. Assume that there exists time such that . Then, gradient flow converges in direction to a KKT point of Problem 2, such that for every layer the parameters vector is a global optimum of Problem 4 w.r.t. .
The theorem follows easily by noticing that if is a linear network, then the constraints in Problem 4 are affine, and its KKT conditions are implied by the KKT conditions of Problem 2. See Appendix C.9 for the formal proof. Note that by Theorems 4.1, 4.4 and 5.1, linear networks in might converge in direction to a KKT point , which is not a local optimum of Problem 2. However, Theorem 5.2 implies that each layer in is a global optimum of Problem 4. Hence, any improvement to requires changing at least two layers simultaneously.
While in linear networks gradient flow maximize the margin for each layer separately, in the following theorem (which we prove in Appendix C.10) we show that this claim does not hold for ReLU networks: Already for fullyconnected networks of depth gradient flow may not converge in direction to a local optimum of Problem 4.
Theorem 5.3.
Let be a fullyconnected depth ReLU network with input dimension and hidden neurons parameterized by . Consider minimizing either the exponential or the logistic loss using gradient flow. There exists a dataset and an initialization such that gradient flow converges to zero loss, and converges in direction to a KKT point of Problem 2, such that the weights of the first layer are not a local optimum of Problem 4 w.r.t. .
Finally, we show that in ReLU networks in of any depth, if the KKT point to which gradient flow converges in direction is such that the inputs to hidden neurons are nonzero, then it must be a local optimum of Problem 4 (but not necessarily a global optimum). The proof follows the ideas from the proof of Theorem 5.2, with some required modifications, and is given in Appendix C.11.
Theorem 5.4.
Let . Let be any depth ReLU network in parameterized by . Consider minimizing either the exponential or the logistic loss over a dataset using gradient flow, and assume that there exists time such that . Let be the KKT point of Problem 2 such that converges to in direction (such exists by Theorem 2.1). Let and assume that for every the inputs to all neurons in layers in the computation are nonzero. Then, the parameters vector is a local optimum of Problem 4 w.r.t. . However, it may not be a global optimum.
6 Nonhomogeneous networks
We define the normalized margin as follows:
If is homogeneous then maximizing the normalized margin is equivalent to solving Problem 2, i.e., minimizing under the constraints (cf. Lyu and Li (2019)). In this section we study the normalized margin in nonhomogeneous networks.
Lyu and Li (2019) showed under the assumptions from Theorem 2.1, that a smoothed version of the normalized margin is monotonically increasing when training homogeneous networks. More precisely, there is a function which is an additive approximation of , such that is monotonically nondecreasing. This result holds only for homogeneous networks. Hence, it does not apply to networks with bias terms or skip connections. Lyu and Li (2019) observed empirically that the normalized margin is monotonically increasing also when training nonhomogeneous networks. However, they did not provide a rigorous proof for this phenomenon and left it as an open problem. The experiments in Lyu and Li (2019)
are on training convolutional neural networks (CNN) with bias on MNIST.
In the following theorem we show an example for a simple nonhomogeneous linear network where the normalized margin is monotonically decreasing during the training. This example implies that in order to obtain a rigorous proof for the empirical phenomenon that was observed in nonhomogeneous networks some additional assumptions must be made.
Theorem 6.1.
Let be a depth linear network with input dimension , width and a skip connection. Namely, is parameterized by where , and we have . Consider the size dataset , and assume that . Then, gradient flow w.r.t. either the exponential loss or the logistic loss converges to zero loss, converges in direction (i.e., exists), and the normalized margin is monotonically decreasing during the training, i.e., for all . Moreover, we have and .
We note that the proof readily extends in a few directions: It applies also for a depth network without a skip connection but with a bias term in the output neuron. In addition, it also holds for ReLU networks. Finally, the theorem applies also for the smoothed version of the normalized margin considered in Lyu and Li (2019). The proof of the theorem is given in Appendix C.12. Intuitively, note that if and then , and if and then . Also, since the partial derivative of the loss w.r.t. depends on (respectively) and on , and the partial derivative w.r.t. depends only on , then grow faster than during the training. Hence, as increases, decreases and increase.
Acknowledgements
This research is supported in part by European Research Council (ERC) grant 754705.
References
 Arora et al. [2019] S. Arora, N. Cohen, W. Hu, and Y. Luo. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pages 7413–7424, 2019.
 Azulay et al. [2021] S. Azulay, E. Moroshko, M. S. Nacson, B. Woodworth, N. Srebro, A. Globerson, and D. Soudry. On the implicit bias of initialization shape: Beyond infinitesimal mirror descent. arXiv preprint arXiv:2102.09769, 2021.
 Belabbas [2020] M. A. Belabbas. On implicit regularization: Morse functions and applications to matrix factorization. arXiv preprint arXiv:2001.04264, 2020.
 Chizat and Bach [2020] L. Chizat and F. Bach. Implicit bias of gradient descent for wide twolayer neural networks trained with the logistic loss. arXiv preprint arXiv:2002.04486, 2020.
 Clarke et al. [2008] F. H. Clarke, Y. S. Ledyaev, R. J. Stern, and P. R. Wolenski. Nonsmooth analysis and control theory, volume 178. Springer Science & Business Media, 2008.
 Du et al. [2018] S. S. Du, W. Hu, and J. D. Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. In Advances in Neural Information Processing Systems, pages 384–395, 2018.
 Dutta et al. [2013] J. Dutta, K. Deb, R. Tulshyan, and R. Arora. Approximate kkt points and a proximity measure for termination. Journal of Global Optimization, 56(4):1463–1499, 2013.
 Eftekhari and Zygalakis [2020] A. Eftekhari and K. Zygalakis. Implicit regularization in matrix sensing: A geometric view leads to stronger results. arXiv preprint arXiv:2008.12091, 2020.
 Gidel et al. [2019] G. Gidel, F. Bach, and S. LacosteJulien. Implicit regularization of discrete gradient dynamics in linear neural networks. In Advances in Neural Information Processing Systems, pages 3202–3211, 2019.
 Gunasekar et al. [2018a] S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018a.
 Gunasekar et al. [2018b] S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471, 2018b.
 Gunasekar et al. [2018c] S. Gunasekar, B. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization in matrix factorization. In 2018 Information Theory and Applications Workshop (ITA), pages 1–10. IEEE, 2018c.
 Jagadeesan et al. [2021] M. Jagadeesan, I. Razenshteyn, and S. Gunasekar. Inductive bias of multichannel linear convolutional networks with bounded weight norm. arXiv preprint arXiv:2102.12238, 2021.
 Ji and Telgarsky [2018a] Z. Ji and M. Telgarsky. Gradient descent aligns the layers of deep linear networks. arXiv preprint arXiv:1810.02032, 2018a.
 Ji and Telgarsky [2018b] Z. Ji and M. Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018b.
 Ji and Telgarsky [2020] Z. Ji and M. Telgarsky. Directional convergence and alignment in deep learning. arXiv preprint arXiv:2006.06657, 2020.
 Ji and Telgarsky [2021] Z. Ji and M. Telgarsky. Characterizing the implicit bias via a primaldual analysis. In Algorithmic Learning Theory, pages 772–804. PMLR, 2021.
 Ji et al. [2020] Z. Ji, M. Dudík, R. E. Schapire, and M. Telgarsky. Gradient descent follows the regularization path for general losses. In Conference on Learning Theory, pages 2109–2136. PMLR, 2020.
 Li et al. [2018] Y. Li, T. Ma, and H. Zhang. Algorithmic regularization in overparameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pages 2–47. PMLR, 2018.
 Li et al. [2020] Z. Li, Y. Luo, and K. Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy lowrank learning. arXiv preprint arXiv:2012.09839, 2020.
 Lyu and Li [2019] K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.

Ma et al. [2018]
C. Ma, K. Wang, Y. Chi, and Y. Chen.
Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion.
InInternational Conference on Machine Learning
, pages 3345–3354. PMLR, 2018.  Moroshko et al. [2020] E. Moroshko, B. E. Woodworth, S. Gunasekar, J. D. Lee, N. Srebro, and D. Soudry. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Advances in Neural Information Processing Systems, 33, 2020.
 Nacson et al. [2019a] M. S. Nacson, S. Gunasekar, J. Lee, N. Srebro, and D. Soudry. Lexicographic and depthsensitive margins in homogeneous and nonhomogeneous deep models. In International Conference on Machine Learning, pages 4683–4692. PMLR, 2019a.

Nacson et al. [2019b]
M. S. Nacson, J. Lee, S. Gunasekar, P. H. P. Savarese, N. Srebro, and
D. Soudry.
Convergence of gradient descent on separable data.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pages 3420–3428. PMLR, 2019b.  Neyshabur et al. [2014] B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
 Neyshabur et al. [2017] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017.
 Razin and Cohen [2020] N. Razin and N. Cohen. Implicit regularization in deep learning may not be explainable by norms. arXiv preprint arXiv:2005.06398, 2020.
 Ruszczynski [2011] A. Ruszczynski. Nonlinear optimization. Princeton university press, 2011.
 Shamir [2020] O. Shamir. Gradient methods never overfit on separable data. arXiv preprint arXiv:2007.00028, 2020.
 Soudry et al. [2018] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
 Vardi and Shamir [2021] G. Vardi and O. Shamir. Implicit regularization in relu networks with the square loss. In Conference on Learning Theory, pages 4224–4258. PMLR, 2021.
 Woodworth et al. [2020] B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro. Kernel and rich regimes in overparametrized models. arXiv preprint arXiv:2002.09277, 2020.
 Yun et al. [2020] C. Yun, S. Krishnan, and H. Mobahi. A unifying view on implicit bias in training linear neural networks. arXiv preprint arXiv:2010.02501, 2020.
 Zhang et al. [2016] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Appendix A More related work
Soudry et al. [2018] showed that gradient descent on linearlyseparable binary classification problems with exponentiallytailed losses (e.g., the exponential loss and the logistic loss), converges to the maximum margin direction. This analysis was extended to other loss functions, tighter convergence rates, nonseparable data, and variants of gradientbased optimization algorithms [Nacson et al., 2019b, Ji and Telgarsky, 2018b, Ji et al., 2020, Gunasekar et al., 2018a, Shamir, 2020, Ji and Telgarsky, 2021].
As detailed in Section 2, Lyu and Li [2019] and Ji and Telgarsky [2020] showed that gradient flow on homogeneous neural networks with exponentialtype losses converge in direction to a KKT point of the maximum margin problem in the parameter space. Similar results under stronger assumptions were previously obtained in Nacson et al. [2019a], Gunasekar et al. [2018b]. The implications of margin maximization in the parameter space on the implicit bias in the predictor space for linear neural networks were studied in Gunasekar et al. [2018b] (as detailed in Section 2) and also in Jagadeesan et al. [2021]. Margin maximization in the predictor space for fullyconnected linear networks was shown by Ji and Telgarsky [2020] (as detailed in Section 2), and similar results under stronger assumptions were previously established in Gunasekar et al. [2018b] and in Ji and Telgarsky [2018a]. The implicit bias in the predictor space of diagonal and convolutional linear networks was studied in Gunasekar et al. [2018b], Moroshko et al. [2020], Yun et al. [2020]. The implicit bias in infinitelywide twolayer homogeneous neural networks was studied in Chizat and Bach [2020].
Finally, the implicit bias of neural networks in regression tasks w.r.t. the square loss was also extensively studied in recent years (e.g., Gunasekar et al. [2018c], Razin and Cohen [2020], Arora et al. [2019], Belabbas [2020], Eftekhari and Zygalakis [2020], Li et al. [2018], Ma et al. [2018], Woodworth et al. [2020], Gidel et al. [2019], Li et al. [2020], Yun et al. [2020], Vardi and Shamir [2021], Azulay et al. [2021]). This setting, however, is less relevant for our work.
Appendix B Preliminaries on the KKT condition
Below we review the definition of the KKT condition for nonsmooth optimization problems (cf. Lyu and Li [2019], Dutta et al. [2013]).
Let be a locally Lipschitz function. The Clarke subdifferential [Clarke et al., 2008] at is the convex set
If is continuously differentiable at then .
Consider the following optimization problem
(5) 
where are locally Lipschitz functions. We say that is a feasible point of Problem 5 if satisfies for all . We say that a feasible point is a KKT point if there exists such that

;

For all we have .
Appendix C Proofs
c.1 Auxiliary lemmas
Throughout our proofs we use the following two lemmas from Du et al. [2018]:
Lemma C.1 (Du et al. [2018]).
Let , and let be a depth fullyconnected linear or ReLU network parameterized by . Suppose that for every we have . Consider minimizing any differentiable loss function (e.g., the exponential or the logistic loss) over a dataset using gradient flow. Then, for every at all time we have
Moreover, for every and we have
where is the vector of incoming weights to the th neuron in the th hidden layer (i.e., the th row of ), and is the vector of outgoing weights from this neuron (i.e., the th column of ).
Lemma C.2 (Du et al. [2018]).
Let , and let be a depth linear or ReLU network in , parameterized by . Consider minimizing any differentiable loss function (e.g., the exponential or the logistic loss) over a dataset using gradient flow. Then, for every at all time we have
Note that Lemma C.2 considers a larger family of neural networks since it allows sparse and shared weights, but Lemma C.1 gives a stronger guarantee, since it implies balancedness between the incoming and outgoing weights of each hidden neuron separately. In our proofs we will also need to use a balancedness property for each hidden neuron separately in depth networks with sparse weights. Since this property is not implied by the above lemmas from Du et al. [2018], we now prove it.
Before stating the lemma, let us introduce some required notations. Let be a depth network in . We can always assume w.l.o.g. that the second layer is fully connected, namely, all hidden neurons are connected to the output neuron. Indeed, otherwise we can ignore the neurons that are not connected to the output neuron. For the network we use the parameterization , where is the number of hidden neurons. For every the vector is the weights vector of the th hidden neuron, and we have where is the input dimention. For an input we denote by a subvector of , such that includes the coordinates of that are connected to the th hidden neuron. Thus, given , the input to the th hidden neuron is . The vector is the weights vector of the second layer. Overall, we have .
Lemma C.3.
Let be a depth linear or ReLU network in , parameterized by . Consider minimizing any differentiable loss function (e.g., the exponential or the logistic loss) over a dataset using gradient flow. Then, for every at all time we have
Proof.
We have
Hence
Moreover,