In the last few years, deep learning has been tremendously successful in many important applications of machine learning. However, our theoretical understanding of deep learning, and thus the ability of developing principled improvements, has lagged behind. A satisfactory theoretical characterization of deep learning is emerging. It covers the following questions: 1)representation power of deep networks 2) optimization of the empirical risk 3) generalization properties of gradient descent techniques — why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized? We refer to the latter as the non-overfitting puzzle, around which several recent papers revolve (see among others Hardt2016 ; NeyshaburSrebro2017 ; Sapiro2017 ; 2017arXiv170608498B ; Musings2017 ). This paper addresses the third question.
2 Deep networks: definitions and properties
Definitions We define a deep network with
layers with the usual coordinate-wise scalar activation functionsas the set of functions , where the input is , the weights are given by the matrices , one per layer, with matching dimensions. We use the symbol as a shorthand for the set of matrices . For simplicity we consider here the case of binary classification in which takes scalar values, implying that the last layer matrix is . The labels are . The weights of hidden layer are collected in a matrix of size
. There are no biases apart form the input layer where the bias is instantiated by one of the input dimensions being a constant. The activation function in this paper is the ReLU activation.
Positive one-homogeneity For ReLU activations the following positive one-homogeneity property holds . For the network this implies , where with the Frobenius norm (for convenience). This implies the following property of ReLU networks w.r.t. their Rademacher complexity:
is the class of neural networks described above and accordinglyis the corresponding class of normalized neural networks. This invariance property of the function under transformations of that leave the product norm the same is typical of ReLU (and linear) networks. In the paper we will refer to the norm of meaning the product of the Frobenius norms of the weight matrices of . Thus . Note that
Structural property The following structural property of the gradient of deep ReLU networks is sometime useful (Lemma 2.1 of DBLP:journals/corr/abs-1711-01530 ):
for . Equation 3 can be rewritten as an inner product
is the vectorized representation of the weight matricesfor each of the different layers (each matrix is a vector)
Gradient flow and continuous approximation We will speak of the gradient flow of the empirical risk (or sometime of the flow of if the context makes clear that one speaks of the gradient of ) referring to
where is the learning rate. In the following we will mix the continuous formulation with the discrete version whenever we feel this is appropriate for the specific statement. We are well aware that the two are not equivalent but we are happy to leave a careful analysis – especially of the discrete case – to better mathematicians.
Maximization by exponential With being the normalized network (weights at each layer are normalized by the Frobenius norm of the layer matrix) and being the product of the Frobenius norms, the exponential loss approximates for “large” a max operation, selecting among all the data points the ones with the smallest margin . Thus minimization of for large corresponds to margin maximization
3 A semi-rigorous theory of the optimization landscape of Deep Nets: Bezout theorem and Boltzman distribution
In theory_II ; theory_IIb we consider Deep Networks in which each ReLU nonlinearity is replaced by a univariate polynomial approximating it. Empirically the network behaves in a quantitatively identical way in our tests. We then consider such a network in the context of regression under a square loss function. As usual we assume that the network is over-parametrized, that is the number of weights is larger than the number of data points . The critical points of the gradient consist of
global minima corresponding to interpolating networks for which;
critical points which correspond to saddles and to local minima for which the loss is not zero but ,
Suppose that the polynomial network does not have any of the symmetries characteristics of the RELU network, such as one-homogeneity. In the case of the global, interpolating minimizers, the function is a polynomial in the weights (and also a polynomial in the inputs ). The degree of each equation is determined by the degree of the univariate polynomial and by the number of layers . Since the system of polynomial equations, unless the equations are inconsistent, is generically underdetermined – as many equations as data points in a larger number of unknowns – Bezout theorem suggests an infinite number of degenerate global minima, under the form of regions of zero empirical error (the set of all solutions is an algebraically closed set of dimension at least ). Notice that if an underdetermined system is chosen at random, the dimension of its zeros is equal to
with probability one.
The critical points of the gradient that are not global minimizers are given by the set of equations . This is a set of polynomial equations in unknowns: . If this were a generic system of polynomial equations, we would expect a set of isolated critical points. A more careful analysis, as suggested by J. Bloom, follows the Preimage theorem and gives the result that while the degeneracy of the global zeros is , the degeneracy of the critical points is . Thus the global minima are more degenerate than the critical points if the overparametrization is at least .
(The Preimage Theorem Hintchin12). Let be a smooth map between manifolds, and let such that at each point , the derivative is surjective. Then is a smooth manifold of dimension .
In our case the proof goes as follows. Consider the maps , that is , where is a polynomial in the s with coefficients provided by the training example and . As an example assume that . Then for, say, , . Assume . Consider the set . We need to check that is surjective for any .
is a linear transformation fromto given by . It is enough to find a single vector such that . For instance choose . Then . Therefore is surjective and the preimage of is a manifold of dimensionality .
The argument can be extended to the case in which there are degeneracies due to intrinsic symmetries of the network, for instance corresponding to invariances under a conttnuous group (discrete groups such as the permutation groups will not induce infinite degeneracy). Suppose that the effective dimensionality of the symmetries is . Assume for simplicity that the symmetries are the same for and . The constraints induced by symmetries – which could be represented as additional equations – reduce the number of effective parameters from to . Then, while all the critical points will be non-denerate in this reduced number of effective parameters, the global minima will be degenerate on a set of dimension at least . If the latter is still larger than zero, the global minima will be degenerate on an algebraic variety of higher dimension than the local minima, that is on a much larger volume in parameter space.
Thus, we have
(informal statement): For appropriate overparametrization – (see Bloom, Banburski, Poggio 2019) – , there are a large number of global zero-error minimizers which are degenerate; the other critical points – saddles and local minima – are generically (that is with probability one) degenerate on a set of lower dimensionality.
The second part of our argument (in theory_IIb
) is that SGD concentrates on the most degenerate minima. The argument is based on the similarity between a Langevin equation and SGD and on the fact that the Boltzman distribution is formally the asymptotic “solution” of the stochastic differential Langevin equation and also of SGDL, defined as SGD with added white noise (see for instanceraginskyetal17 . The Boltzman distribution is
where is a normalization constant, is the loss and reflects the noise power. The equation implies that SGDL prefers degenerate minima relative to non-degenerate ones of the same depth. In addition, among two minimum basins of equal depth, the one with a larger volume is much more likely in high dimensions as shown by the simulations in theory_IIb . Taken together, these two facts suggest that SGD selects degenerate minimizers corresponding to larger isotropic flat regions of the loss. Then SDGL shows concentration – because of the high dimensionality – of its asymptotic distribution Equation 7.
(informal statement): For appropriate overparametrization of the deep network, SGD selects with high probability the global minimizer of the empirical loss, which are highly degenerate.
4 Related work
There are many recent papers studying optimization and generalization in deep learning. For optimization we mention work based on the idea that noisy gradient descent DBLP:journals/corr/Jin0NKJ17 ; DBLP:journals/corr/GeHJY15 ; pmlr-v49-lee16 ; s.2018when can find a global minimum. More recently, several authors studied the dynamics of gradient descent for deep networks with assumptions about the input distribution or on how the lables are generated. They obtain global convergence for some shallow neural networks Tian:2017:AFP:3305890.3306033 ; s8409482 ; Li:2017:CAT:3294771.3294828 ; DBLP:conf/icml/BrutzkusG17 ; pmlr-v80-du18b ; DBLP:journals/corr/abs-1811-03804 . Some local convergence results have also been proved Zhong:2017:RGO:3305890.3306109 ; DBLP:journals/corr/abs-1711-03440 ; 2018arXiv180607808Z . The most interesting such approach is DBLP:journals/corr/abs-1811-03804 , which focuses on minimizing the training loss and proving that randomly initialized gradient descent can achieve zero training loss (see also NIPS2018_8038 ; du2018gradient ; DBLP:journals/corr/abs-1811-08888 ) as in section 3. In summary, there is by now an extensive literature on optimization that formalizes and refines to different special cases and to the discrete domain our results of Theory II and IIb (see section 3).
For generalization, which is the topic of this paper, existing work demonstrate that gradient descent works under the same situations as kernel methods and random feature methods NIPS2017_6836 ; DBLP:journals/corr/abs-1811-04918 ; Arora2019FineGrainedAO . Closest to our approach – which is focused on the role of batch and weight normalization – is the paper Wei2018OnTM . Its authors study generalization assuming a regularizer because they are – like us – interested in normalized margin. Unlike their assumption of an explicit regularization, we show here that commonly used techniques, such as batch normalization, in fact normalize margin without the need to add a regularizer or to use weight decay.
5 Preliminaries on Generalization
Classical generalization bounds for regression suggest that bounding the complexity of the minimizer provides a bound on generalization. Ideally, the optimization algorithm should select the smallest complexity minimizers among the solutions – that is, in the case of ReLU networks, the minimizers with minimum norm. An approach to achieve this goal is to add a vanishing regularization term to the loss function (the parameter goes to zero with iterations) that, under certain conditions, provides convergence to the minimum norm minimizer, independently of initial conditions. This approach goes back to Halpern fixed point theorem halpern1967 ; it is also independently suggested by other techniques such as Lagrange multipliers, normalization and margin maximization theorems DBLP:conf/nips/RossetZH03 .
Well-known margin bounds for classification suggest a similar (see Appendix 10) approach: maximization of the margin of the normalized network (the weights at each layer are normalized by the Frobenius norm of the weight matrix of the layer). The margin is the value of over the support vectors (the data with smallest margin , assuming ).
In the case of nonlinear deep networks, the critical points of the gradient of an exponential-type loss include saddles, local minima (if they exist) and global minima of the loss function; the latter are generically degenerate theory_II . A similar approach to the the linear case leads to minimum norm solutions, independently of initial conditions.
5.1 Regression: (local) minimum norm empirical minimizers
We recall that generalization bounds Bousquet2003 apply to with probability at least and have the typical form
where is the expected loss, is the empirical Rademacher average of the class of functions measuring its complexity; are constants that depend on properties of the Lipschitz constant of the loss function, and on the architecture of the network.
The bound together with the property Equation 1 implies that among the minimizers with zero square loss, the optimization algorithm should select the minimum norm solution. In any case, the algorithm should control the norm . Standard GD or SGD algorithms do not provide an explicit control of the norm. Empirically it seems that initialization with small weights helps – as in the linear case (see Figures and see section 7). We propose a slight modification of the standard gradient descent algorithms to provide a norm-minimizing GD update – NMGD in short – as
where is the learning rate and (this is one of several choices) is the vanishing regularization-like Halpern (see Appendix 12) term.
5.2 Classification: maximizing the margin of the normalized minimizer
A typical margin bound for classification Shawe-Taylor:2004:KMP:975545 is
where is the margin, is the expected classification error, is the empirical loss of a surrogate loss such as the logistic or the exponential. For a point , the margin is . Since , the margin bound is optimized by effectively maximizing on the “support vectors” – that is the s.t .
We show (see Appendix 10) that for separable data, maximizing the margin subject to unit norm constraint is equivalent to minimize the norm of subject to a constraint on the margin. A regularized loss with an appropriately vanishing regularization parameter is a closely related optimization technique. For this reason we will refer to the solutions in all these cases as minimum norm. This view treats interpolation (in the regression case) and classification (in the margin case) in a unified way.
6 Gradient descent with norm constraint
In this section we focus on the classification case with an exponential loss function. The generalization bounds in the previous section are satisfied by the maximizing the margin subject to the product of the norms being equal to one:
In words: find the network weights that maximize the margin subject to a norm constraint. The latter ensures a bounded Rademacher complexity and together they minimize the term . In fact, existing generalization bounds such as Equation 6 in 2017arXiv171206541G, see also 2017arXiv170608498B are given in terms of products of upper bounds on the norm of each layer: the bounds require that each layer is bounded, rather than just the product is bounded.
This constraint is implied by a unit constraint on the norm of each layer which defines an equivalence class of networks because of Eq. (1).
A direct approach is to minimize the exponential loss function , subject to , that is under a unit norm constraint for the weight matrix at each layer. Clearly these constraints imply the constraint on the product of weight matrices in (11). As we discuss later (see Appendices and 845952 ), there are several ways to implement the minimization in the tangent space of . The interesting observation is that they are closely related to to gradient descent techniques widely used for training deep networks, such as weight normalization (WN) SalDied16 and batch normalization (BN) ioffe2015batch . In the following we describe one of the techniques, the Lagrange multiplier method, because it enforces the constraint from the generalization bounds in a transparent way.
6.1 Lagrange multiplier method
We define the loss
where the Lagrange multipliers are chosen to satisfy at convergence or when the algorithm is stopped (the constraint can also be enforced at each iteration, see later).
We perform gradient descent on with respect to . We obtain for
and for each layer
The sequence must satisfy .
Since the first term in the right hand side of Equation (14) goes to zero with and the Lagrange multipliers also go to zero, the normalized weight vectors converge at infinity with . On the other hand, grows to infinity. Interestingly, as shown in section 7, the norm square of each layer grows at the same rate.
Let us assume that starting at some time , is large enough that the following asymptotic expansion (as ) is a good approximation: , where is the multiplicity of the minimal .
The data points with the corresponding minimum value of the margin are the support vectors. They are a subset of cardinality of the datapoints, all with the same margin . In particular, the term becomes .
A rigorous proof of the argument above can be regarded as an extension of the main theorem in DBLP:conf/nips/RossetZH03 from the case of linear functions to the case of one-homogeneous functions. In fact, while updating the present version of this paper we noticed that 2018arXiv181005369W has theorems including such an extension.
If we impose the conditions at each , must satisfy
where we redefined as the quantity . Thus
goes to zero at infinity because does.
It is possible to add a regularization term to the equation for . The effect of regularization is to bound to a maximum size , controlled by a fixed regularization parameter : in this case the dynamics of converges to a (very large) set by a (very small) value of .
6.2 Related techniques for norm control: weight normalization, batch normalization and natural gradient enforcing unit norm
A main observation of this paper is that the Lagrange multiplier technique is very similar in its goal and implementation to other gradient descent algorithms with unit norm constraint. A review of gradient-based algorithms with unit-norm constraints 845952 lists
The four techniques are equivalent for small values of 845952 . Stability issues for numerical implementations are also characterized in 845952 . Our main point here is that the four techniques are closely related and have the same goal: performing gradient descent with a unit norm constraint. It seems fair to say that in the case of GD (a single minibatch, including all data) the four techniques should behave in a similar way. In particular (see Appendices 21.2 and 21.1), batch normalization controls the norms , though it does not control the norm of each layer – as WN does. In this sense it implements a somewhat weaker version of the generalization bound.
This argument suggests that WN and BN implement an approximation of constrained natural gradient. Interestingly, there is a close relationship between the Fisher-Rao norm and the natural gradient DBLP:journals/corr/abs-1711-01530 . In particular, the natural gradient descent is the steepest descent direction induced by the Fisher-Rao geometry.
6.3 Margin maximizers
As we mentioned, in GD with unit norm constraint there will be convergence to for . There may be trajectory-dependent, multiple alternative selections of the support vectors (SVs) during the course of the iteration while grows: each set of SVs may correspond to a max margin, minimum norm solution without being the global minimum norm solution. Because of Bezout-type arguments theory_II we expect multiple maxima. They should generically be degenerate even under the normalization constraints – which enforce each of the sets of weights to be on a unit hypersphere. Importantly, the normalization algorithms ensure control of the norm and thus of the generalization bound even if they cannot ensure that the algorithm converges to the globally best minimum norm solution (this depends on initial conditions for instance). In summary
In the appendices we discuss the dynamics of gradient descent in the continuous framework for a variety of losses and in the presence of regularization or normalization. Typically, normalization is similar to a vanishing regularization term.
The Lagrange multiplier case is a simple example (see Appendix 6.1). For the following equations – as many as the number of weights – have to be satisfied asymptotically
where and goes to zero at infinity at the same rate as (see the special case of Equation (15)). This suggests that weight matrices from to should be in relations of the type for linear multilayer nets; appropriately similar relations should hold for the rectifier nonlinearities. In other words, gradient descent under unit norm is biased towards balancing the weights of different layers since this is the solution with minimum norm.
The Hessian of w.r.t. tells us about the linearized dynamics around the asymptotic critical point of the gradient. The Hessian (see Appendix 18)
is in general degenerate corresponding to an asymptotically degenerate hyperbolic equilibrium (biased towards minimum norm solutions if the rate of decay of implements correctly a Halpern iteration). The number of degenerate directions of the gradient flow corresponds to the number of symmetries of the neural network as discussed in Appendix 19. In the deep linear case, these would correspond to the freedom of applying opposite general linear transformations to neighboring layers. In the case of ReLU networks the situation becomes data-dependent.
For classification with exponential-type losses the Lagrange multiplier technique, WN and BN are trying to achieve approximately the same result – maximize the margin while constraining the norm. An even higher level perspective, unifying view of several different optimization techniques including the case of regression, is to regard them as instances of Halpern iterations. Appendix 12 describes the technique. The gradient flow corresponds to an operator which is non-expansive. The fixed points of the flow are degenerate. Minimization with a regularization term in the weights that vanishes at the appropriate rate (Halpern iterations) converges to the minimum norm minimizer associated to the local minimum. Halpern iterations are a form of regularization with a vanishing (which is the form of regularization used to define the pseudoinverse). From this perspective, the Lagrange multiplier term can be seen as a Halpern term which “attracts” the solution towards zero norm. This corresponds to a local minimum norm solution for the unnormalized network (imagine for instance in 2D that there is a surface of zero loss with a boundary as in Figure 1). The minimum norm solution in the classification case corresponds to a maximum margin solution for the normalized network. Globally optimal generalization is not guaranteed but generalization bounds such as Equation 10 are locally optimized. It should be emphasized however that it is not yet clear whether all the algorithms we mentioned implement the correct dependence of the Halpern term on the number of iterations. We will examine this issue in future work.
7 Generalization without unit norm constraints
Empirically it appears that GD and SGD converge to solutions that can generalize even without BN or WN or other techniques enforcing explicit unit norm constraints. Without explicit constraints, convergence is difficult for quite deep networks; generalization is usually not as good as with BN or WN but it still occurs. How is this possible?
The following result (see Appendix 14.3.1) seems to solve the puzzle: the unconstrained gradient descent dynamics – with and where – is equivalent to the dynamics yield by weight normalization in terms of and , that is
(informal statement) The standard dynamical system is equivalent to the system in which the weights are normalized. Thus for both dynamics, normalization of the weights at the end of the iterations via
yields the normalized classifier.
This means that becomes optimal during standard GD without the need for explicit control of the unit norm of . Notice that the unconstrained dynamics of and defined by Equations 13 and 14 is consistent with the dynamics of the .
Another interesting property of the dynamics of which is shared with the dynamics of under unit norm constraint is suggested by recent work NIPS2018_7321 : the difference between the square of the Frobenius norms of the weights of various layers does not change during gradient descent. This implies that if the weight matrices are all small at initialization, the gradient flow corresponding to gradient descent maintains approximatevely equal Frobenius norms across different layers, which is part of constraint we enforce in an explicit way with the Lagrange multiplier or the WN technique. The observation of NIPS2018_7321 is easy to see in our framework. Consider Equation (12) for , that is without norm constraint. Inspection of it shows that is independent of . It follows that
Thus if we consider two of the layers, the following property holds: with . If is small at initialization then the norm of the two layers will remain very similar under the gradient flow – a condition required by minimum norm solutions. A formal proof can be sketched as follows. Consider the gradient descent equations
The above dynamics induces the following dynamics on using the relation . Thus
because of lemma 3. It follows that
that implies that the rate of growth of is independent of . If we assume that initially, they will remain equal while growing throughout training.
Then, the minimization problem with is equivalent to the Lagrange multiplier problem of Equation (12), though the overall norm of , that is the norm of its layers, is not explicitely controlled. In other words, the norms of the layers are balanced, thus avoiding the situation in which one layer may contribute to decreasing loss by improving but another may achieve the same result by simply increasing its norm. Following the discussion in section 5.2, generalization depends on bounding the ratio of Rademacher complexity to the margin . The balance of norms property allows us to cancel the dependence on for all layers.
This property of exponential-type losses is the non-linear equivalent of Srebro result for linear networks and is similar to the well-known property of the linear case: GD starting from zero or from very small weights converges to the minimum norm. It is important to emphasize that in the multilayer, nonlinear case we expect several maximum margin solutions (unlike the linear case), depending on intial conditions and stochasticity of SGD.
Of course, other effects, in addition to the role of initialization and batch or weight normalization may be at work here, improving generalization. For instance, high dimensionality under certain conditions has been shown to lead to better generalization for certain interpolating kernels 2018arXiv180800387L ; 2018arXiv181211167R . Though this is still an open question, it seems likely that similar results may also be valid for deep networks.
Furthermore, commonly used weight decay with appropriate parameters can induce generalization. Typical implementations of data augmentation also eliminate the overparametrization problem: at each iteration of SGD only “new” data are used and depending on the number of iterations it is possible to obtain more training data than parameters. In any case, within this online framework, one expects convergence to the minimum of the expected risk (see Appendix 11) without the need to invoke generalization bounds.
For a generic loss function such as the square loss and linear networks there is convergence to the minimum norm solution by GD for zero-norm initial conditions.
For exponential type losses and linear networks in the case of classification the convergence is independent of initial conditions 2017arXiv171010345S . The reason is that what matters is .
The property also holds for the square loss.
For exponential type losses and one-homogeneous networks in the case of classification the situation is similar since . With zero-norm initial conditions the norms of the layers are approximately equal and . Notice that the degeneracy of the solutions of the gradient flow is reduced by the unit norm constraints on each of the layers. As we showed, the norm square of each layer grows – under unregularized gradient descent – at the same time-dependent rate. This means that the time derivative of the product of the norms squared does change with time. Thus a bounded product does not remain bounded even when divided by a common time-dependent scale, unless the norm of all layers are equal at initialization.
Our results imply that multilayer, nonlinear, deep networks under gradient descent with norm constraint converge to maximum margin solutions. This is similar to the situation for linear networks. The prototypical (linear) example for over-parametrized deep networks is convergence of gradient descent to weights that represent the pseudoinverse of the input matrix.
We have to distinguish between square loss regression and classification via an exponential-type loss. In the case of square loss regression, NMGD converges to the minimum norm solution independently of initial conditions – under the assumption that the global minimum is achieved.
Consider now the case of classification by minimization of exponential losses using the Lagrange normalization algorithm. The main result is that the dynamical system in the normalized weights converges to a solution that (locally) maximizes margin. We discuss the close relations between this algorithm and weight normalization algorithms, which are themselves related to batch normalization. All these algorithms are commonly used. The fact that the solution corresponds to a maximum margin solution under a fixed norm constraint also explains the puzzling behavior of Figure 3, in which batch normalization was used. The test classification error does not get worse when the number of parameters increases well beyond the number of training data because the dynamical system is constrained to maximize the margin under unit norm of , without necessarily minimizing the loss.
An additional implication of our results is that the effectiveness of batch normalization is based on more fundamental reasons than reducing covariate shifts (the properties described in 2018arXiv180511604S are fully consistent with our characterization in terms of a regularization-like effect). Controlling the norm of the weights is exactly what generalization bounds prescribe: GD with normalization (NMGD) is the correct way to do it. Normalization is closely related to Halpern iterations used to achieve a minimum norm solution.
The theoretical framework described in this paper leaves a number of important open problems. Does the empirical landscape have multiple global minima with different minimum norms (see Figure 1), as we suspect? Or is the landscape “nicer” for large overparametrization – as hinted in several recent papers (see for instance 2018arXiv180406561M and 2019arXiv190202880N )? Can one ensure convergence to the global empirical minimizer with global minimum norm? How? Are there conditions on the Lagrange multiplier term – and on corresponding parameters for weight and batch normalization – that ensure convergence to a maximum margin solution independently of initial conditions?
We thank Yuan Yao, Misha Belkin, Jason Lee and especially Sasha Rakhlin for illuminating discussions. Part of the funding is from Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216, and part by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.
-  D. Soudry, E. Hoffer, and N. Srebro. The Implicit Bias of Gradient Descent on Separable Data. ArXiv e-prints, October 2017.
-  Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 384–395. Curran Associates, Inc., 2018.
-  Moritz Hardt and Tengyu Ma. Identity matters in deep learning. CoRR, abs/1611.04231, 2016.
-  Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv:1706.08947, 2017.
-  Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel Rodrigues. Robust large margin deep neural networks. arXiv:1605.08254, 2017.
-  P. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks. ArXiv e-prints, June 2017.
-  C. Zhang, Q. Liao, A. Rakhlin, K. Sridharan, B. Miranda, N.Golowich, and T. Poggio. Musings on deep learning: Optimization properties of SGD. CBMM Memo No. 067, 2017.
-  Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. CoRR, abs/1711.01530, 2017.
-  Saharon Rosset, Ji Zhu, and Trevor Hastie. Margin maximizing loss functions. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], pages 1237–1244, 2003.
-  T. Poggio and Q. Liao. Theory II: Landscape of the empirical risk in deep learning. arXiv:1703.09833, CBMM Memo No. 066, 2017.
-  C. Zhang, Q. Liao, A. Rakhlin, K. Sridharan, B. Miranda, N.Golowich, and T. Poggio. Theory of deep learning IIb: Optimization properties of SGD. CBMM Memo 072, 2017.
-  M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: A nonasymptotic analysis. arXiv:180.3251 [cs, math], 2017.
-  Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. CoRR, abs/1703.00887, 2017.
-  Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. CoRR, abs/1503.02101, 2015.
-  Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1246–1257, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.
-  Simon S. Du, Jason D. Lee, and Yuandong Tian. When is a convolutional filter easy to learn? In International Conference on Learning Representations, 2018.
-  Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 3404–3413. JMLR.org, 2017.
-  M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, Feb 2019.
-  Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 597–607, USA, 2017. Curran Associates Inc.
-  Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 605–614, 2017.
-  Simon Du, Jason Lee, Yuandong Tian, Aarti Singh, and Barnabas Poczos. Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1339–1348, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
-  Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. CoRR, abs/1811.03804, 2018.
-  Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 4140–4149. JMLR.org, 2017.
-  Kai Zhong, Zhao Song, and Inderjit S. Dhillon. Learning non-overlapping convolutional neural networks with multiple kernels. CoRR, abs/1711.03440, 2017.
-  X. Zhang, Y. Yu, L. Wang, and Q. Gu. Learning One-hidden-layer ReLU Networks via Gradient Descent. arXiv e-prints, June 2018.
Yuanzhi Li and Yingyu Liang.
Learning overparameterized neural networks via stochastic gradient descent on structured data.In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8157–8166. Curran Associates, Inc., 2018.
-  Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
-  Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. CoRR, abs/1811.08888, 2018.
-  Amit Daniely. Sgd learns the conjugate kernel class of the network. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2422–2430. Curran Associates, Inc., 2017.
-  Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. CoRR, abs/1811.04918, 2018.
-  Sanjeev Arora, Simon S. Du, Wei Hu, Zhi yuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. CoRR, abs/1901.08584, 2019.
-  Colin Wei, Jason D. Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks. CoRR, abs/1810.05369, 2018.
-  Benjamin Halpern. Fixed points of nonexpanding maps. Bull. Amer. Math. Soc., 73(6):957–961, 11 1967.
O. Bousquet, S. Boucheron, and G. Lugosi.
Introduction to statistical learning theory.pages 169–207, 2003.
-  John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004.
-  S. C. Douglas, S. Amari, and S. Y. Kung. On gradient adaptation with unit-norm constraints. IEEE Transactions on Signal Processing, 48(6):1843–1847, June 2000.
-  Tim Salimans and Diederik P. Kingm. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in Neural Information Processing Systems, 2016.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Colin Wei, Jason D. Lee, Qiang Liu, and Tengyu Ma. On the Margin Theory of Feedforward Neural Networks. arXiv e-prints, page arXiv:1810.05369, Oct 2018.
-  Tengyuan Liang and Alexander Rakhlin. Just Interpolate: Kernel ”Ridgeless” Regression Can Generalize. arXiv e-prints, page arXiv:1808.00387, Aug 2018.
-  Alexander Rakhlin and Xiyu Zhai. Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon. arXiv e-prints, page arXiv:1812.11167, Dec 2018.
-  Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How Does Batch Normalization Help Optimization? arXiv e-prints, page arXiv:1805.11604, May 2018.
-  Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A Mean Field View of the Landscape of Two-Layers Neural Networks. arXiv e-prints, page arXiv:1804.06561, Apr 2018.
-  Phan-Minh Nguyen. Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks. arXiv e-prints, page arXiv:1902.02880, Feb 2019.
-  Paulo Jorge S. G. Ferreira. The existence and uniqueness of the minimum norm solution to certain linear and nonlinear problems. Signal Processing, 55:137–139, 1996.
-  Huan Xu and Shie Mannor. Robustness and generalization. CoRR, abs/1005.2243, 2010.
-  Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv:1509.01240 [cs, math, stat], September 2015. arXiv: 1509.01240.
-  Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014.
-  T. Poggio, Q. Liao, B. Miranda, L. Rosasco, X. Boix, J. Hidary, and H. Mhaskar. Theory of deep learning III: explaining the non-overfitting puzzle. arXiv:1703.09833, CBMM Memo No. 073, 2017.
-  Levent Sagun, Léon Bottou, and Yann LeCun. Singularity of the hessian in deep learning. CoRR, abs/1611.07476, 2016.
-  Daniel Kunin, Jonathan M. Bloom, Aleksandrina Goeva, and Cotton Seed. Loss landscapes of regularized linear autoencoders. CoRR, abs/1901.08168, 2019.
-  Igor Gitman and Boris Ginsburg. Comparison of batch normalization and weight normalization algorithms for the large-scale image classification. CoRR, abs/1709.08145, 2017.
SGD easily finds global minima in CIFAR10 suggesting that under appropriate overparametrization it does not “see” in the empirical loss landscape any local minima.
Different initializations affect final result (large initialization typically induce larger final norm and larger test error). It is significant that there is dependence on initial conditions (differently from  linear case).
Similar to initialization, perturbations of the weights increase norm and increase test error.
Training loss of normalized nets predict well test performance of the same networks.
In the computer simulations shown in this section, we turn off all the “tricks” used to improve performance such as data augmentation, weight decay, etc. However, we keep batch normalization. We reduce in some of the experiments the size of the network or the size of the training set. As a consequence, performance is not state-of-the-art, but optimal performance is not the goal here (in fact the networks we use achieve state-of-the-art performance using standard setups). The expected risk was measured as usual by an out-of-sample test set.
A basic explanation for the puzzles is similar to the linear case: when the minima are degenerate the minimum norm minimizers are the best for generalization. The linear case corresponds to quadratic loss for a linear network shown in Figure 4.
In this very simple case we test our theoretical analysis with the following experiment. After convergence of GD, we apply a small random perturbation with unit norm to the parameters , then run gradient descent until the training error is again zero; this sequence is repeated times. We make the following predictions for the square loss:
The training error will go back to zero after each sequence of GD.
Any small perturbation of the optimum will be corrected by the GD dynamics to push back the non-degenerate weight directions to the original values. Since the components of the weights in the degenerate directions are in the null space of the gradient, running GD after each perturbation will not change the weights in those directions. Overall, the weights will change in the experiment.
Repeated perturbations of the parameters at convergence, each followed by gradient descent until convergence, will not increase the training error but will change the parameters, increase norms of some of the parameters and increase the associated test error. The norm of the projections of the weights in the null space undergoes a random walk.
The same predictions apply also to the cross entropy case with the caveat that the weights increase even without perturbations, though more slowly. Previous experiments by  showed changes in the parameters and in the expected risk, consistently with our predictions above, which are further supported by the numerical experiments of Figure 8. In the case of cross-entropy the almost zero error valleys of the empirical risk function are slightly sloped downwards towards infinity, becoming flat only asymptotically.
The numerical experiments show, as predicted, that the behavior under small perturbations around a global minimum of the empirical risk for a deep networks is similar to that of linear degenerate regression (compare Figure 8 with Figure 5 ). For the loss, the minimum of the expected risk may or may not occur at a finite number of iterations. If it does, it corresponds to an equivalent optimum (because of “noise”) non-zero and non-vanishing regularization parameter . Thus a specific “early stopping” would be better than no stopping. The corresponding classification error, however, may not show overfitting.
Figure 9 shows the behavior of the loss in CIFAR in the absense of perturbations. This should be compared with Figure 5 which shows the case of an overparametrized linear network under quadratic loss corresponding to the multidimensional equivalent of the degenerate situation of Figure 4. The nondegenerate, convex case is shown in Figure 7.
Figure 10 shows the testing error for an overparametrized linear network optimized under the square loss.This is a special case in which the minimum norm solution is theoretically guaranteed by zero inital conditions without NMGD.
10 Minimal norm and maximum margin
We discuss the connection between maximum margin and minimal norms problems in binary classification. To do so, we reprise some classic reasonings used to derive support vector machines. We show they directly extend beyond linearly parametrized functions as long as there is a one-homogeneity property, namely, for all,
Given a training set of data points , where labels are , the functional margin is
If there exists such that the functional margin is strictly positive, then the training set is separable. We assume in the following that this is indeed the case. The maximum (max) margin problem is
The latter constraint is needed to avoid trivial solutions in light of the one-homogeneity property. We next show that Problem (23) is equivalent to
To see this, we introduce a number of equivalent formulations. First, notice that functional margin (22) can be equivalently written as
Then, the max margin problem (23) can be written as
Next, we can incorporate the norm constraint noting that using one-homogeneity,
so that Problem (25) becomes
Finally, using again one-homogeneity, without loss of generality, we can set and obtain the equivalent problem
The result is then clear noting that
11 Data augmentation and generalization with “infinite” data sets
In the case of batch learning, generalization guarantees on an algorithm are conditions under which the empirical error on the training set converges to the expected error , ideally with bounds that depend on the size of the training set. The practical relevance of this guarantee is that the empirical error is then a measurable proxy for the unknown expected error and its error can be bound . In the case of “pure” online algorithms such as SGD – in which the samples are drawn i.i.d. from the unknown underlying distribution – there is no training set per se or equivalently the training set has infinite size
. Under usual conditions on the loss function and the learning rate, SGD converges to the minimum of the expected risk. Thus, the proof of convergence towards the minimum of the expected risk bypasses the need for generalization guarantees. With data augmentation most of the implementations – such as the PyTorch one – generate “new” examples at each iteration. This effectively extends the size of the finite training setfor guaranteeing convergence to the minimum of the expected risk. Thus existing proofs of the convergence of SGD provide the guarantee that it converges to the “true” expected risk when the size of the “augmented” training set increases with .
Notice that while there exists unique , does not need to be unique: the set of which provide global minima of the expected error is an equivalence class.
12 Halpern iterations: selecting minimum norm solution among degenerate minima
In this section we summarize a modification of gradient descent that we apply to the various problems of optimization under the square and exponential loss for one-layer and nonlinear, deep networks.
We are interested in the convergence of solutions of gradient descent dynamics and their stability properties. In addition to the standard dynamical system tools we also use closely related elementary properties of non-expansive operators. A reason is that they describe the step of numerical implementation of the continuous dynamical systems that we consider. More importantly, they provide iterative techniques that converge (in a convex set) to the minimum norm of the fixed points, even when the operators are not linear, independently of initial conditions.
 Let be a strictly convex normed space. The set of fixed points of a non-expansive mapping with a closed convex subset of is either empty or closed and convex. If it is not empty, it contains a unique element of smallest norm.
In our case . To fix ideas, consider gradient descent on the square loss. As discussed later and in several papers, the Hessian of the loss function ( of a deep networks with ReLUs has eigenvalues bounded from above (see for instance  and ) because the network is Lipschitz continuous and bounded from below by zero at the global minimum. Thus with an appropriate choice of the operator is non-expanding and its fixed points are not an empty set, see Appendix 20. If we assume that the minimum is global and that there are no local minima but only saddle points then the null vector is in . Then the element of minimum norm can be found by iterative procedures (such as Halpern’s method, see Theorem 1 in ) of the form
where the sequence satisfies conditions such as and 222 Notice that these iterative procedures are often part of the numerical implementation (see  and section 4.1) of discretized method for solving a differential equation whose equilibrium points are the minimizers of a differentiable convex subset of a function . Note also that proximal minimization corresponds to backward Euler steps for numerical integration of a gradient flow. Proximal minimization can be seen as introducing quadratic regularization into a smooth minimization problem in order to improve convergence of some iterative method in such a way that the final result obtained is not affected by the regularization. .
In particular, the following holds
 For any the iteration with converges to one of the fixed points of . The sequence with and converges to the fixed point of T with minimum norm.
The norm-minimizing GD update – NMGD in short – has the form
where is the learning rate and (this is one of several choices).
It is an interesting question whether convergence to the minimum norm is independent of initial conditions and of perturbations. This may depend among other factors on the rate at which the Halpern term decays.
13 Network minimizers under square and exponential loss
We consider one-layer and multilayer networks under the square loss and the exponential loss. Here are the main observations and results
One-layer networks The Hessian is in general degenerate. Regularization with arbitrarily small ensures independence from initial conditions for both the square and the exponential loss. In the absence of explicit regularization, GD converge to the minimum norm solution for zero initial conditions. With NMGD-type iterations GD converge to the minimum norm independently of initial conditions (this is similar to the result of  obtained with different assumptions and techniques). For the exponential loss NMGD ensures convergence to the normalized solution that maximizes the margin (and that corresponds to the overall minimum norm solution), see Appendix 14.3. In the exponential loss case, weight normalization GD is degenerate since the data (support vectors) may not span the space of the weights.
Deep networks, square loss The Hessian is in general degenerate, even in the presence of regularization (with fixed ). NMGD-type iterations lead to convergence not only to the fixed points – as vanilla GD does – but to the (locally) minimum norm fixed point.
Deep networks, exponential loss The Hessian is in general degenerate, even in the presence of regularization. NMGD-type iterations lead to convergence to the minimum norm fixed point associated with the global minimum.
Implications of minimum norm for generalization in regression problems NMGD-based minimization ensures minimum norm solutions.
Implications of minimum norm for classification
For classification a typical margin bound is
which depends on the margin . is the expected classification error; is the empirical loss of a surrogate loss such as the logistic. For a point the margin is . Since , the margin bound is optimized by effectively maximizing on the “support vectors”. As shown in Appendix 10 maximizing margin under the unit norm constraint is equivalent to minimizing the norm under the separability constraint.
NMGD can be seen as a variation of regularization (that is weight decay) by requiring to decrease to zero. The theoretical reason for NMGD is that NMGD ensures minimum norm or equivalently maximum margin solutions.
Notice that one of the definitions of the pseudoinverse of a linear operator corresponds to NMGD: it is the regularized solution to a degenerate minimization problem in the square loss for .
The failure of regularization with a fixed to induce hyperbolic solutions in the multi-layer case was surprising to us. Technically this is due to contributions to non-diagonal parts of the Hessian from derivatives across layers and to the shift of the minimum.
14 One-layer networks
14.1 Square loss
For linear networks under square loss GD is a non-expansive operator. There are fixed points. The Hessian is degenerate. Regularization with arbitrarily small ensures independence of initial conditions. Even in the absence of explicit regularization GD converge to the minimum norm solution for zero initial conditions. Convergence to the minimum norm holds also with NMGD-type iterations but now independently of initial conditions.
We consider linear networks with one layer and one scalar output that is because there is only one layer. Thus with .
where is a bounded real-valued variable. Assume further that there exists a -dimensional weight vector that fits all the training data, achieving zero loss on the training set, that is
Dynamics The dynamics is
The only components of the the weights that change under the dynamics are in the vector space spanned by the examples ; components of the weights in the null space of the matrix of examples are invariant to the dynamics. Thus converges to the minimum norm solution if the dynamical system starts from zero weights.
The Jacobian of – and Hessian of – for is
This linearization of the dynamics around for which yields
where the associated is convex, since the Jacobian is minus the sum of auto-covariance matrices and thus is semi-negative definite. It is negative definite if the examples span the whole space but it is degenerate with some zero eigenvalues if .
Regularization If a regularization term is added to the loss the fixed point shifts. The equation
The Hessian at is with
which is always negative definite for any arbitrarily small fixed . Thus the dynamics of the perturbations around the equilibrium is given by
and is hyperbolic. Explicit regularization ensures the existence of a hyperbolic equilibrium for any at a finite . In the limit of the equilibrium converges to a minimum norm solution.
NMGD The gradient flow corresponds to with . The gradient is non-expansive (see Appendix 20). There are fixed points ( satisfying ) that are degenerate. Minimization using the NMGD method converges to the minimum norm minimizer.
14.2 Exponential loss
Linear networks under exponential loss and GD show growing Frobenius norm. On a compact domain () the exponential loss is -smooth and corresponds to a non-expansive operator . Regularization with arbitrarily small ensures convergence to a fixed point independent of initial conditions. GD with normalization and NMGD-type iterations converge to the minimum norm, maximum margin solution for separable data with degenerate Hessian.
Consider now the exponential loss. Even for a linear network the dynamical system associated with the exponential loss is nonlinear. While  gives a rather complete characterization of the dynamics, here we describe a different approach.
The exponential loss for a linear network is
is a binary variable taking the valueor . Assume further that the -dimensional weight vector separates correctly all the training data, achieving zero classification error on the training set, that is . In some cases below (it will be clear from context) we incorporate into .
Dynamics The dynamics is