Theory III: Dynamics and Generalization in Deep Networks

03/12/2019 ∙ by Andrzej Banburski, et al. ∙ 0

We review recent observations on the dynamical systems induced by gradient descent methods used for training deep networks and summarize properties of the solutions they converge to. Recent results illuminate the absence of overfitting in the special case of linear networks for binary classification. They prove that minimization of loss functions such as the logistic, the cross-entropy and the exponential loss yields asymptotic convergence to the maximum margin solution for linearly separable datasets, independently of the initial conditions. Here we discuss the case of nonlinear DNNs near zero minima of the empirical loss, under exponential-type and square losses, for several variations of the basic gradient descent algorithm, including a new NMGD (norm minimizing gradient descent) version that converges to the minimum norm fixed points of the gradient descent iteration. Our main results are: 1) gradient descent algorithms with weight normalization constraint achieve generalization; 2) the fundamental reason for the effectiveness of existing weight normalization and batch normalization techniques is that they are approximate implementations of maximizing the margin under unit norm constraint; 3) without unit norm constraints some level of generalization can still be obtained for not-too-deep networks because the balance of the weights across different layers, if present at initialization, is maintained by the gradient flow. In the perspective of these theoretical results, we discuss experimental evidence around the apparent absence of overfitting, that is the observation that the expected classification error does not get worse when increasing the number of parameters. Our explanation focuses on the implicit normalization enforced by algorithms such as batch normalization. In particular, the control of the norm of the weights is related to Halpern iterations for minimum norm solutions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last few years, deep learning has been tremendously successful in many important applications of machine learning. However, our theoretical understanding of deep learning, and thus the ability of developing principled improvements, has lagged behind. A satisfactory theoretical characterization of deep learning is emerging. It covers the following questions: 1)

representation power of deep networks 2) optimization of the empirical risk 3) generalization properties of gradient descent techniques — why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized? We refer to the latter as the non-overfitting puzzle, around which several recent papers revolve (see among others Hardt2016 ; NeyshaburSrebro2017 ; Sapiro2017 ; 2017arXiv170608498B ; Musings2017 ). This paper addresses the third question.

2 Deep networks: definitions and properties

Definitions We define a deep network with

layers with the usual coordinate-wise scalar activation functions

as the set of functions , where the input is , the weights are given by the matrices , one per layer, with matching dimensions. We use the symbol as a shorthand for the set of matrices . For simplicity we consider here the case of binary classification in which takes scalar values, implying that the last layer matrix is . The labels are . The weights of hidden layer are collected in a matrix of size

. There are no biases apart form the input layer where the bias is instantiated by one of the input dimensions being a constant. The activation function in this paper is the ReLU activation.

Positive one-homogeneity For ReLU activations the following positive one-homogeneity property holds . For the network this implies , where with the Frobenius norm (for convenience). This implies the following property of ReLU networks w.r.t. their Rademacher complexity:

(1)

where ,

is the class of neural networks described above and accordingly

is the corresponding class of normalized neural networks. This invariance property of the function under transformations of that leave the product norm the same is typical of ReLU (and linear) networks. In the paper we will refer to the norm of meaning the product of the Frobenius norms of the weight matrices of . Thus . Note that

(2)

Structural property The following structural property of the gradient of deep ReLU networks is sometime useful (Lemma 2.1 of DBLP:journals/corr/abs-1711-01530 ):

(3)

for . Equation 3 can be rewritten as an inner product

(4)

where

is the vectorized representation of the weight matrices

for each of the different layers (each matrix is a vector)

Gradient flow and continuous approximation We will speak of the gradient flow of the empirical risk (or sometime of the flow of if the context makes clear that one speaks of the gradient of ) referring to

(5)

where is the learning rate. In the following we will mix the continuous formulation with the discrete version whenever we feel this is appropriate for the specific statement. We are well aware that the two are not equivalent but we are happy to leave a careful analysis – especially of the discrete case – to better mathematicians.

Maximization by exponential With being the normalized network (weights at each layer are normalized by the Frobenius norm of the layer matrix) and being the product of the Frobenius norms, the exponential loss approximates for “large” a max operation, selecting among all the data points the ones with the smallest margin . Thus minimization of for large corresponds to margin maximization

(6)

A more formal argument may be developed extending theorems of DBLP:conf/nips/RossetZH03 to the nonlinear case as described in NIPS2018_7321 .

3 A semi-rigorous theory of the optimization landscape of Deep Nets: Bezout theorem and Boltzman distribution

In theory_II ; theory_IIb we consider Deep Networks in which each ReLU nonlinearity is replaced by a univariate polynomial approximating it. Empirically the network behaves in a quantitatively identical way in our tests. We then consider such a network in the context of regression under a square loss function. As usual we assume that the network is over-parametrized, that is the number of weights is larger than the number of data points . The critical points of the gradient consist of

  • global minima corresponding to interpolating networks for which

    ;

  • critical points which correspond to saddles and to local minima for which the loss is not zero but ,

Suppose that the polynomial network does not have any of the symmetries characteristics of the RELU network, such as one-homogeneity. In the case of the global, interpolating minimizers, the function is a polynomial in the weights (and also a polynomial in the inputs ). The degree of each equation is determined by the degree of the univariate polynomial and by the number of layers . Since the system of polynomial equations, unless the equations are inconsistent, is generically underdetermined – as many equations as data points in a larger number of unknowns – Bezout theorem suggests an infinite number of degenerate global minima, under the form of regions of zero empirical error (the set of all solutions is an algebraically closed set of dimension at least ). Notice that if an underdetermined system is chosen at random, the dimension of its zeros is equal to

with probability one.

The critical points of the gradient that are not global minimizers are given by the set of equations . This is a set of polynomial equations in unknowns: . If this were a generic system of polynomial equations, we would expect a set of isolated critical points. A more careful analysis, as suggested by J. Bloom, follows the Preimage theorem and gives the result that while the degeneracy of the global zeros is , the degeneracy of the critical points is . Thus the global minima are more degenerate than the critical points if the overparametrization is at least .

(The Preimage Theorem Hintchin12). Let be a smooth map between manifolds, and let such that at each point , the derivative is surjective. Then is a smooth manifold of dimension .

In our case the proof goes as follows. Consider the maps , that is , where is a polynomial in the s with coefficients provided by the training example and . As an example assume that . Then for, say, , . Assume . Consider the set . We need to check that is surjective for any .

is a linear transformation from

to given by . It is enough to find a single vector such that . For instance choose . Then . Therefore is surjective and the preimage of is a manifold of dimensionality .

The argument can be extended to the case in which there are degeneracies due to intrinsic symmetries of the network, for instance corresponding to invariances under a conttnuous group (discrete groups such as the permutation groups will not induce infinite degeneracy). Suppose that the effective dimensionality of the symmetries is . Assume for simplicity that the symmetries are the same for and . The constraints induced by symmetries – which could be represented as additional equations – reduce the number of effective parameters from to . Then, while all the critical points will be non-denerate in this reduced number of effective parameters, the global minima will be degenerate on a set of dimension at least . If the latter is still larger than zero, the global minima will be degenerate on an algebraic variety of higher dimension than the local minima, that is on a much larger volume in parameter space.

Thus, we have

(informal statement): For appropriate overparametrization – (see Bloom, Banburski, Poggio 2019) – , there are a large number of global zero-error minimizers which are degenerate; the other critical points – saddles and local minima – are generically (that is with probability one) degenerate on a set of lower dimensionality.

The second part of our argument (in theory_IIb

) is that SGD concentrates on the most degenerate minima. The argument is based on the similarity between a Langevin equation and SGD and on the fact that the Boltzman distribution is formally the asymptotic “solution” of the stochastic differential Langevin equation and also of SGDL, defined as SGD with added white noise (see for instance

raginskyetal17 . The Boltzman distribution is

(7)

where is a normalization constant, is the loss and reflects the noise power. The equation implies that SGDL prefers degenerate minima relative to non-degenerate ones of the same depth. In addition, among two minimum basins of equal depth, the one with a larger volume is much more likely in high dimensions as shown by the simulations in theory_IIb . Taken together, these two facts suggest that SGD selects degenerate minimizers corresponding to larger isotropic flat regions of the loss. Then SDGL shows concentration – because of the high dimensionality – of its asymptotic distribution Equation 7.

Together theory_II and theory_IIb suggest the following

(informal statement): For appropriate overparametrization of the deep network, SGD selects with high probability the global minimizer of the empirical loss, which are highly degenerate.

4 Related work

There are many recent papers studying optimization and generalization in deep learning. For optimization we mention work based on the idea that noisy gradient descent DBLP:journals/corr/Jin0NKJ17 ; DBLP:journals/corr/GeHJY15 ; pmlr-v49-lee16 ; s.2018when can find a global minimum. More recently, several authors studied the dynamics of gradient descent for deep networks with assumptions about the input distribution or on how the lables are generated. They obtain global convergence for some shallow neural networks Tian:2017:AFP:3305890.3306033 ; s8409482 ; Li:2017:CAT:3294771.3294828 ; DBLP:conf/icml/BrutzkusG17 ; pmlr-v80-du18b ; DBLP:journals/corr/abs-1811-03804 . Some local convergence results have also been proved Zhong:2017:RGO:3305890.3306109 ; DBLP:journals/corr/abs-1711-03440 ; 2018arXiv180607808Z . The most interesting such approach is DBLP:journals/corr/abs-1811-03804 , which focuses on minimizing the training loss and proving that randomly initialized gradient descent can achieve zero training loss (see also NIPS2018_8038 ; du2018gradient ; DBLP:journals/corr/abs-1811-08888 ) as in section 3. In summary, there is by now an extensive literature on optimization that formalizes and refines to different special cases and to the discrete domain our results of Theory II and IIb (see section 3).

For generalization, which is the topic of this paper, existing work demonstrate that gradient descent works under the same situations as kernel methods and random feature methods NIPS2017_6836 ; DBLP:journals/corr/abs-1811-04918 ; Arora2019FineGrainedAO . Closest to our approach – which is focused on the role of batch and weight normalization – is the paper Wei2018OnTM . Its authors study generalization assuming a regularizer because they are – like us – interested in normalized margin. Unlike their assumption of an explicit regularization, we show here that commonly used techniques, such as batch normalization, in fact normalize margin without the need to add a regularizer or to use weight decay.

5 Preliminaries on Generalization

Classical generalization bounds for regression suggest that bounding the complexity of the minimizer provides a bound on generalization. Ideally, the optimization algorithm should select the smallest complexity minimizers among the solutions – that is, in the case of ReLU networks, the minimizers with minimum norm. An approach to achieve this goal is to add a vanishing regularization term to the loss function (the parameter goes to zero with iterations) that, under certain conditions, provides convergence to the minimum norm minimizer, independently of initial conditions. This approach goes back to Halpern fixed point theorem halpern1967 ; it is also independently suggested by other techniques such as Lagrange multipliers, normalization and margin maximization theorems DBLP:conf/nips/RossetZH03 .

Well-known margin bounds for classification suggest a similar (see Appendix 10) approach: maximization of the margin of the normalized network (the weights at each layer are normalized by the Frobenius norm of the weight matrix of the layer). The margin is the value of over the support vectors (the data with smallest margin , assuming ).

In the case of nonlinear deep networks, the critical points of the gradient of an exponential-type loss include saddles, local minima (if they exist) and global minima of the loss function; the latter are generically degenerate theory_II . A similar approach to the the linear case leads to minimum norm solutions, independently of initial conditions.

5.1 Regression: (local) minimum norm empirical minimizers

We recall that generalization bounds Bousquet2003 apply to with probability at least and have the typical form

(8)

where is the expected loss, is the empirical Rademacher average of the class of functions measuring its complexity; are constants that depend on properties of the Lipschitz constant of the loss function, and on the architecture of the network.

The bound together with the property Equation 1 implies that among the minimizers with zero square loss, the optimization algorithm should select the minimum norm solution. In any case, the algorithm should control the norm . Standard GD or SGD algorithms do not provide an explicit control of the norm. Empirically it seems that initialization with small weights helps – as in the linear case (see Figures and see section 7). We propose a slight modification of the standard gradient descent algorithms to provide a norm-minimizing GD update – NMGD in short – as

(9)

where is the learning rate and (this is one of several choices) is the vanishing regularization-like Halpern (see Appendix 12) term.

5.2 Classification: maximizing the margin of the normalized minimizer

A typical margin bound for classification Shawe-Taylor:2004:KMP:975545 is

(10)

where is the margin, is the expected classification error, is the empirical loss of a surrogate loss such as the logistic or the exponential. For a point , the margin is . Since , the margin bound is optimized by effectively maximizing on the “support vectors” – that is the s.t .

We show (see Appendix 10) that for separable data, maximizing the margin subject to unit norm constraint is equivalent to minimize the norm of subject to a constraint on the margin. A regularized loss with an appropriately vanishing regularization parameter is a closely related optimization technique. For this reason we will refer to the solutions in all these cases as minimum norm. This view treats interpolation (in the regression case) and classification (in the margin case) in a unified way.

6 Gradient descent with norm constraint

In this section we focus on the classification case with an exponential loss function. The generalization bounds in the previous section are satisfied by the maximizing the margin subject to the product of the norms being equal to one:

(11)

In words: find the network weights that maximize the margin subject to a norm constraint. The latter ensures a bounded Rademacher complexity and together they minimize the term . In fact, existing generalization bounds such as Equation 6 in 2017arXiv171206541G, see also 2017arXiv170608498B are given in terms of products of upper bounds on the norm of each layer: the bounds require that each layer is bounded, rather than just the product is bounded.

This constraint is implied by a unit constraint on the norm of each layer which defines an equivalence class of networks because of Eq. (1).

A direct approach is to minimize the exponential loss function , subject to , that is under a unit norm constraint for the weight matrix at each layer. Clearly these constraints imply the constraint on the product of weight matrices in (11). As we discuss later (see Appendices and 845952 ), there are several ways to implement the minimization in the tangent space of . The interesting observation is that they are closely related to to gradient descent techniques widely used for training deep networks, such as weight normalization (WN) SalDied16 and batch normalization (BN) ioffe2015batch . In the following we describe one of the techniques, the Lagrange multiplier method, because it enforces the constraint from the generalization bounds in a transparent way.

6.1 Lagrange multiplier method

We define the loss

(12)

where the Lagrange multipliers are chosen to satisfy at convergence or when the algorithm is stopped (the constraint can also be enforced at each iteration, see later).

We perform gradient descent on with respect to . We obtain for

(13)

and for each layer

(14)

The sequence must satisfy .

Since the first term in the right hand side of Equation (14) goes to zero with and the Lagrange multipliers also go to zero, the normalized weight vectors converge at infinity with . On the other hand, grows to infinity. Interestingly, as shown in section 7, the norm square of each layer grows at the same rate.

Let us assume that starting at some time , is large enough that the following asymptotic expansion (as ) is a good approximation: , where is the multiplicity of the minimal .

The data points with the corresponding minimum value of the margin are the support vectors. They are a subset of cardinality of the datapoints, all with the same margin . In particular, the term becomes .

A rigorous proof of the argument above can be regarded as an extension of the main theorem in DBLP:conf/nips/RossetZH03 from the case of linear functions to the case of one-homogeneous functions. In fact, while updating the present version of this paper we noticed that 2018arXiv181005369W has theorems including such an extension.

Remarks

  1. If we impose the conditions at each , must satisfy

    where we redefined as the quantity . Thus

    (15)

    goes to zero at infinity because does.

  2. It is possible to add a regularization term to the equation for . The effect of regularization is to bound to a maximum size , controlled by a fixed regularization parameter : in this case the dynamics of converges to a (very large) set by a (very small) value of .

6.2 Related techniques for norm control: weight normalization, batch normalization and natural gradient enforcing unit norm

A main observation of this paper is that the Lagrange multiplier technique is very similar in its goal and implementation to other gradient descent algorithms with unit norm constraint. A review of gradient-based algorithms with unit-norm constraints 845952 lists

  1. the Lagrange multiplier method of our section 6.1,

  2. the coefficient normalization method that is related to batch normalization, see Appendix 21.2

  3. the tangent gradient method that corresponds to weight normalization (Appendix 21.1) and finally

  4. the true gradient method using natural gradient.

The four techniques are equivalent for small values of 845952 . Stability issues for numerical implementations are also characterized in 845952 . Our main point here is that the four techniques are closely related and have the same goal: performing gradient descent with a unit norm constraint. It seems fair to say that in the case of GD (a single minibatch, including all data) the four techniques should behave in a similar way. In particular (see Appendices 21.2 and 21.1), batch normalization controls the norms , though it does not control the norm of each layer – as WN does. In this sense it implements a somewhat weaker version of the generalization bound.

This argument suggests that WN and BN implement an approximation of constrained natural gradient. Interestingly, there is a close relationship between the Fisher-Rao norm and the natural gradient DBLP:journals/corr/abs-1711-01530 . In particular, the natural gradient descent is the steepest descent direction induced by the Fisher-Rao geometry.

6.3 Margin maximizers

As we mentioned, in GD with unit norm constraint there will be convergence to for . There may be trajectory-dependent, multiple alternative selections of the support vectors (SVs) during the course of the iteration while grows: each set of SVs may correspond to a max margin, minimum norm solution without being the global minimum norm solution. Because of Bezout-type arguments theory_II we expect multiple maxima. They should generically be degenerate even under the normalization constraints – which enforce each of the sets of weights to be on a unit hypersphere. Importantly, the normalization algorithms ensure control of the norm and thus of the generalization bound even if they cannot ensure that the algorithm converges to the globally best minimum norm solution (this depends on initial conditions for instance). In summary

(informal statement)

The GD equations 13 and 14 converge to maximum margin solutions with unit norm.

6.4 Dynamics

In the appendices we discuss the dynamics of gradient descent in the continuous framework for a variety of losses and in the presence of regularization or normalization. Typically, normalization is similar to a vanishing regularization term.

The Lagrange multiplier case is a simple example (see Appendix 6.1). For the following equations – as many as the number of weights – have to be satisfied asymptotically

(16)

where and goes to zero at infinity at the same rate as (see the special case of Equation (15)). This suggests that weight matrices from to should be in relations of the type for linear multilayer nets; appropriately similar relations should hold for the rectifier nonlinearities. In other words, gradient descent under unit norm is biased towards balancing the weights of different layers since this is the solution with minimum norm.

The Hessian of w.r.t. tells us about the linearized dynamics around the asymptotic critical point of the gradient. The Hessian (see Appendix 18)

(17)

is in general degenerate corresponding to an asymptotically degenerate hyperbolic equilibrium (biased towards minimum norm solutions if the rate of decay of implements correctly a Halpern iteration). The number of degenerate directions of the gradient flow corresponds to the number of symmetries of the neural network as discussed in Appendix 19. In the deep linear case, these would correspond to the freedom of applying opposite general linear transformations to neighboring layers. In the case of ReLU networks the situation becomes data-dependent.

Remark

For classification with exponential-type losses the Lagrange multiplier technique, WN and BN are trying to achieve approximately the same result – maximize the margin while constraining the norm. An even higher level perspective, unifying view of several different optimization techniques including the case of regression, is to regard them as instances of Halpern iterations. Appendix 12 describes the technique. The gradient flow corresponds to an operator which is non-expansive. The fixed points of the flow are degenerate. Minimization with a regularization term in the weights that vanishes at the appropriate rate (Halpern iterations) converges to the minimum norm minimizer associated to the local minimum. Halpern iterations are a form of regularization with a vanishing (which is the form of regularization used to define the pseudoinverse). From this perspective, the Lagrange multiplier term can be seen as a Halpern term which “attracts” the solution towards zero norm. This corresponds to a local minimum norm solution for the unnormalized network (imagine for instance in 2D that there is a surface of zero loss with a boundary as in Figure 1). The minimum norm solution in the classification case corresponds to a maximum margin solution for the normalized network. Globally optimal generalization is not guaranteed but generalization bounds such as Equation 10 are locally optimized. It should be emphasized however that it is not yet clear whether all the algorithms we mentioned implement the correct dependence of the Halpern term on the number of iterations. We will examine this issue in future work.

Figure 1: Landscape of the empirical loss with unnormalized weights. Suppose the empirical loss at the water level in the figure is . Then there are various global minima each with the same loss and different minimum norms. Because of the universality of deep networks from the point of view of function approximation, it seems likely that similar landscapes may be realizable (consider the approximator with the components of as parameters; an example is ). It is however an open question whether overparametrization may typically induce “nicer” landscapes, without the many “gulfs” in the figure.

7 Generalization without unit norm constraints

Empirically it appears that GD and SGD converge to solutions that can generalize even without BN or WN or other techniques enforcing explicit unit norm constraints. Without explicit constraints, convergence is difficult for quite deep networks; generalization is usually not as good as with BN or WN but it still occurs. How is this possible?

The following result (see Appendix 14.3.1) seems to solve the puzzle: the unconstrained gradient descent dynamics – with and where – is equivalent to the dynamics yield by weight normalization in terms of and , that is

(informal statement) The standard dynamical system is equivalent to the system in which the weights are normalized. Thus for both dynamics, normalization of the weights at the end of the iterations via

yields the normalized classifier

.

This means that becomes optimal during standard GD without the need for explicit control of the unit norm of . Notice that the unconstrained dynamics of and defined by Equations 13 and 14 is consistent with the dynamics of the .

Another interesting property of the dynamics of which is shared with the dynamics of under unit norm constraint is suggested by recent work NIPS2018_7321 : the difference between the square of the Frobenius norms of the weights of various layers does not change during gradient descent. This implies that if the weight matrices are all small at initialization, the gradient flow corresponding to gradient descent maintains approximatevely equal Frobenius norms across different layers, which is part of constraint we enforce in an explicit way with the Lagrange multiplier or the WN technique. The observation of NIPS2018_7321 is easy to see in our framework. Consider Equation (12) for , that is without norm constraint. Inspection of it shows that is independent of . It follows that

(18)

Thus if we consider two of the layers, the following property holds: with . If is small at initialization then the norm of the two layers will remain very similar under the gradient flow – a condition required by minimum norm solutions. A formal proof can be sketched as follows. Consider the gradient descent equations

(19)

The above dynamics induces the following dynamics on using the relation . Thus

(20)

because of lemma 3. It follows that

(21)

that implies that the rate of growth of is independent of . If we assume that initially, they will remain equal while growing throughout training.

Then, the minimization problem with is equivalent to the Lagrange multiplier problem of Equation (12), though the overall norm of , that is the norm of its layers, is not explicitely controlled. In other words, the norms of the layers are balanced, thus avoiding the situation in which one layer may contribute to decreasing loss by improving but another may achieve the same result by simply increasing its norm. Following the discussion in section 5.2, generalization depends on bounding the ratio of Rademacher complexity to the margin . The balance of norms property allows us to cancel the dependence on for all layers.

This property of exponential-type losses is the non-linear equivalent of Srebro result for linear networks and is similar to the well-known property of the linear case: GD starting from zero or from very small weights converges to the minimum norm. It is important to emphasize that in the multilayer, nonlinear case we expect several maximum margin solutions (unlike the linear case), depending on intial conditions and stochasticity of SGD.

Of course, other effects, in addition to the role of initialization and batch or weight normalization may be at work here, improving generalization. For instance, high dimensionality under certain conditions has been shown to lead to better generalization for certain interpolating kernels 2018arXiv180800387L ; 2018arXiv181211167R . Though this is still an open question, it seems likely that similar results may also be valid for deep networks.

Furthermore, commonly used weight decay with appropriate parameters can induce generalization. Typical implementations of data augmentation also eliminate the overparametrization problem: at each iteration of SGD only “new” data are used and depending on the number of iterations it is possible to obtain more training data than parameters. In any case, within this online framework, one expects convergence to the minimum of the expected risk (see Appendix 11) without the need to invoke generalization bounds.

Remarks

  • For a generic loss function such as the square loss and linear networks there is convergence to the minimum norm solution by GD for zero-norm initial conditions.

  • For exponential type losses and linear networks in the case of classification the convergence is independent of initial conditions 2017arXiv171010345S . The reason is that what matters is .

  • The property also holds for the square loss.

  • For exponential type losses and one-homogeneous networks in the case of classification the situation is similar since . With zero-norm initial conditions the norms of the layers are approximately equal and . Notice that the degeneracy of the solutions of the gradient flow is reduced by the unit norm constraints on each of the layers. As we showed, the norm square of each layer grows – under unregularized gradient descent – at the same time-dependent rate. This means that the time derivative of the product of the norms squared does change with time. Thus a bounded product does not remain bounded even when divided by a common time-dependent scale, unless the norm of all layers are equal at initialization.

8 Discussion

Our results imply that multilayer, nonlinear, deep networks under gradient descent with norm constraint converge to maximum margin solutions. This is similar to the situation for linear networks. The prototypical (linear) example for over-parametrized deep networks is convergence of gradient descent to weights that represent the pseudoinverse of the input matrix.

We have to distinguish between square loss regression and classification via an exponential-type loss. In the case of square loss regression, NMGD converges to the minimum norm solution independently of initial conditions – under the assumption that the global minimum is achieved.

Consider now the case of classification by minimization of exponential losses using the Lagrange normalization algorithm. The main result is that the dynamical system in the normalized weights converges to a solution that (locally) maximizes margin. We discuss the close relations between this algorithm and weight normalization algorithms, which are themselves related to batch normalization. All these algorithms are commonly used. The fact that the solution corresponds to a maximum margin solution under a fixed norm constraint also explains the puzzling behavior of Figure 3, in which batch normalization was used. The test classification error does not get worse when the number of parameters increases well beyond the number of training data because the dynamical system is constrained to maximize the margin under unit norm of , without necessarily minimizing the loss.

An additional implication of our results is that the effectiveness of batch normalization is based on more fundamental reasons than reducing covariate shifts (the properties described in 2018arXiv180511604S are fully consistent with our characterization in terms of a regularization-like effect). Controlling the norm of the weights is exactly what generalization bounds prescribe: GD with normalization (NMGD) is the correct way to do it. Normalization is closely related to Halpern iterations used to achieve a minimum norm solution.

The theoretical framework described in this paper leaves a number of important open problems. Does the empirical landscape have multiple global minima with different minimum norms (see Figure 1), as we suspect? Or is the landscape “nicer” for large overparametrization – as hinted in several recent papers (see for instance 2018arXiv180406561M and 2019arXiv190202880N )? Can one ensure convergence to the global empirical minimizer with global minimum norm? How? Are there conditions on the Lagrange multiplier term – and on corresponding parameters for weight and batch normalization – that ensure convergence to a maximum margin solution independently of initial conditions?

Acknowledgments

We thank Yuan Yao, Misha Belkin, Jason Lee and especially Sasha Rakhlin for illuminating discussions. Part of the funding is from Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216, and part by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.

References

9 Experiments

Summary:

  • SGD easily finds global minima in CIFAR10 suggesting that under appropriate overparametrization it does not “see” in the empirical loss landscape any local minima.

  • Different initializations affect final result (large initialization typically induce larger final norm and larger test error). It is significant that there is dependence on initial conditions (differently from [1] linear case).

  • Similar to initialization, perturbations of the weights increase norm and increase test error.

  • Training loss of normalized nets predict well test performance of the same networks.

In the computer simulations shown in this section, we turn off all the “tricks” used to improve performance such as data augmentation, weight decay, etc. However, we keep batch normalization. We reduce in some of the experiments the size of the network or the size of the training set. As a consequence, performance is not state-of-the-art, but optimal performance is not the goal here (in fact the networks we use achieve state-of-the-art performance using standard setups). The expected risk was measured as usual by an out-of-sample test set.

The puzzles we want to explain are in Figures 2 and 3.

Figure 2: Generalization for Different number of Training Examples. (a) Generalization error in CIFAR and (b) generalization error in CIFAR with random labels. The DNN was trained by minimizing the cross-entropy loss and it is a 5-layer convolutional network (i.e., no pooling) with 16 channels per hidden layer. ReLU are used as the non-linearities between layers. The resulting architecture has approximately parameters. SGD with batch normalization was used with batch size for epochs for each point. Neither data augmentation nor regularization were used.
Figure 3:

Expected error in CIFAR-10 as a function of number of neurons.

The DNN is the same as in Figure 2. (a) Dependence of the expected error as the number of parameters increases. (b) Dependence of the cross-entropy risk as the number of parameters increases. There is some “overfitting” in the expected risk, though the peculiarities of the exponential loss function exaggerate it. The expected classification error does not increase here when increasing the number of parameters, because the product of the norms of the network is close to the minimum norm (here because of initialization).

A basic explanation for the puzzles is similar to the linear case: when the minima are degenerate the minimum norm minimizers are the best for generalization. The linear case corresponds to quadratic loss for a linear network shown in Figure 4.

Figure 4: A quadratic loss function in two parameters and

. The minimum has a degenerate Hessian with a zero eigenvalue. In the proposition described in the text, it represents the “generic” situation in a small neighborhood of zero minimizers with many zero eigenvalues – and a few positive eigenvalues – of the Hessian of a nonlinear multilayer network. In multilayer networks the loss function is likely to be a fractal-like surface with many degenerate global minima, each similar to a multidimensional version of the degenerate minimum shown here. For the crossentropy loss, the degenerate valleys are sloped towards infinity.

Figure 5: Training and testing with the square loss for a linear network in the feature space (i.e. ) with a degenerate Hessian of the type of Figure 4. The feature matrix is a polynomial with degree 30. The target function is a sine function with frequency on the interval . The number of training point are while the number of test points are . The training was done with full gradient descent with step size for iterations. The weights were not perturbed in this experiment. The norm of the weights is shown on the right. Note that training was repeated 30 times and what is reported in the figure is the average train and test error as well as average norm of the weights over the 30 repetitions. There is overfitting in the test error.

In this very simple case we test our theoretical analysis with the following experiment. After convergence of GD, we apply a small random perturbation with unit norm to the parameters , then run gradient descent until the training error is again zero; this sequence is repeated times. We make the following predictions for the square loss:

  • The training error will go back to zero after each sequence of GD.

  • Any small perturbation of the optimum will be corrected by the GD dynamics to push back the non-degenerate weight directions to the original values. Since the components of the weights in the degenerate directions are in the null space of the gradient, running GD after each perturbation will not change the weights in those directions. Overall, the weights will change in the experiment.

  • Repeated perturbations of the parameters at convergence, each followed by gradient descent until convergence, will not increase the training error but will change the parameters, increase norms of some of the parameters and increase the associated test error. The norm of the projections of the weights in the null space undergoes a random walk.

Figure 6: Training and testing with the square loss for a linear network in the feature space (i.e. ) with a degenerate Hessian of the type of Figure 4. The target function is a sine function with frequency on the interval . The number of training points is while the number of test points is . For the first pair of plots the feature matrix is a polynomial with degree 39. For the first pair had points were sampled to according to the Chebyshev nodes scheme to speed up training to reach zero on the train error. Training was done with full Gradient Descent step size for iterations. Weights were perturbed every iterations and Gradient Descent was allowed to converge to zero training error (up to machine precision) after each perturbation. The weights were perturbed by addition of Gaussian noise with mean

and standard deviation

. The perturbation was stopped half way at iteration . The norm of the weights is shown in the second plot. Note that training was repeated 29 times figures reports the average train and test error as well as average norm of the weights over the repetitions. For the second pair of plots the feature matrix is a polynomial with degree 30. Training was done with full gradient descent with step size for iterations. The norm of the weights is shown in the fourth plot. Note that training was repeated 30 times figures reports the average train and test error as well as average norm of the weights over the repetitions. The weights were not perturbed in this experiment.
Figure 7: The graph on the left shows training and testing loss for a linear network in the feature space (i.e. ) in the nondegenerate quadratic convex case. The feature matrix is a polynomial with degree 4. The target function is a sine function with frequency on the interval . The number of training point are while the number of test points are . The training was done with full gradient descent with step size for iterations. The inset zooms in on plot showing the absense of overfitting. In the plot on the right, weights were perturbed every iterations and then gradient descent was allowed to converge to zero training error after each perturbation. The weights were perturbed by adding Gaussian noise with mean and standard deviation . The plot on the left had no perturbation. The norm of the weights is shown on the right. Note that training was repeated 30 times and what is reported in the figure is the average train and test error as well as average norm of the weights over the 30 repetitions.

The same predictions apply also to the cross entropy case with the caveat that the weights increase even without perturbations, though more slowly. Previous experiments by [10] showed changes in the parameters and in the expected risk, consistently with our predictions above, which are further supported by the numerical experiments of Figure 8. In the case of cross-entropy the almost zero error valleys of the empirical risk function are slightly sloped downwards towards infinity, becoming flat only asymptotically.

Figure 8:

We train a 5-layer convolutional neural networks on CIFAR-10 with Gradient Descent (GD) on cross-entropy loss with and without perturbations. The main results are shown in the 3 subfigures in the bottom row. Initially, the network was trained with GD as normal. After it reaches 0 training classification error (after roughly 1800 epochs of GD), a perturbation is applied to the weights of every layer of the network. This perturbation is a Gaussian noise with standard deviation being

of that of the weights of the corresponding layer. From this point, random Gaussian noises with such standard deviations are added to every layer after every 100 training epochs. The empirical risk goes back to the original level after the perturbation, but the expected risk grows increasingly higher. As expected, the -norm of the weights increases after each perturbation step. After 7500 epochs the perturbation is stopped. The left column shows the classification error. The middle column shows the cross-entropy risk on CIFAR during perturbations. The right column is the corresponding L2 norm of the weights. The 3 subfigures in the top row shows a control experiment where no perturbation is performed at all throughout training, The network has 4 convolutional layers (filter size

, stride 2) and a fully-connected layer. The number of feature maps (i.e., channels) in hidden layers are 16, 32, 64 and 128 respectively. Neither data augmentation nor regularization is performed.

The numerical experiments show, as predicted, that the behavior under small perturbations around a global minimum of the empirical risk for a deep networks is similar to that of linear degenerate regression (compare Figure 8 with Figure 5 ). For the loss, the minimum of the expected risk may or may not occur at a finite number of iterations. If it does, it corresponds to an equivalent optimum (because of “noise”) non-zero and non-vanishing regularization parameter . Thus a specific “early stopping” would be better than no stopping. The corresponding classification error, however, may not show overfitting.

Figure 9 shows the behavior of the loss in CIFAR in the absense of perturbations. This should be compared with Figure 5 which shows the case of an overparametrized linear network under quadratic loss corresponding to the multidimensional equivalent of the degenerate situation of Figure 4. The nondegenerate, convex case is shown in Figure 7.

Figure 9: Same as Figure 4 but without perturbations of weights. Notice that there is some overfitting in terms of the testing loss. Classification however is robust to this overfitting (see text).

Figure 10 shows the testing error for an overparametrized linear network optimized under the square loss.This is a special case in which the minimum norm solution is theoretically guaranteed by zero inital conditions without NMGD.

Figure 10: Training and testing with the square loss for a linear network in the feature space (i.e. ) with a degenerate Hessian of the type of Figure 4. The feature matrix is a polynomial with increasing degree, from 1 to 300. The square loss is plotted vs the number of monomials, that is the number of parameters. The target function is a sine function with frequency on the interval . The number of training points where and the number of test points were . The solution to the over-parametrized system was the minimum norm solution. More points where sampled at the edges of the interval (i.e. using Chebyshev nodes) to avoid exaggerated numerical errors. The figure shows how eventually the minimum norm solution overfits.

10 Minimal norm and maximum margin

We discuss the connection between maximum margin and minimal norms problems in binary classification. To do so, we reprise some classic reasonings used to derive support vector machines. We show they directly extend beyond linearly parametrized functions as long as there is a one-homogeneity property, namely, for all

,

Given a training set of data points , where labels are , the functional margin is

(22)

If there exists such that the functional margin is strictly positive, then the training set is separable. We assume in the following that this is indeed the case. The maximum (max) margin problem is

(23)

The latter constraint is needed to avoid trivial solutions in light of the one-homogeneity property. We next show that Problem (23) is equivalent to

(24)

To see this, we introduce a number of equivalent formulations. First, notice that functional margin (22) can be equivalently written as

Then, the max margin problem (23) can be written as

(25)

Next, we can incorporate the norm constraint noting that using one-homogeneity,

so that Problem (25) becomes

(26)

Finally, using again one-homogeneity, without loss of generality, we can set and obtain the equivalent problem

(27)

The result is then clear noting that

11 Data augmentation and generalization with “infinite” data sets

In the case of batch learning, generalization guarantees on an algorithm are conditions under which the empirical error on the training set converges to the expected error , ideally with bounds that depend on the size of the training set. The practical relevance of this guarantee is that the empirical error is then a measurable proxy for the unknown expected error and its error can be bound . In the case of “pure” online algorithms such as SGD – in which the samples are drawn i.i.d. from the unknown underlying distribution – there is no training set per se or equivalently the training set has infinite size

. Under usual conditions on the loss function and the learning rate, SGD converges to the minimum of the expected risk. Thus, the proof of convergence towards the minimum of the expected risk bypasses the need for generalization guarantees. With data augmentation most of the implementations – such as the PyTorch one – generate “new” examples at each iteration. This effectively extends the size of the finite training set

for guaranteeing convergence to the minimum of the expected risk. Thus existing proofs of the convergence of SGD provide the guarantee that it converges to the “true” expected risk when the size of the “augmented” training set increases with .

Notice that while there exists unique , does not need to be unique: the set of which provide global minima of the expected error is an equivalence class.

12 Halpern iterations: selecting minimum norm solution among degenerate minima

In this section we summarize a modification of gradient descent that we apply to the various problems of optimization under the square and exponential loss for one-layer and nonlinear, deep networks.

We are interested in the convergence of solutions of gradient descent dynamics and their stability properties. In addition to the standard dynamical system tools we also use closely related elementary properties of non-expansive operators. A reason is that they describe the step of numerical implementation of the continuous dynamical systems that we consider. More importantly, they provide iterative techniques that converge (in a convex set) to the minimum norm of the fixed points, even when the operators are not linear, independently of initial conditions.

Let us define an operator in a normed space with norm as non expansive if . Then the following result is classical ([45, 33])

[45] Let be a strictly convex normed space. The set of fixed points of a non-expansive mapping with a closed convex subset of is either empty or closed and convex. If it is not empty, it contains a unique element of smallest norm.

In our case . To fix ideas, consider gradient descent on the square loss. As discussed later and in several papers, the Hessian of the loss function ( of a deep networks with ReLUs has eigenvalues bounded from above (see for instance [46] and [47]) because the network is Lipschitz continuous and bounded from below by zero at the global minimum. Thus with an appropriate choice of the operator is non-expanding and its fixed points are not an empty set, see Appendix 20. If we assume that the minimum is global and that there are no local minima but only saddle points then the null vector is in . Then the element of minimum norm can be found by iterative procedures (such as Halpern’s method, see Theorem 1 in [33]) of the form

(28)

where the sequence satisfies conditions such as and 222 Notice that these iterative procedures are often part of the numerical implementation (see [48] and section 4.1) of discretized method for solving a differential equation whose equilibrium points are the minimizers of a differentiable convex subset of a function . Note also that proximal minimization corresponds to backward Euler steps for numerical integration of a gradient flow. Proximal minimization can be seen as introducing quadratic regularization into a smooth minimization problem in order to improve convergence of some iterative method in such a way that the final result obtained is not affected by the regularization. .

In particular, the following holds

[33] For any the iteration with converges to one of the fixed points of . The sequence with and converges to the fixed point of T with minimum norm.

The norm-minimizing GD update – NMGD in short – has the form

(29)

where is the learning rate and (this is one of several choices).

It is an interesting question whether convergence to the minimum norm is independent of initial conditions and of perturbations. This may depend among other factors on the rate at which the Halpern term decays.

13 Network minimizers under square and exponential loss

We consider one-layer and multilayer networks under the square loss and the exponential loss. Here are the main observations and results

  1. One-layer networks The Hessian is in general degenerate. Regularization with arbitrarily small ensures independence from initial conditions for both the square and the exponential loss. In the absence of explicit regularization, GD converge to the minimum norm solution for zero initial conditions. With NMGD-type iterations GD converge to the minimum norm independently of initial conditions (this is similar to the result of [1] obtained with different assumptions and techniques). For the exponential loss NMGD ensures convergence to the normalized solution that maximizes the margin (and that corresponds to the overall minimum norm solution), see Appendix 14.3. In the exponential loss case, weight normalization GD is degenerate since the data (support vectors) may not span the space of the weights.

  2. Deep networks, square loss The Hessian is in general degenerate, even in the presence of regularization (with fixed ). NMGD-type iterations lead to convergence not only to the fixed points – as vanilla GD does – but to the (locally) minimum norm fixed point.

  3. Deep networks, exponential loss The Hessian is in general degenerate, even in the presence of regularization. NMGD-type iterations lead to convergence to the minimum norm fixed point associated with the global minimum.

  4. Implications of minimum norm for generalization in regression problems NMGD-based minimization ensures minimum norm solutions.

  5. Implications of minimum norm for classification

    For classification a typical margin bound is

    (30)

    which depends on the margin . is the expected classification error; is the empirical loss of a surrogate loss such as the logistic. For a point the margin is . Since , the margin bound is optimized by effectively maximizing on the “support vectors”. As shown in Appendix 10 maximizing margin under the unit norm constraint is equivalent to minimizing the norm under the separability constraint.

Remarks

  • NMGD can be seen as a variation of regularization (that is weight decay) by requiring to decrease to zero. The theoretical reason for NMGD is that NMGD ensures minimum norm or equivalently maximum margin solutions.

  • Notice that one of the definitions of the pseudoinverse of a linear operator corresponds to NMGD: it is the regularized solution to a degenerate minimization problem in the square loss for .

  • The failure of regularization with a fixed to induce hyperbolic solutions in the multi-layer case was surprising to us. Technically this is due to contributions to non-diagonal parts of the Hessian from derivatives across layers and to the shift of the minimum.

14 One-layer networks

14.1 Square loss

For linear networks under square loss GD is a non-expansive operator. There are fixed points. The Hessian is degenerate. Regularization with arbitrarily small ensures independence of initial conditions. Even in the absence of explicit regularization GD converge to the minimum norm solution for zero initial conditions. Convergence to the minimum norm holds also with NMGD-type iterations but now independently of initial conditions.

We consider linear networks with one layer and one scalar output that is because there is only one layer. Thus with .

Consider

(31)

where is a bounded real-valued variable. Assume further that there exists a -dimensional weight vector that fits all the training data, achieving zero loss on the training set, that is

  1. Dynamics The dynamics is

    (32)

    with .

    The only components of the the weights that change under the dynamics are in the vector space spanned by the examples ; components of the weights in the null space of the matrix of examples are invariant to the dynamics. Thus converges to the minimum norm solution if the dynamical system starts from zero weights.

  2. The Jacobian of – and Hessian of – for is

    (33)

    This linearization of the dynamics around for which yields

    (34)

    where the associated is convex, since the Jacobian is minus the sum of auto-covariance matrices and thus is semi-negative definite. It is negative definite if the examples span the whole space but it is degenerate with some zero eigenvalues if [49].

  3. Regularization If a regularization term is added to the loss the fixed point shifts. The equation

    (35)

    gives for

    (36)

    The Hessian at is with

    (37)

    which is always negative definite for any arbitrarily small fixed . Thus the dynamics of the perturbations around the equilibrium is given by

    (38)

    and is hyperbolic. Explicit regularization ensures the existence of a hyperbolic equilibrium for any at a finite . In the limit of the equilibrium converges to a minimum norm solution.

  4. NMGD The gradient flow corresponds to with . The gradient is non-expansive (see Appendix 20). There are fixed points ( satisfying ) that are degenerate. Minimization using the NMGD method converges to the minimum norm minimizer.

14.2 Exponential loss

Linear networks under exponential loss and GD show growing Frobenius norm. On a compact domain () the exponential loss is -smooth and corresponds to a non-expansive operator . Regularization with arbitrarily small ensures convergence to a fixed point independent of initial conditions. GD with normalization and NMGD-type iterations converge to the minimum norm, maximum margin solution for separable data with degenerate Hessian.

Consider now the exponential loss. Even for a linear network the dynamical system associated with the exponential loss is nonlinear. While [1] gives a rather complete characterization of the dynamics, here we describe a different approach.

The exponential loss for a linear network is

(39)

where

is a binary variable taking the value

or . Assume further that the -dimensional weight vector separates correctly all the training data, achieving zero classification error on the training set, that is . In some cases below (it will be clear from context) we incorporate into .

  1. Dynamics The dynamics is