In the last few years, deep learning has been tremendously successful in many important applications of machine learning. However, our theoretical understanding of deep learning, and thus the ability of developing principled improvements, has lagged behind. A satisfactory theoretical characterization of deep learning is emerging. It covers the following areas: 1) approximation properties of deep networks 2) optimization of the empirical risk 3) generalization properties of gradient descent techniques – why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized?
1.1 When Can Deep Networks Avoid the Curse of Dimensionality?
We start with the first set of questions, summarizing results in (HierarchicalKernels2015, Hierarchical2015, poggio2015December), and (Mhaskaretal2016, MhaskarPoggio2016). The main result is that deep networks have the theoretical guarantee, which shallow networks do not have, that they can avoid the curse of dimensionality for an important class of problems, corresponding to compositional functions, that is functions of functions. An especially interesting subset of such compositional functions are hierarchically local compositional functions
where all the constituent functions are local in the sense of bounded small dimensionality. The deep networks that can approximate them without the curse of dimensionality are of the deep convolutional type – though, importantly, weight sharing is not necessary.
Implications of the theorems likely to be relevant in practice are:
a) Deep convolutional architectures have the theoretical guarantee that they can be much better than one layer architectures such as kernel machines for certain classes of problems; b) the problems for which certain deep networks are guaranteed to avoid the curse of dimensionality (see for a nice review (Donoho00high-dimensionaldata)) correspond to input-output mappings that are compositional with local constituent functions; c) the key aspect of convolutional networks that can give them an exponential advantage is not weight sharing but locality at each level of the hierarchy.
1.2 Related Work
Several papers in the ’80s focused on the approximation power and learning properties of one-hidden layer networks (called shallow networks here). Very little appeared on multilayer networks, (but see (mhaskar1993approx, mhaskar1993neural, chui1994neural, chui1996, Pinkus1999)). By now, several papers (poggio03mathematics, MontufarBengio2014, DBLP:journals/corr/abs-1304-7045) have appeared. (Anselmi2014, anselmi2015theoretical, poggioetal2015, LiaoPoggio2016, Mhaskaretal2016) derive new upper bounds for the approximation by deep networks of certain important classes of functions which avoid the curse of dimensionality. The upper bound for the approximation by shallow networks of general functions was well known to be exponential. It seems natural to assume that, since there is no general way for shallow networks to exploit a compositional prior, lower bounds for the approximation by shallow networks of compositional functions should also be exponential. In fact, examples of specific functions that cannot be represented efficiently by shallow networks have been given, for instance in (Telgarsky2015, SafranShamir2016, Theory_I). An interesting review of approximation of univariate functions by deep networks has recently appeared (2019arXiv190502199D).
1.3 Degree of approximation
The general paradigm is as follows. We are interested in determining how complex a network ought to be to theoretically guarantee approximation of an unknown target function up to a given accuracy . To measure the accuracy, we need a norm on some normed linear space . As we will see the norm used in the results of this paper is the norm in keeping with the standard choice in approximation theory. As it turns out, the results of this section require the sup norm in order to be independent from the unknown distribution of the input data.
Let be the be set of all networks of a given kind with units (which we take to be or measure of the complexity of the approximant network). The degree of approximation is defined by For example, if for some , then a network with complexity will be sufficient to guarantee an approximation with accuracy at least . The only a priori information on the class of target functions , is codified by the statement that for some subspace . This subspace is a smoothness and compositional class, characterized by the parameters and ( in the example of Figure 1 ; it is the size of the kernel in a convolutional network).
1.4 Shallow and deep networks
This section characterizes conditions under which deep networks are “better” than shallow network in approximating functions. Thus we compare shallow (one-hidden layer) networks with deep networks as shown in Figure 1
. Both types of networks use the same small set of operations – dot products, linear combinations, a fixed nonlinear function of one variable, possibly convolution and pooling. Each node in the networks corresponds to a node in the graph of the function to be approximated, as shown in the Figure. A unit is a neuron which computes, where
is the vector of weights on the vector input. Both and the real number are parameters tuned by learning. We assume here that each node in the networks computes the linear combination of such units
. Notice that in our main example of a network corresponding to a function with a binary tree graph, the resulting architecture is an idealized version of deep convolutional neural networks described in the literature. In particular, it has only one output at the top unlike most of the deep architectures with many channels and many top-level outputs. Correspondingly, each node computes a single value instead of multiple channels, using the combination of several units. However our results hold also for these more complex networks (see(Theory_I)).
The sequence of results is as follows.
Both shallow (a) and deep (b) networks are universal, that is they can approximate arbitrarily well any continuous function of variables on a compact domain. The result for shallow networks is classical.
We consider a special class of functions of variables on a compact domain that are hierarchical compositions of local functions, such as
The structure of the function in Figure 1 b) is represented by a graph of the binary tree type, reflecting dimensionality for the constituent functions . In general, is arbitrary but fixed and independent of the dimensionality of the compositional function . (Theory_I) formalizes the more general compositional case using directed acyclic graphs.
The approximation of functions with a compositional structure – can be achieved with the same degree of accuracy by deep and shallow networks but the number of parameters are much smaller for the deep networks than for the shallow network with equivalent approximation accuracy.
We approximate functions with networks in which the activation nonlinearity is a smoothed version of the so called ReLU, originally called ramp by Breiman and given by . The architecture of the deep networks reflects the function graph with each node being a ridge function, comprising one or more neurons.
Let , be the space of all continuous functions on , with . Let denote the class of all shallow networks with units of the form
where , . The number of trainable parameters here is . Let be an integer, and be the set of all functions of variables with continuous partial derivatives of orders up to such that , where denotes the partial derivative indicated by the multi-integer , and is the sum of the components of .
For the hierarchical binary tree network, the analogous spaces are defined by considering the compact set to be the class of all compositional functions of variables with a binary tree architecture and constituent functions in . We define the corresponding class of deep networks to be the set of all deep networks with a binary tree architecture, where each of the constituent nodes is in , where , being the set of non–leaf vertices of the tree. We note that in the case when is an integer power of , the total number of parameters involved in a deep network in is .
The first theorem is about shallow networks.
Let be infinitely differentiable, and not a polynomial. For the complexity of shallow networks that provide accuracy at least is
The estimate of Theorem1 is the best possible if the only a priori information we are allowed to assume is that the target function belongs to . The exponential dependence on the dimension of the number of parameters needed to obtain an accuracy is known as the curse of dimensionality. Note that the constants involved in in the theorems will depend upon the norms of the derivatives of as well as .
Our second and main theorem is about deep networks with smooth activations (preliminary versions appeared in (poggio2015December, Hierarchical2015, Mhaskaretal2016)). We formulate it in the binary tree case for simplicity but it extends immediately to functions that are compositions of constituent functions of a fixed number of variables (in convolutional networks corresponds to the size of the kernel).
For consider a deep network with the same
compositional architecture and with an activation function
consider a deep network with the same compositional architecture and with an activation functionwhich is infinitely differentiable, and not a polynomial. The complexity of the network to provide approximation with accuracy at least is
The proof is in (Theory_I). The assumptions on in the theorems are not satisfied by the ReLU function , but they are satisfied by smoothing the function in an arbitrarily small interval around the origin. The result of the theorem can be extended to non-smooth ReLU(Theory_I).
In summary, when the only a priori assumption on the target function is about the number of derivatives, then to guarantee an accuracy of , we need a shallow network with trainable parameters. If we assume a hierarchical structure on the target function as in Theorem 2, then the corresponding deep network yields a guaranteed accuracy of with trainable parameters. Note that Theorem 2 applies to all with a compositional architecture given by a graph which correspond to, or is a subgraph of, the graph associated with the deep network – in this case the graph corresponding to .
2 The Optimization Landscape of Deep Nets with Smooth Activation Function
The main question in optimization of deep networks is to the landscape of the empirical loss in terms of its global minima and local critical points of the gradient.
2.1 Related work
There are many recent papers studying optimization in deep learning. For optimization we mention work based on the idea that noisy gradient descent (DBLP:journals/corr/Jin0NKJ17, DBLP:journals/corr/GeHJY15, pmlr-v49-lee16, s.2018when) can find a global minimum. More recently, several authors studied the dynamics of gradient descent for deep networks with assumptions about the input distribution or on how the labels are generated. They obtain global convergence for some shallow neural networks (Tian:2017:AFP:3305890.3306033, s8409482, Li:2017:CAT:3294771.3294828, DBLP:conf/icml/BrutzkusG17, pmlr-v80-du18b, DBLP:journals/corr/abs-1811-03804). Some local convergence results have also been proved (Zhong:2017:RGO:3305890.3306109, DBLP:journals/corr/abs-1711-03440, 2018arXiv180607808Z). The most interesting such approach is (DBLP:journals/corr/abs-1811-03804), which focuses on minimizing the training loss and proving that randomly initialized gradient descent can achieve zero training loss (see also (NIPS2018_8038, du2018gradient, DBLP:journals/corr/abs-1811-08888)). In summary, there is by now an extensive literature on optimization that formalizes and refines to different special cases and to the discrete domain our results of (theory_II, theory_IIb).
2.2 Degeneracy of global and local minima under the exponential loss
The first part of the argument of this section relies on the obvious fact (see (theory_III)), that for RELU networks under the hypothesis of an exponential-type loss function, there are no local minima that separate the data – the only critical points of the gradient that separate the data are the global minima.
Notice that the global minima are at , when the exponential is zero. As a consequence, the Hessian is identically zero with all eigenvalues being zero. On the other hand any point of the loss at a finite has nonzero Hessian: for instance in the linear case the Hessian is proportional to . The local minima which are not global minima must misclassify. How degenerate are they?
Simple arguments (theory_III) suggest that the critical points which are not global minima cannot be completely degenerate. We thus have the following
Under the exponential loss, global minima are completely degenerate with all eigenvalues of the Hessian ( of them with being the number of parameters in the network) being zero. The other critical points of the gradient are less degenerate, with at least one – and typically – nonzero eigenvalues.
For the general case of non-exponential loss and smooth nonlinearities instead of the RELU the following conjecture has been proposed (theory_III):
: For appropriate overparametrization, there are a large number of global zero-error minimizers which are degenerate; the other critical points – saddles and local minima – are generically (that is with probability one) degenerate on a set of much lower dimensionality.
2.3 SGD and Boltzmann Equation
The second part of our argument (in (theory_IIb)
) is that SGD concentrates in probability on the most degenerate minima. The argument is based on the similarity between a Langevin equation and SGD and on the fact that the Boltzmann distribution is exactly the asymptotic “solution” of the stochastic differential Langevin equation and also of SGDL, defined as SGD with added white noise (see for instance(raginskyetal17)). The Boltzmann distribution is
where is a normalization constant, is the loss and reflects the noise power. The equation implies that SGDL prefers degenerate minima relative to non-degenerate ones of the same depth. In addition, among two minimum basins of equal depth, the one with a larger volume is much more likely in high dimensions as shown by the simulations in (theory_IIb). Taken together, these two facts suggest that SGD selects degenerate minimizers corresponding to larger isotropic flat regions of the loss. Then SDGL shows concentration – because of the high dimensionality – of its asymptotic distribution Equation 3.
Together (theory_II) and (theory_III) suggest the following
: For appropriate overparametrization of the deep network, SGD selects with high probability the global minimizers of the empirical loss, which are highly degenerate.
Recent results by (2017arXiv171010345S) illuminate the apparent absence of ”overfitting” (see Figure 4) in the special case of linear networks for binary classification. They prove that minimization of loss functions such as the logistic, the cross-entropy and the exponential loss yields asymptotic convergence to the maximum margin solution for linearly separable datasets, independently of the initial conditions and without explicit regularization. Here we discuss the case of nonlinear multilayer DNNs under exponential-type losses, for several variations of the basic gradient descent algorithm. The main results are:
classical uniform convergence bounds for generalization suggest a form of complexity control on the dynamics of the weight directions : minimize a surrogate loss subject to a unit norm constraint;
gradient descent on the exponential loss with an explicit unit norm constraint is equivalent to a well-known gradient descent algorithms weight normalization
which is closely related to batch normalization;
unconstrained gradient descent on the exponential loss yields a dynamics with the same critical points as weight normalization: the dynamics implicitly respect a unit constraint on the directions of the weights .
We observe that several of these results directly apply to kernel machines
for the exponential loss under the separability/interpolation assumption, because kernel machines are one-homogeneous.
3.1 Related work
A number of papers have studied gradient descent for deep networks (NIPS2017_6836, DBLP:journals/corr/abs-1811-04918, Arora2019FineGrainedAO). Close to the approach summarized here (details are in (theory_III)) is the paper (Wei2018OnTM)
. Its authors study generalization assuming a regularizer because they are – like us – interested in normalized margin. Unlike their assumption of an explicit regularization, we show here that commonly used techniques, such as weight and batch normalization, in fact minimize the surrogate loss margin while controlling the complexity of the classifier without the need to add a regularizer or to use weight decay. Surprisingly, we will show that even standard gradient descent on the weights implicitly controls the complexity through an “implicit” unitnorm constraint. Two very recent papers ((2019arXiv190507325S) and (DBLP:journals/corr/abs-1906-05890)) develop an elegant but complicated margin maximization based approach which lead to some of the same results of this section (and many more). The important question of which conditions are necessary for gradient descent to converge to the maximum of the margin of are studied by (2019arXiv190507325S) and (DBLP:journals/corr/abs-1906-05890). Our approach does not need the notion of maximum margin but our theorem 3 establishes a connection with it and thus with the results of (2019arXiv190507325S) and (DBLP:journals/corr/abs-1906-05890). Our main goal here (and in (theory_III)) is to achieve a simple understanding of where the complexity control underlying generalization is hiding in the training of deep networks.
3.2 Deep networks: definitions and properties
We define a deep network with layers with the usual coordinate-wise scalar activation functions as the set of functions , where the input is , the weights are given by the matrices , one per layer, with matching dimensions. We sometime use the symbol as a shorthand for the set of matrices . For simplicity we consider here the case of binary classification in which takes scalar values, implying that the last layer matrix is . The labels are . The weights of hidden layer are collected in a matrix of size . There are no biases apart form the input layer where the bias is instantiated by one of the input dimensions being a constant. The activation function in this section is the ReLU activation.
For ReLU activations the following important positive one-homogeneity property holds . A consequence of one-homogeneity is a structural lemma (Lemma 2.1 of (DBLP:journals/corr/abs-1711-01530)) where is here the vectorized representation of the weight matrices for each of the different layers (each matrix is a vector).
For the network, homogeneity implies , where with the matrix norm . Another property of the Rademacher complexity of ReLU networks that follows from homogeneity is where , is the class of neural networks described above.
We define ; is the associated class of normalized neural networks (we call with the understanding that ). Note that and that the definitions of , and all depend on the choice of the norm used in normalization.
In the case of training data that can be separated by the networks . We will sometime write as a shorthand for .
3.3 Uniform convergence bounds: minimizing a surrogate loss under norm constraint
Classical generalization bounds for regression (Bousquet2003) suggest that minimizing the empirical loss of a loss function such as the cross-entropy subject to constrained complexity of the minimizer is a way to to attain generalization, that is an expected loss which is close to the empirical loss:
The following generalization bounds apply to with probability at least :
where is the expected loss, is the empirical loss, is the empirical Rademacher average of the class of functions , measuring its complexity; are constants that depend on properties of the Lipschitz constant of the loss function, and on the architecture of the network.
Thus minimizing under a constraint on the Rademacher complexity a surrogate function such as the cross-entropy (which becomes the logistic loss in the binary classification case) will minimize an upper bound on the expected classification error because such surrogate functions are upper bounds on the function. We can choose a class of functions with normalized weights and write and . One can choose any fixed as a (Ivanov) regularization-type tradeoff.
In summary, the problem of generalization may be approached by minimizing the exponential loss – more in general an exponential-type loss, such the logistic and the cross-entropy – under a unit norm constraint on the weight matrices, since we are interested in the directions of the weights:
where we write using the homogeneity of the network. As it will become clear later, gradient descent techniques on the exponential loss automatically increase to infinity. We will typically consider the sequence of minimizations over for a sequence of increasing . The key quantity for us is and the associated weights ; is in a certain sense an auxiliary variable, a constraint that is progressively relaxed.
In the following we explore the implications for deep networks of this classical approach to generalization.
3.3.1 Remark: minimization of an exponential-type loss implies margin maximization
Though not critical for our approach to the question of generalization in deep networks it is interesting that constrained minimization of the exponential loss implies margin maximization. This property relates our approach to the results of several recent papers (2017arXiv171010345S, 2019arXiv190507325S, DBLP:journals/corr/abs-1906-05890). Notice that our theorem 3 as in (DBLP:conf/nips/RossetZH03) is a sufficient condition for margin maximization. Necessity is not true for general loss functions.
To state the margin property more formally, we adapt to our setting a different result due to (DBLP:conf/nips/RossetZH03) (they consider for a linear network a vanishing regularization term whereas we have for nonlinear networks a set of unit norm constraints). First we recall the definition of the empirical loss with an exponential loss function . We define a the margin of , that is .
Then our margin maximization theorem (proved in (theory_III)) takes the form
Consider the set of corresponding to
where the norm is a chosen norm and is the empirical exponential loss. For each layer consider a sequence of increasing . Then the associated sequence of defined by Equation 6, converges for to the maximum margin of , that is to .
3.4 Minimization under unit norm constraint: weight normalization
The approach is then to minimize the loss function , with , subject to , that is under a unit norm constraint for the weight matrix at each layer (if then is the Frobenius norm), since are the directions of the weights which are the relevant quantity for classification. The minimization is understood as a sequence of minimizations for a sequence of increasing . Clearly these constraints imply the constraint on the norm of the product of weight matrices for any norm (because any induced operator norm is a sub-multiplicative matrix norm). The standard choice for a loss function is an exponential-type loss such the cross-entropy, which for binary classification becomes the logistic function. We study here the exponential because it is simpler and retains all the basic properties.
There are several gradient descent techniques that given the unconstrained optimization problem transform it into a constrained gradient descent problem. To provide the background let us formulate the standard unconstrained gradient descent problem for the exponential loss as it is used in practical training of deep networks:
where is the weight matrix of layer . Notice that, since the structural property implies that at a critical point we have , the only critical points of this dynamics that separate the data (i.e. ) are global minima at infinity. Of course for separable data, while the loss decreases asymptotically to zero, the norm of the weights increases to infinity, as we will see later. Equations 7 define a dynamical system in terms of the gradient of the exponential loss .
The set of gradient-based algorithms enforcing a unit-norm constraints (845952) comprises several techniques that are equivalent for small values of the step size. They are all good approximations of the true gradient method. One of them is the Lagrange multiplier method; another is the tangent gradient method based on the following theorem:
(845952) Let denote a vector norm that is differentiable with respect to the elements of and let be any vector function with finite norm. Then, calling , the equation
with , describes the flow of a vector that satisfies for all .
In particular, a form for is , the gradient update in a gradient descent algorithm. We call the tangent gradient transformation of . In the case of we replace in Equation 8 with because . This gives and
Consider now the empirical loss written in terms of and instead of , using the change of variables defined by but without imposing a unit norm constraint on . The flows in can be computed as and , with given by Equations 7.
We now enforce the unit norm constraint on by using the tangent gradient transform on the flow. This yields
Notice that the dynamics above follows from the classical approach of controlling the Rademacher complexity of during optimization (suggested by bounds such as Equation 4. The approach and the resulting dynamics for the directions of the weights may seem different from the standard unconstrained approach in training deep networks. It turns out, however, that the dynamics described by Equations 9 is the same dynamics of Weight Normalization.
The technique of Weight normalization (SalDied16) was originally proposed as a small improvement on standard gradient descent “to reduce covariate shifts”. It was defined for each layer in terms of , as
3.5 Generalization with hidden complexity control
Empirically it appears that GD and SGD converge to solutions that can generalize even without batch or weight normalization. Convergence may be difficult for quite deep networks and generalization may not be as good as with batch normalization but it still occurs. How is this possible?
We study the dynamical system under the reparametrization with . We consider for each weight matrix the corresponding “vectorized” representation in terms of vectors . We use the following definitions and properties (for a vector ):
Define ; thus with . Also define .
The following relations are easy to check:
The gradient descent dynamic system used in training deep networks for the exponential loss is given by Equation 7
. Following the chain rulefor the time derivatives, the dynamics for is exactly (see (theory_III)) equivalent to the following dynamics for and :
The key point here is that the dynamics of includes a unit norm constraint: using the tangent gradient transform will not change the equation because .
As separate remarks , notice that if for , separates all the data, , that is diverges to with . In the 1-layer network case the dynamics yields asymptotically. For deeper networks, this is different. (theory_III) shows (for one support vector) that the product of weights at each layer diverges faster than logarithmically, but each individual layer diverges slower than in the 1-layer case. The norm of the each layer grows at the same rate , independent of . The dynamics has stationary or critical points given by
where . We examine later the linear one-layer case in which case the stationary points of the gradient are given by and of course coincide with the solutions obtained with Lagrange multipliers. In the general case the critical points correspond for to degenerate zero “asymptotic minima” of the loss.
To understand whether there exists an implicit complexity control in standard gradient descent of the weight directions, we check whether there exists an norm for which unconstrained normalization is equivalent to constrained normalization.
From Theorem 4 we expect the constrained case to be given by the action of the following projector onto the tangent space:
The constrained Gradient Descent is then
These two dynamical systems are clearly different for generic reflecting the presence or absence of a regularization-like constraint on the dynamics of .
As we have seen however, for the 1-layer dynamical system obtained by minimizing in and with under the constraint , is the weight normalization dynamics
which is quite similar to the standard gradient equations
The two dynamical systems differ only by a factor in the equations. However, the critical points of the gradient for the flow, that is the point for which , are the same in both cases since for any and thus is equivalent to . Hence, gradient descent with unit -norm constraint is equivalent to the standard, unconstrained gradient descent but only when . Thus
The standard dynamical system used in deep learning, defined by , implicitly respectss a unit norm constraint on with . Thus, under an exponential loss, if the dynamics converges, the represent the minimizer under the unit norm constraint.
Thus standard GD implicitly enforces the norm constraint on , consistently with Srebro’s results on implicit bias of GD. Other minimization techniques such as coordinate descent may be biased towards different norm constraints.
3.6 Linear networks and rates of convergence
The linear () networks case (2017arXiv171010345S) is an interesting example of our analysis in terms of and dynamics. We start with unconstrained gradient descent, that is with the dynamical system
If gradient descent in converges to at finite time, satisfies , where with positive coefficients and are the support vectors (see (theory_III)). A solution then exists (, the pseudoinverse of , since is a vector, is given by ). On the other hand, the operator in associated with equation 19 is non-expanding, because . Thus there is a fixed point which is independent of initial conditions (Ferreira1996) and unique (in the linear case)
The rates of convergence of the solutions and , derived in different way in (2017arXiv171010345S), may be read out from the equations for and . It is easy to check that a general solution for is of the form . A similar estimate for the exponential term gives . Assume for simplicity a single support vector . We claim that a solution for the error , since converges to , behaves as . In fact we write and plug it in the equation for in 20. We obtain (assuming normalized input )
which has the form . Assuming of the form we obtain . Thus the error indeed converges as .
A similar analysis for the weight normalization equations 17 considers the same dynamical system with a change in the equation for , which becomes
This equation differs by a factor from equation 20. As a consequence equation 21 is of the form , with a general solution of the form . In summary, GD with weight normalization converges faster to the same equilibrium than standard gradient descent: the rate for is vs .
Our goal was to find . We have seen that various forms of gradient descent enforce different paths in increasing that empirically have different effects on convergence rate. It will be an interesting theoretical and practical challenge to find the optimal way, in terms of generalization and convergence rate, to grow .
Our analysis of simplified batch normalization (theory_III) suggests that several of the same considerations that we used for weight normalization should apply (in the linear one layer case BN is identical to WN). However, BN differs from WN in the multilayer case in several ways, in addition to weight normalization: it has for instance separate normalization for each unit, that is for each row of the weight matrix at each layer.
A main difference between shallow and deep networks is in terms of approximation power or, in equivalent words, of the ability to learn good representations from data based on the compositional structure of certain tasks. Unlike shallow networks, deep local networks – in particular convolutional networks – can avoid the curse of dimensionality in approximating the class of hierarchically local compositional functions. This means that for such class of functions deep local networks represent an appropriate hypothesis class that allows good approximation with a minimum number of parameters. It is not clear, of course, why many problems encountered in practice should match the class of compositional functions. Though we and others have argued that the explanation may be in either the physics or the neuroscience of the brain, these arguments are not rigorous. Our conjecture at present is that compositionality is imposed by the wiring of our cortex and, critically, is reflected in language. Thus compositionality of some of the most common visual tasks may simply reflect the way our brain works.
Optimization turns out to be surprisingly easy to perform for overparametrized deep networks because SGD will converge with high probability to global minima that are typically much more degenerate for the exponential loss than other local critical points.
More surprisingly, gradient descent yields generalization in classification performance, despite overparametrization and even in the absence of explicit norm control or regularization, because standard gradient descent in the weights enforces an implicit unit () norm constraint on the directions of the weights in the case of exponential-type losses.
In summary, it is tempting to conclude that the practical success of deep learning has its roots in the almost magic synergy of unexpected and elegant theoretical properties of several aspects of the technique: the deep convolutional network architecture itself, its overparametrization, the use of stochastic gradient descent, the exponential loss, the homogeneity of the RELU units and of the resulting networks.
Of course many problems remain open on the way to develop a full theory and, especially, in translating it to new architectures. More detailed results are needed in approximation theory, especially for densely connected networks. Our framework for optimization is missing at present a full classification of local minima and their dependence on overparametrization for general loss functions. The analysis of generalization should include an analysis of convergence of the weights for multilayer networks (see (2019arXiv190507325S) and (DBLP:journals/corr/abs-1906-05890)). A full theory would also require an analysis of the trade-off between approximation and estimation error, relaxing the separability assumption.
We are grateful to Sasha Rakhlin and Nate Srebro for useful suggestions about the structural lemma and about separating critical points. Part of the funding is from the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216, and part by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.