1 Introduction
Modern machine learning applications involve datasets of increasing dimensionality, complexity and size, which in turn motivate the use of highdimensional, nonlinear models, as illustrated in many deep learning algorithms across computer vision, speech and natural language understanding. The prevalent strategy for learning is to rely on Stochastic Gradient Descent (SGD) methods, that typically operate on nonconvex objectives. In this context, an outstanding goal is to provide a theoretical framework that explains under what conditions – relating input data distribution, choice of architecture and choice of optimization scheme – this setup will be successful.
More precisely, let denote a model class parametrized by
, which in the case of Neural Networks (NNs) contains the aggregated weights across all layers. In a supervised learning setting, this model is deployed on some data
random variable taking values in , to predict targets given features , and its risk for a given is(1) 
where
is a convex loss, such as a square loss or a logistic regression loss. In the following we refer to (
1) as the risk, the energy or the loss interchangeably. The aim is to findand this is attempted in practice by running SGD iteration on the parameter
where are (i.i.d.) drawn from . Under some technical conditions, the expected gradient is known to converge to [7]
. Understanding the nature of such stationary points  and therefore the landscape of the loss function  is a task of fundamental importance to understand performance of SGD.
Whereas there is a growing literature in analyzing the behavior of SGD on nonconvex objectives [43, 28, 25, 44], we focus here on properties of the optimization problem above that are algorithm independent. Many authors in the literature have attempted to characterize the landscape of the loss function (1) by studying its critical points. Global optimality results have been obtained for NN architectures with linear activations [26, 29, 46], quadratic activations [41, 18] and some more general nonlinear activations, under appropriate regularity assumptions [42, 34, 20]. Some other insights have been obtained by leveraging tools for complexity analysis of spin glasses [11]
and random matrix theory
[35]. Other analysis involved studying goodness of the initialization of the parameter values [15, 37, 19] or other topological properties of the loss (1), such as connectivity of sublevel sets [17, 21]. A common factor shared in the above cited works (and in common practice) is that overparametrisation of the model class (i.e. ) often leads to improved performance, despite the potential increase in generalization error.Each model defines a functional space , whose complexity a priori increases with the dimensionality of the parameter space . Whereas several authors have studied these models from the lens of approximation theory [13, 12, 5] by focusing on specific aspects of the parameterisation (such as the depth of the network), in this work we explore another hypothesis, namely that overparameterization remedies the complexity of that functional space, often leading to loss functions without poor local minima.
Our approach is inspired by Freeman and Bruna [21], and is related to recent work that also explores convexifications of neural networks [1, 47, 4]. Our analysis focuses mostly on the class of twolayer neural networks, with a hidden layer of size , and covers both empirical and population risk landscapes. A given activation function determines a functional space . In essence, our work identifies notions of intrinsic dimension of this functional space, and establishes the following facts:

If the hidden layer size is at least equal to the upper intrinsic dimension, then the resulting loss landscape is free of poor local minima, independently of the data distribution;

If is smaller than the lower intrinsic dimension, then there exist data distributions yielding arbitrarily poor local minima.
We articulate the notion of poor local minima via what we call spurious valleys, defined as connected components of the sublevel sets that do not contain a global minima. Upper and lower intrinsic dimensions define only two scenarios: either (i) they are both finite, enabling positive results; or (ii) they are both infinite, implying the negative results. Moreover, case (i) only occurs for polynomial activation functions or when the data distribution is discrete, corresponding to generic empirical risk minimization. The negative result covers many classes of activation functions with infinite intrinsic dimension. In particular, they generalize previously known negative results (such as leaky ReLUs) [46, 38] to a far wider class of activations. While in general the upper and lower intrinsic dimension may not match, we show that in some cases (linear and quadratic networks) the gap between the positive and negative results can be closed by improving on the former.
The negative results are worstcase in nature, and leave open the question of how complex is a ‘typical’ energy landscape corresponding to a generic data distribution. We answer this question by showing that, even if spurious valleys may appear in general, they are in practice easily avoided from random initializations, up to a low energy threshold, which approaches the global minimum at a rate inversely proportional to the hidden layer size up to log factors. This fact is shown for networks with homogeneous activations and generic data distributions and it is based on properties of random kernel quadrature rules [2].
Many other type of analysis of the convergence of NNs gradientbased optimization algorithms have been considered in the literature. For example, Ge et al. [24] proved convergence of GD on a modified loss; Shamir [40] compared optimization properties of residual networks with respect to linear models; in Dauphin et al. [16] it is argued that the issues arising in the optimization of NN architectures are due to the presence of saddle points in the loss function rather than spurious local minima. Optimization landscapes have also been studied in other contexts than from NNs training, such as low rank [22], matrix completion [23], problems arising in semidefinite programming [9, 3] and implicit generative modeling [6].
The rest of the paper is structured as follows. Section 2 formally introduces the notion of spurious valleys and explains why this is a relevant concept from the optimization point of view. It also defines the intrinsic dimensions of a network (Section 2.1). In Section 3 we state our main positive results (Theorem 7) and we discuss two settings where they bear fruit: polynomial activation functions and empirical risk minimization. For the case of linear and quadratic activations, we improve on our general result, by proving that, for the linear case, Theorem 7 holds without any assumptions on the distributions of the data or on the size/rank of any variables (which extends previous results on the optimization of linear NNs [29, 46]), and by recovering, for the quadratic case, results which are in line with current literature [41, 18]. Section 4 is dedicated to constructions of worst case scenarios for activation with infinite lower intrinsic dimension. We then show, in Section 5, that, even if spurious valleys may exist, they tend to be confined to regimes of low risk. Some discussion is reported in Section 6.
1.1 Notation
We introduce notation we use throughout the rest of the paper. For any integers we denote and, if ,
. We denote scalar valued variables as lowercase nonbold; vector valued variables as lowercase bold; matrix and tensor valued variables as uppercase bold. Given a vector
, we denote its components as ; given a matrix , we denote its rows as ; given a tensor , we denote its components as . Given some vectors , , the tensor product denotes the dimensional tensor whose components are given by ; given a vector , we denote .denotes the identity matrix and
the standard basis in . For any random variables (r.v.’s) and with values in and respectively, we denote and . For every integer , we denote by , and , respectively, the general linear group, the orthogonal group and the special orthogonal group of real matrices. We denote by the space of order symmetric tensors on . For any , we define the symmetric rank [14] as . We define . Finally, denotes the dimensional sphere .2 Problem setting
Let be two r.v.’s. These r.v.’s take values in and and represent the input and output data, respectively. We consider oracle square loss functions of the form
(2) 
where is convex. For every , the function models the dependence of the output on the input as . We focus on twolayers NN functions , i.e. of the form
(3) 
where . Here represents the width of the hidden layer and is a continuous elementwise activation function.
The loss function is (in general) a nonconvex object; it may present spurious (i.e. non global) local minima. In this work, we characterize by determining absence or presence of spurious valleys, as defined below.
Definition 1.
For all we define the sublevel set of as . We define a spurious valley as a connected component of a sublevel set which does not contain a global minimum of the loss .
Since, in practice, the loss (2) is minimized with a gradient descent based algorithm, then absence of spurious valleys is a desirable property, if we wish the algorithm to converge to an optimal parameter. It is easy to see that not having spurious valleys is equivalent to the following property:

[label=P.0]

Given any initial parameter , there exists a continuous path such that:



The function is nonincreasing

As pointed out in Freeman and Bruna [21], this implies that has no strict spurious (i.e. non global) local minima. The absence of generic (i.e. nonstrict) spurious local minima is guaranteed if the path is such that the function
is strictly decreasing. For many activation functions used in practice (such as the ReLU
), the parameter determining the function is determined up to the action of a symmetries group (e.g., in the case of the ReLU, is an positive homogeneous function). This prevents strict minima: for any value of the parameter there exists a (often large) manifold intersecting along which the loss function is constant. Absence of spurious valleys for the loss (2) implies that it is always possible to move from any point in the parameter space to a global minima, without increasing the loss.2.1 Intrinsic dimension of a network
The main result of this work is to exploit that the property of absence of spurious valleys is related to the complexity of the functional space defined by the network architecture. We therefore define two measures of such complexity which we will use to show, respectively, positive and negative results in this regard.
To simplify the discussion, we introduce some notation which we will use throughout the rest of the paper. Let be a continuous activation function. For every we denote to be the function . We refer to each as a filter function. If is a r.v. taking values in , we denote by the space of square integrable function on w.r.t. the probability measure induced by the r.v. . We then define the two following functional spaces:
represents the space of (onedimensional output) functions modeled by the network architecture and to be the space of (
dimensional) input data distributions for which the filter functions have finite second moment. We finally define
as the linear space spanned by the functions for .
Definition 2.
Let be a continuous activation function and a r.v. We define^{1}^{1}1For any linear subspace , denotes the dimension of as a subspace of .
as the upper intrinsic dimension of the pair . We define the level upper intrinsic dimension of as .
The upper intrinsic dimension defined above is therefore the dimension of the functional space spanned by the filter functions or, equivalently, of the image of the map . Notice that . In particular, if the distribution is discrete, i.e. it is concentrated on a finite number of points , then . Otherwise, if the distribution is not discrete, then .
The level upper intrinsic dimension is defined as the dimension of the functional linear space . We note that if is a r.v. with almost surely (a.s.) positive density w.r.t. the Lebesgue measure , then .
The following lemma exhausts all the cases when the upper intrinsic dimension is not infinite.
Lemma 3.
Let be a continuous activation function and such that . If is a polynomial, then
Otherwise (i.e. if is not a polynomial) it holds .
We then define the lower intrinsic dimension, which corresponds to the concept of ‘how many hidden neurons are needed to represent a generic function of
’.Definition 4.
Let be a continuous activation function and a r.v. We define^{2}^{2}2For any subsets , we say that if as subsets of (and similar with other inclusions or equalities).
as the lower dimension of the pair . We define the level lower dimension of as .
If is finite, then it corresponds to the minimum number of hidden neurons which are needed to represent any function of with the NN architecture (3). Clearly, this implies that
for every continuous activation function and any . As with the upper instrinsic dimension, we note that if is a r.v. with a.s. positive density w.r.t. the Lebesgue measure , then .
In the case of homogeneus polynomial activations with integer, the level lower dimension of coincides with the notion of (maximal) symmetric tensor rank.
Lemma 5.
Let , with positive integer. Then
Finally, the next lemma implies that for most nonpolynomial activation functions practical interest, the lower intrinsic dimension is infinite.
Lemma 6.
Let be a continuous activation function such that and . Then if and only if is not a polynomial.
3 Finite network dimension and absence of spurious valleys
In this section we provide our positive results. Essentially they state that if the width of the network matches the dimension of the functional space spanned by its filter functions, then no spurious valleys exist. We first provide the main result (Theorem 7) in a general form, which allows a straightforward derivation of two cases of interest: empirical risk minimization (Corollary 8) and polynomial activations (Corollary 9).
Theorem 7.
For any continuous activation function and r.v. with finite upper intrinsic dimension , the loss function
for twolayers NNs admits no spurious valleys in the overparametrized regime .
The above result can be rephrased as follows: if the network is such that any of its output units can be chosen from the whole linear space spanned by its filter functions , then the associated optimization problem is such that there always exists a descent path to an optimal solution, for any initialization of the parameters.
Applying the observations in Section 2.1 describing the cases of finite intrinsic dimension, we immediately get the following corollaries.
Corollary 8 (Erm).
Consider data points . For twolayers NNs , where is any continuous activation function, the empirical loss function
admits no spurious valleys in the overparametrized regime .
This result is in line with previous works that considered the landscape of empirical risk minimization for halfrectified deep networks [42, 45, 31, 34]. However, its proof illustrates the danger of studying empirical risk minimization landscapes in overparametrized regimes, since it bypasses all the geometric and algebraic properties needed in the population risk setting  which may be more relevant to understand the generalization properties of the model.
Corollary 9 (Polynomial activations).
For twolayers NNs with polynomial activation function , the loss function admits no spurious valleys in the overparametrized regime
Under the hypothesis of Corollary 9 with , a generic function of , , can be also represented, for some , in the generalized linear form
with . The parameters and differ for their dimensions:
One would therefore like Corollary 9 to hold also (at least) for . In the next section we address this problem for the linear activation and the quadratic activation .
3.1 Improved overparametrization bounds for homogeneous polynomial activations
The overparametrization bounds obtained in Corollary 9 are quite nondesiderable in practical applications. We show that they can indeed be improved, for the case of linear and quadratic networks.
3.1.1 Linear networks case
Linear networks have been considered as a first order approximation of feedforward multilayers networks [29]. It was shown, in several works [29, 21, 46], that, for linear networks of any depth
(4) 
with , the loss function has no spurious local minima, if . This corresponds exactly with overparametrization regime in Corollary 9, for the case of twolayers networks. The following theorem improves on Corollary 9 for the case of multilayers linear networks, showing that no overparametrization is required in this case to avoid spurious valleys, for square loss functions.
Theorem 10 (Linear networks).
For linear NNs (4) of any depth and of any layer widths , , and any inputoutput dimensions , the square loss function admits no spurious valleys.
3.1.2 Quadratic networks case
Quadratic activations have been considered in the literature [31, 18, 41] as second order approximation of general nonlinear activations. In particular, for twolayers networks with one dimensional output and square loss functions evaluated on samples, it was shown in Du and Lee [18] that, if , the loss has no spurious local minima. Corollary 9 requires an overparametrization bound for the case of quadratic activations. In the following theorem we show that is sufficient for the statement to hold, in the case of square loss functions and one dimensional output ().
Theorem 11 (Quadratic networks).
For twolayers NNs with quadratic activation function and onedimensional output (), the square loss function admits no spurious valleys in the overparametrized regime .
This result is in line with the one from Soltanolkotabi et al. [41], where the authors proved absence of spurious local minima when , but for fixed second layer weights. The proof (reported in Section A) consists in constructing a path satisfying 1 and improves upon the proof of Theorem 7 by leveraging the special linearized structure of the network for quadratic activation. For every parameter , we can write
We notice that can also be represented by a NN with layers; indeed, if is the SVD of , then . Therefore is sufficient to describe any element in . The factor in the statement is due to some technicalities in the proof, but a more involved proof should be able to extend the result to the regime . The extension of such mechanism for higher order tensors (appearing as a result of multiple layers or highorder polynomial activations) using tensor decomposition also seems possible and is left for future work.
3.1.3 Lower to upper intrinsic dimension gap
As observed in Lemma 5 and for all integer . Therefore, Theorem 10 and Theorem 11 say that, for , , and , the square loss function admits no spurious valleys in the overparametrized regime . We conjecture that this hold for any (sufficiently regular) activation function with finite intrinsic lower dimension.
4 Infinite intrinsic dimension and presence of spurious valleys
This section is devoted to the construction of worstcase scenarios for nonover parametrized networks. The main result (Theorem 12) essentially states that, for networks with width smaller than the lower intrinsic dimension defined above, spurious valleys can be created by choosing adversarial data distributions. We then show how this implies negative results for underparametrized polynomial architectures and a large variety of architectures used in practice.
Theorem 12.
Consider the square loss function for twolayers NNs with nonnegative activation function such that . If , then there exists a r.v. such that the square loss function admits spurious valleys. In particular, for any given , the r.v. can be chosen in such a way that there are two disjoint open subsets such that
(5) 
and any path such that and is a global minima verifies
(6) 
Equation (5) in Theorem 12 says that any local descent algorithm, if initialized with a parameter value belonging to a spurious valleys, at its best it will only be able to produce a final parameter value which is at least far from optimality. Equation (6) implies that there exists an open subset of the spurious valleys such that any path starting in a parameter belonging to such subset must ‘upclimb’ at least in the loss value. In the following we refer to such property, as stated in Theorem 12, by saying that the loss function has arbitrarily bad spurious valleys. Note that this result ensures that spurious valleys have positive Lebesgue measure (given by ), so there is a positive probability that gradient descent methods initialized with a measure that is absolutely continuous with respect to Lebesgue will get stuck in a bad local minima.
Applying the observations describing the values of the lower intrinsic dimension for different activation functions, we get the following corollaries.
Corollary 13 (Homogeneous even degree polynomial activations).
Assume that with integer. For twolayers NNs , if and the hiddenlayer width satisfies
then there exists a r.v. such that the square loss function has arbitrarily bad spurious valleys.
This follows by Theorem 12 and Corollary 5, since . For the well known case (symmetric matrices) it holds ; therefore Corollary 13 implies that the bound provided in Corollary 9 is almost (up to a factor ) tight. Still, this is in line with recent works which explored quadratic architectures [41, 18].
Corollary 14 (Spurious valleys exist in generic architectures).
If , for twolayers NNs with any hiddenlayer width and continuous nonnegative nonpolynomial activation function , then there exists a r.v. such that the square loss function has arbitrarily bad spurious valleys. This setting includes the following activation functions:

The ReLU activation function and some relaxations of it, such as softplus activation functions , with ;

The sigmoid activation function and the approximating erf function
, which represents an approximation to the sigmoid function.
This follows by Theorem 12 by observing that if is one of the above activation functions. Corollary 14 generalizes on some recent negative results [38, 46] for practical activations. We remark that while in these works the authors proved existence of spurious local minima, we prove that, in fact, arbitrarily bad spurious valleys can exist, which is a stronger negative characterization.
The results of this section can be interpreted as worstcase scenarios for the problem of optimizing (2). We showed that, even for simple twolayers neural network architectures with nonlinear activation functions used in practice (such as ReLU), global optimality results can not hold, unless we make some assumptions on the data distributions.
5 Typical Spurious Valleys and LowEnergy Barriers
In the previous section it was shown that whenever the number of hidden units is below the lower intrinsic dimension, then one can show worstcase data distributions that yield a landscape with arbitrarily bad spurious valleys.
In this section, we study the energy landscape under generic data distributions in case of homogeneous activation, and show that, although spurious valleys may appear, they tend do so below a certain energy level, controlled by the decay of the spectral decomposition of the kernel defined by the activation function and by the amount of parametrisation . This phenomena is consistent with the empirical success of gradientdescent algorithms in conditions where is indeed below the intrinsic dimension.
We consider oracle square loss functions of the form
(7) 
for onedimensional output twolayers NNs , with , a positively homogeneous function, and square integrable r.v. (). Notice that we can write
for some measurable such that . In particular this implies that
As , the optimization problem (7) becomes convex, a fact that is exploited in several recent works [32, 36, 10]. As observed by Bach in [2], the effect of having only a finite number of hidden neurons can be recast as obtaining a quadrature rule for the reproducing kernel associated to the activation function. The following theorem is a direct application of Proposition 1 from [2], and relates the quadrature error with the ability to avoid large loss barriers with high probability.
Theorem 15.
Let
be the uniform distribution over the unit sphere
and consider an initial parameter with sampled i.i.d. Then the following hold:
There exists a path such that , the function is nonincreasing, and
with probability greater or equal then , for every .

If is sufficiently regular^{3}^{3}3More precisely, if the function can be written as for some ., there exists a path such that , the function is nonincreasing, and
with probability greater or equal then for every .
The above result implies that convex optimization over the second layer is sufficient to reach a model whose error relative to the best possible one is inversely proportional to the hiddenlayer size (up to logarithm factors). Nevertheless, in practice, this approach will generally perform worse than standard gradientdescent training, which may require less overparametrization to give satisfying results (see for example the numerical experiments in [10], Section 4). This shows the importance of understanding gradient descent dynamics on the first layer, amenable to analysis in the limit using meanfield techniques [36, 32] as well as optimal transport on Wasserstein metrics [10]. Precisely quantifying how much is gained by optimizing jointly in the nonasymptotic case remains an important open question [10] left for future work.
6 Future directions
We considered the problem of characterizing the loss surface of neural networks from the perspective of optimization, with the goal of deriving weak certificates that enable  or prevent  the existence of descent paths towards global minima.
The topological properties studied in this paper, however, do not yet capture fundamental aspects that are necessary to explain the empirical success of deep learning methods. We identify a number of different directions that deserve further attention.
The positive results presented above rely on being able to reduce the network to the case when (convex) optimization over the second layer is sufficient to reach optimal weight values. A better understanding of first layer dynamics needs to be carried out. Moreover, in such positive results we only proved nonexistence of (high) energy barriers. While this is an interesting property from the optimization point of view, it is also not sufficient to guarantee convergence of local descent algorithms. Another informative property of the loss function that should be addressed in future works is the existence of local descents in non optimal points: for every non optimal and any neighborhood of , there exists such that . More generally, our present work is not informative on the performance of gradient descent in the regimes with no spurious valley.
The other very important point to be addressed in future is how to extend the above results to architectures of more practical interest. Depth and the specific linear structure of Convolutional Neural Networks, critical to explain the excellent empirical performance of deep learning in computer vision, text or speech, need to be exploited, as well as specific design choices such as Residual connections and several normalization strategies – as done recently in
[40] and [39] respectively. This also requires making specific assumptions on the data distribution, and is left for future work.Acknowledgements
We would like to thank Gérard Ben Arous and Léon Bottou for fruitful discussions, and Jean Ponce for valuable comments and corrections of the original version of this manuscript. The first author would also like to thank Jumageldi Charyyev for fruitful discussions on the proofs of several propositions and Andrea Ottolini for valuable comments on a previous version of this manuscript.
References

[1]
Francis Bach.
Breaking the curse of dimensionality with convex neural networks.
Journal of Machine Learning Research, 18(19):1–53, 2017.  [2] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1–38, 2017.
 [3] Afonso S Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the lowrank approach for semidefinite programs arising in synchronization and community detection. In Conference on Learning Theory, pages 361–382, 2016.
 [4] Alberto Bietti and Julien Mairal. Group invariance and stability to deformations of deep convolutional representations. arXiv preprint arXiv:1706.03078, 2017.
 [5] Helmut Bölcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approximation with sparsely connected deep neural networks. arXiv preprint arXiv:1705.01714, 2017.
 [6] Leon Bottou, Martin Arjovsky, David LopezPaz, and Maxime Oquab. Geometrical insights for implicit generative modeling. arXiv preprint arXiv:1712.07822, 2017.
 [7] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for largescale machine learning. arXiv preprint arXiv:1606.04838, 2016.
 [8] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
 [9] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The nonconvex burermonteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems, pages 2757–2765, 2016.
 [10] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. arXiv preprint arXiv:1805.09545, 2018.
 [11] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204, 2015.
 [12] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728, 2016.
 [13] Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning, pages 955–963, 2016.
 [14] Pierre Comon, Gene Golub, LekHeng Lim, and Bernard Mourrain. Symmetric tensors and symmetric tensor rank. SIAM Journal on Matrix Analysis and Applications, 30(3):1254–1279, 2008.
 [15] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261, 2016.
 [16] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
 [17] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A Hamprecht. Essentially no barriers in neural network energy landscape. arXiv preprint arXiv:1803.00885, 2018.
 [18] Simon S Du and Jason D Lee. On the power of overparametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018.
 [19] Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns onehiddenlayer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
 [20] Soheil Feizi, Hamid Javadi, Jesse Zhang, and David Tse. Porcupine neural networks:(almost) all local optima are global. arXiv preprint arXiv:1710.02196, 2017.
 [21] Daniel Freeman and Joan Bruna. Topology and geometry of halfrectified network optimization. ICLR 2017, 2017.
 [22] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. arXiv preprint arXiv:1704.00708, 2017.
 [23] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
 [24] Rong Ge, Jason D Lee, and Tengyu Ma. Learning onehiddenlayer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
 [25] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018.
 [26] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
 [27] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
 [28] Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
 [29] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
 [30] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867, 1993.
 [31] Roi Livni, Shai ShalevShwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pages 855–863, 2014.
 [32] Song Mei, Andrea Montanari, and PhanMinh Nguyen. A mean field view of the landscape of twolayers neural networks. arXiv preprint arXiv:1804.06561, 2018.
 [33] Marco Mondelli and Andrea Montanari. On the connection between learning twolayers neural networks and tensor decomposition. arXiv preprint arXiv:1802.07301, 2018.
 [34] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017.
 [35] Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via random matrix theory. In International Conference on Machine Learning, pages 2798–2806, 2017.
 [36] Grant M Rotskoff and Eric VandenEijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018.
 [37] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pages 774–782, 2016.
 [38] Itay Safran and Ohad Shamir. Spurious local minima are common in twolayer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.
 [39] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604, 2018.
 [40] Ohad Shamir. Are resnets provably better than linear predictors? arXiv preprint arXiv:1804.06739, 2018.
 [41] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of overparameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.
 [42] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
 [43] Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.
 [44] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017.
 [45] Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. arXiv preprint arXiv:1611.03131, 2016.
 [46] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. A critical view of optimality in deep learning. arXiv preprint arXiv:1802.03487, 2018.
 [47] Yuchen Zhang, Percy Liang, and Martin J Wainwright. Convexified convolutional neural networks. arXiv preprint arXiv:1609.01000, 2016.
Appendix A Proofs of Section 3
a.1 Proof of Theorem 7
We note that, under the assumptions of Theorem 7, the same optimal NN functions could also be obtained using a generalized linear model, where the representation function has the linear form , for some parameter independent function . The main difference between the two models is that the former requires the choice of a nonlinear activation function , while the latter implies the choice of a kernel functions. This is the content of the following lemma.
Lemma 16.
Let be a continuous function and a r.v. Assume that the linear space
is finite dimensional. Then there exists a scalar product on and a map such that
(8) 
for all . Moreover, the function is continuous.
Proof.
For sake of simplicity, in the following we write for and for . Let be a basis of . If and , then we can define a scalar product on as
If we define the map as
then property (8) follows directly by the definition of the function . Moreover, we can choose such that is a basis of . Now we need to show that, for , the map is continuous. Let be the matrix and be the vector . Then , which is continuous in . This shows that the map is continuous. ∎
The nontrivial fact captured by Theorem 7 is the following: when the capacity of network is large enough to match a generalized linear model, but still finite, then the problem of optimizing the loss function (2), which is in general a highly nonconvex object, satisfies an interesting optimization property in view of the local descent algorithms which are used in practice to solve it.
Proof of Theorem 7.
Thanks to Lemma 16, there exist two continuous maps , with , such that for every . Therefore, every twolayers NN can be written as , where, if , then (that is is applied rowwise).
The proof of the Theorem consists in exploiting the above linearized representation of to show that property 1 holds (remind that this is equivalent to saying that the loss function has no spurious valleys). Given an initial parameter , we want to construct a continuous path , such that the function is nonincreasing and such that , , where . The construction of such a path can be articulated in two main steps:
Step 1..
The first part of the path consist showing that we can assume that w.l.o.g. Let be the rows of ; suppose that (otherwise there is nothing to show) and that are linearly independent. Denote , and the columns of . For , we can write
(9) 
If we define such that (denoting the th row of )
then . The path leaves the network unchanged, i.e. for . At this point, we can select such that the matrix with rows for and for , verifies . Notice that the existence of such vectors , , is guaranteed by the definition of . The path leaves the network unchanged, i.e. for . The new parameter value satisfies . ∎
Step 2..
By step 1, we can assume that . Since the network has the form and since the function is convex, there exists such that
Comments
There are no comments yet.