Modern machine learning applications involve datasets of increasing dimensionality, complexity and size, which in turn motivate the use of high-dimensional, non-linear models, as illustrated in many deep learning algorithms across computer vision, speech and natural language understanding. The prevalent strategy for learning is to rely on Stochastic Gradient Descent (SGD) methods, that typically operate on non-convex objectives. In this context, an outstanding goal is to provide a theoretical framework that explains under what conditions – relating input data distribution, choice of architecture and choice of optimization scheme – this setup will be successful.
More precisely, let denote a model class parametrized by
, which in the case of Neural Networks (NNs) contains the aggregated weights across all layers. In a supervised learning setting, this model is deployed on some datarandom variable taking values in , to predict targets given features , and its risk for a given is
is a convex loss, such as a square loss or a logistic regression loss. In the following we refer to (1) as the risk, the energy or the loss interchangeably. The aim is to find
and this is attempted in practice by running SGD iteration on the parameter
where are (i.i.d.) drawn from . Under some technical conditions, the expected gradient is known to converge to 
. Understanding the nature of such stationary points - and therefore the landscape of the loss function - is a task of fundamental importance to understand performance of SGD.
Whereas there is a growing literature in analyzing the behavior of SGD on non-convex objectives [43, 28, 25, 44], we focus here on properties of the optimization problem above that are algorithm independent. Many authors in the literature have attempted to characterize the landscape of the loss function (1) by studying its critical points. Global optimality results have been obtained for NN architectures with linear activations [26, 29, 46], quadratic activations [41, 18] and some more general non-linear activations, under appropriate regularity assumptions [42, 34, 20]. Some other insights have been obtained by leveraging tools for complexity analysis of spin glasses 
and random matrix theory. Other analysis involved studying goodness of the initialization of the parameter values [15, 37, 19] or other topological properties of the loss (1), such as connectivity of sub-level sets [17, 21]. A common factor shared in the above cited works (and in common practice) is that overparametrisation of the model class (i.e. ) often leads to improved performance, despite the potential increase in generalization error.
Each model defines a functional space , whose complexity a priori increases with the dimensionality of the parameter space . Whereas several authors have studied these models from the lens of approximation theory [13, 12, 5] by focusing on specific aspects of the parameterisation (such as the depth of the network), in this work we explore another hypothesis, namely that over-parameterization remedies the complexity of that functional space, often leading to loss functions without poor local minima.
Our approach is inspired by Freeman and Bruna , and is related to recent work that also explores convexifications of neural networks [1, 47, 4]. Our analysis focuses mostly on the class of two-layer neural networks, with a hidden layer of size , and covers both empirical and population risk landscapes. A given activation function determines a functional space . In essence, our work identifies notions of intrinsic dimension of this functional space, and establishes the following facts:
If the hidden layer size is at least equal to the upper intrinsic dimension, then the resulting loss landscape is free of poor local minima, independently of the data distribution;
If is smaller than the lower intrinsic dimension, then there exist data distributions yielding arbitrarily poor local minima.
We articulate the notion of poor local minima via what we call spurious valleys, defined as connected components of the sublevel sets that do not contain a global minima. Upper and lower intrinsic dimensions define only two scenarios: either (i) they are both finite, enabling positive results; or (ii) they are both infinite, implying the negative results. Moreover, case (i) only occurs for polynomial activation functions or when the data distribution is discrete, corresponding to generic empirical risk minimization. The negative result covers many classes of activation functions with infinite intrinsic dimension. In particular, they generalize previously known negative results (such as leaky ReLUs) [46, 38] to a far wider class of activations. While in general the upper and lower intrinsic dimension may not match, we show that in some cases (linear and quadratic networks) the gap between the positive and negative results can be closed by improving on the former.
The negative results are worst-case in nature, and leave open the question of how complex is a ‘typical’ energy landscape corresponding to a generic data distribution. We answer this question by showing that, even if spurious valleys may appear in general, they are in practice easily avoided from random initializations, up to a low energy threshold, which approaches the global minimum at a rate inversely proportional to the hidden layer size up to log factors. This fact is shown for networks with homogeneous activations and generic data distributions and it is based on properties of random kernel quadrature rules .
Many other type of analysis of the convergence of NNs gradient-based optimization algorithms have been considered in the literature. For example, Ge et al.  proved convergence of GD on a modified loss; Shamir  compared optimization properties of residual networks with respect to linear models; in Dauphin et al.  it is argued that the issues arising in the optimization of NN architectures are due to the presence of saddle points in the loss function rather than spurious local minima. Optimization landscapes have also been studied in other contexts than from NNs training, such as low rank , matrix completion , problems arising in semidefinite programming [9, 3] and implicit generative modeling .
The rest of the paper is structured as follows. Section 2 formally introduces the notion of spurious valleys and explains why this is a relevant concept from the optimization point of view. It also defines the intrinsic dimensions of a network (Section 2.1). In Section 3 we state our main positive results (Theorem 7) and we discuss two settings where they bear fruit: polynomial activation functions and empirical risk minimization. For the case of linear and quadratic activations, we improve on our general result, by proving that, for the linear case, Theorem 7 holds without any assumptions on the distributions of the data or on the size/rank of any variables (which extends previous results on the optimization of linear NNs [29, 46]), and by recovering, for the quadratic case, results which are in line with current literature [41, 18]. Section 4 is dedicated to constructions of worst case scenarios for activation with infinite lower intrinsic dimension. We then show, in Section 5, that, even if spurious valleys may exist, they tend to be confined to regimes of low risk. Some discussion is reported in Section 6.
We introduce notation we use throughout the rest of the paper. For any integers we denote and, if ,, we denote its components as ; given a matrix , we denote its rows as ; given a tensor , we denote its components as . Given some vectors , , the tensor product denotes the dimensional tensor whose components are given by ; given a vector , we denote .
denotes the identity matrix andthe standard basis in . For any random variables (r.v.’s) and with values in and respectively, we denote and . For every integer , we denote by , and , respectively, the general linear group, the orthogonal group and the special orthogonal group of real matrices. We denote by the space of order symmetric tensors on . For any , we define the symmetric rank  as . We define . Finally, denotes the -dimensional sphere .
2 Problem setting
Let be two r.v.’s. These r.v.’s take values in and and represent the input and output data, respectively. We consider oracle square loss functions of the form
where is convex. For every , the function models the dependence of the output on the input as . We focus on two-layers NN functions , i.e. of the form
where . Here represents the width of the hidden layer and is a continuous element-wise activation function.
The loss function is (in general) a non-convex object; it may present spurious (i.e. non global) local minima. In this work, we characterize by determining absence or presence of spurious valleys, as defined below.
For all we define the sub-level set of as . We define a spurious valley as a connected component of a sub-level set which does not contain a global minimum of the loss .
Since, in practice, the loss (2) is minimized with a gradient descent based algorithm, then absence of spurious valleys is a desirable property, if we wish the algorithm to converge to an optimal parameter. It is easy to see that not having spurious valleys is equivalent to the following property:
Given any initial parameter , there exists a continuous path such that:
The function is non-increasing
As pointed out in Freeman and Bruna , this implies that has no strict spurious (i.e. non global) local minima. The absence of generic (i.e. non-strict) spurious local minima is guaranteed if the path is such that the function
is strictly decreasing. For many activation functions used in practice (such as the ReLU), the parameter determining the function is determined up to the action of a symmetries group (e.g., in the case of the ReLU, is an positive homogeneous function). This prevents strict minima: for any value of the parameter there exists a (often large) manifold intersecting along which the loss function is constant. Absence of spurious valleys for the loss (2) implies that it is always possible to move from any point in the parameter space to a global minima, without increasing the loss.
2.1 Intrinsic dimension of a network
The main result of this work is to exploit that the property of absence of spurious valleys is related to the complexity of the functional space defined by the network architecture. We therefore define two measures of such complexity which we will use to show, respectively, positive and negative results in this regard.
To simplify the discussion, we introduce some notation which we will use throughout the rest of the paper. Let be a continuous activation function. For every we denote to be the function . We refer to each as a filter function. If is a r.v. taking values in , we denote by the space of square integrable function on w.r.t. the probability measure induced by the r.v. . We then define the two following functional spaces:
represents the space of (one-dimensional output) functions modeled by the network architecture and to be the space of (
-dimensional) input data distributions for which the filter functions have finite second moment. We finally define
as the linear space spanned by the functions for .
Let be a continuous activation function and a r.v. We define111For any linear subspace , denotes the dimension of as a subspace of .
as the upper intrinsic dimension of the pair . We define the level upper intrinsic dimension of as .
The upper intrinsic dimension defined above is therefore the dimension of the functional space spanned by the filter functions or, equivalently, of the image of the map . Notice that . In particular, if the distribution is discrete, i.e. it is concentrated on a finite number of points , then . Otherwise, if the distribution is not discrete, then .
The level upper intrinsic dimension is defined as the dimension of the functional linear space . We note that if is a r.v. with almost surely (a.s.) positive density w.r.t. the Lebesgue measure , then .
The following lemma exhausts all the cases when the upper intrinsic dimension is not infinite.
Let be a continuous activation function and such that . If is a polynomial, then
Otherwise (i.e. if is not a polynomial) it holds .
We then define the lower intrinsic dimension, which corresponds to the concept of ‘how many hidden neurons are needed to represent a generic function of’.
Let be a continuous activation function and a r.v. We define222For any subsets , we say that if as subsets of (and similar with other inclusions or equalities).
as the lower dimension of the pair . We define the level lower dimension of as .
If is finite, then it corresponds to the minimum number of hidden neurons which are needed to represent any function of with the NN architecture (3). Clearly, this implies that
for every continuous activation function and any . As with the upper instrinsic dimension, we note that if is a r.v. with a.s. positive density w.r.t. the Lebesgue measure , then .
In the case of homogeneus polynomial activations with integer, the level lower dimension of coincides with the notion of (maximal) symmetric tensor rank.
Let , with positive integer. Then
Finally, the next lemma implies that for most non-polynomial activation functions practical interest, the lower intrinsic dimension is infinite.
Let be a continuous activation function such that and . Then if and only if is not a polynomial.
3 Finite network dimension and absence of spurious valleys
In this section we provide our positive results. Essentially they state that if the width of the network matches the dimension of the functional space spanned by its filter functions, then no spurious valleys exist. We first provide the main result (Theorem 7) in a general form, which allows a straight-forward derivation of two cases of interest: empirical risk minimization (Corollary 8) and polynomial activations (Corollary 9).
For any continuous activation function and r.v. with finite upper intrinsic dimension , the loss function
for two-layers NNs admits no spurious valleys in the over-parame-trized regime .
The above result can be re-phrased as follows: if the network is such that any of its output units can be chosen from the whole linear space spanned by its filter functions , then the associated optimization problem is such that there always exists a descent path to an optimal solution, for any initialization of the parameters.
Applying the observations in Section 2.1 describing the cases of finite intrinsic dimension, we immediately get the following corollaries.
Corollary 8 (Erm).
Consider data points . For two-layers NNs , where is any continuous activation function, the empirical loss function
admits no spurious valleys in the over-parametrized regime .
This result is in line with previous works that considered the landscape of empirical risk minimization for half-rectified deep networks [42, 45, 31, 34]. However, its proof illustrates the danger of studying empirical risk minimization landscapes in over-parametrized regimes, since it bypasses all the geometric and algebraic properties needed in the population risk setting - which may be more relevant to understand the generalization properties of the model.
Corollary 9 (Polynomial activations).
For two-layers NNs with polynomial activation function , the loss function admits no spurious valleys in the over-parametrized regime
Under the hypothesis of Corollary 9 with , a generic function of , , can be also represented, for some , in the generalized linear form
with . The parameters and differ for their dimensions:
One would therefore like Corollary 9 to hold also (at least) for . In the next section we address this problem for the linear activation and the quadratic activation .
3.1 Improved over-parametrization bounds for homogeneous polynomial activations
The over-parametrization bounds obtained in Corollary 9 are quite non-desiderable in practical applications. We show that they can indeed be improved, for the case of linear and quadratic networks.
3.1.1 Linear networks case
with , the loss function has no spurious local minima, if . This corresponds exactly with over-parametrization regime in Corollary 9, for the case of two-layers networks. The following theorem improves on Corollary 9 for the case of multi-layers linear networks, showing that no over-parametrization is required in this case to avoid spurious valleys, for square loss functions.
Theorem 10 (Linear networks).
For linear NNs (4) of any depth and of any layer widths , , and any input-output dimensions , the square loss function admits no spurious valleys.
3.1.2 Quadratic networks case
Quadratic activations have been considered in the literature [31, 18, 41] as second order approximation of general non-linear activations. In particular, for two-layers networks with one dimensional output and square loss functions evaluated on samples, it was shown in Du and Lee  that, if , the loss has no spurious local minima. Corollary 9 requires an over-parametrization bound for the case of quadratic activations. In the following theorem we show that is sufficient for the statement to hold, in the case of square loss functions and one dimensional output ().
Theorem 11 (Quadratic networks).
For two-layers NNs with quadratic activation function and one-dimensional output (), the square loss function admits no spurious valleys in the over-parametrized regime .
This result is in line with the one from Soltanolkotabi et al. , where the authors proved absence of spurious local minima when , but for fixed second layer weights. The proof (reported in Section A) consists in constructing a path satisfying 1 and improves upon the proof of Theorem 7 by leveraging the special linearized structure of the network for quadratic activation. For every parameter , we can write
We notice that can also be represented by a NN with layers; indeed, if is the SVD of , then . Therefore is sufficient to describe any element in . The factor in the statement is due to some technicalities in the proof, but a more involved proof should be able to extend the result to the regime . The extension of such mechanism for higher order tensors (appearing as a result of multiple layers or high-order polynomial activations) using tensor decomposition also seems possible and is left for future work.
3.1.3 Lower to upper intrinsic dimension gap
As observed in Lemma 5 and for all integer . Therefore, Theorem 10 and Theorem 11 say that, for , , and , the square loss function admits no spurious valleys in the over-parametrized regime . We conjecture that this hold for any (sufficiently regular) activation function with finite intrinsic lower dimension.
4 Infinite intrinsic dimension and presence of spurious valleys
This section is devoted to the construction of worst-case scenarios for non-over parame-trized networks. The main result (Theorem 12) essentially states that, for networks with width smaller than the lower intrinsic dimension defined above, spurious valleys can be created by choosing adversarial data distributions. We then show how this implies negative results for under-parametrized polynomial architectures and a large variety of architectures used in practice.
Consider the square loss function for two-layers NNs with non-negative activation function such that . If , then there exists a r.v. such that the square loss function admits spurious valleys. In particular, for any given , the r.v. can be chosen in such a way that there are two disjoint open subsets such that
and any path such that and is a global minima verifies
Equation (5) in Theorem 12 says that any local descent algorithm, if initialized with a parameter value belonging to a spurious valleys, at its best it will only be able to produce a final parameter value which is at least far from optimality. Equation (6) implies that there exists an open subset of the spurious valleys such that any path starting in a parameter belonging to such subset must ‘up-climb’ at least in the loss value. In the following we refer to such property, as stated in Theorem 12, by saying that the loss function has arbitrarily bad spurious valleys. Note that this result ensures that spurious valleys have positive Lebesgue measure (given by ), so there is a positive probability that gradient descent methods initialized with a measure that is absolutely continuous with respect to Lebesgue will get stuck in a bad local minima.
Applying the observations describing the values of the lower intrinsic dimension for different activation functions, we get the following corollaries.
Corollary 13 (Homogeneous even degree polynomial activations).
Assume that with integer. For two-layers NNs , if and the hidden-layer width satisfies
then there exists a r.v. such that the square loss function has arbitrarily bad spurious valleys.
This follows by Theorem 12 and Corollary 5, since . For the well known case (symmetric matrices) it holds ; therefore Corollary 13 implies that the bound provided in Corollary 9 is almost (up to a factor ) tight. Still, this is in line with recent works which explored quadratic architectures [41, 18].
Corollary 14 (Spurious valleys exist in generic architectures).
If , for two-layers NNs with any hidden-layer width and continuous non-negative non-polynomial activation function , then there exists a r.v. such that the square loss function has arbitrarily bad spurious valleys. This setting includes the following activation functions:
The ReLU activation function and some relaxations of it, such as softplus activation functions , with ;
The sigmoid activation function and the approximating erf function
, which represents an approximation to the sigmoid function.
This follows by Theorem 12 by observing that if is one of the above activation functions. Corollary 14 generalizes on some recent negative results [38, 46] for practical activations. We remark that while in these works the authors proved existence of spurious local minima, we prove that, in fact, arbitrarily bad spurious valleys can exist, which is a stronger negative characterization.
The results of this section can be interpreted as worst-case scenarios for the problem of optimizing (2). We showed that, even for simple two-layers neural network architectures with non-linear activation functions used in practice (such as ReLU), global optimality results can not hold, unless we make some assumptions on the data distributions.
5 Typical Spurious Valleys and Low-Energy Barriers
In the previous section it was shown that whenever the number of hidden units is below the lower intrinsic dimension, then one can show worst-case data distributions that yield a landscape with arbitrarily bad spurious valleys.
In this section, we study the energy landscape under generic data distributions in case of homogeneous activation, and show that, although spurious valleys may appear, they tend do so below a certain energy level, controlled by the decay of the spectral decomposition of the kernel defined by the activation function and by the amount of parametrisation . This phenomena is consistent with the empirical success of gradient-descent algorithms in conditions where is indeed below the intrinsic dimension.
We consider oracle square loss functions of the form
for one-dimensional output two-layers NNs , with , a positively homogeneous function, and square integrable r.v. (). Notice that we can write
for some measurable such that . In particular this implies that
As , the optimization problem (7) becomes convex, a fact that is exploited in several recent works [32, 36, 10]. As observed by Bach in , the effect of having only a finite number of hidden neurons can be recast as obtaining a quadrature rule for the reproducing kernel associated to the activation function. The following theorem is a direct application of Proposition 1 from , and relates the quadrature error with the ability to avoid large loss barriers with high probability.
Let be the uniform distribution over the unit sphere
be the uniform distribution over the unit sphereand consider an initial parameter with sampled i.i.d. Then the following hold:
There exists a path such that , the function is non-increasing, and
with probability greater or equal then , for every .
If is sufficiently regular333More precisely, if the function can be written as for some ., there exists a path such that , the function is non-increasing, and
with probability greater or equal then for every .
The above result implies that convex optimization over the second layer is sufficient to reach a model whose error relative to the best possible one is inversely proportional to the hidden-layer size (up to logarithm factors). Nevertheless, in practice, this approach will generally perform worse than standard gradient-descent training, which may require less over-parametrization to give satisfying results (see for example the numerical experiments in , Section 4). This shows the importance of understanding gradient descent dynamics on the first layer, amenable to analysis in the limit using mean-field techniques [36, 32] as well as optimal transport on Wasserstein metrics . Precisely quantifying how much is gained by optimizing jointly in the non-asymptotic case remains an important open question  left for future work.
6 Future directions
We considered the problem of characterizing the loss surface of neural networks from the perspective of optimization, with the goal of deriving weak certificates that enable - or prevent - the existence of descent paths towards global minima.
The topological properties studied in this paper, however, do not yet capture fundamental aspects that are necessary to explain the empirical success of deep learning methods. We identify a number of different directions that deserve further attention.
The positive results presented above rely on being able to reduce the network to the case when (convex) optimization over the second layer is sufficient to reach optimal weight values. A better understanding of first layer dynamics needs to be carried out. Moreover, in such positive results we only proved non-existence of (high) energy barriers. While this is an interesting property from the optimization point of view, it is also not sufficient to guarantee convergence of local descent algorithms. Another informative property of the loss function that should be addressed in future works is the existence of local descents in non optimal points: for every non optimal and any neighborhood of , there exists such that . More generally, our present work is not informative on the performance of gradient descent in the regimes with no spurious valley.
The other very important point to be addressed in future is how to extend the above results to architectures of more practical interest. Depth and the specific linear structure of Convolutional Neural Networks, critical to explain the excellent empirical performance of deep learning in computer vision, text or speech, need to be exploited, as well as specific design choices such as Residual connections and several normalization strategies – as done recently in and  respectively. This also requires making specific assumptions on the data distribution, and is left for future work.
We would like to thank Gérard Ben Arous and Léon Bottou for fruitful discussions, and Jean Ponce for valuable comments and corrections of the original version of this manuscript. The first author would also like to thank Jumageldi Charyyev for fruitful discussions on the proofs of several propositions and Andrea Ottolini for valuable comments on a previous version of this manuscript.
Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017.
-  Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1–38, 2017.
-  Afonso S Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the low-rank approach for semidefinite programs arising in synchronization and community detection. In Conference on Learning Theory, pages 361–382, 2016.
-  Alberto Bietti and Julien Mairal. Group invariance and stability to deformations of deep convolutional representations. arXiv preprint arXiv:1706.03078, 2017.
-  Helmut Bölcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approximation with sparsely connected deep neural networks. arXiv preprint arXiv:1705.01714, 2017.
-  Leon Bottou, Martin Arjovsky, David Lopez-Paz, and Maxime Oquab. Geometrical insights for implicit generative modeling. arXiv preprint arXiv:1712.07822, 2017.
-  Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838, 2016.
-  Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
-  Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex burer-monteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems, pages 2757–2765, 2016.
-  Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv preprint arXiv:1805.09545, 2018.
-  Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204, 2015.
-  Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728, 2016.
-  Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning, pages 955–963, 2016.
-  Pierre Comon, Gene Golub, Lek-Heng Lim, and Bernard Mourrain. Symmetric tensors and symmetric tensor rank. SIAM Journal on Matrix Analysis and Applications, 30(3):1254–1279, 2008.
-  Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261, 2016.
-  Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
-  Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A Hamprecht. Essentially no barriers in neural network energy landscape. arXiv preprint arXiv:1803.00885, 2018.
-  Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018.
-  Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
-  Soheil Feizi, Hamid Javadi, Jesse Zhang, and David Tse. Porcupine neural networks:(almost) all local optima are global. arXiv preprint arXiv:1710.02196, 2017.
-  Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. ICLR 2017, 2017.
-  Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. arXiv preprint arXiv:1704.00708, 2017.
-  Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
-  Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
-  Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018.
-  Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
-  Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
-  Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
-  Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
-  Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867, 1993.
-  Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pages 855–863, 2014.
-  Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layers neural networks. arXiv preprint arXiv:1804.06561, 2018.
-  Marco Mondelli and Andrea Montanari. On the connection between learning two-layers neural networks and tensor decomposition. arXiv preprint arXiv:1802.07301, 2018.
-  Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017.
-  Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via random matrix theory. In International Conference on Machine Learning, pages 2798–2806, 2017.
-  Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018.
-  Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pages 774–782, 2016.
-  Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.
-  Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604, 2018.
-  Ohad Shamir. Are resnets provably better than linear predictors? arXiv preprint arXiv:1804.06739, 2018.
-  Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.
-  Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
-  Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.
-  Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017.
-  Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. arXiv preprint arXiv:1611.03131, 2016.
-  Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. A critical view of optimality in deep learning. arXiv preprint arXiv:1802.03487, 2018.
-  Yuchen Zhang, Percy Liang, and Martin J Wainwright. Convexified convolutional neural networks. arXiv preprint arXiv:1609.01000, 2016.
Appendix A Proofs of Section 3
a.1 Proof of Theorem 7
We note that, under the assumptions of Theorem 7, the same optimal NN functions could also be obtained using a generalized linear model, where the representation function has the linear form , for some parameter independent function . The main difference between the two models is that the former requires the choice of a non-linear activation function , while the latter implies the choice of a kernel functions. This is the content of the following lemma.
Let be a continuous function and a r.v. Assume that the linear space
is finite dimensional. Then there exists a scalar product on and a map such that
for all . Moreover, the function is continuous.
For sake of simplicity, in the following we write for and for . Let be a basis of . If and , then we can define a scalar product on as
If we define the map as
then property (8) follows directly by the definition of the function . Moreover, we can choose such that is a basis of . Now we need to show that, for , the map is continuous. Let be the matrix and be the vector . Then , which is continuous in . This shows that the map is continuous. ∎
The non-trivial fact captured by Theorem 7 is the following: when the capacity of network is large enough to match a generalized linear model, but still finite, then the problem of optimizing the loss function (2), which is in general a highly non-convex object, satisfies an interesting optimization property in view of the local descent algorithms which are used in practice to solve it.
Proof of Theorem 7.
Thanks to Lemma 16, there exist two continuous maps , with , such that for every . Therefore, every two-layers NN can be written as , where, if , then (that is is applied row-wise).
The proof of the Theorem consists in exploiting the above linearized representation of to show that property 1 holds (remind that this is equivalent to saying that the loss function has no spurious valleys). Given an initial parameter , we want to construct a continuous path , such that the function is non-increasing and such that , , where . The construction of such a path can be articulated in two main steps:
The first part of the path consist showing that we can assume that w.l.o.g. Let be the rows of ; suppose that (otherwise there is nothing to show) and that are linearly independent. Denote , and the columns of . For , we can write
If we define such that (denoting the -th row of )
then . The path leaves the network unchanged, i.e. for . At this point, we can select such that the matrix with rows for and for , verifies . Notice that the existence of such vectors , , is guaranteed by the definition of . The path leaves the network unchanged, i.e. for . The new parameter value satisfies . ∎
By step 1, we can assume that . Since the network has the form and since the function is convex, there exists such that