Neural Networks with Finite Intrinsic Dimension have no Spurious Valleys

02/18/2018 ∙ by Luca Venturi, et al. ∙ NYU college 0

Neural networks provide a rich class of high-dimensional, non-convex optimization problems. Despite their non-convexity, gradient-descent methods often successfully optimize these models. This has motivated a recent spur in research attempting to characterize properties of their loss surface that may be responsible for such success. In particular, several authors have noted that over-parametrization appears to act as a remedy against non-convexity. In this paper, we address this phenomenon by studying key topological properties of the loss, such as the presence or absence of "spurious valleys", defined as connected components of sub-level sets that do not include a global minimum. Focusing on a class of two-layer neural networks defined by smooth (but generally non-linear) activation functions, our main contribution is to prove that as soon as the hidden layer size matches the intrinsic dimension of the reproducing space, defined as the linear functional space generated by the activations, no spurious valleys exist, thus allowing the existence of descent directions. Our setup includes smooth activations such as polynomials, both in the empirical and population risk, and generic activations in the empirical risk case.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern machine learning applications involve datasets of increasing dimensionality, complexity and size, which in turn motivate the use of high-dimensional, non-linear models, as illustrated in many deep learning algorithms across computer vision, speech and natural language understanding. The prevalent strategy for learning is to rely on Stochastic Gradient Descent (SGD) methods, that typically operate on non-convex objectives. In this context, an outstanding goal is to provide a theoretical framework that explains under what conditions – relating input data distribution, choice of architecture and choice of optimization scheme – this setup will be successful.

More precisely, let denote a model class parametrized by

, which in the case of Neural Networks (NNs) contains the aggregated weights across all layers. In a supervised learning setting, this model is deployed on some data

random variable taking values in , to predict targets given features , and its risk for a given is

(1)

where

is a convex loss, such as a square loss or a logistic regression loss. In the following we refer to (

1) as the risk, the energy or the loss interchangeably. The aim is to find

and this is attempted in practice by running SGD iteration on the parameter

where are (i.i.d.) drawn from . Under some technical conditions, the expected gradient is known to converge to [7]

. Understanding the nature of such stationary points - and therefore the landscape of the loss function - is a task of fundamental importance to understand performance of SGD.

Whereas there is a growing literature in analyzing the behavior of SGD on non-convex objectives [43, 28, 25, 44], we focus here on properties of the optimization problem above that are algorithm independent. Many authors in the literature have attempted to characterize the landscape of the loss function (1) by studying its critical points. Global optimality results have been obtained for NN architectures with linear activations [26, 29, 46], quadratic activations [41, 18] and some more general non-linear activations, under appropriate regularity assumptions [42, 34, 20]. Some other insights have been obtained by leveraging tools for complexity analysis of spin glasses [11]

and random matrix theory

[35]. Other analysis involved studying goodness of the initialization of the parameter values  [15, 37, 19] or other topological properties of the loss (1), such as connectivity of sub-level sets [17, 21]. A common factor shared in the above cited works (and in common practice) is that overparametrisation of the model class (i.e. ) often leads to improved performance, despite the potential increase in generalization error.

Each model defines a functional space , whose complexity a priori increases with the dimensionality of the parameter space . Whereas several authors have studied these models from the lens of approximation theory [13, 12, 5] by focusing on specific aspects of the parameterisation (such as the depth of the network), in this work we explore another hypothesis, namely that over-parameterization remedies the complexity of that functional space, often leading to loss functions without poor local minima.

Our approach is inspired by Freeman and Bruna [21], and is related to recent work that also explores convexifications of neural networks [1, 47, 4]. Our analysis focuses mostly on the class of two-layer neural networks, with a hidden layer of size , and covers both empirical and population risk landscapes. A given activation function determines a functional space . In essence, our work identifies notions of intrinsic dimension of this functional space, and establishes the following facts:

  1. If the hidden layer size is at least equal to the upper intrinsic dimension, then the resulting loss landscape is free of poor local minima, independently of the data distribution;

  2. If is smaller than the lower intrinsic dimension, then there exist data distributions yielding arbitrarily poor local minima.

We articulate the notion of poor local minima via what we call spurious valleys, defined as connected components of the sublevel sets that do not contain a global minima. Upper and lower intrinsic dimensions define only two scenarios: either (i) they are both finite, enabling positive results; or (ii) they are both infinite, implying the negative results. Moreover, case (i) only occurs for polynomial activation functions or when the data distribution is discrete, corresponding to generic empirical risk minimization. The negative result covers many classes of activation functions with infinite intrinsic dimension. In particular, they generalize previously known negative results (such as leaky ReLUs) [46, 38] to a far wider class of activations. While in general the upper and lower intrinsic dimension may not match, we show that in some cases (linear and quadratic networks) the gap between the positive and negative results can be closed by improving on the former.

The negative results are worst-case in nature, and leave open the question of how complex is a ‘typical’ energy landscape corresponding to a generic data distribution. We answer this question by showing that, even if spurious valleys may appear in general, they are in practice easily avoided from random initializations, up to a low energy threshold, which approaches the global minimum at a rate inversely proportional to the hidden layer size up to log factors. This fact is shown for networks with homogeneous activations and generic data distributions and it is based on properties of random kernel quadrature rules [2].

Many other type of analysis of the convergence of NNs gradient-based optimization algorithms have been considered in the literature. For example, Ge et al. [24] proved convergence of GD on a modified loss; Shamir [40] compared optimization properties of residual networks with respect to linear models; in Dauphin et al. [16] it is argued that the issues arising in the optimization of NN architectures are due to the presence of saddle points in the loss function rather than spurious local minima. Optimization landscapes have also been studied in other contexts than from NNs training, such as low rank [22], matrix completion [23], problems arising in semidefinite programming [9, 3] and implicit generative modeling [6].

The rest of the paper is structured as follows. Section 2 formally introduces the notion of spurious valleys and explains why this is a relevant concept from the optimization point of view. It also defines the intrinsic dimensions of a network (Section 2.1). In Section 3 we state our main positive results (Theorem 7) and we discuss two settings where they bear fruit: polynomial activation functions and empirical risk minimization. For the case of linear and quadratic activations, we improve on our general result, by proving that, for the linear case, Theorem 7 holds without any assumptions on the distributions of the data or on the size/rank of any variables (which extends previous results on the optimization of linear NNs [29, 46]), and by recovering, for the quadratic case, results which are in line with current literature [41, 18]. Section 4 is dedicated to constructions of worst case scenarios for activation with infinite lower intrinsic dimension. We then show, in Section 5, that, even if spurious valleys may exist, they tend to be confined to regimes of low risk. Some discussion is reported in Section 6.

1.1 Notation

We introduce notation we use throughout the rest of the paper. For any integers we denote and, if ,

. We denote scalar valued variables as lowercase non-bold; vector valued variables as lowercase bold; matrix and tensor valued variables as uppercase bold. Given a vector

, we denote its components as ; given a matrix , we denote its rows as ; given a tensor , we denote its components as . Given some vectors , , the tensor product denotes the dimensional tensor whose components are given by ; given a vector , we denote .

denotes the identity matrix and

the standard basis in . For any random variables (r.v.’s) and with values in and respectively, we denote and . For every integer , we denote by , and , respectively, the general linear group, the orthogonal group and the special orthogonal group of real matrices. We denote by the space of order symmetric tensors on . For any , we define the symmetric rank [14] as . We define . Finally, denotes the -dimensional sphere .

2 Problem setting

Let be two r.v.’s. These r.v.’s take values in and and represent the input and output data, respectively. We consider oracle square loss functions of the form

(2)

where is convex. For every , the function models the dependence of the output on the input as . We focus on two-layers NN functions , i.e. of the form

(3)

where . Here represents the width of the hidden layer and is a continuous element-wise activation function.

The loss function is (in general) a non-convex object; it may present spurious (i.e. non global) local minima. In this work, we characterize by determining absence or presence of spurious valleys, as defined below.

Definition 1.

For all we define the sub-level set of as . We define a spurious valley as a connected component of a sub-level set which does not contain a global minimum of the loss .

Since, in practice, the loss (2) is minimized with a gradient descent based algorithm, then absence of spurious valleys is a desirable property, if we wish the algorithm to converge to an optimal parameter. It is easy to see that not having spurious valleys is equivalent to the following property:

  1. [label=P.0]

  2. Given any initial parameter , there exists a continuous path such that:

    1. The function is non-increasing

As pointed out in Freeman and Bruna [21], this implies that has no strict spurious (i.e. non global) local minima. The absence of generic (i.e. non-strict) spurious local minima is guaranteed if the path is such that the function

is strictly decreasing. For many activation functions used in practice (such as the ReLU

), the parameter determining the function is determined up to the action of a symmetries group (e.g., in the case of the ReLU, is an positive homogeneous function). This prevents strict minima: for any value of the parameter there exists a (often large) manifold intersecting along which the loss function is constant. Absence of spurious valleys for the loss (2) implies that it is always possible to move from any point in the parameter space to a global minima, without increasing the loss.

2.1 Intrinsic dimension of a network

The main result of this work is to exploit that the property of absence of spurious valleys is related to the complexity of the functional space defined by the network architecture. We therefore define two measures of such complexity which we will use to show, respectively, positive and negative results in this regard.

To simplify the discussion, we introduce some notation which we will use throughout the rest of the paper. Let be a continuous activation function. For every we denote to be the function . We refer to each as a filter function. If is a r.v. taking values in , we denote by the space of square integrable function on w.r.t. the probability measure induced by the r.v. . We then define the two following functional spaces:

represents the space of (one-dimensional output) functions modeled by the network architecture and to be the space of (

-dimensional) input data distributions for which the filter functions have finite second moment. We finally define

as the linear space spanned by the functions for .

Definition 2.

Let be a continuous activation function and a r.v. We define111For any linear subspace , denotes the dimension of as a subspace of .

as the upper intrinsic dimension of the pair . We define the level upper intrinsic dimension of as .

The upper intrinsic dimension defined above is therefore the dimension of the functional space spanned by the filter functions or, equivalently, of the image of the map . Notice that . In particular, if the distribution is discrete, i.e. it is concentrated on a finite number of points , then . Otherwise, if the distribution is not discrete, then .

The level upper intrinsic dimension is defined as the dimension of the functional linear space . We note that if is a r.v. with almost surely (a.s.) positive density w.r.t. the Lebesgue measure , then .

The following lemma exhausts all the cases when the upper intrinsic dimension is not infinite.

Lemma 3.

Let be a continuous activation function and such that . If is a polynomial, then

Otherwise (i.e. if is not a polynomial) it holds .

We then define the lower intrinsic dimension, which corresponds to the concept of ‘how many hidden neurons are needed to represent a generic function of

’.

Definition 4.

Let be a continuous activation function and a r.v. We define222For any subsets , we say that if as subsets of (and similar with other inclusions or equalities).

as the lower dimension of the pair . We define the level lower dimension of as .

If is finite, then it corresponds to the minimum number of hidden neurons which are needed to represent any function of with the NN architecture (3). Clearly, this implies that

for every continuous activation function and any . As with the upper instrinsic dimension, we note that if is a r.v. with a.s. positive density w.r.t. the Lebesgue measure , then .

In the case of homogeneus polynomial activations with integer, the level lower dimension of coincides with the notion of (maximal) symmetric tensor rank.

Lemma 5.

Let , with positive integer. Then

Finally, the next lemma implies that for most non-polynomial activation functions practical interest, the lower intrinsic dimension is infinite.

Lemma 6.

Let be a continuous activation function such that and . Then if and only if is not a polynomial.

3 Finite network dimension and absence of spurious valleys

In this section we provide our positive results. Essentially they state that if the width of the network matches the dimension of the functional space spanned by its filter functions, then no spurious valleys exist. We first provide the main result (Theorem 7) in a general form, which allows a straight-forward derivation of two cases of interest: empirical risk minimization (Corollary 8) and polynomial activations (Corollary 9).

Theorem 7.

For any continuous activation function and r.v. with finite upper intrinsic dimension , the loss function

for two-layers NNs admits no spurious valleys in the over-parame-trized regime .

The above result can be re-phrased as follows: if the network is such that any of its output units can be chosen from the whole linear space spanned by its filter functions , then the associated optimization problem is such that there always exists a descent path to an optimal solution, for any initialization of the parameters.

Applying the observations in Section 2.1 describing the cases of finite intrinsic dimension, we immediately get the following corollaries.

Corollary 8 (Erm).

Consider data points . For two-layers NNs , where is any continuous activation function, the empirical loss function

admits no spurious valleys in the over-parametrized regime .

This result is in line with previous works that considered the landscape of empirical risk minimization for half-rectified deep networks [42, 45, 31, 34]. However, its proof illustrates the danger of studying empirical risk minimization landscapes in over-parametrized regimes, since it bypasses all the geometric and algebraic properties needed in the population risk setting - which may be more relevant to understand the generalization properties of the model.

Corollary 9 (Polynomial activations).

For two-layers NNs with polynomial activation function , the loss function admits no spurious valleys in the over-parametrized regime

Under the hypothesis of Corollary 9 with , a generic function of , , can be also represented, for some , in the generalized linear form

with . The parameters and differ for their dimensions:

One would therefore like Corollary 9 to hold also (at least) for . In the next section we address this problem for the linear activation and the quadratic activation .

3.1 Improved over-parametrization bounds for homogeneous polynomial activations

The over-parametrization bounds obtained in Corollary 9 are quite non-desiderable in practical applications. We show that they can indeed be improved, for the case of linear and quadratic networks.

3.1.1 Linear networks case

Linear networks have been considered as a first order approximation of feed-forward multi-layers networks [29]. It was shown, in several works [29, 21, 46], that, for linear networks of any depth

(4)

with , the loss function has no spurious local minima, if . This corresponds exactly with over-parametrization regime in Corollary 9, for the case of two-layers networks. The following theorem improves on Corollary 9 for the case of multi-layers linear networks, showing that no over-parametrization is required in this case to avoid spurious valleys, for square loss functions.

Theorem 10 (Linear networks).

For linear NNs (4) of any depth and of any layer widths , , and any input-output dimensions , the square loss function admits no spurious valleys.

3.1.2 Quadratic networks case

Quadratic activations have been considered in the literature [31, 18, 41] as second order approximation of general non-linear activations. In particular, for two-layers networks with one dimensional output and square loss functions evaluated on samples, it was shown in Du and Lee [18] that, if , the loss has no spurious local minima. Corollary 9 requires an over-parametrization bound for the case of quadratic activations. In the following theorem we show that is sufficient for the statement to hold, in the case of square loss functions and one dimensional output ().

Theorem 11 (Quadratic networks).

For two-layers NNs with quadratic activation function and one-dimensional output (), the square loss function admits no spurious valleys in the over-parametrized regime .

This result is in line with the one from Soltanolkotabi et al. [41], where the authors proved absence of spurious local minima when , but for fixed second layer weights. The proof (reported in Section A) consists in constructing a path satisfying 1 and improves upon the proof of Theorem 7 by leveraging the special linearized structure of the network for quadratic activation. For every parameter , we can write

We notice that can also be represented by a NN with layers; indeed, if is the SVD of , then . Therefore is sufficient to describe any element in . The factor in the statement is due to some technicalities in the proof, but a more involved proof should be able to extend the result to the regime . The extension of such mechanism for higher order tensors (appearing as a result of multiple layers or high-order polynomial activations) using tensor decomposition also seems possible and is left for future work.

3.1.3 Lower to upper intrinsic dimension gap

As observed in Lemma 5 and for all integer . Therefore, Theorem 10 and Theorem 11 say that, for , , and , the square loss function admits no spurious valleys in the over-parametrized regime . We conjecture that this hold for any (sufficiently regular) activation function with finite intrinsic lower dimension.

4 Infinite intrinsic dimension and presence of spurious valleys

This section is devoted to the construction of worst-case scenarios for non-over parame-trized networks. The main result (Theorem 12) essentially states that, for networks with width smaller than the lower intrinsic dimension defined above, spurious valleys can be created by choosing adversarial data distributions. We then show how this implies negative results for under-parametrized polynomial architectures and a large variety of architectures used in practice.

Theorem 12.

Consider the square loss function for two-layers NNs with non-negative activation function such that . If , then there exists a r.v. such that the square loss function admits spurious valleys. In particular, for any given , the r.v. can be chosen in such a way that there are two disjoint open subsets such that

(5)

and any path such that and is a global minima verifies

(6)

Equation (5) in Theorem 12 says that any local descent algorithm, if initialized with a parameter value belonging to a spurious valleys, at its best it will only be able to produce a final parameter value which is at least far from optimality. Equation (6) implies that there exists an open subset of the spurious valleys such that any path starting in a parameter belonging to such subset must ‘up-climb’ at least in the loss value. In the following we refer to such property, as stated in Theorem 12, by saying that the loss function has arbitrarily bad spurious valleys. Note that this result ensures that spurious valleys have positive Lebesgue measure (given by ), so there is a positive probability that gradient descent methods initialized with a measure that is absolutely continuous with respect to Lebesgue will get stuck in a bad local minima.

Applying the observations describing the values of the lower intrinsic dimension for different activation functions, we get the following corollaries.

Corollary 13 (Homogeneous even degree polynomial activations).

Assume that with integer. For two-layers NNs , if and the hidden-layer width satisfies

then there exists a r.v. such that the square loss function has arbitrarily bad spurious valleys.

This follows by Theorem 12 and Corollary 5, since . For the well known case (symmetric matrices) it holds ; therefore Corollary 13 implies that the bound provided in Corollary 9 is almost (up to a factor ) tight. Still, this is in line with recent works which explored quadratic architectures [41, 18].

Corollary 14 (Spurious valleys exist in generic architectures).

If , for two-layers NNs with any hidden-layer width and continuous non-negative non-polynomial activation function , then there exists a r.v. such that the square loss function has arbitrarily bad spurious valleys. This setting includes the following activation functions:

  • The ReLU activation function and some relaxations of it, such as softplus activation functions , with ;

  • The sigmoid activation function and the approximating erf function

    , which represents an approximation to the sigmoid function.

This follows by Theorem 12 by observing that if is one of the above activation functions. Corollary 14 generalizes on some recent negative results [38, 46] for practical activations. We remark that while in these works the authors proved existence of spurious local minima, we prove that, in fact, arbitrarily bad spurious valleys can exist, which is a stronger negative characterization.

The results of this section can be interpreted as worst-case scenarios for the problem of optimizing (2). We showed that, even for simple two-layers neural network architectures with non-linear activation functions used in practice (such as ReLU), global optimality results can not hold, unless we make some assumptions on the data distributions.

5 Typical Spurious Valleys and Low-Energy Barriers

In the previous section it was shown that whenever the number of hidden units is below the lower intrinsic dimension, then one can show worst-case data distributions that yield a landscape with arbitrarily bad spurious valleys.

In this section, we study the energy landscape under generic data distributions in case of homogeneous activation, and show that, although spurious valleys may appear, they tend do so below a certain energy level, controlled by the decay of the spectral decomposition of the kernel defined by the activation function and by the amount of parametrisation . This phenomena is consistent with the empirical success of gradient-descent algorithms in conditions where is indeed below the intrinsic dimension.

We consider oracle square loss functions of the form

(7)

for one-dimensional output two-layers NNs , with , a positively homogeneous function, and square integrable r.v. (). Notice that we can write

for some measurable such that . In particular this implies that

As , the optimization problem (7) becomes convex, a fact that is exploited in several recent works [32, 36, 10]. As observed by Bach in [2], the effect of having only a finite number of hidden neurons can be recast as obtaining a quadrature rule for the reproducing kernel associated to the activation function. The following theorem is a direct application of Proposition 1 from [2], and relates the quadrature error with the ability to avoid large loss barriers with high probability.

Theorem 15.

Let

be the uniform distribution over the unit sphere

and consider an initial parameter with sampled i.i.d. Then the following hold:

  1. There exists a path such that , the function is non-increasing, and

    with probability greater or equal then , for every .

  2. If is sufficiently regular333More precisely, if the function can be written as for some ., there exists a path such that , the function is non-increasing, and

    with probability greater or equal then for every .

The above result implies that convex optimization over the second layer is sufficient to reach a model whose error relative to the best possible one is inversely proportional to the hidden-layer size (up to logarithm factors). Nevertheless, in practice, this approach will generally perform worse than standard gradient-descent training, which may require less over-parametrization to give satisfying results (see for example the numerical experiments in [10], Section 4). This shows the importance of understanding gradient descent dynamics on the first layer, amenable to analysis in the limit using mean-field techniques [36, 32] as well as optimal transport on Wasserstein metrics [10]. Precisely quantifying how much is gained by optimizing jointly in the non-asymptotic case remains an important open question [10] left for future work.

6 Future directions

We considered the problem of characterizing the loss surface of neural networks from the perspective of optimization, with the goal of deriving weak certificates that enable - or prevent - the existence of descent paths towards global minima.

The topological properties studied in this paper, however, do not yet capture fundamental aspects that are necessary to explain the empirical success of deep learning methods. We identify a number of different directions that deserve further attention.

The positive results presented above rely on being able to reduce the network to the case when (convex) optimization over the second layer is sufficient to reach optimal weight values. A better understanding of first layer dynamics needs to be carried out. Moreover, in such positive results we only proved non-existence of (high) energy barriers. While this is an interesting property from the optimization point of view, it is also not sufficient to guarantee convergence of local descent algorithms. Another informative property of the loss function that should be addressed in future works is the existence of local descents in non optimal points: for every non optimal and any neighborhood of , there exists such that . More generally, our present work is not informative on the performance of gradient descent in the regimes with no spurious valley.

The other very important point to be addressed in future is how to extend the above results to architectures of more practical interest. Depth and the specific linear structure of Convolutional Neural Networks, critical to explain the excellent empirical performance of deep learning in computer vision, text or speech, need to be exploited, as well as specific design choices such as Residual connections and several normalization strategies – as done recently in

[40] and [39] respectively. This also requires making specific assumptions on the data distribution, and is left for future work.

Acknowledgements

We would like to thank Gérard Ben Arous and Léon Bottou for fruitful discussions, and Jean Ponce for valuable comments and corrections of the original version of this manuscript. The first author would also like to thank Jumageldi Charyyev for fruitful discussions on the proofs of several propositions and Andrea Ottolini for valuable comments on a previous version of this manuscript.

References

Appendix A Proofs of Section 3

a.1 Proof of Theorem 7

We note that, under the assumptions of Theorem 7, the same optimal NN functions could also be obtained using a generalized linear model, where the representation function has the linear form , for some parameter independent function . The main difference between the two models is that the former requires the choice of a non-linear activation function , while the latter implies the choice of a kernel functions. This is the content of the following lemma.

Lemma 16.

Let be a continuous function and a r.v. Assume that the linear space

is finite dimensional. Then there exists a scalar product on and a map such that

(8)

for all . Moreover, the function is continuous.

Proof.

For sake of simplicity, in the following we write for and for . Let be a basis of . If and , then we can define a scalar product on as

If we define the map as

then property (8) follows directly by the definition of the function . Moreover, we can choose such that is a basis of . Now we need to show that, for , the map is continuous. Let be the matrix and be the vector . Then , which is continuous in . This shows that the map is continuous. ∎

The non-trivial fact captured by Theorem 7 is the following: when the capacity of network is large enough to match a generalized linear model, but still finite, then the problem of optimizing the loss function (2), which is in general a highly non-convex object, satisfies an interesting optimization property in view of the local descent algorithms which are used in practice to solve it.

Proof of Theorem 7.

Thanks to Lemma 16, there exist two continuous maps , with , such that for every . Therefore, every two-layers NN can be written as , where, if , then (that is is applied row-wise).

The proof of the Theorem consists in exploiting the above linearized representation of to show that property 1 holds (remind that this is equivalent to saying that the loss function has no spurious valleys). Given an initial parameter , we want to construct a continuous path , such that the function is non-increasing and such that , , where . The construction of such a path can be articulated in two main steps:

Step 1..

The first part of the path consist showing that we can assume that w.l.o.g. Let be the rows of ; suppose that (otherwise there is nothing to show) and that are linearly independent. Denote , and the columns of . For , we can write

(9)

If we define such that (denoting the -th row of )

then . The path leaves the network unchanged, i.e. for . At this point, we can select such that the matrix with rows for and for , verifies . Notice that the existence of such vectors , , is guaranteed by the definition of . The path leaves the network unchanged, i.e. for . The new parameter value satisfies . ∎

Step 2..

By step 1, we can assume that . Since the network has the form and since the function is convex, there exists such that