A Critical View of Global Optimality in Deep Learning

02/10/2018 ∙ by Chulhee Yun, et al. ∙ MIT 0

We investigate the loss surface of deep linear and nonlinear neural networks. We show that for deep linear networks with differentiable losses, critical points after the multilinear parameterization inherit the structure of critical points of the underlying loss with linear parameterization. As corollaries we obtain "local minima are global" results that subsume most previous results, while showing how to distinguish global minima from saddle points. For nonlinear neural networks, we prove two theorems showing that even for networks with one hidden layer, there can be spurious local minima. Indeed, for piecewise linear nonnegative homogeneous activations (e.g., ReLU), we prove that for almost all practical datasets there exist infinitely many local minima that are not global. We conclude by constructing a counterexample involving other activation functions (e.g., sigmoid, tanh, arctan, etc.), for which there exists a local minimum strictly inferior to the global minimum.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural network training reduces to solving nonconvex empirical risk minimization problems, a task that is in general intractable. But success stories of deep learning suggest that local minima of the empirical risk could be close to global minima.

Choromanska et al. [4] use spherical spin-glass models from statistical physics to justify how the size of neural networks may result in local minima that are close to global. However, due to the complexities introduced by nonlinearity, a rigorous understanding of optimality in deep neural networks remains elusive.

Initial steps towards understanding optimality have focused on deep linear networks. This area has seen substantial recent progress. In deep linear networks there is no nonlinear activation; the output is simply a multilinear function of the input. Baldi and Hornik [1] prove that some shallow networks have no spurious local minima, and Kawaguchi [10] extends this result to squared error deep linear networks, showing that they only have global minima and saddle points. Several other works on linear nets have also appeared [15, 8, 27, 28, 14, 13].

The theory of nonlinear neural networks (which is the actual setting of interest), however, is still in its infancy. There have been attempts to extend the “local minima are global” property from linear to nonlinear networks, but recent results suggest that this property does not usually hold [28]. Although not unexpected, rigorously proving such results turns out to be non-trivial, forcing several authors (e.g., [18, 6, 24]) to make somewhat unrealistic assumptions (realizability and Gaussianity) on data.

In contrast, we prove existence of spurious local minima under the least restrictive (to our knowledge) assumptions. Since seemingly subtle changes to assumptions can greatly influence the analysis as well as the applicability of known results, let us first summarize what is known; this will also help provide a better intuitive perspective on our results (as the technical details are somewhat involved).

1.1 What is known so far?

There is a large and rapidly expanding literature of optimization of neural networks. Some works focus on the loss surface [1, 26, 10, 21, 20, 25, 16, 17, 18, 14, 27, 28, 24, 19], while others study the convergence of gradient-based methods for optimizing this loss [22, 3, 6]. In particular, our focus is on the loss surface itself, independent of any algorithmic concerns; this is reflected in the works summarized below.

For ReLU networks, the works [21, 28] provide counterexample datasets that lead to spurious local minima, dashing hopes of “local implies global” properties. However, these works fail to provide statements about generic datasets, and one can argue that their setups are limited to isolated pathological examples. In comparison, our Theorem 2.1 shows existence of spurious local minima for almost all datasets, a much more general result. Zhou and Liang [28] also give characterization of critical points of shallow ReLU networks, but with more than one hidden node the characterization provided is limited to certain regions.

There are also results that study population risk of shallow ReLU networks under a restrictive assumption that input data is i.i.d. Gaussian distributed

[18, 24, 6]. Moreover, these works also assume realizability, i.e., the output data is generated from a neural network with the same architecture as the model one trains, with unknown true parameters. These assumptions enable one to compute the population risk in a closed form, and ensure that one can always achieve zero loss at global minima. The authors of [18, 24] study the population risk function of the form , where the true parameters

’s are orthogonal unit vectors. Through extensive experiments and computer-assisted local minimality checks,

Safran and Shamir [18] show existence of local minima for . However, this result is empirical and does not have constructive proofs. Wu et al. [24] show that with , there is no spurious local minima on the manifold . Du et al. [6]

study population risk of one-hidden-layer CNN. They show that there can be a spurious local minimum, but gradient descent converges to the global minimum with probability at least 1/4.

Our paper focuses on empirical risk instead of population risk, and does not assume either Gaussianity or realizability. Our assumption on the dataset is that it is not linearly fittable111That is, given input data matrices and , there is no matrix such that ., which is vastly more general and realistic than assuming that input data is Gaussian or that the output is generated from an unknown neural network. Our results also show that [24] fails to extend to empirical risk and non-unit parameter vectors (see the discussion after Theorem 3.1).

Laurent and von Brecht [14] studies one-hidden-layer networks with hinge loss for classification. Under linear separability, the authors prove that Leaky-ReLU networks don’t have bad local minima, while ReLU networks do. Our focus is on regression, and we only make mild assumptions on data.

For deep linear networks, the most relevant result to ours is [13]. When all hidden layers are wider than the input or output layers, Laurent and Brecht [13] prove that any local minimum of a deep linear network under differentiable convex loss is global.222Although their result overlaps with a subset of Theorem 4.1, our theorem was obtained independently. They prove this by showing a statement about relationship between linear vs. multilinear parametrization. Our result in Theorem 4.1 is strictly more general that their results, and presents a comprehensive characterization.

A different body of literature [26, 20, 25, 16, 17] considers sufficient conditions for global optimality in nonlinear networks. These results make certain architectural assumptions (and some technical restrictions) that may not usually apply to realistic networks. There are also other works on global optimality conditions for specially designed architectures [9, 7].

1.2 Contributions and Summary of Results

We summarize our key contributions more precisely below. Our work encompasses results for both nonlinear and linear neural networks. First, we study whether the “local minima are global” property holds for nonlinear networks. Unfortunately, our results here are negative. Specifically, we prove

  • For piecewise linear and nonnegative homogeneous activation functions (e.g., ReLU), we prove in Theorem 2.1 that if linear models cannot perfectly fit the data, one can construct infinitely many local minima that are not global. In practice, most datasets are not linearly fittable, hence this result gives a constructive proof of spurious local minima for generic datasets. In contrast, several existing results either provide only one counterexample [21, 28], or make restrictive assumptions of realizability [18, 6] or linear separability [14]. This result is presented in Section 2.

  • In Theorem 3.1 we tackle more general nonlinear activation functions, and provide a simple architecture (with squared loss) and dataset, for which there exists a local minimum inferior to the global minimum for a realizable dataset. Our analysis applies to a wide range of activations, including sigmoid, tanh, arctan, ELU [5], SELU [11], and ReLU. Considering that realizability of data simplifies the analysis and ensures zero loss at global optima, our counterexample that is realizable and yet has a spurious local minimum is surprising, suggesting that the situation is likely worse for non-realizable data. See Section 3 for details.

We complement our negative results by presenting the following positive result on linear networks:

  • Assume that the hidden layers are as wide as either the input or the output, and that the empirical risk equals , where

    is a differentiable loss function and

    is the weight matrix for layer . Theorem 4.1 shows if is a critical point of , then its type of stationarity (local min/max, or saddle) is closely related to the behavior of evaluated at the product . If we additionally assume that any critical point of is a global minimum, Corollary 4.1 shows that the empirical risk only has global minima and saddles, and provides a simple condition to distinguish between them. To the best of our knowledge, this is the most general result on deep linear networks and it subsumes several previous results, e.g., [10, 27, 28, 13]. This result is in Section 4.

Notation.

For an integer , denotes the set of integers from to (inclusive). For a vector , we use to denote its -th component, while denotes a vector comprised of the first components of . Let () be the all ones (zeros) column vector or matrix with size .

2 “ReLU-like” networks: bad local minima exist for most data

We study below whether nonlinear neural networks provably have spurious local minima. We show in §2 and §3 that even for extremely simple nonlinear networks, one encounters spurious local minima. We first consider ReLU and ReLU-like networks. Here, we prove that as long as linear models cannot perfectly fit the data, there exists a local minimum strictly inferior to the global one. Using nonnegative homogeneity, we can scale the parameters to get infinitely many local minima.

Consider a training dataset that consists of data points. The inputs and the outputs are of dimension and , respectively. We aggregate these items, and write as the data matrix and as the label matrix. Consider the 1-hidden-layer neural network , where is a nonlinear activation function, , , , and . We analyze the empirical risk with squared loss

Next, define a class of piecewise linear nonnegative homogeneous functions

(1)

where and . Note that ReLU and Leaky-ReLU are members of this class.

2.1 Main results and discussion

We use the shorthand . The main result of this section, Theorem 2.1, considers the case where linear models cannot fit , i.e., for all matrix . With ReLU-like activation (1) and a few mild assumptions, Theorem 2.1 shows that there exist spurious local minima.

Suppose that the following conditions hold:

  1. Output dimension is , and linear models cannot perfectly fit .

  2. All the data points ’s are distinct.

  3. The activation function is .

  4. The hidden layer has at least width 2: .

Then, there is a spurious local minimum whose risk is the same as linear least squares model. Moreover, due to nonnegative homogeneity of , there are infinitely many such local minima. Noticing that most real world datasets cannot be perfectly fit with linear models, Theorem 2.1 shows that when we use the activation , the empirical risk has bad local minima for almost all datasets that one may encounter in practice. Although it is not very surprising that neural networks have spurious local minima, proving this rigorously is non-trivial. We provide a constructive and deterministic proof for this problem that holds for very general datasets, which is in contrast to experimental results of [18]. We emphasize that Theorem 2.1 also holds even for “slightest” nonlinearities, e.g., when and where is small. This suggests that the “local min is global” property is limited to the trivial setting of linear neural networks.

Existing results on squared error loss either provide one counterexample [21, 28], or assume realizability and Gaussian input [18, 6]. Realizability is an assumption that the output is generated by a network with unknown parameters. In real datasets, neither input is Gaussian nor output is generated by neural networks; in contrast, our result holds for most realistic situations, and hence delivers useful insight.

There are several results proving sufficient conditions for global optimality of nonlinear neural networks [20, 25, 16]. But they rely on assumptions that the network width scales with the number of data points. For instance, applying Theorem 3.4 of [16] to our network proves that if has linearly independent columns and other assumptions hold, then any critical point with is a global minimum. However, linearly independent columns already imply , so even linear models can fit any ; i.e., there is less merit in using a complex model to fit . Theorem 2.1 does not make any structural assumption other than , and addresses the case where it is impossible to fit with linear models, which is much more realistic.

It is worth comparing our result with [14], who use hinge loss based classification and assume linear separability to prove “no spurious local minima” for Leaky-ReLU networks. Their result does not contradict our theorem because the losses are different and we do not assume linear separability.

One might wonder if our theorem holds even with . Venturi et al. [23] showed that one-hidden-layer neural networks with doesn’t have spurious valleys; however, their result shows nonexistence of strict spurious local minima, whereas due to we only have non-strict local minima. Based on [2], one might claim that with wide enough hidden layer and random and , one can fit any ; however, this is not the case, by our assumption that linear models cannot fit . Note that there is a non-trivial region in the parameter space where (entry-wise). In this region, the output of neural network is still a linear combination of rows of , so cannot fit ; in fact, it can only do as well as linear models.

2.2 Analysis of Theorem 2.1

The proof of the theorem is split into two steps. First, we prove that there exist local minima whose risk value is the same as the linear least squares solution, and that there are infinitely many such minima. Second, we will construct a tuple of parameters that has strictly smaller empirical risk than .

Step 1: A local minimum as good as the linear solution. The main idea here is to exploit the weights from the linear least squares solution, and to tune the parameters so that all inputs to hidden nodes become positive. Doing so makes the hidden nodes “locally linear,” so that the constructed

that produce linear least squares estimates at the output become locally optimal.

Recall that , and define a linear least squares loss that is minimized at , so that . Since , the solution is a row vector. For all , let be the output of the linear least squares model, and similarly .

Let , a negative constant making for all . Define parameters

where is any arbitrary fixed positive constant, gives the first components of , and the last component. Since , for any , (component-wise), given our choice of . Thus, all hidden node inputs are positive. Moreover, , so that the loss .

So far, we checked that has the same empirical risk as a linear least squares solution. It now remains to show that this point is indeed a local minimum of . To that end, we consider the perturbed parameters , and check their risk is always larger. A useful point is that since is a minimum of , we have

(2)

so and . For small enough perturbations, still holds for all . So, we can observe that

(3)

where and are and ; they are aggregated perturbation terms. We used (2) to obtain the last equality of (3). Thus, for small perturbations, proving is indeed a local minimum of . Since this is true for arbitrary , there are infinitely many such local minima. We can also construct similar local minima by permuting hidden nodes, etc.

Step 2: A point strictly better than the local minimum.

The proof of this step is more involved. In the previous step, we “pushed” all the input to the hidden nodes to positive side, and took advantage of “local linearity” of the hidden nodes near . But to construct parameters that have strictly smaller risk than (to prove that is a spurious local minimum), we make the sign of inputs to the hidden nodes different depending on data.

To this end, we sort the indices of data points in increasing order of ; i.e., . Define the set . The remaining construction is divided into two cases: and , whose main ideas are essentially the same. We present the proof for , and defer the other case to Appendix A2 as it is rarer, and its proof, while instructive for its perturbation argument, is technically too involved.

Case 1: . Pick any . We can observe that , because of (2). Define , so that for all and for all . Then, let be a constant satisfying , whose value will be specified later. Since is small enough, . Now select parameters

Recall again that . For , and , so

Similarly, for , and results in . Here, we push the outputs of the network by from , and the direction of the “push” varies depending on whether or .

The empirical risk for this choice of parameters is

Since and , we can choose , and choose small so that , proving that is a spurious local minimum.

3 Counterexample: bad local minima for many activations

The proof of Theorem 2.1 crucially exploits the piecewise linearity of the activation functions. Thus, one may wonder whether the spurious local minima seen there are an artifact of the specific nonlinearity. We show below that this is not the case. We provide a counterexample nonlinear network and a dataset for which a wide range of nonlinear activations result in a local minimum that is strictly inferior to the global minimum with exactly zero empirical risk. Examples of such activation functions include popular activation functions such as sigmoid, tanh, arctan, ELU, SELU, and ReLU.

We consider again the squared error empirical risk of a one-hidden-layer nonlinear neural network:

where we fix and . Also, let be the -th derivative of , whenever it exists at . For short, let and denote the first and second derivatives.

3.1 Main results and discussion

Let the loss and network be as defined above. Consider the dataset

For this network and dataset the following results hold:

  1. If there exist real numbers such that

    1. , and

    2. ,

    then there is a tuple at which equals .

  2. If there exist real numbers such that the following conditions hold:

    1. ,

    2. is infinitely differentiable at and ,

    3. there exists a constant such that and .

    4. ,

    5. ,

    then there exists a tuple such that the output of the network is the same as the linear least squares model, the risk , and is a local minimum of .

Theorem 3.1 shows that for this architecture and dataset, activations that satisfy (C3.1.1)–(C3.1.7) introduce at least one spurious local minimum. Notice that the empirical risk is zero at the global minimum. This means that the data and can actually be “generated” by the network, which satisfies the realizability assumption that others use [18, 6, 24]. Notice that our counterexample is “easy to fit,” and yet, there exists a local minimum that is not global. This leads us to conjecture that with harder datasets, the problems with spurious local minima could be worse. The proof of Theorem 3.1 can be found in Appendix A3.

Discussion. Note that the conditions (C3.1.1)–(C3.1.7) only require existence of certain real numbers rather than some global properties of activation , hence are not as restrictive as they look. Conditions (C3.1.1)–(C3.1.2) come from a choice of tuple that perfectly fits the data. Condition (C3.1.3) is necessary for constructing with the same output as the linear least squares model, and Conditions (C3.1.4)–(C3.1.7) are needed for showing local minimality of via Taylor expansions. The class of functions that satisfy conditions (C3.1.1)–(C3.1.7) is quite large, and includes the nonlinear activation functions used in practice. The next corollary highlights this observation (for a proof with explicit choices of the involved real numbers, please see Appendix A5). For the counterexample in Theorem 3.1, the set of activation functions satisfying conditions (C3.1.1)–(C3.1.7) include sigmoid, tanh, arctan, ELU, and SELU.

Admittedly, Theorem 3.1 and Corollary 3.1 give one counterexample instead of stating a claim about generic datasets. Nevertheless, this example shows that for many practical nonlinear activations, the desirable “local minimum is global” property cannot hold even for realizable datasets, suggesting that the situation could be worse for non-realizable ones.

Remark: “ReLU-like” activation functions. Recall the piecewise linear nonnegative homogeneous activation function . They do not satisfy condition (C3.1.7), so Theorem 3.1 cannot be directly applied. Also, if (i.e., ReLU), conditions (C3.1.1)–(C3.1.2) are also violated. However, the statements of Theorem 3.1 hold even for , which is shown in Appendix A6. Recalling again and , this means that even with the “slightest” nonlinearity in activation function, the network has a global minimum with risk zero while there exists a bad local minimum that performs just as linear least squares models. In other words, “local minima are global” property is rather brittle and can only hold for linear neural networks. Another thing to note is that in Appendix A6, the bias parameters are all zero, for both and . For models without bias parameters, is still a spurious local minimum, thus showing that [24] fails to extend to empirical risks and non-unit weight vectors.

4 Global optimality in linear networks

In this section we present our results on deep linear neural networks. Assuming that the hidden layers are at least as wide as either the input or output, we show that critical points of the loss with a multilinear parameterization inherit the type of critical points of the loss with a linear parameterization. As a corollary, we show that for differentiable losses whose critical points are globally optimal, deep linear networks have only global minima or saddle points. Furthermore, we provide an efficiently checkable condition for global minimality.

Suppose the network has hidden layers having widths . To ease notation, we set and . The weights between adjacent layers are kept in matrices (), and the output of the network is given by the product of weight matrices with the data matrix: . Let be the tuple of all weight matrices, and denote the product for , and the identity for . We consider the empirical risk , which, for linear networks assumes the form

(4)

where is a suitable differentiable loss. For example, when , . Lastly, we write .

Remark: bias terms. We omit the bias terms here. This choice is for simplicity; models with bias can be handled by the usual trick of augmenting data and weight matrices.

4.1 Main results and discussion

We are now ready to state our first main theorem, whose proof is deferred to Appendix A7. Suppose that for all , , and that the loss is given by (4), where is differentiable on . For any critical point of the loss , the following claims hold:

  1. If , then is a saddle of .

  2. If , then

    1. is a local min (max) of if is a local min (max) of ; moreover,

    2. is a global min (max) of if and only if is a global min (max) of .

  3. If there exists such that has full row rank and has full column rank, then , so 2(a) and 2(b) hold. Also,

    1. is a local min (max) of if is a local min (max) of .

Let us paraphrase Theorem 4.1 in words. In particular, it states that if the hidden layers are “wide enough” so that the product can attain full rank and if the loss assumes the form (4) for a differentiable loss , then the type (optimal or saddle point) of a critical point of is governed by the behavior of at the product .

Note that for any critical point of the loss , either or . Parts 1 and 2 handle these two cases. Also observe that the condition in Part 3 implies , so Part 3 is a refinement of Part 2. A notable fact is that a sufficient condition for Part 3 is having full rank. For example, if , full-rank implies , whereby the condition in Part 3 holds with .

If is not critical for , then must be a saddle point of . If is a local min/max of , is also a local min/max of . Notice, however, that Part 2(a) does not address the case of saddle points; when is a saddle point of , the tuple can behave arbitrarily. However, with the condition in Part 3, statements 2(a) and 3(a) hold at the same time, so that is a local min/max of if and only if is a local min/max of . Observe that the same “if and only if” statement holds for saddle points due to their definition; in summary, the types (min/max/saddle) of the critical points and match exactly.

Although Theorem 4.1 itself is of interest, the following corollary highlights its key implication for deep linear networks. In addition to the assumptions in Theorem 4.1, assume that any critical point of is a global min (max). For any critical point of , if , then is a saddle of , while if , then is a global min (max) of .

Proof.

If , then is a saddle point by Theorem 4.1.1. If , then is a global min (max) of by assumption. By Theorem 4.1.2(b), must be a global min (max) of . ∎

Corollary 4.1 shows that for any differentiable loss function whose critical points are global minima, the loss has only global minima and saddle points, therefore satisfying the “local minima are global” property. In other words, for such an , the multilinear re-parametrization introduced by deep linear networks does not introduce any spurious local minima/maxima; it only introduces saddle points. Importantly, Corollary 4.1 also provides a checkable condition that distinguishes global minima from saddle points. Since is nonconvex, it is remarkable that such a simple necessary and sufficient condition for global optimality is available.

Our result generalizes previous works on linear networks such as [10, 27, 28], because it provides conditions for global optimality for a broader range of loss functions without assumptions on datasets. Laurent and Brecht [13] proved that if is a local min of , then is a critical point of . First, observe that this result is implied by Theorem 4.1.1. So our result, which was proved in parallel and independently, is strictly more general. With additional assumption that critical points of are global minima, Laurent and Brecht [13] showed that “local min is global” property holds for linear neural networks; our Corollay 4.1 gives a simple and efficient test condition as well as proving there are only global minima and saddles, which is clearly stronger.

5 Discussion and Future Work

We investigated the loss surface of deep linear and nonlinear neural networks. We proved two theorems showing existence of spurious local minima on nonlinear networks, which apply to almost all datasets (Theorem 2.1) and a wide class of activations (Theorem 3.1). We concluded by Theorem 4.1, showing a general result studying the behavior of critical points in multilinearly parametrized functions, which unifies other existing results on linear networks. Given that spurious local minima are common in neural networks, a valuable future research direction will be investigating how far local minima are from global minima in general, and how the size of the network affects this gap. Another direction would be to add regularizers and see how they affect the loss surface. Additionally, one can try to show algorithmic results in a similar flavor as [6]. We hope that our paper will be a stepping stone to such future research.

References

Appendix A1 Notation

We first list notation used throughout the appendix. For integers , denotes the set of integers between them. We write , if . For a vector , we use to denote its -th component, while denotes a vector comprised of the first components of . Let (or ) be the all ones (zeros) column vector in . For a subspace , we denote by its orthogonal complement.

For a matrix , is the -th entry and its -th column. Let and

denote the largest and smallest singular values of

, respectively; , , , and denote respectively the row space, column space, rank, and Frobenius norm of matrix . Let and be the null space and the left-null space of , respectively. When is a square matrix, let be the trace of . For matrices and of the same size, denotes the usual trace inner product of and . Equivalently, . Let

be the all zeros matrix in

.

Appendix A2 Proof of Theorem 2.1, Step 2, Case 2

Case 2. .

We start with a lemma discussing what implies. If , the following statements hold:

  1. There are some ’s that are duplicate; i.e. for some , .

  2. If is non-duplicate, meaning that , holds.

  3. If is duplicate, holds.

  4. There exists at least one duplicate such that, for that , there exist at least two different ’s that satisfy and .

Proof.

We prove this by showing if any of these statements are not true, then we have or a contradiction.

  1. If all the ’s are distinct and , by definition of , for all . This violates our assumption that linear models cannot perfectly fit .

  2. If we have for a non-duplicate , at least one of the following statements must hold: or , meaning that or .

  3. Suppose is duplicate and . Let and . Then at least one of the following statements must hold: or . If , we can also see that , so . Similarly, if , then .

  4. Since holds for any duplicate , if holds for one then there must be at least two of them that satisfies . If this doesn’t hold for all duplicate , with Part 2 this means that holds for all . This violates our assumption that linear models cannot perfectly fit .

From Lemma A2.4, we saw that there is a duplicate value of such that some of the data points satisfy and . The proof strategy in this case is essentially the same, but the difference is that we choose one of such duplicate , and then choose a vector to “perturb” the linear least squares solution in order to break the tie between ’s that satisfies and .

We start by defining the minimum among such duplicate values of ’s, and a set of indices that satisfies .

Then, we define a subset of :

By Lemma A2.4, cardinality of