Log In Sign Up

Non-attracting Regions of Local Minima in Deep and Wide Neural Networks

Understanding the loss surface of neural networks is essential for the design of models with predictable performance and their success in applications. Experimental results suggest that sufficiently deep and wide neural networks are not negatively impacted by suboptimal local minima. Despite recent progress, the reason for this outcome is not fully understood. Could deep networks have very few, if at all, suboptimal local optima? or could all of them be equally good? We provide a construction to show that suboptimal local minima (i.e. non-global ones), even though degenerate, exist for fully connected neural networks with sigmoid activation functions. The local minima obtained by our proposed construction belong to a connected set of local solutions that can be escaped from via a non-increasing path on the loss curve. For extremely wide neural networks with two hidden layers, we prove that every suboptimal local minimum belongs to such a connected set. This provides a partial explanation for the successful application of deep neural networks. In addition, we also characterize under what conditions the same construction leads to saddle points instead of local minima for deep neural networks.


The loss surface of deep and wide neural networks

While the optimization problem behind deep neural networks is highly non...

No Spurious Local Minima: on the Optimization Landscapes of Wide and Deep Neural Networks

Empirical studies suggest that wide neural networks are comparably easy ...

Unique Properties of Wide Minima in Deep Networks

It is well known that (stochastic) gradient descent has an implicit bias...

Elimination of All Bad Local Minima in Deep Learning

In this paper, we theoretically prove that we can eliminate all suboptim...

Existence of local minima of a minimal 2D pose-graph SLAM problem

In this paper, we show that for a minimal pose-graph problem, even in th...

Analysis of minima for geodesic and chordal cost for a minimal 2D pose-graph SLAM problem

In this paper, we show that for a minimal pose-graph problem, even in th...

Partial local entropy and anisotropy in deep weight spaces

We refine a recently-proposed class of local entropic loss functions by ...

I Introduction

At the heart of most optimization problems lies the search for the global minimum of a loss function. The common approach to finding a solution is to initialize at random in parameter space and subsequently follow directions of decreasing loss based on local methods. This approach lacks a global progress criteria, which leads to descent into one of the nearest local minima. Since the loss function of deep neural networks is non-convex, the common approach of using gradient descent variants is vulnerable precisely to that problem.

Authors pursuing the early approaches to local descent by back-propagating gradients [1] experimentally noticed that suboptimal local minima appeared surprisingly harmless. More recently, for deep neural networks, the earlier observations were further supported by the experiments of e.g., [2]. Several authors aimed to provide theoretical insight for this behavior. Broadly, two views may be distinguished. Some, aiming at explanation, rely on simplifying modeling assumptions. Others investigate neural networks under realistic assumptions, but often focus on failure cases only. Recently, Nguyen and Hein[3]

provide partial explanations for deep and extremely wide neural networks for a class of activation functions including the commonly used sigmoid. Extreme width is characterized by a “wide” layer that has more neurons than input patterns to learn. For almost every instantiation of parameter values

(i.e. for all but a null set of parameter values) it is shown that, if the loss function has a local minimum at , then this local minimum must be a global one. This suggests that for deep and wide neural networks, possibly every local minimum is global. The question on what happens at the null set of parameter values, for which the result does not hold, remains unanswered.

Similar observations for neural networks with one hidden layer were made earlier by Gori and Tesi[4] and Poston et al. [5]. Poston et al. [5]

show for a neural network with one hidden layer and sigmoid activation function that, if the hidden layer has more nodes than training patterns, then the error function (squared sum of prediction losses over the samples) has no suboptimal “local minimum” and “each point is arbitrarily close to a point from which a strictly decreasing path starts, so such a point cannot be separated from a so called “good” point by a barrier of any positive height”

[5]. It was criticized by Sprinkhuizen-Kuyper and Boers[6] that the definition of a local minimum used in the proof of [5] was rather strict and unconventional. In particular, the results do not imply that no suboptimal local minima, defined in the usual way, exist. As a consequence, the notion of attracting and non-attracting regions of local minima were introduced and the authors prove that non-attracting regions exist by providing an example for the extended XOR problem. The existence of these regions imply that a gradient-based approach descending the loss surface using local information may still not converge to the global minimum. The main objective of this work is to revisit the problem of such non-attracting regions and show that they also exist in deep and wide networks. In particular, a gradient based approach may get stuck in a suboptimal local minimum. Most importantly, the performance of deep and wide neural networks cannot be explained by the analysis of the loss curve alone, without taking proper initialization or the stochasticity of SGD into account.

Our observations are not fundamentally negative. At first, the local minima we find are rather degenerate. With proper initialization, a local descent technique is unlikely to get stuck in one of the degenerate, suboptimal local minima111That a proper initialization largely improves training performance is well-known. See, e.g., [7].. Secondly, the minima reside on a non-attracting region of local minima (see Definition 1

). Due to its exploration properties, stochastic gradient descent will eventually be able to escape from such a region (see

[8]). We conjecture that in sufficiently wide and deep networks, except for a null set of parameter values as starting points, there is always a monotonically decreasing path down to the global minimum. This was shown in [5] for neural networks with one hidden layer, sigmoid activation function and square loss, and we generalize this result to neural networks with two hidden layers. (More precisely, our result holds for all neural networks with square loss and a class of activation functions including the sigmoid, where the wide layer is the last or second last hidden layer). This implies that in such networks every local minimum belongs to a non-attracting region of local minima.

Our proof of the existence of suboptimal local minima even in extremely wide and deep networks is based on a construction of local minima in neural networks given by Fukumizu and Amari[9]. By relying on careful computation we are able to characterize when this construction is applicable to deep neural networks. Interestingly, in deeper layers, the construction rarely seems to lead to local minima, but more often to saddle points. The argument that saddle points rather than suboptimal local minima are the main problem in deep networks has been raised before (see [10]) but a theoretical justification[11] uses strong assumptions that do not exactly hold in neural networks. Here, we provide the first analytical argument, under realistic assumptions on the neural network structure, describing when certain critical points of the training loss lead to saddle points in deeper networks.

Ii Related work

We discuss related work on suboptimal minima of the loss surface. In addition, we refer the reader to the overview article [12] for a discussion on the non-convexity in neural network training.

It is known that learning the parameters of neural networks is, in general, a hard problem. Blum and Rivest[13] prove NP-completeness for a specific neural network. It has also been shown that local minima and other critical points exist in the loss function of neural network training (see e.g. [14, 9, 15, 16, 6, 17]). The understanding of these critical points has led to significant improvements in neural network training. This includes weight initialization techniques (e.g. [7]

), improved backpropagation algorithms to avoid saturation effects in neurons

[18], entirely new activation functions, or the use of second order information [19, 20].

That suboptimal local minima must become rather degenerate if the neural network becomes sufficiently large was observed for networks with one hidden layer in [4] and [5]. Recently, Nguyen and Hein[3] generalized this result to deeper networks containing an extremely wide hidden layer. Our contribution can be considered as a continuation of this work.

To explain the persuasive performance of deep neural networks, Dauphin et al. [10] experimentally show that there is a similarity in the behavior of critical points of the neural network’s loss function with theoretical properties of critical points found for Gaussian fields on high-dimensional spaces[21]. Choromanska et al. [11]

supply a theoretical connection, but they also require strong (arguably unrealistic) assumptions on the network structure. The results imply that (under their assumptions on the deep network) the loss at a local minimum must be close to the loss of the global minimum with high probability. In this line of research,

[22] experimentally show a similarity between spin glass models and the loss curve of neural networks.

Why deep networks perform better than shallow ones is also investigated in [23]

by considering a class of compositional functions. There is a number of papers partially answering the same question for ReLU and LeakyReLU networks, where the space becomes combinatorial in terms of a positive activation, compared to a stalled (or weak) signal. Soudry and Hoffer

[24] probabilistically compare the volume of regions (for a specific measure) containing bad local and global minima in the limit, as the number of data points goes to infinity. For networks with one hidden layer and ReLU activation function, Freeman and Bruna[25] quantify the amount of hill-climbing necessary to go from one point in the parameter space to another and finds that for increasing overparameterization, all level sets become connected. Swirszcz et al. [26] construct datasets that allow to find suboptimal local minima in overparameterized networks. Instead of analyzing local minima, Xie et al. [27] consider regions where the derivative of the loss is small for two-layer ReLu networks. Soudry and Carmon[28] consider leaky ReLU activation functions to find, similarly to the result of Nguyen and Hein[3], that for almost every combination of activation patterns in two consecutive mildly wide layers, a local minimum has global optimality.

To gain better insight into theoretical aspects, some papers consider linear networks, where the activation function is the identity. The classic result by Baldi and Hornik[29] shows that linear two-layer neural networks have a unique global minimum and all other critical values are saddle points. Kawaguchi,[30], Lu and Kawaguchi[31] and Yun et al. [32] discuss generalizations of [29] to deep linear networks.

The existence of non-increasing paths on the loss curve down to the global minimum is studied by Poston et al. [5] for extremely wide two-layer neural networks with sigmoid activation functions. For ReLU networks, Safran and Shamir[33] show that, if one starts at a sufficiently high initialization loss, then there is a strictly decreasing path with varying weights into the global minimum. Haeffele and Vidal[34] consider a specific class of ReLU networks with regularization, give a sufficient condition that a local minimum is globally optimal, and show that a non-increasing path down to the global minimum exists.

Finally, worth mentioning is the study of Liao and Poggio[35] who use polynomial approximations to argue, by relying on Bezout’s theorem, that the loss function should have many local minima with zero empirical loss. Also relevant is the observation by Brady et al. [36] showing that, if the global minimum is not of zero loss, then a perfect predictor may have a larger loss in training than one producing worse classification results.

Iii Main results

Iii-a Problem definition

We consider regression networks with fully connected layers of size given by

where denotes the weight matrix of the -th layer, , the bias terms, and a nonlinear activation function. The neural network function is denoted by and we notationally suppress dependence on parameters. We assume the activation function to belong to the class of strictly monotonically increasing, analytic, bounded functions on with image in interval such that , a class we denote by . As prominent examples, the sigmoid activation function and lie in .

We assume no activation function at the output layer.

The neural network is assumed to be a regression network mapping into the real domain , i.e. and . We train on a finite dataset of size with input patterns and desired target value . We aim to minimize the squared loss . Further, denotes the collection of all .

The dependence of the neural network function on translates into a dependence of of the loss function on the parameters . Due to assumptions on , is twice continuously differentiable. The goal of training a neural network consists of minimizing over . There is a unique value denoting the infimum of the neural network’s loss (most often in our examples). Any set of weights that satisfies is called a global minimum. Due to its non-convexity, the loss function of a neural network is in general known to potentially suffer from local minima (precise definition of a local minimum below). We will study the existence of suboptimal local minima in the sense that a local minimum is suboptimal if its loss is strictly larger than .

We refer to deep neural networks as models with more than one hidden layer. Further, we refer to wide neural networks as the type of model considered in [4, 5, 3] with one hidden layer containing at least as many neurons as input patterns (i.e. for some in our notation).

Disclaimer: Naturally, training for zero global loss is not desirable in practice, neither is the use of fully connected wide and deep neural networks necessarily. The results of this paper are of theoretical importance. To be able to understand the complex learning behavior of deep neural networks in practice, it is a necessity to understand the networks with the most fundamental structure. In this regard, while our result are not directly applicable to neural networks used in practice, they do offer explanations for their learning behavior.

Iii-B A special kind of local minimum

The standard definition of a local minimum, which is also used here, is a point such that has a neighborhood with for all . Since local minima do not need to be isolated (i.e. for all ) two types of connected regions of local minima may be distinguished. Note that our definition slightly differs from the one by [6].

Definition 1.

[6] Let be a differentiable function. Suppose is a maximal connected subset of parameter values , such that every is a local minimum of with value .

  • R is called an attracting region of local minima, if there is a neighborhood of such that every continuous path , which is non-increasing in and starts from some , satisfies for all .

  • R is called a non-attracting region of local minima, if every neighborhood of contains a point from where a continuous path exists that is non-increasing in and ends in a point with .

Despite its non-attractive nature, a non-attracting region of local minima may be harmful for a gradient descent approach. A path of greatest descent can end in a local minimum on . However, no point on needs to have a neighborhood of attraction in the sense that following the path of greatest descent from a point in a neighborhood of will lead back to . (The path can lead to a different local minimum on close by or reach points with strictly smaller values than .)

In the example of such a region for the 2-3-1 XOR network provided in [6], a local minimum (of higher loss than the global loss) resides at points in parameter space with some coordinates at infinity. In particular, a gradient descent approach may lead to diverging parameters in that case. However, a different non-increasing path down to the global minimum always exists. It can be shown that local minima at infinity also exist for wide and deep neural networks. (The proof can be found in Appendix -A.)

Theorem 1 (cf. [6] Section III).

Let denote the squared loss of a fully connected regression neural network with sigmoid activation functions, having at least one hidden layer and each hidden layer containing at least two neurons. Then, for almost every finite dataset, the loss function possesses a local minimum at infinity. The local minimum is suboptimal whenever dataset and neural network are such that a constant function is not an optimal solution.

Fig. 1: The function illustrating a non-attracting region of local minima given by . (This example does not exactly appear in neural networks considered in this paper, but is of similar nature.)

A different type of non-attracting regions of local minima (without infinite parameter values) is considered for neural networks with one hidden layer by Fukumizu and Amari[9] and Wei et al. [8] under the name of singularities. This type of region is characterized by singularities in the weight space (a subset of the null set not covered by the results of Nguyen and Hein[3]) leading to a loss value strictly larger than the global loss. The dynamics around such region are investigated by Wei et al. [8]. Again, a full batch gradient descent approach can get stuck in a local minimum in this type of region. A rough illustration of the nature of these non-attracting regions of local minima is depicted in Fig. 1.

Non-attracting regions of local minima do not only exist in small two-layer neural networks.

Theorem 2.

There exist deep and wide fully-connected neural networks with sigmoid activation function such that the squared loss function of a finite dataset has a non-attracting region of local minima (at finite parameter values).

The construction of such local minima is discussed in Section V with a complete proof in Appendix -B.

Corollary 1.

Any attempt to show for fully connected deep and wide neural networks that a gradient descent technique will always lead to a global minimum only based on a description of the loss curve will fail if it doesn’t take into consideration properties of the learning procedure (such as the stochasticity of stochastic gradient descent), properties of a suitable initialization technique, or assumptions on the dataset.

On the positive side, we point out that a stochastic method such as stochastic gradient descent has a good chance to escape a non-attracting region of local minima due to noise. With infinite time at hand and sufficient exploration, the region can be escaped from with high probability (see [8] for a more detailed discussion). In Section V-A we will further characterize when the method used to construct examples of regions of non-attracting local minima is applicable. This characterization limits us to the construction of extremely degenerate examples. We give an intuitive argument why assuring the necessary assumptions for the construction becomes more difficult for wider and deeper networks and why it is natural to expect a lower suboptimal loss (where the suboptimal minima are less “bad”) the less degenerate the constructed minima are and the more parameters a neural network possesses.

Iii-C Non-increasing path to a global minimum

By definition, every neighborhood of a non-attracting region of local minima contains points from where a non-increasing path to a value less than the value of the region exists. (By definition all points belonging to a non-attracting region have the same value, in fact they are all local minima.) The question therefore arises whether from almost everywhere in parameter space there is such a non-increasing path all the way down to a global minimum. If the last hidden layer is the wide layer having more neurons than input patterns (for example consider a wide two-layer neural network), then this holds true by the results of [3] (and [4, 5]). We show the same conclusion to hold for wide neural networks having the second last hidden layer the wide one. In particular, this implies that for wide neural networks with two hidden layers, starting from almost everywhere in parameter space, there is non-increasing path down to a global minimum.

Theorem 3.

Consider a fully connected regression neural network with activation function in the class equipped with the squared loss function for a finite dataset. Assume that the second last hidden layer contains more neurons than the number of input patterns. Then, for each set of parameters and all , there is such that and such that a path non-increasing in loss from to a global minimum where for each exists.

Corollary 2.

Consider a wide, fully connected regression neural network with two hidden layers and activation function in the class and trained to minimize the squared loss over a finite dataset. Then all suboptimal local minima are contained in a non-attracting region of local minima.

The rest of the paper contains the arguments leading to the given results.

Iv Notational choices

We fix additional notation aside the problem definition from Section III-A. For input

, we denote the pattern vector of values at all neurons at layer

before activation by and after activation by .

In general, we will denote column vectors of size with coefficients by or simply and matrices with entries at position by . The neuron value pattern is then a vector of size denoted by , and the activation pattern .

Using that can be considered a composition of functions from consecutive layers, we denote the function from to the output by . For convenience of the reader, a tabular summary of all notation is provided in Appendix References.

V Construction of local minima

We recall the construction of so-called hierarchical suboptimal local minima given in [9] and extend it to deep networks. For the hierarchical construction of critical points, we add one additional neuron to a hidden layer . (Negative indices are unused for neurons, which allows us to add a neuron with this index.) Once we have fixed the layer

, we denote the parameters of the incoming linear transformation by

, so that denotes the contribution of neuron in layer to neuron in layer , and the parameters of the outgoing linear transformation by , where denotes the contribution of neuron in layer to neuron in layer . For weights of the output layer (into a single neuron), we write instead of .

Fig. 2: An image of a two-layer neural network defined by weights in the image of the embedding . Numbers in nodes denote the index of a neuron in form of (layer, neuron index).

We recall the function used in [9] to construct local minima in a hierarchical way. This function describes the mapping from the parameters of the original network to the parameters after adding a neuron and is determined by incoming weights into , outgoing weights of , and a change of the outgoing weights of for one chosen in the smaller network. Sorting the network parameters in a convenient way, the embedding of the smaller network into the larger one is defined for any by a function mapping parameters of the smaller network to parameters of the larger network and is defined by

Here denotes the collection of all remaining network parameters, i.e., all for and all parameters from linear transformation of layers with index smaller than or larger than , if existent. A visualization of is shown in Fig. 2.

Important fact: For the functions of smaller and larger network at parameters and respectively, we have for all . More generally, we even have and for all and .

V-a Characterization of hierarchical local minima

Using to embed a smaller deep neural network into a second one with one additional neuron, it has been shown that critical points get mapped to critical points.

Theorem 4 (Nitta [15]).

Consider two neural networks as in Section III-A, which differ by one neuron in layer with index in the larger network. If parameter choices determine a critical point for the squared loss over a finite dataset in the smaller network then, for each , determines a critical point in the larger network.

As a consequence, whenever an embedding of a local minimum with into a larger network does not lead to a local minimum, then it leads to a saddle point instead. (There are no local maxima in the networks we consider, since the loss function is convex with respect to the parameters of the last layer.) For neural networks with one hidden layer, it was characterized when a critical point leads to a local minimum.

Theorem 5 (Fukumizu, Amari [9]).

Consider two neural networks as in Section III-A with only one hidden layer and which differ by one neuron in the hidden layer with index in the larger network. Assume that parameters determine a local minimum for the squared loss over a finite dataset in the smaller neural network and that .

Then determines a local minimum in the larger network if the matrix given by

is positive definite and , or if is negative definite and or . (Here, we denote the -th input dimension of input by .)

We extend the previous theorem to a characterization in the case of deep networks. We note that a similar computation was performed in [19] for neural networks with two hidden layers.

Theorem 6.

Consider two (possibly deep) neural networks as in Section III-A, which differ by one neuron in layer with index in the larger network. Assume that the parameter choices determine a local minimum for the squared loss over a finite dataset in the smaller network. If the matrix defined by


is either

  • positive definite and , or

  • negative definite and

then determines a non-attracting region of local minima in the larger network if and only if


is zero, , for all .

Remark 1.

In the case of a neural network with only one hidden layer as considered in Theorem 5, the function is the identity function on and the matrix in (1) reduces to the matrix in Theorem 5. The condition that for all does hold for shallow neural networks with one hidden layer as we show below. This proves Theorem 6 to be consistent with Theorem 5.

The theorem follows from a careful computation of the Hessian of the cost function

, characterizing when it is positive (or negative) semidefinite and checking that the loss function does not change along directions that correspond to an eigenvector of the Hessian with eigenvalue 0. We state the outcome of the computation in Lemma 

1 and refer the reader interested in a full proof of Theorem 6 to Appendix -B.

Lemma 1.

Consider two (possibly deep) neural networks as in Section III-A, which differ by one neuron in layer with index in the larger network. Fix . Assume that the parameter choices determine a critical point in the smaller network.

Let denote the the loss function of the larger network and the loss function of the smaller network. Let such that .

With respect to the basis of the parameter space of the larger network given by , the Hessian of (i.e., the second derivative with respect to the new network parameters) at is given by

V-B Shallow networks with a single hidden layer

For the construction of suboptimal local minima in wide two-layer networks, we begin by following the experiments of [9] that prove the existence of suboptimal local minima in (non-wide) two-layer neural networks.

Consider a neural network of size 1–2–1. We use the corresponding network function to construct a dataset by randomly choosing and letting . By construction, we know that a neural network of size 1–2–1 can perfectly fit the dataset with zero error.

Consider now a smaller network of size 1–1–1 having too little expressibility for a global fit of all data points. We find parameters where the loss function of the neural network is in a local minimum with non-zero loss. For this small example, the required positive definiteness of from (1) for a use of with reduces to checking a real number for positivity, which we assume to hold true. We can now apply and Theorem 5 to find parameters for a neural network of size 1–2–1 that determine a suboptimal local minimum. This example may serve as the base case for a proof by induction to show the following result.

Theorem 7.

There is a wide neural network with one hidden layer and arbitrarily many neurons in the hidden layer that has a non-attracting region of suboptimal local minima.

Having already established the existence of parameters for a (small) neural network leading to a suboptimal local minimum, it suffices to note that iteratively adding neurons using Theorem 5 is possible. Iteratively at step , we add a neuron to the network by an application of with the same . The corresponding matrix from (1),

is positive semidefinite. (We use here that neither nor ever change during this construction.) By Theorem 5 we always find a suboptimal minimum with nonzero loss for the network for . Note however, that a continuous change of to a value outside of does not change the network function, but leads to a saddle point. Hence, we found a non-attracting region of suboptimal minima.

Remark 2.

Since we started the construction from a network of size 1–1–1, our constructed example is extremely degenerate: The suboptimal local minima of the wide network have identical incoming weight vectors for each hidden neuron. Obviously, the suboptimality of this parameter setting is easily discovered. Also with proper initialization, the chance of landing in this local minimum is vanishing.

However, one may also start the construction from a more complex network with a larger network with several hidden neurons. In this case, when adding a few more neurons using , it is much harder to detect the suboptimality of the parameters from visual inspection.

(b) (c)
Fig. 3: A non-attracting region of local minima in a deep neural network. (a) A local minimum at t=0. Top: Evolution of the loss into random directions. Bottom: The minimum over all sampled directions. (b) Path along a degenerate direction to a saddle point. (c) Saddle point with the same loss value. Error evolution along a direction of descent.

V-C Deep neural networks

According to Theorem 6, next to positive definiteness of the matrix for some , in deep networks there is a second condition for the construction of hierarchical local minima using the map , i.e. . We consider conditions that make .

Proposition 1.

Suppose we have a hierarchically constructed critical point of the squared loss of a neural network constructed by adding a neuron into layer with index by application of the map to a neuron . Suppose further that for the outgoing weights of we have , and suppose that is defined as in (2). Then if one of the following holds.

  • []

  • The layer is the last hidden layer. (This condition includes the case indexing the hidden layer in a two-layer network.)

  • for all

  • For each and each , with ,

    (This condition holds in the case of the weight infinity attractors in the proof to Theorem 1 for the second last layer. It also holds in a global minimum.)

The proof is contained in Appendix -C.

V-D Experiment for deep networks

To construct a local minimum in a deep and wide neural network, we start by considering a three-layer network of size 2–2–4–1, i.e. we have two input dimensions, one output dimension and hidden layers of two and four neurons. We use its network function to create a dataset of 50 samples , hence we know that a network of size 2–2–4–1 can attain zero loss.

We initialize a new neural network of size 2–2–2–1 and train it until convergence, before using the construction to add neurons to the network. When adding neurons to the last hidden layer using , Proposition 1 assures that for all . We check for positive definiteness of the matrix , and only continue when this property holds. Having thus assured the necessary condition of Theorem 6, we can add a few neurons to the last hidden layer (by induction as in the two-layer case), which results in local minimum of a network of size 2–2–M–1. The local minimum of non-zero loss that we attain is suboptimal whenever by construction. For the network is wide.

Experimentally, we show not only that indeed we end up with a suboptimal minimum, but also that it belongs to a non-attracting region of local minima. In Fig. 3 we show results after adding eleven neurons to the last hidden layer. On the left side, we plot the loss in the neighborhood of the constructed local minimum in parameter space. The top image shows the loss curve into randomly generated directions, the bottom displays the minimal loss over all these directions. On the top right we show the change of loss along one of the degenerate directions that allows reaching a saddle point. In such a saddle point we know from Lemma 1 the direction of descent. The image on the bottom right shows that indeed the direction allows a reduction in loss. Being able to reach a saddle point from a local minimum by a path of non-increasing loss shows that indeed we found a non-attracting region of local minima.

V-E A discussion of limitations and of the loss of non-attracting regions of suboptimal minima

We fix a neuron in layer and aim to use to find a local minimum in the larger network. We then need to check whether a matrix is positive definite, which depends on the dataset. Under strong independence assumptions (the signs of different eigenvalues of are independent), one may argue similar to arguments in [10] that the probability of finding to be positive definite (all eigenvalues positive) is exponentially decreasing in the number of possible neurons of the previous layer . At the same time, the number of neurons in layer to use for the construction only increases linearly in the number of neurons in layer .

Experimentally, we use a four-layer neural network of size 2–8–12–8–1 to construct a (random) dataset containing 500 labeled samples. We train a network of size 2–4–6–4–1 on the dataset until convergence using SciPy’s222 BFGS implementation. For each layer , we check each neuron whether it can be used for enlargment of the network using the map for some , i.e., we check whether the corresponding matrix is positive definite. We repeat this experiment 1000 times. For the first layer, we find that in 547 of 4000 test cases the matrix is positive definite. For the second layer we only find positive definite in 33 of 6000 cases, and for the last hidden layer there are only 6 instances out of 4000 where the matrix is positive definite. Since the matrix is of size for the first/second/last hidden layer respectively, the number of positive matrices is less than what would be expected under the strong independence assumptions discussed above.

In addition, in deeper layers, further away from the output layer, it seems dataset dependent and unlikely to us that . Simulations seem to support this belief. However, it is difficult to check the condition numerically. Firstly, it is hard to find the exact position of minima and we only compute numerical approximations of . Secondly, the terms are small for sufficiently large networks and numerical errors play a role. Due to these two facts, it becomes barely possible to check the condition of exact equality to zero. In Fig. 4 we show the distribution of maximal entries of the matrix for neurons in the first, second and third layer of the network of size 2–4–6–4–1 trained as above. Note that for the third layer we know from theory that in a critical point we have , but due to numerical errors much larger values arise.

Further, a region of local minima as above requires linearly dependent activation pattern vectors. This is how linear dimensions for subsequent layers get lost, reducing the ability to approximate the target function. Intuitively, in a deep and wide neural network there are many possible directions of descent. Loosing some of them still leaves the network with enough freedom to closely approximate the target function. As a result, these suboptimal minima have a loss close to the global loss.

Conclusively, finding suboptimal local minima with high loss by the construction using becomes hard when the networks become deep and wide.

Vi Proving the existence of a non-increasing path to the global minimum

Fig. 4: Distribution of maximal entries of matrices for first, second and last hidden layer over training a network of size 2–4–6–4–1 over 1000 random datasets.

In the previous section we showed the existence of non-attracting regions of local minima. These type of local minima do not rule out the possibility of non-increasing paths to the global minimum from almost everywhere in parameter space. In this section, we sketch the proof to Theorem 3 illustrated in form of several lemmas, where up to the basic assumptions on the neural network structure as in Section III-A (with activation function in ), the assumption of one lemma is given by the conclusion of the previous one. A full proof can be found in Appendix -D.

We consider vectors that we call activation vectors, different from the activation pattern vectors from above. The activation vector at neuron in layer is denoted by and defined by all values at the given neuron for different samples :

In other words while we fix and for the activation pattern vectors and let run over its possible values, we fix and for the activation vectors and let run over its samples in the dataset.

The first step of the proof is to use the freedom given by to have the activation vectors of the wide layer span the whole space .

Lemma 2.

[3, Corollary 4.5] For each choice of parameters and all there is such that and for the activation vectors of the wide layer for we have

Lemma 3.

Assume that in the wide layer we have that the activation vectors satisfy

Then for any continuous paths in layer with there is a continuous path of parameters with and such that

Lemma 4.

For all continuous paths in , i.e. the N-fold copy of the image of , there is a continuous path in such that for all .

The activation vectors of the last hidden layer span a linear subspace of . The optimal parameters of the output layer compute the best approximation of onto . Lemma 3 and Lemma 4 together imply that we can achieve any desired continuous change of the spanning vectors of , and hence the linear subspace , by a suitable change of the parameters .

As it turns out, there is a natural possible path of parameters that strictly monotonically decreases the loss to the global minimum whenever we may assume that not all non-zero coefficients of have the same sign. If this is not the case, however, we first follow a different path through the parameter space to eventually assure different signs of coefficients of . Interestingly, this path leaves the loss constant. In other words, from certain points in parameter space it is necessary to follow a path of constant loss until we reach a point from where we can further decrease the loss; just like in the case of the non-attracting regions of local minima.

Lemma 5.

For , let be a set of vectors in and their linear span. If has a representation where all are positive (or all negative), then there are continuous paths of vectors in such that the following properties hold.

  • .

  • for all , so that there are continuous paths such that .

  • There are such that and .

We apply Lemma 5 to activation vectors giving continuous paths and . Then the output of the neural network along this path remains constant, hence so does the loss. The desired change of activation vectors can be performed by a suitable change of parameters according to Lemma 3 and Lemma 4. The simultaneous change of and defines the first part of our desired path in the parameter space which keeps constant. The final part of the desired path is given by the following lemma.

Lemma 6.

Assume a neural network structure as above with activation vectors of the wide hidden layer spanning . If the weights of the output layer satisfy that there is both a positive and a negative weight, then there is a continuous path from the current weights of decreasing loss down to the global minimum at .


Fix , the prediction for the current weights. The main idea is to change the activation vectors of the last hidden layer according to

With fixed, at the output this results in a change of , which reduces the loss to zero. The required change of activation vectors can be implemented by an application of Lemma 3 and Lemma 4, but only if the image of each lies in the image of the activation function. Hence, the latter must be arranged.

In the case that , it suffices to first decrease the norm of while simultaneously increasing the norm of the outgoing weight so that the output remains constant. If, however, is in the boundary of the interval (for example the case of a sigmoid activation function), then the assumption of non-zero weights with different signs becomes necessary. We let

We further define to be the vector with coordinate for equal to and 0 otherwise, and we let analogously