The application of deep learning(LeCun et al., 2015)
has in recent years lead to a dramatic boost in performance in many areas such as computer vision, speech recognition or natural language processing. Despite this huge empirical success, the theoretical understanding of deep learning is still limited. In this paper we address the non-convex optimization problem of training a feedforward neural network. This problem turns out to be very difficult as there can be exponentially many distinct local minima(Auer et al., 1996; Safran & Shamir, 2016)
. It has been shown that the training of a network with a single neuron with a variety of activation functions turns out to be NP-hard(Sima, 2002).
In practice local search techniques like stochastic gradient descent or variants are used for training deep neural networks. Surprisingly, it has been observed(Dauphin et al., 2014; Goodfellow et al., 2015)
that in the training of state-of-the-art feedforward neural networks with sparse connectivity like convolutional neural networks(LeCun et al., 1990; Krizhevsky et al., 2012) or fully connected ones one does not encounter problems with suboptimal local minima. However, as the authors admit themselves in (Goodfellow et al., 2015), the reason for this might be that there is a connection between the fact that these networks have good performance and that they are easy to train.
On the theoretical side there have been several interesting developments recently, see e.g. (Brutzkus & Globerson, 2017; Lee et al., 2016; Poggio & Liao, 2017; Rister & Rubin, 2017; Soudry & Hoffer, 2017; Zhou & Feng, 2017). For some class of networks one can show that one can train them globally optimal efficiently. However, it turns out that these approaches are either not practical (Janzamin et al., 2016; Haeffele & Vidal, 2015; Soltanolkotabi, 2017) as they require e.g. knowledge about the data generating measure, or they modify the neural network structure and objective (Gautier et al., 2016). One class of networks which are simpler to analyze are deep linear networks for which it has been shown that every local minimum is a global minimum (Baldi & Hornik, 1988; Kawaguchi, 2016). While this is a highly non-trivial result as the optimization problem is non-convex, deep linear networks are not interesting in practice as one efficiently just learns a linear function. In order to characterize the loss surface for general networks, an interesting approach has been taken by (Choromanska et al., 2015a)
. By randomizing the nonlinear part of a feedforward network with ReLU activation function and making some additional simplifying assumptions, they can relate it to a certain spin glass model which one can analyze. In this model the objective of local minima is close to the global optimum and the number of bad local minima decreases quickly with the distance to the global optimum. This is a very interesting result but is based on a number of unrealistic assumptions(Choromanska et al., 2015b). It has recently been shown (Kawaguchi, 2016) that if some of these assumptions are dropped one basically recovers the result of the linear case, but the model is still unrealistic.
In this paper we analyze the case of overspecified neural networks, that is the network is larger than what is required to achieve minimum training error. Under overspecification (Safran & Shamir, 2016) have recently analyzed under which conditions it is possible to generate an initialization so that it is in principle possible to reach the global optimum with descent methods. However, they can only deal with one hidden layer networks and have to make strong assumptions on the data such as linear independence or cluster structure. In this paper overspecification means that there exists a very wide layer, where the number of hidden units is larger than the number of training points. For this case, we can show that a large class of local minima is globally optimal. In fact, we will argue that almost every critical point is globally optimal. Our results generalize previous work of (Yu & Chen, 1995), who have analyzed a similar setting for one hidden layer networks, to networks of arbitrary depth. Moreover, it extends results of (Gori & Tesi, 1992; Frasconi et al., 1997) who have shown that for certain deep feedforward neural networks almost all local minima are globally optimal whenever the training data is linearly independent. While it is clear that our assumption on the number of hidden units is quite strong, there are several recent neural network structures which contain a quite wide hidden layer relative to the number of training points e.g. in (Lin et al., 2016) they have 50,000 training samples and the network has one hidden layer with 10,000 hidden units and (Ba & Caruana, 2014) have 1.1 million training samples and a layer with 400,000 hidden units. We refer to (Ciresan et al., 2010; Neyshabur et al., 2015; Vincent et al., 2010; Caruana et al., 2001) for other examples where the number of hidden units of one layer is on the order of the number of training samples. We conjecture that for these kind of wide networks it still holds that almost all local minima are globally optimal. The reason is that one can expect linear separability of the training data in the wide layer. We provide supporting evidence for this conjecture by showing that basically every critical point for which the training data is linearly separable in the wide layer is globally optimal. Moreover, we want to emphasize that all of our results hold for neural networks used in practice. There are no simplifying assumptions as in previous work.
2 Feedforward Neural Networks and Backpropagation
We are mainly concerned with multi-class problems but our results also apply to multivariate regression problems. Let be the number of training samples and denote by the input resp. output matrix for the training data , where is the input dimension and the number of classes. We consider fully-connected feedforward networks with layers, indexed from which correspond to the input layer, 1st hidden layer, etc, and output layer. The network structure is determined by the weight matrices ; where is the number of hidden units of layer (for consistency, we set
), and the bias vectors. We denote by the space of all possible parameters of the network. In this paper, denotes the set of integers and the set of integers from to . The activation function is assumed at least to be continuously differentiable, that is . In this paper, we assume that all the functions are applied componentwise. Let be the mappings from the input space to the feature space at layer , which are defined as
for every In the following, let and be the matrices that store the feature vectors of layer after and before applying the activation function. One can easily check that
In this paper we analyze the behavior of the loss of the network without any form of regularization, that is the final objective of the network is defined as
is assumed to be a continuously differentiable loss function, that is. The prototype loss which we consider in this paper is the squared loss, , which is one of the standard loss functions in the neural network literature. We assume throughout this paper that the minimum of (1) is attained.
The idea of backpropagation is the core of our theoretical analysis. Lemma2.1
below shows well-known relations for feed-forward neural networks, which are used throughout the paper. The derivative of the loss w.r.t. the value of unitat layer evaluated at a single training sample is denoted as We arrange these vectors for all training samples into a single matrix , defined as
In the following we use the Hadamard product , which for is defined as with .
Let . Then it holds
By definition, it holds for every that
, the chain rule yields for everythat
For every it holds
For every one obtains
For every , it holds
Note that Lemma 2.1 does not apply to non-differentiable activation functions like the ReLU function, . However, it is known that one can approximate this activation function arbitrarily well by a smooth function e.g. (a.k.a. softplus) satisfies for any .
3 Main Result
We first discuss some prior work and present then our main result together with extensive discussion. For improved readability we postpone the proof of the main result to the next section which contains several intermediate results which are of independent interest.
3.1 Previous Work
Our work can be seen as a generalization of the work of (Gori & Tesi, 1992; Yu & Chen, 1995). While (Yu & Chen, 1995) has shown that for a one-hidden layer network, that if , then every local minimum is a global minimum, the work of (Gori & Tesi, 1992) considered also multi-layer networks. For the convenience of the reader, we first restate Theorem 1 of (Gori & Tesi, 1992) using our previously introduced notation. The critical points of a continuously differentiable function are the points where the gradient vanishes, that is . Note that this is a necessary condition for a local minimum.
While this result is already for general multi-layer networks, the condition “ implies ” is the main caveat. It is already noted in (Gori & Tesi, 1992), that “it is quite hard to understand its practical meaning” as it requires prior knowledge of at every critical point. Note that this is almost impossible as depends on all the weights of the network. For a particular case, when the training samples (biases added) are linearly independent, i.e. , the condition holds automatically. This case is discussed in the following Theorem 3.4, where we consider a more general class of loss and activation functions.
3.2 First Main Result and Discussion
A function is real analytic if the corresponding multivariate Taylor series converges to on an open subset of (Krantz & Parks, 2002). All results in this section are proven under the following assumptions on the loss/activation function and training data.
There are no identical training samples, i.e. for all ,
is analytic on , strictly monotonically increasing and
is bounded or
there are positive , s.t. for and for
and if then is a global minimum
These conditions are not always necessary to prove some of the intermediate results presented below, but we decided to provide the proof under the above strong assumptions for better readability. For instance, all of our results also hold for strictly monotonically decreasing activation functions. Note that the above conditions are not restrictive as many standard activation functions satisfy them.
The sigmoid activation function , the tangent hyperbolic and the softplus function for satisfy Assumption 3.2.
Proof: Note that . Moreover, it is well known that is real-analytic on . The exponential function is analytic with values in . As composition of real-analytic function is real-analytic (see Prop 1.4.2 in (Krantz & Parks, 2002)), we get that and are real-analytic. Similarly, since is real-analytic on and the composition with the exponential function is real-analytic, we get that is a real-analytic function.
Finally, we note that , are strictly monotonically increasing. Since , are bounded, they both satisfy Assumption 3.2. For , we note that for , and thus it holds for every that
and with for it holds for every . In particular
which implies that satisfies Assumption 3.2
The conditions on are satisfied for any twice continuously differentiable convex loss function. A typical example is the squared loss or the Pseudo-Huber loss (Hartley & Zisserman, 2004) given as which approximates for small and is linear with slope for large But also non-convex loss functions satisfy this requirement, for instance:
Blake-Zisserman: for For small , this curve approximates , whereas for large the asymptotic value is
This function computes the negative log-likehood of a gaussian mixture model.
Cauchy: for This curve approximates for small and the value of determines for what range of this approximation is close.
We refer to (Hartley & Zisserman, 2004) (p.617-p.619) for more examples and discussion on robust loss functions.
As a motivation for our main result, we first analyze the case when the training samples are linearly independent, which requires It can be seen as a generalization of Corollary 1 in (Gori & Tesi, 1992).
The proof is based on induction. At a critical point it holds
By assumption, the data matrix has full column rank,
Using induction, let us assume that for some , then by Lemma 2.1,
we have .
As by assumption is strictly positive, this is equivalent to
As by assumption has full column rank,
it follows . Finally, we get .
With Lemma 2.1 we thus get
which implies with the same argument as above
From our Assumption 3.2, it holds that if then is a global minimum of
Thus each individual entry of must represent a global minimum of
This combined with (1) implies that the critical point must be a global minimum of
Theorem 3.4 implies that the weight matrices of potential saddle points or suboptimal local minima need to have low rank for one particular layer. Note however that the set of low rank weight matrices in
has measure zero. At the moment we cannot prove that suboptimal low rank local minima cannot exist. However, it seems implausible that such suboptimal low rank local minima exist as every neighborhood of such points contains full rank matrices which increase the expressiveness of the network. Thus it should be possible to use this degree of freedom to further reduce the loss, which contradicts the definition of a local minimum. Thus we conjecture that all local minima are indeed globally optimal.
The main restriction in the assumptions of Theorem 3.4 is the linear independence of the training samples as it requires , which is very restrictive in practice. We prove in this section a similar guarantee in our main Theorem 3.8 by implicitly transporting this condition to some higher layer. A similar guarantee has been proven by (Yu & Chen, 1995) for a single hidden layer network, whereas we consider general multi-layer networks. The main ingredient of the proof of our main result is the observation in the following lemma.
Proof: By Lemma 2.1 it holds that
By our assumption, it holds that
Since , we can apply a similar induction argument
as in the proof of Theorem 3.4,
to arrive at and thus a global minimum.
The first condition of Lemma 3.5 can be seen as a generalization of the requirement of linearly independent training inputs in Theorem 3.4 to a condition of linear independence of the feature vectors at a hidden layer. Lemma 3.5 suggests that if we want to make statements about the global optimality of critical points, it is sufficient to know when and which critical points fulfill these conditions. The third condition is trivially satisfied by a critical point and the requirement of full column rank of the weight matrices is similar to Theorem 3.4. However, the first one may not be fulfilled since is dependent not only on the weights but also on the architecture. The main difficulty of the proof of our following main theorem is to prove that this first condition holds under the rather simple requirement that for a subset of all critical points.
But before we state the theorem we have to discuss a particular notion of non-degenerate critical point.
Definition 3.6 (Block Hessian)
Let be a twice-continuously differentiable function defined on some open domain The Hessian w.r.t. a subset of variables is denoted as When , we write to denote the full Hessian matrix.
We use this to introduce a slightly more general notion of non-degenerate critical point.
Definition 3.7 (Non-degenerate critical point)
Let be a twice-continuously differentiable function defined on some open domain Let be a critical point, i.e. , then
is non-degenerate for a subset of variables if is non-singular.
is non-degenerate if is non-singular.
Note that a non-degenerate critical point might not be non-degenerate for a subset of variables, and vice versa, if it is non-degenerate on a subset of variables it does not necessarily imply non-degeneracy on the whole set. For instance,
Clearly, but and but The concept of non-degeneracy on a subset of variables is crucial for the following statement of our main result.
First of all we note that the full column rank condition of in Theorem 3.4, and 3.8 implicitly requires that This means the network needs to have a pyramidal structure from layer to . It is interesting to note that most modern neural network architectures have a pyramidal structure from some layer, typically the first hidden layer, on. Thus this is not a restrictive requirement. Indeed, one can even argue that Theorem 3.8 gives an implicit justification as it hints on the fact that such networks are easy to train if one layer is sufficiently wide.
Note that Theorem 3.8 does not require fully non-degenerate critical points but non-degeneracy is only needed for some subset of variables that includes layer . As a consequence of Theorem 3.8, we get directly a stronger result for non-degenerate local minima.
The Hessian at a non-degenerate local minimum is positive definite and
every principal submatrix of a positive definite matrix is again positive definite,
in particular for the subset of variables .
Then application of Theorem 3.8 yields the result.
Let us discuss the implications of these results. First, note that Theorem 3.8 is slightly weaker than Theorem 3.4 as it requires also non-degeneracy wrt to a set of variables including layer . Moreover, similar to Theorem 3.4 it does not exclude the possibility of suboptimal local minima of low rank in the layers “above” layer . On the other hand it makes also very strong statements. In fact, if for some then even degenerate saddle points/local maxima are excluded as long as they are non-degenerate with respect to any subset of parameters of upper layers that include layer and the rank condition holds. Thus given that the weight matrices of the upper layers have full column rank , there is not much room left for degenerate saddle points/local maxima. Moreover, for a one-hidden-layer network for which , every non-degenerate critical point with respect to the output layer parameters is a global minimum, as the full rank condition is not active for one-hidden layer networks.
Concerning the non-degeneracy condition of main Theorem 3.8, one might ask how likely it is to encounter degenerate points of a smooth function. This is answered by an application of Sard’s/Morse theorem in (Milnor, 1965).
Theorem 3.10 (A. Morse, p.11)
If is twice continuously differentiable. Then for almost all with respect to the Lebesgue measure it holds that defined as has only non-degenerate critical points.
Note that the theorem would still hold if one would draw uniformly at random from the set for any . Thus almost every linear perturbation of a function will lead to the fact all of its critical points are non-degenerate. Thus, this result indicates that exact degenerate points might be rare. Note however that in practice the Hessian at critical points can be close to singular (at least up to numerical precision), which might affect the training of neural networks negatively (Sagun et al., 2016).
As we argued for Theorem 3.4 our main Theorem 3.8 does not exclude the possibility of suboptimal degenerate local minima or suboptimal local minima of low rank. However, we conjecture that the second case cannot happen as every neighborhood of the local minima contains full rank matrices which increase the expressiveness of the network and this additional flexibility can be used to reduce the loss which contradicts the definition of a local minimum.
As mentioned in the introduction the condition looks at first sight very strong. However, as mentioned in the introduction, in practice often networks are used where one hidden layer is rather wide, that is is on the order of (typically it is the first layer of the network). As the condition of Theorem 3.8 is sufficient and not necessary, one can expect out of continuity reasons that the loss surface of networks where the condition is approximately true, is still rather well behaved, in the sense that still most local minima are indeed globally optimal and the suboptimal ones are not far away from the globally optimal ones.
4 Proof of Main Result
For better readability, we first prove our main Theorem 3.8 for a special case where is the whole set of upper layers, i.e. and then show how to extend the proof to the general case where Our proof strategy is as follows. We first show that the output of each layer are real analytic functions of network parameters. Then we prove that there exists a set of parameters such that Using properties of real analytic functions, we conclude that the set of parameters where has measure zero. Then with the non-degeneracy condition, we can apply the implicit-function theorem to conclude that even if is not true at a critical point, then still in any neighborhood of it there exists a point where the conditions of Lemma 3.5 are true and the loss is minimal. By continuity of this implies that the loss must also be minimal at the critical point.
We introduce some notation frequently used in the proofs. Let be the open ball in of radius around .
If the Assumptions 3.2 hold, then the output of each layer for every are real analytic functions of the network parameters on
Any linear function is real analytic and
the set of real analytic functions is closed under addition, multiplication and composition,
see e.g. Prop. 2.2.2 and Prop. 2.2.8 in (Krantz & Parks, 2002).
As we assume that the activation function is real analytic,
we get that all the output functions of the neural network
are real analytic functions of the parameters as compositions of real analytic functions.
The concept of real analytic functions is important in our proofs as these functions can never be “constant” in a set of the parameter space which has positive measure unless they are constant everywhere. This is captured by the following lemma.
In the next lemma we show that there exist network parameters such that holds if . Note that this is only possible due to the fact that one uses non-linear activation functions. For deep linear networks, it is not possible for to achieve maximum rank if the layers below it are not sufficiently wide. To see this, one considers for a linear network, then since the addition of a rank-one term does not increase the rank of a matrix by more than one. By using induction, one gets for every
The existence of network parameters where together with the previous lemma will then be used to show that the set of network parameters where has measure zero.
If the Assumptions 3.2 hold and for some , then there exists at least one set of parameters such that
Proof: We first show by induction that there always exists a set of parameters s.t. has distinct rows. Indeed, we have . The set of that makes to have distinct rows is characterized by
Note, that is strictly monotonic and thus bijective on its domain. Thus this is equivalent to
Let us denote the first column of by , then the existence of for which
would imply the result. Note that by assumption for all . Then the set
is a hyperplane, which has measure zero and thus the set where condition (2) fails corresponds to the union of hyperplanes which again has measure zero. Thus there always exists a vector such that condition (2) is satisfied and thus there exists such that the rows of are distinct. Now, assume that has distinct rows for some , then by the same argument as above we need to construct such that
By construction and thus with the same argument as above we can choose such that this condition holds. As a result, there exists a set of parameters so that has distinct rows.
Now, given that has distinct rows, we show how to construct in such a way that has full row rank. Since , it is sufficient to make the first columns of together with the all-ones vector become linearly independent. In particular, let where and be the matrices containing outputs of the first hidden units and last hidden units of layer respectively. Let and Let with for every By definition of , it holds for As mentioned above, we just need to show there exists so that because then it will follow immediately that Pick any satisfying potentially after reordering w.l.o.g. By the discussion above such a vector always exists since the complementary set is contained in which has measure zero.
We first prove the result for the case where is bounded. Since is bounded and strictly monotonically increasing, there exist two finite values with s.t.
Moreover, since is strictly monotonically increasing it holds for every , Pick some . For , we define for every Note that the matrix changes as we vary . Thus, we consider a family of matrices defined as Then it holds for every
Let then it holds
Let be a modified matrix where one subtracts every row by row of , in particular, let
then it holds