Despite the rapid growing list of successful applications of neural networks trained with back-propagation in various fields from computer vision to speech recognition 7], our theoretical understanding on these elaborate systems is developing at a more modest pace.
One of the major difficulties in the design of neural networks lie in the fact that, to obtain networks with greater expressive power, one needs to cascade more and more layers to make them “deeper” and hope to extract more “abstract” features from the (numerous) training data. Nonetheless, from an optimization viewpoint, this “deeper” structure gives rise to non-convex loss functions and makes the optimization dynamics seemingly intractable. In general finding a global minimum of a generic non-convex function is an NP-complete problem which is the case for neural networks as shown in  on a very simple network.
the author proved that under some appropriate rank condition on the (cascading) matrix product, all critical points of a deep linear neural networks are either global minima or saddle points with Hessian admitting eigenvalues with different signs, meaning that linear deep networks are somehow “close” to those examples mentioned at the beginning of this paragraph.
Nonetheless, most existing results are built on the common belief that, simple gradient descent algorithm converges to the critical points with vanishing gradients, which may not always be the case. Consider the following example: let
Clearly, is non negative and its critical points are all global minima corresponding to the plane . It is also not difficult to see that trajectories starting at with in and nonzero are not bounded. In particular, the set of divergent initializations has positive (Lebesgue) measure in .
It is thus of interest to see if, in the special case of neural networks (linear or nonlinear), a similar phenomenon happens so that there exists a non-zero measure set leading to unbounded gradient descent trajectory, so as to tighten the loopholes left in the previous analyses where the convergence to critical points are widely used.
In this article we present a geometric framework to pursue the global convergence property of simple gradient algorithms for neural networks. We elaborate on the model from [21, 12] and evaluate the dynamics of the underlying gradient system. Based on the characterization of appropriate quantities that are conserved along the gradient descent induced by the network cascading structure, we show in Section 2 that, in the case of linear networks, the global convergence to critical points via the Lojasiewicz’s theorem; while similar results are established in Section 5 for nonlinear networks. As a consequence, the study of convergence property of the underlying gradient system can be further reduced to the “local” study of each critical points, in particular the associated basin of attraction. In Section 3 we provide a more detailed characterization of the critical point and in Section 4 discuss the associated basins of attraction and Hessians of saddle points. In particular, we prove that in the single-hidden-layer case , the Hessian of every critical points admits a negative eigenvalue while this is not true in the case of for a large number of critical points. Note also that we handle first order variations of the gradient descent system with computations on the Hessian of the loss function, which provides a much more flexible approach than that developed in . The ultimate goal consists in actually evaluating the “size” of the union of basins of attraction over all the critical points. At this point it is tempting to state the following conjecture: the union of basins of attraction over all the critical points has zero Lebesgue measure. Even in the case of we are not yet able to prove (or disprove) the conjecture.
Notations: We denote the time derivative, the transpose operator. denotes the kernel of a linear map, i.e., . and the rank and trace operator, respectively, for a (square) matrix.
Acknowledgements: The authors are indebted to Jérôme Bolte for the help in finding the exemple provided in Eq. (1).
2 System Model and Main Result
2.1 Problem Setup
We start with a linear neural network with hidden layers as illustrated in Figure 1. To begin with, the network structure as well as associated notations are presented as follows.
Let the pair denote the training data and associated targets, with and , where denotes the number of instances in the training set and the dimensions of data and targets, respectively. We denote the weight matrix that connects to for and set , as in Figure 1. The output of network for the training set is therefore given by
We further denote the -tuple of for simplicity and work on the mean square error given by the following Frobenius norm,
under the following assumptions:
Assumption 1 (Dimension Condition).
Assumption 2 (Full Rank Data and Targets).
The matrices and are of full (row) rank, i.e., of rank and , respectively, according to Assumption 1.
Assumption 1 and 2 on the dimension and rank of the training data are realistic and practically easy to satisfy, as discussed in previous works [3, 12].333Assumption 1 is demanded here for convenience and our results can be extended to handle more elaborate dimension settings. Similarly, when the training data is rank deficient, the learning problem can be reduced to a lower-dimensional one by removing these non-informative data in such a way that Assumption 2 holds.
, with the singular value decomposition (SVD) onwe obtain
for diagonal and positive definite so that
with the -tuples of for . By similarly expanding the effective loss of the network according to (the singular value of) the associated reduced target with , for diagonal and positive definite so that
where we denote , and for . Therefore the state space444The network (weight) parameters evolve through time and are considered to be state variables of the dynamical system, while the pair is fixed and thus referred as the “parameters” of the given system. of is equal to .
With the above notations, we demand in addition the following assumption on the target .
Assumption 3 (Distinct Singular Values).
The target has distinct singular values.
The objective of this article is to study the gradient descent  dynamics (GDD) defined as
Definition 1 (Gdd).
The Gradient Descent Dynamics of is the dynamical system defined on by
where denotes the gradient of the loss function with respect to . A point is a critical point of if and only if and we denote the set of critical points.
To facilitate further discussion, we drop the bars on ’s and sometimes the argument and introduce the following notations.
For , we consider the weight matrix and the corresponding variation of the same size. For simplicity, we denote and the -tuples of and , respectively. For two indices , we use to denote the product if and the appropriate identity if so that the whole product writes and consequently
For , and , we use and if to denote the following products
For instance, we have, for ,
We can use the above notations to derive the first-order variation of the loss function and hence the GDD equations. To this end, set
where stands for polynomial terms of order equal or larger than two in the ’s. We thus obtain, for ,
We first remark the following interesting (and crucial to what follows) property of the gradient system (6), inspired by  which essentially considered the case where all dimensions are equal to one.
Lemma 1 (Invariant in GDD).
Consider any trajectory of the gradient system given by (6). Then, for , the value of remains constant for , i.e.,
for . As a consequence, there exist constant real numbers , , such that, along a trajectory of the gradient system given by (6), one has for ,
Moreover, there exists a positive constant only depending on and on the dimensions involved in the problem such that, for every trajectory of the gradient system given by (6), there exist two polynomials and of degree with nonnegative coefficients (depending on the initialization) such that, for every ,
With the above notations together with (6), one gets
Equation (9) is established by induction on . In the sequel, the various constants (generically denoted by ) are positive and only dependent on the ’s and the ’s, thus independent of . The case is immediate. We assume that it holds for and treat the case . One has
Using (8), we replace the product by in the above expression and obtain that
First note that
where is symmetric and nonnegative definite. By using the fact that , one deduces that there exists a nonnegative constant such that
Using the induction hypothesis on , we deduce that
where are polynomials of degree . Again with (8), we replace the term by . By developing the square inside the larger product, we obtain as principal term
with lower order terms that can be upper and lower bounded, thanks to the induction hypothesis, by and , respectively, for some polynomials of degree with nonnegative coefficients. We then similarly proceed by replacing the term by
and so on, so as to end up with the following estimate
for some polynomials of degree with nonnegative coefficients. Recall that, for positive integers, there exists a positive constant only depending on such that for every nonnegative symmetric matrix , one has
which concludes the proof. ∎
Lemma 1 provides a key structural property of the GDD that is instrumental to ensure the boundedness of the gradient descent trajectories. As a matter of fact, variations of Lemma 1 have been used to partition the state space into invariant manifolds so as to explicitly characterize the basins of attraction in GDD in rank-one matrix approximation problems 
. Also, similar arguments hold in more elaborate cases, e.g., for the popular softmax-cross-entropy loss with one-hot vector target with and without weight decay (regularization) ; also the conservation of norms in (8
) holds true in nonlinear neural networks with ReLU and Leaky ReLU nonlinearities, as pointed out in. We will see in Section 5 that similar phenomenon also happens in the case of sigmoid activation.
2.2 General Strategy and Main Result
In this article we establish the global convergence to critical points of all gradient descent trajectories. While one expects the gradient descent algorithm to converge to critical points, this may not always be the case. Two possible (undesirable) situations are 1) a trajectory is unbounded or 2) it oscillates “around” several critical points without convergence, i.e., along an -limit set made of a continuum of critical points (see  for notions on -limit sets). The property of an iterative algorithm (like gradient descent) to converge to a critical point for any initialization is referred to as “global convergence” . However, it is very important to stress the fact that it does not imply (contrary to what the name might suggest) convergence to a global (or good) minimum for all initializations.
To answer the convergence question, we resort to Lojasiewicz’s theorem for the convergence of a gradient descent flow of the type of (6) with real analytic right-hand side , as formally recalled below.
Theorem 1 (Lojasiewicz’s theorem, ).
Since the fundamental (strict) gradient descent direction (as in Definition 1) in Lojasiewicz’s theorem can in fact be relaxed to a (more general) angle condition (see for example Theorem 2.2 in ), the line of argument developed in the core of the article may be similarly followed to prove the global convergence of more advanced optimizers (e.g., SGD, SGD-Momentum , ADAM , etc.), for which the direction of descent is not strictly the opposite of the gradient direction. This constitutes an important direction of future exploration.
Since the loss function is a polynomial of degree in the components of , Lojasiewicz’s theorem ensures that if a given trajectory of the gradient descent flow is bounded (i.e., it remains in a compact set for every ) it must converge to a critical point with a guaranteed rate of convergence. In particular, the aforementioned phenomenon of “oscillation” cannot occur and we are left to ensure the absence of unbounded trajectories. Lemma 1 is the core argument to show that all trajectories of the GDD are indeed bounded, leading to the first result of this article as follows.
Proposition 1 (Global Convergence of GDD to Critical Points).
With Lojasiewicz’s theorem, we are left to prove that each trajectory of (6) remains in a compact set. Taking into account (8), it is enough to prove that is bounded. To this end, denoting and considering its time derivative, one gets, after computations similar to those performed in the proof of Lemma 1 that
for some polynomial of degree . With (9), the above inequality becomes
As an important byproduct of Lemma 1 we have the following remark on the exponential convergence of GDD in linear networks.
Remark 2 (Exponential Convergence in Linear Networks).
Under Notations 1 we have
where we constantly use the fact that for symmetric and semi-positive definite we have and denote the minimum eigenvalue of a symmetric and semi-positive definite matrix . Therefore, if there exists at least such that
with the -th eigenvalue of arranged in algebraically nondecreasing order so that . Then since , the matrix is of rank maximum and thus admits at least zero eigenvalues so that
for . Moreover, since for we also have,
we obtain at once that for ,
so that by taking in (10) we result in
which concludes the proof. ∎
Remark 2 entails that, for a linear network, although in general only polynomial convergence rate can be established , it is possible to wisely initialize the gradient descent algorithm to achieve exponential convergence.
To continue the analysis of the gradient system, one should evaluate the “size” of the basin of attraction of each critical point and then to evaluate the “size” of the (set) union of these basins of attraction. For instance, assume that for each critical point, the associated Hessian admits a negative eigenvalue. Then the basin of attraction of each critical point has zero (Lebesgue) measure, but this does not ensure that the union of these basins of attraction also has zero measure.
3 Characterization of Critical Points
3.1 Critical Points Condition
Decomposing with , the effective loss writes
where we recall and so that the product . To fully characterize the critical points of the loss in (11) as well as their basins of attraction, we shall expansion the first two order variations of as described in the following proposition.
Proposition 2 (Variation of in Deep Networks).
, set . We then have the following expansion for ,