A Geometric Approach of Gradient Descent Algorithms in Neural Networks

In this article we present a geometric framework to analyze convergence of gradient descent trajectories in the context of neural networks. In the case of linear networks of an arbitrary number of hidden layers, we characterize appropriate quantities which are conserved along the gradient descent system (GDS). We use them to prove boundedness of every trajectory of the GDS, which implies convergence to a critical point. We further focus on the local behavior in the neighborhood of each critical points and perform a study on the associated basin of attractions so as to measure the "possibility" of converging to saddle points and local minima.

Authors

• 3 publications
• 12 publications
• 21 publications
• Secondary gradient descent in higher codimension

09/14/2018 ∙ by Y Cooper, et al. ∙ 0

• Global Convergence and Geometric Characterization of Slow to Fast Weight Evolution in Neural Network Training for Classifying Linearly Non-Separable Data

In this paper, we study the dynamics of gradient descent in learning neu...
02/28/2020 ∙ by Ziang Long, et al. ∙ 0

• The Global Optimization Geometry of Shallow Linear Neural Networks

We examine the squared error loss landscape of shallow linear neural net...
05/13/2018 ∙ by Zhihui Zhu, et al. ∙ 0

• A Geometric Framework for Convolutional Neural Networks

In this paper, a geometric framework for neural networks is proposed. Th...
08/15/2016 ∙ by Anthony L. Caterini, et al. ∙ 0

• Collective evolution of weights in wide neural networks

We derive a nonlinear integro-differential transport equation describing...
10/09/2018 ∙ by Dmitry Yarotsky, et al. ∙ 0

• Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability

We study the detailed path-wise behavior of the discrete-time Langevin a...
02/18/2018 ∙ by Belinda Tzen, et al. ∙ 0

• Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks

We analyze algorithms for approximating a function f(x) = Φ x mapping ^d...
02/16/2018 ∙ by Peter L. Bartlett, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the rapid growing list of successful applications of neural networks trained with back-propagation in various fields from computer vision

[14] to speech recognition [17][7], our theoretical understanding on these elaborate systems is developing at a more modest pace.

One of the major difficulties in the design of neural networks lie in the fact that, to obtain networks with greater expressive power, one needs to cascade more and more layers to make them “deeper” and hope to extract more “abstract” features from the (numerous) training data. Nonetheless, from an optimization viewpoint, this “deeper” structure gives rise to non-convex loss functions and makes the optimization dynamics seemingly intractable. In general finding a global minimum of a generic non-convex function is an NP-complete problem

[19] which is the case for neural networks as shown in [4] on a very simple network.

Yet, many non-convex problems such as phase retrieval, independent component analysis and orthogonal tensor decomposition obey two important properties

[22]: 1) all local minima are also global; and 2) around any saddle point the objective function has a negative directional curvature (therefore implying the possibility to continue to descend, also referred to as “strict” saddles [15]) and thus allow for the possibility to find global minima with simple gradient descent algorithm. In this regard, the loss surfaces of deep neural networks are receiving an unprecedented research interest: in the pioneering work of Baldi and Hornik [3] the landscape of mean square losses was studied in the case of linear auto-encoders (i.e., the same dimension for input data and output targets) of depth one; more recently in the work of Saxe et al. [21] the dynamics of the corresponding gradient descent system was first studied, by assuming the input data empirical correlation matrix to be identity, in a linear deep neural networks. Then in [12]

the author proved that under some appropriate rank condition on the (cascading) matrix product, all critical points of a deep linear neural networks are either global minima or saddle points with Hessian admitting eigenvalues with different signs, meaning that linear deep networks are somehow “close” to those examples mentioned at the beginning of this paragraph.

Nonetheless, most existing results are built on the common belief that, simple gradient descent algorithm converges to the critical points with vanishing gradients, which may not always be the case. Consider the following example: let

 (1)

Clearly, is non negative and its critical points are all global minima corresponding to the plane . It is also not difficult to see that trajectories starting at with in and nonzero are not bounded. In particular, the set of divergent initializations has positive (Lebesgue) measure in .

It is thus of interest to see if, in the special case of neural networks (linear or nonlinear), a similar phenomenon happens so that there exists a non-zero measure set leading to unbounded gradient descent trajectory, so as to tighten the loopholes left in the previous analyses where the convergence to critical points are widely used.

In this article we present a geometric framework to pursue the global convergence property of simple gradient algorithms for neural networks. We elaborate on the model from [21, 12] and evaluate the dynamics of the underlying gradient system. Based on the characterization of appropriate quantities that are conserved along the gradient descent induced by the network cascading structure, we show in Section 2 that, in the case of linear networks, the global convergence to critical points via the Lojasiewicz’s theorem; while similar results are established in Section 5 for nonlinear networks. As a consequence, the study of convergence property of the underlying gradient system can be further reduced to the “local” study of each critical points, in particular the associated basin of attraction. In Section 3 we provide a more detailed characterization of the critical point and in Section 4 discuss the associated basins of attraction and Hessians of saddle points. In particular, we prove that in the single-hidden-layer case , the Hessian of every critical points admits a negative eigenvalue while this is not true in the case of for a large number of critical points. Note also that we handle first order variations of the gradient descent system with computations on the Hessian of the loss function, which provides a much more flexible approach than that developed in [12]. The ultimate goal consists in actually evaluating the “size” of the union of basins of attraction over all the critical points. At this point it is tempting to state the following conjecture: the union of basins of attraction over all the critical points has zero Lebesgue measure. Even in the case of we are not yet able to prove (or disprove) the conjecture.

Notations: We denote the time derivative, the transpose operator. denotes the kernel of a linear map, i.e., . and the rank and trace operator, respectively, for a (square) matrix.

Acknowledgements: The authors are indebted to Jérôme Bolte for the help in finding the exemple provided in Eq. (1).

2 System Model and Main Result

2.1 Problem Setup

We start with a linear neural network with hidden layers as illustrated in Figure 1. To begin with, the network structure as well as associated notations are presented as follows.

Let the pair denote the training data and associated targets, with and , where denotes the number of instances in the training set and the dimensions of data and targets, respectively. We denote the weight matrix that connects to for and set , as in Figure 1. The output of network for the training set is therefore given by

 ^Y=WH+1…W1X.

We further denote the -tuple of for simplicity and work on the mean square error given by the following Frobenius norm,

 L(W)=12∥Y−^Y∥2F=12∥Y−WH+1…W1X∥2F, (2)

under the following assumptions:

.

Assumption 2 (Full Rank Data and Targets).

The matrices and are of full (row) rank, i.e., of rank and , respectively, according to Assumption 1.

Assumption 1 and 2 on the dimension and rank of the training data are realistic and practically easy to satisfy, as discussed in previous works [3, 12].333Assumption 1 is demanded here for convenience and our results can be extended to handle more elaborate dimension settings. Similarly, when the training data is rank deficient, the learning problem can be reduced to a lower-dimensional one by removing these non-informative data in such a way that Assumption 2 holds.

Under Assumption 1 and 2

, with the singular value decomposition (SVD) on

we obtain

 X=UXΣXVTX,VX=[V1XV2X],ΣX=[SX0],V1X∈Rm×dx (3)

for diagonal and positive definite so that

 L(W)=12∥YV1X−WH+1…W1UXSX∥2F+12∥YV2X∥2F:=L(W)+12∥YV2X∥2F

with the -tuples of for . By similarly expanding the effective loss of the network according to (the singular value of) the associated reduced target with , for diagonal and positive definite so that

 L(W)=12∥ΣY−¯WH+1¯WH…¯W2¯W1∥2F, (4)

where we denote , and for . Therefore the state space444The network (weight) parameters evolve through time and are considered to be state variables of the dynamical system, while the pair is fixed and thus referred as the “parameters” of the given system. of is equal to .

With the above notations, we demand in addition the following assumption on the target .

Assumption 3 (Distinct Singular Values).

The target has distinct singular values.

Similar to Assumption 2, Assumption 3 is a classical assumption that is demanded in previous works [3, 12] and actually holds for an open and dense subset of .

Definition 1 (Gdd).

The Gradient Descent Dynamics of is the dynamical system defined on by

 dWdt=−∇WL(W),

where denotes the gradient of the loss function with respect to . A point is a critical point of if and only if and we denote the set of critical points.

To facilitate further discussion, we drop the bars on ’s and sometimes the argument and introduce the following notations.

Notations 1.

For , we consider the weight matrix and the corresponding variation of the same size. For simplicity, we denote and the -tuples of and , respectively. For two indices , we use to denote the product if and the appropriate identity if so that the whole product writes and consequently

 L(W)=12∥ΣY−(ΠW)H+11∥2F.

For , and , we use and if to denote the following products

 P0(W) =(ΠW)H+11, Prj1,…,jr(W,w) =(ΠW)H+1jr+1wjr(ΠW)jr−1jr−1+1wjr−1…(ΠW)j2−1j1+1wj1(ΠW)j1−11.

For instance, we have, for ,

 P1j(W,w) =(ΠW)H+1j+1wj(ΠW)j−11, P2j,k(W,w) =(ΠW)H+1k+1wk(ΠW)k−1j+1wj(ΠW)j−11.

We can use the above notations to derive the first-order variation of the loss function and hence the GDD equations. To this end, set

 M=ΣY−(ΠW)H+11, (5)

so that

 L(W+w)=L(W)−H+1∑j=1tr(P1j(W,w)MT)+O(∥w∥2),

where stands for polynomial terms of order equal or larger than two in the ’s. We thus obtain, for ,

 dWjdt=[(ΠW)H+1j+1]TM[(ΠW)j−11]T. (6)

We first remark the following interesting (and crucial to what follows) property of the gradient system (6), inspired by [21] which essentially considered the case where all dimensions are equal to one.

Lemma 1 (Invariant in GDD).

Consider any trajectory of the gradient system given by (6). Then, for , the value of remains constant for , i.e.,

 WTj+1Wj+1−WjWTj=(WTj+1Wj+1−WjWTj)∣∣t=0:=Cj, (7)

for . As a consequence, there exist constant real numbers , , such that, along a trajectory of the gradient system given by (6), one has for ,

 ∥Wj∥2F=∥WH+1∥2F+cj. (8)

Moreover, there exists a positive constant only depending on and on the dimensions involved in the problem such that, for every trajectory of the gradient system given by (6), there exist two polynomials and of degree with nonnegative coefficients (depending on the initialization) such that, for every ,

 C0∥WH+1∥2(H+1)F−P(∥WH+1∥2F) ≤tr([(ΠW)H+11]T(ΠW)H+11) ≤∥WH+1∥2(H+1)F+Q(∥WH+1∥2F). (9)
Proof.

With the above notations together with (6), one gets

 d(WjWTj)dt =[(ΠW)H+1j+1]TM[(ΠW)j−11]TWTj+Wj(ΠW)j−11MT(ΠW)H+1j+1 =[(ΠW)H+1j+1]TM[(ΠW)j1]T+(ΠW)j1MT(ΠW)H+1j+1 =WTj+1[(ΠW)H+1j+2]TM[(ΠW)j1]T+(ΠW)j1MT(ΠW)H+1j+2Wj+1 =d(WTj+1Wj+1)dt,

hence the conclusion of (7). To deduce (8), it remains to add the above equations, up to transposition, from the indices to , and then take the trace.

Equation (9) is established by induction on . In the sequel, the various constants (generically denoted by ) are positive and only dependent on the ’s and the ’s, thus independent of . The case is immediate. We assume that it holds for and treat the case . One has

 tr([(ΠW)H+11]T(ΠW)H+11) =tr(WT1WT2…WTH+1WH+1…W2W1) =tr(WTH+1WH+1…W2W1WT1WT2…WTH).

Using (8), we replace the product by in the above expression and obtain that

 tr([(ΠW)H+11]T(ΠW)H+11) =tr(WTH+1WH+1…(W2WT2)2…WTH) −tr(WTH+1WH+1…W2C1WT2…WTH).

First note that

 tr(WTH+1WH+1…W2C1WT2…WTH)=tr(AC1),

where is symmetric and nonnegative definite. By using the fact that , one deduces that there exists a nonnegative constant such that

 −Ktr(WTH+1WH+1…W2WT2…WTH)+tr(WTH+1WH+1…(W2WT2)2…WTH) ≤tr([(ΠW)H+11]T(ΠW)H+11) ≤tr(WTH+1WH+1…(W2WT2)2…WTH)+Ktr(WTH+1WH+1…W2WT2…WTH).

Using the induction hypothesis on , we deduce that

 −P(∥WH+1∥2F)+tr(WTH+1WH+1…(W2WT2)2…WTH) ≤tr([(ΠW)H+11]T(ΠW)H+11) ≤tr(WTH+1WH+1…(W2WT2)2…WTH)+Q(∥WH+1∥2F).

where are polynomials of degree . Again with (8), we replace the term by . By developing the square inside the larger product, we obtain as principal term

 tr(WTH+1WH+1…W3(WT3W3)2WT3…WTH)=tr(WTH+1WH+1…(W3WT3)3…WTH),

with lower order terms that can be upper and lower bounded, thanks to the induction hypothesis, by and , respectively, for some polynomials of degree with nonnegative coefficients. We then similarly proceed by replacing the term by

and so on, so as to end up with the following estimate

 tr((WH+1WTH+1)H+1)−P(∥WH+1∥2F) ≤tr(WTH+1WH+1…W3(WT3W3)2WT3…WTH) ≤tr((WH+1WTH+1)H+1)+Q(∥WH+1∥2F),

for some polynomials of degree with nonnegative coefficients. Recall that, for positive integers, there exists a positive constant only depending on such that for every nonnegative symmetric matrix , one has

 C0(tr(S))l≤tr(Sl)≤(tr(S))l

which concludes the proof. ∎

Lemma 1 provides a key structural property of the GDD that is instrumental to ensure the boundedness of the gradient descent trajectories. As a matter of fact, variations of Lemma 1 have been used to partition the state space into invariant manifolds so as to explicitly characterize the basins of attraction in GDD in rank-one matrix approximation problems [18]

. Also, similar arguments hold in more elaborate cases, e.g., for the popular softmax-cross-entropy loss with one-hot vector target with and without weight decay (

regularization) [2]; also the conservation of norms in (8

) holds true in nonlinear neural networks with ReLU and Leaky ReLU nonlinearities, as pointed out in

[9]. We will see in Section 5 that similar phenomenon also happens in the case of sigmoid activation.

2.2 General Strategy and Main Result

In this article we establish the global convergence to critical points of all gradient descent trajectories. While one expects the gradient descent algorithm to converge to critical points, this may not always be the case. Two possible (undesirable) situations are 1) a trajectory is unbounded or 2) it oscillates “around” several critical points without convergence, i.e., along an -limit set made of a continuum of critical points (see [23] for notions on -limit sets). The property of an iterative algorithm (like gradient descent) to converge to a critical point for any initialization is referred to as “global convergence” [24]. However, it is very important to stress the fact that it does not imply (contrary to what the name might suggest) convergence to a global (or good) minimum for all initializations.

To answer the convergence question, we resort to Lojasiewicz’s theorem for the convergence of a gradient descent flow of the type of (6) with real analytic right-hand side [16], as formally recalled below.

Theorem 1 (Lojasiewicz’s theorem, [16]).

Let be a real analytic function and let be a solution trajectory of the gradient system given by Definition 1. Further assume that . Then converges to a critical point of , as . The rate of convergence is determined by the associated Lojasiewicz exponent [8].

Remark 1.

Since the fundamental (strict) gradient descent direction (as in Definition 1) in Lojasiewicz’s theorem can in fact be relaxed to a (more general) angle condition (see for example Theorem 2.2 in [1]), the line of argument developed in the core of the article may be similarly followed to prove the global convergence of more advanced optimizers (e.g., SGD, SGD-Momentum [20], ADAM [13], etc.), for which the direction of descent is not strictly the opposite of the gradient direction. This constitutes an important direction of future exploration.

Since the loss function is a polynomial of degree in the components of , Lojasiewicz’s theorem ensures that if a given trajectory of the gradient descent flow is bounded (i.e., it remains in a compact set for every ) it must converge to a critical point with a guaranteed rate of convergence. In particular, the aforementioned phenomenon of “oscillation” cannot occur and we are left to ensure the absence of unbounded trajectories. Lemma 1 is the core argument to show that all trajectories of the GDD are indeed bounded, leading to the first result of this article as follows.

Proposition 1 (Global Convergence of GDD to Critical Points).

Let be a data-target pair satisfying Assumptions 1 and 2. Then, every trajectory of the corresponding gradient flow described by Definition 1 converges to a critical point as , at rate at least of , for some fixed only depending on the dimension of the problem.

Proof.

With Lojasiewicz’s theorem, we are left to prove that each trajectory of (6) remains in a compact set. Taking into account (8), it is enough to prove that is bounded. To this end, denoting and considering its time derivative, one gets, after computations similar to those performed in the proof of Lemma 1 that

 dgdt≤−2tr([(ΠW)H+11]T(ΠW)H+11)+P(g),

for some polynomial of degree . With (9), the above inequality becomes

 dgdt≤−2C0gH+1+P(g)

which concludes the proof of Proposition 1. The guaranteed rate of convergence can be obtained from estimates associated with polynomial gradient systems [8]. ∎

As an important byproduct of Lemma 1 we have the following remark on the exponential convergence of GDD in linear networks.

Remark 2 (Exponential Convergence in Linear Networks).

Let Assumptions 1 and 2 hold. If and the initialization has at least positive eigenvalues for . Then, every trajectory of the GDD converges to a global minimum at least at the rate of with the -smallest eigenvalue of .

Proof.

Under Notations 1 we have

 dMdt=−H+1∑j=1(ΠW)H+1j+1dWjdt(ΠW)j−11=−H+1∑j=1(ΠW)H+1j+1[(ΠW)H+1j+1]TM[(ΠW)j−11]T(ΠW)j−11

so that

 dLdt =trd(MTM)dt=−2H+1∑j=1tr(MT(ΠW)H+1j+1[(ΠW)H+1j+1]TM[(ΠW)j−11]T(ΠW)j−11) ≤−2H+1∑j=1j−1∏k=1λmin(WTkWk)tr(MT(ΠW)H+1j+1[(ΠW)H+1j+1]TM) ≤−2H+1∑j=1j−1∏k=1λmin(WTkWk)H+1∏l=j+1λmin(WlWTl)tr(MTM)

where we constantly use the fact that for symmetric and semi-positive definite we have and denote the minimum eigenvalue of a symmetric and semi-positive definite matrix . Therefore, if there exists at least such that

 H+1∏l=j+1λmin(WlWTl)j−1∏k=1λmin(WTkWk)>0 (10)

then we obtain for some and thus the conclusion. From Lemma 1 and Weyl’s inequality (e.g., [10, Corollary 4.3.12]), we have, for that

 λi(WTj+1Wj+1)≥λi(Cj)+λmin(WjWTj)≥λi(Cj)

with the -th eigenvalue of arranged in algebraically nondecreasing order so that . Then since , the matrix is of rank maximum and thus admits at least zero eigenvalues so that

 λi(WTj+1Wj+1)=0,λi(Cj)≤0

for . Moreover, since for we also have,

 λi+dj−dj+1(WTj+1Wj+1)=λi(Wj+1WTj+1)

we obtain at once that for ,

 λi(Wj+1WTj+1)=λi+dj−dj+1(WTj+1Wj+1)≥λi+dj−dj+1(Cj)

so that by taking in (10) we result in

 H∏l=1λmin(Wl+1WTl+1)≥H∏l=1λdl−dl+1+1(Cl)>0

which concludes the proof. ∎

Remark 2 entails that, for a linear network, although in general only polynomial convergence rate can be established [8], it is possible to wisely initialize the gradient descent algorithm to achieve exponential convergence.

To continue the analysis of the gradient system, one should evaluate the “size” of the basin of attraction of each critical point and then to evaluate the “size” of the (set) union of these basins of attraction. For instance, assume that for each critical point, the associated Hessian admits a negative eigenvalue. Then the basin of attraction of each critical point has zero (Lebesgue) measure, but this does not ensure that the union of these basins of attraction also has zero measure.

3 Characterization of Critical Points

3.1 Critical Points Condition

Decomposing with , the effective loss writes

 L(W)=12∥SY−(ΠW)H+12W1,1∥2F+12∥(ΠW)H+12W1,2∥2F (11)

where we recall and so that the product . To fully characterize the critical points of the loss in (11) as well as their basins of attraction, we shall expansion the first two order variations of as described in the following proposition.

Proposition 2 (Variation of L in Deep Networks).

For and
, set . We then have the following expansion for ,

 L(W+w)=L(W)+ΔW(w)+HW(w)+O(∥w∥3), (12)

with and

 ΔW(w) =−H∑j=2tr(WH+1Q1j(W,w)W1,1MT)−tr(wH+1(ΠW)H2W1,1MT) +H∑j=2tr(WH+1Q1j(W,w)W1,2WT1,2[(ΠW)H+12]T)+tr(wH+1(ΠW)H2W1,2WT1,2[(ΠW)H+12]T) −tr((ΠW)H+12w1,1MT)+tr((ΠW)H+12w1,2WT1,2[(ΠW)H+12]T),
 HW(w) =−H∑j≠k≥2tr(WH+1Q2j,k(W,w)W1,1MT)−H∑j=2tr(wH+1Q1j(W,w)W1,1MT) −H∑l=2tr(WH+1Q1l(W,w)w1,1MT)−tr(wH+1(ΠW)H2w1,1MT) +H∑j≠k≥2tr(WH+1Q2j,k(W,w)W1,2WT1,2[(ΠW)H+12]T)+H∑j=2tr(wH+1Q1j(W,w)W1,2WT1,2[(ΠW)H+12]T) +H∑j=2tr(WH+1Q1j(W,w)w1,2WT1,2[(ΠW)H+12]T)+tr(wH+1(ΠW)H2w1,2WT1,2[(ΠW)H+12]T) +12∥∥ ∥∥H∑j=2WH+1Q1j(W,w)W1,1+wH+1(ΠW)H2W1,1+(ΠW)H+12w1,1∥∥ ∥∥2F +12∥∥ ∥∥WH+1H∑j=2Q1j(W,w)W1,2+wH+1(ΠW)H2W1,2+(ΠW)H+12w1,2∥∥ ∥∥