Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

In this note, we study the dynamics of gradient descent on objective functions of the form f(∏_i=1^k w_i) (with respect to scalar parameters w_1,...,w_k), which arise in the context of training depth-k linear neural networks. We prove that for standard random initializations, and under mild assumptions on f, the number of iterations required for convergence scales exponentially with the depth k. This highlights a potential obstacle in understanding the convergence of gradient-based methods for deep linear neural networks, where k is large.

Authors

• 45 publications
• A Geometric Approach of Gradient Descent Algorithms in Neural Networks

In this article we present a geometric framework to analyze convergence ...

11/08/2018 ∙ by Yacine Chitour, et al. ∙ 0

• On the convergence of gradient descent for two layer neural networks

It has been shown that gradient descent can yield the zero training loss...

09/30/2019 ∙ by Lei Li, et al. ∙ 0

• On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Conventional wisdom in deep learning states that increasing depth improv...

02/19/2018 ∙ by Sanjeev Arora, et al. ∙ 0

• The Implicit Bias of Depth: How Incremental Learning Drives Generalization

A leading hypothesis for the surprising generalization of neural network...

09/26/2019 ∙ by Daniel Gissin, et al. ∙ 0

• Distribution-Specific Hardness of Learning Neural Networks

Although neural networks are routinely and successfully trained in pract...

09/05/2016 ∙ by Ohad Shamir, et al. ∙ 0

• Provable Methods for Training Neural Networks with Sparse Connectivity

We provide novel guaranteed approaches for training feedforward neural n...

12/08/2014 ∙ by Hanie Sedghi, et al. ∙ 0

• Optimal Statistical Rates for Decentralised Non-Parametric Regression with Linear Speed-Up

We analyse the learning performance of Distributed Gradient Descent in t...

05/08/2019 ∙ by Dominic Richards, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the biggest open problems in theoretical machine learning is to explain why deep artificial neural networks can be efficiently trained in practice, using simple gradient-based methods. Such training requires optimizing complex, highly non-convex objective functions, which seem intractable from a worst-case viewpoint. Over the past few years, much research has been devoted to this question, but it remains largely unanswered.

Trying to understand simpler versions of this question, significant attention has been devoted to linear neural networks, which are predictors mathematically defined as , with being a set of parameter matrices, and being the depth parameter (e.g. Saxe et al. (2013); Kawaguchi (2016); Hardt and Ma (2016); Lu and Kawaguchi (2017); Bartlett et al. (2018); Laurent and Brecht (2018)). The optimization problem associated with training such networks can be formulated as

 minW1,…,WkF(W1,…,Wk):=f(k∏i=1Wi) (1)

for some matrix-valued function . Although much simpler than general feedforward neural networks (which involve additional non-linear functions), it is widely believed that Eq. (1) captures important aspects of neural network optimization problems. Moreover, Eq. (1) has a simple algebraic structure, which makes it more amenable to analysis. In particular, it is known that when is convex and differentiable, Eq. (1) has no local minima except global ones (see Laurent and Brecht (2018) and references therein). In other words, if an optimization algorithm converges to some local minimum, then it must converge to a global minimum.

Importantly, this no-local-minima result does not imply that gradient-based methods indeed solve Eq. (1) efficiently: Even when they converge to local minima (which is not always guaranteed, say in case the parameters diverge), the number of required iterations might be arbitrarily large. To study this question, Bartlett et al. (2018) recently considered the special case where ( being the Frobenius norm) for square matrices , using gradient descent starting from for all . Specifically, the authors prove a polynomial-time convergence guarantee when is positive semidefinite. On the other hand, when

is symmetric and with negative eigenvalues, it is shown that gradient descent with this initialization will never converge.

Although these results provide important insights, they crucially assume that each is initialized exactly at the identity . Since in practice parameters are initialized randomly, it is natural to ask whether such results hold with random initialization. Indeed, even though gradient descent might fail to converge with a specific initialization, it could be that even a tiny random perturbation is sufficient for polynomial-time convergence. To take a particularly simple special case, consider the objective for . It is an easy exercise to show that gradient descent starting from any (and sufficiently small step sizes) will fail to converge to an optimal solution111Essentially, at any iteration and will remain equal and nonnegative, hence and .. On the other hand, polynomial-time convergence holds with random initialization (see Du et al. (2018)).

Unfortunately, analyzing the dynamics of gradient descent on objectives such as Eq. (1) appears to be quite challenging. Thus, in this note, we consider a more tractable special case of Eq. (1), where the matrices are scalars:

 minw∈RkF(w):=f(∏iwi) . (2)

We show that under mild conditions on the function , and with standard initializations (including Xavier initialization and any reasonable initialization close to ), gradient descent will require iterations to converge. We complement this by showing that iterations suffice for convergence to an -optimal point in these cases. The take-home message is that even if we focus on linear neural networks, natural objective functions, and random initializations, the associated optimization problems can be intractable for gradient descent to solve when the depth is large. As we discuss in Sec. 4, this does not mean that gradient-based methods cannot learn deep linear networks in general. However, the results do imply that one would need to make additional assumptions or algorithmic modifications to circumvent these negative results.

Finally, we note that our results provide a possibly interesting contrast to the recent work of Arora et al. (2018), which suggests that increasing depth can sometimes accelerate the optimization process. Here we show that at least in some cases, the opposite occurs: Adding depth can quickly turn a trivial optimization problem into an intractable one for gradient descent.

2 Preliminaries

Notation.

We use bold-faced letters to denote vectors. Given a vector

, refers to its -th coordinate. , and refer to the Euclidean norm, the -norm and the infinity norm respectively. We let and be a shorthand for . Also, we define a product over an empty set as being equal to . Since our main focus is to study the dependence on the network depth , we use the standard notation to hide constants independent of , and to hide constants and factors logarithmic in .

Gradient Descent. We consider the standard gradient descent method for unconstrained optimization of functions in Euclidean space, which given an initialization point , performs repeated iterations of the form for (where is the gradient, and is a step size parameter). For objectives as in Eq. (2), we have , and gradient descent takes the form

 ∀j, wj(t+1)=wj(t)−ηf′(∏iwi(t))∏j≠iwi(t) .

Random Initialization. One of the most common initialization methods for neural networks is Xavier initialization (Glorot and Bengio, 2010), which in the setting of Eq. (1) corresponds to choosing each entry of each matrix

independently from a zero-mean distribution with variance

(usually uniform or Gaussian). This ensures that the variance of the network outputs (with respect to the initialization) is constant irrespective of the network size. Motivated by residual networks, Hardt and Ma (2016) and Bartlett et al. (2018) consider initializing each independently at , possibly with some random perturbation. In this paper we denote such an initialization scheme as a near-identity initialization. Since we focus here on the case as in Eq. (2), Xavier initialization corresponds to choosing each independently from a zero-mean, unit-variance distribution, and near-identity initialization corresponds to choosing each close to .

3 Exponential Convergence Time for Gradient Descent

For our negative results, we impose the following mild conditions on the function in Eq. (2):

Assumption 1.

is differentiable, Lipschitz continuous and strictly monotonically increasing on any interval where . Moreover, .

Here, we assume that is fixed, and our goal is to study the convergence time of gradient descent on Eq. (2) as a function of the depth . Some simple examples satisfying Assumption 1 in the context of machine learning include and (e.g., squared loss and logistic loss with respect to the input/output pair , respectively). We note that this non-symmetry with respect to positive/negative values is completely arbitrary, and one can prove similar results if their roles are reversed.

3.1 Xavier Initialization

We begin with the case of Xavier initialization, where we initialize all coordinates of in Eq. (2) independently from a zero-mean, unit variance distribution. We will consider any distribution which satisfies the following:

Assumption 2.

are drawn i.i.d. from a zero-mean, unit variance distribution such that

1. for all

where are absolute constants independent of .

The first part of the assumption is satisfied for any distribution with bounded density. As to the second part, the following lemma shows that it is satisfied for uniform and Gaussian distributions (with an explicit

), and in fact for any non-trivial distribution (with a distribution-dependent ):

Lemma 1.
• If is drawn from a zero-mean, unit-variance Gaussian, then  .

• If

is drawn from a zero-mean, unit-variance uniform distribution, then

.

• If is drawn from any zero-mean, unit variance distribution not supported on a single value, then .

Proof.

The first two parts follow from standard results on Gaussian and uniform distributions. As to the third part, by Jensen’s inequality and the fact that is a strictly concave function, . ∎

With such an initialization, we now show that gradient descent is overwhelmingly likely to take at least exponential time to converge:

Theorem 1.

The following holds for some positive constants independent of : Under Assumptions 1 and 2, if gradient descent is ran with any step size

, then with probability at least

over the initialization, the number of iterations required to reach suboptimality less than is at least .

In the above, hides dependencies on the absolute constants in the theorem statement and the assumptions. The proof (as all other major proofs in this paper) appears in the appendix.

The intuition behind the theorem is quite simple: Under our assumptions, it is easy to show that the product of any coordinates from is overwhelmingly likely to be exponentially small in . Since the derivative of our objective w.r.t. any has the form , it follows that the gradient is exponentially small in . Moreover, we show that the gradient is exponentially small at any point within a bounded distance from the initialization (which is the main technical challenge of the proof, since the gradient is by no means Lipschitz). As a result, gradient descent will only make exponentially small steps. Assuming we start from a point bounded away from a global minimum, it follows that the number of required iterations must be exponentially large in .

We note that the observation that Xavier initialization leads to highly skewed values in deep enough networks is not new (see

Saxe et al. (2013); Pennington et al. (2017)), and has motivated alternative initializations such as orthogonal initialization222It is interesting to note that in our setting, orthogonal initialization amounts to choosing each in , which can easily cause non-convergence, e.g. for when and small enough step sizes.. Our main contribution is to rigorously analyze how this affects the optimization process for our setting.

3.2 Near-Identity Initialization

We now turn to consider initializations where each is initialized close to . Here, it will be convenient to make deterministic rather than stochastic assumptions on the initialization point (which are satisfied with high probability for reasonable distributions):

Assumption 3.

For some absolute constants independent of , gradient descent is initialized at a point which satisfies and .

To justify this assumption, note that if are chosen i.i.d. and not in the range of for some , then their product is likely to explode or vanish with .

Theorem 2.

The following holds for some positive constants independent of : Under Assumptions 1 and 3, if gradient descent is ran with any positive step size , then the number of iterations required to reach suboptimality less than is at least .

As before, hides dependencies on the absolute constants in the theorem statement, as well as those in the assumptions.

The formal proof appears in the appendix. To help explain its intuition, we provide in Figure 1 the actual evolution of for a typical run of gradient descent, when and we initialize all coordinates reasonably close to . Recall that for any , the gradient descent updates take the form

 ∀j, wj(t+1) = wj(t)−η(∏iwi(t)+1)∏i≠jwi(t) ,

where . Thus, initially, all parameters decrease with , as to be expected. However, as their value fall to around or below , their product decreases rapidly to . Since the gradient of each scales as , the magnitude of the gradients becomes very small, and the algorithm makes only slow progress. Eventually, one of the parameters becomes negative, in which case all other parameters start increasing, and the algorithm converges. However, the length of the slow middle phase can be shown to be exponential in the depth / number of parameters .

3.3 A Positive Result

Having established that the number of iterations is at least , we now show that this is nearly tight. Specifically, we prove that gradient descent indeed converges in the settings studied so far, with a number of iterations scaling as (this can be interpreted as a constant for any constant ). For simplicity, we prove this in the case where , but the technique can be easily generalized to other convex under mild conditions. We note that the case of and each initialized to is covered by the results in Bartlett et al. (2018). However, here we show a convergence result for other values of , and even if are not all initialized at .

We will use the following assumptions on our objective and parameters:

Assumption 4.

The following hold for some absolute positive constants independent of :

• The initialization satisfies the following:

• and

The assumptions and ensure that the objective satisfies the conditions of our negative results, for both Xavier and near-identity initializations (the other cases can be studied using similar techniques).

Theorem 3.

Consider the objective . Under Assumption 4, for any step size for some large enough constant , and for any , the number of gradient descent iterations required for is at most .

4 Discussion

In this work, we showed that for one-dimensional deep linear neural networks, gradient descent can easily require exponentially many iterations (in the depth of the network) to converge. It is important to emphasize, though, that this does not imply that gradient-based methods fail to learn linear networks in general: First of all, our results are specific to the case where the parameter matrix of each layer is one-dimensional, and do not necessarily extend to higher dimensions. A possibly interesting exception is when , and both and the initialization are diagonal matrices. In that case, it is easy to show that the matrices produced by gradient descent remain diagonal, and the objective can be rewritten as a sum of independent one-dimensional problems for which our results would apply. However, this reasoning fails for non-diagonal initializations and target matrices . Based on some numerical experiments, we believe that even in the non-diagonal case, gradient descent can sometime require exponential time to converge, but this phenomenon is not particularly common, and it is quite likely that this can be avoided under reasonable assumptions. Finally, we focused on standard gradient descent, and it is quite possible that our results can be circumvented using other gradient-based algorithms (for example, by adding random noise to the gradient updates or using adaptive step sizes).

Despite these reservations, we believe our results point to a potential obstacle in understanding the convergence of gradient-based methods for linear networks: At the very least, one would have to rule out one-dimensional layers, or consider algorithms other than plain gradient descent, in order to establish polynomial-time convergence guarantees for deep linear networks.

Appendix A Proofs

a.1 Proof of Thm. 1

The proof is based on the following two lemmas:

Lemma 2.

Suppose are drawn i.i.d. from a distribution such that for some . Then

 Pr⎛⎝maxj∣∣ ∣∣∏i≠jwi∣∣ ∣∣≥ka(k−1)/2⎞⎠ ≤ a(k−1)/2
Proof.

For any fixed , by Markov’s inequality and the i.i.d. assumption,

 Pr⎛⎝∣∣ ∣∣∏i≠jwi∣∣ ∣∣≥ka(k−1)/2⎞⎠ ≤ E[|∏i≠jwi|]ka(k−1)/2 = (E[|w1|])k−1ka(k−1)/2 ≤ ak−1ka(k−1)/2 = 1ka(k−1)/2 .

Taking a union bound over all , the result follows. ∎

Lemma 3.

Let be fixed. Let such that and . Then for any such that , it holds that as well as .

Proof.

We claim that it is enough to prove the following:

 ∀w,v∈Rk s.t.  maxj∣∣ ∣∣∏i≠jwi∣∣ ∣∣≤α  ,  mini|wi|≥δ  ,  maxj∣∣ ∣∣∏i≠jvi∣∣ ∣∣>β it holds that  ∥v−w∥ > δ√k−1log(β/α) . (3)

Indeed, this would imply that for any satisfying the conditions above, and any s.t. , we must have , and therefore , as well as by definition of , as required.

To prove Eq. (3), we first state and prove the following auxiliary result:

 ∀w,v∈Rk−1 s.t.  ∀i vi≥wi≥0  ,  ∏iwi≤α  ,  miniwi≥δ  ,  ∏ivi>β it holds that  ∥v−w∥ > δ√k−1log(β/α) . (4)

This statement holds by the following calculation:

 ∥v−w∥ ≥ 1√k−1∥v−w∥1 = 1√k−1⋅∑i(vi−wi) (∗)≥ 1√k−1∑iwi(log(vi)−log(wi)) ≥ δ√k−1∑i(log(vi)−log(wi)) = δ√k−1log(∏ivi∏iwi) > δ√k−1log(βα) ,

where is due to the fact that is -Lipschitz in , and the assumption that .

It remains to explain how Eq. (4) implies Eq. (3). Indeed, let be any two vectors in which satisfy the conditions of Eq. (3). Now, suppose we transform them into vectors by the following procedure:

• Change the sign of every and to be positive

• For any such that , change to equal .

• Drop a coordinate which maximizes .

It is easy to verify that the resulting vectors satisfy the conditions of Eq. (4), and . Therefore, by Eq. (4), as required. ∎

With these two lemmas in hand, we turn to prove the theorem. By Lemma 2 and Assumption 2, we have

 Pr⎛⎝maxj∣∣ ∣∣∏i≠jwi(1)∣∣ ∣∣≥exp(−2Ck)⎞⎠ ≤ exp(−C′k) .

for some fixed constants and any large enough . Moreover, again by Assumption 2, it holds for any that , so by a union bound,

 Pr(mini|wi|

Finally, by Assumption 2, Markov’s inequality and a union bound,

 Pr(∥w(1)∥∞≥exp(Ck)) ≤kexp(−Ck)

Combining the last three displayed equations with a union bound, and applying Lemma 3 (with , , and ), we get the following: With probability at least over the choice of ,

• .

• For any at a distance at most from , we have

 ∥v∥∞ ≤ ∥w(1)∥∞+exp(−Ck)log(2)√k−1 ≤ O(exp(Ck)) ,
 ∣∣ ∣∣∏ivi∣∣ ∣∣ ≤ β∥v∥∞ = 2exp(−2Ck)⋅O(exp(Ck)) = O(exp(−Ck))

and

 ∥∇F(v)∥ ≤ supp:|p|≤β∥v∥∞|f′(p)|⋅√kβ ≤ supp:|p|≤O(exp(−Ck))|f′(p)|⋅2√kexp(−2Ck) = O(√kexp(−2Ck)) .

This has two implications:

1. Since the gradient descent updates are of the form , and we can assume by the theorem’s conditions, the number of iterations required to get to a distance larger than from is at least

 exp(−Ck)log(2)√k−1exp(Ck/2)⋅O(√kexp(−2Ck)) = Ω(exp(Ck/2)k) ,

which is at least iterations.

2. As long as we are at a distance smaller than the above, . In particular, for large enough , so by Assumption 1 and definition of , we have that is lower bounded by a constant independent of .

Overall, we get that with probability at least , we initialize at some region in which all points are at least suboptimal, and at least iterations are required to escape it. This immediately implies our theorem.

a.2 Proof of Thm. 2

We begin with the following auxiliary lemma, and then turn to analyze the dynamics of gradient descent in our setting.

Lemma 4.

For any positive scalars such that ,

 ∏i(wi−α) ≤ ⎛⎝(∏iwi)1/k−α⎞⎠k .
Proof.

Taking the -th root and switching sides, the inequality in the lemma is equivalent to proving

 (∏i(wi−α))1/k+α ≤ (∏iwi)1/k.

Letting , and for all , the above is equivalent to proving that

 (∏iai)1/k+(∏ibi)1/k ≤ (∏i(ai+bi))1/k,

namely that the sum of the geometric means of two positive sequences

and is at most the geometric mean of their sum . This follows from the superadditivity of the geometric mean (see Steele [2004, Exercise 2.11]) ∎

Lemma 5.

If and for some positive constants , then for any ,

 |wj(t+1)2−wj′(t+1)2| ≤ |wj(t)2−wj′(t)2|+C′′η2(∏iwi(t))2 ,

where is some constant dependent only on and the function .

Proof.

By definition,

 wj(t+1)2 −wj′(t+1)2 = ⎛⎝wj(t)−ηf′(∏iwi(t))∏i≠jwi(t)⎞⎠2−⎛⎝wj′(t)−ηf′(∏iwi(t))∏i≠j′wi(t)⎞⎠2 = wj(t)2−wj′(t)2+η2f′(∏iwi(t))2⎛⎜⎝⎛⎝∏i≠jwi(t)⎞⎠2−⎛⎝∏i≠j′wi(t)⎞⎠2⎞⎟⎠ = wj(t)2−wj′(t)2+η2(∏iwi(t))2⋅f′(∏iwi(t))2(1wj(t)2−1wj′(t)2) .

By assumption, and . Therefore, by our assumptions on , the displayed equation above implies that

 |wj(t+1)2−wj′(t+1)2| ≤ |wj(t)2−wj′(t)2|+C′′η2(∏iwi(t))2

for some constant dependent on and as required. ∎

Lemma 6.

Suppose that at some iteration , for some constant independent of , it holds that and for some . Then after at most iterations, if for all , then

• Each as well as monotonically decrease in

• For all ,

• .

In the above, hides constants dependent only on and the function .

Proof.

If , we can pick , and the lemma trivially holds. Otherwise, let be the smallest (positive) index such that (if no such index exists, take , although the arguments below imply that must be finite). Since we assume for all are positive, and is monotonically increasing,

 wj(r+1) = wj(r)−ηf′(∏iwi(r))∏i≠jwi(r) ≤ wj(r),

so monotonically decreases in . Moreover, these are all positive numbers by assumption, so monotonically decreases in as well. This shows the first part of the lemma.

As to the second part, the displayed equation above, the fact that and decrease in , and our assumptions on imply that for any ,

 wj(r+1) = wj(r)−ηf′(∏iwi(r))∏i≠jwi(r) = wj(r)−ηwj(r)f′(∏iwi(r))∏iwi(r) = wj(r)−Θ(1)⋅ηβ .

where hides constants dependent only on and . As to the third part of the lemma, fix some , and repeatedly apply the displayed equation above for , to get that that (which is still by the lemma assumptions). In that case,

 ∏iwi(t+s) ≤ ∏i(wi(t)−Θ(1)⋅ηβs) (∗)≤(β1/k−Θ(1)⋅ηβs)k = β(1−Θ(1)⋅ηβ1−1/ks)k ≤ βexp(−Θ(1)⋅ηβ1−1/ksk)

where follows from Lemma 4 and the fact that . The right hand side in turn is at most for any for some constant . In particular, if , then by choosing s.t. , we get that even though , which contradicts the definition of . Hence as stated in the lemma. ∎

Combining Lemma 5 and Lemma 6, we have the following:

Lemma 7.

For any constants and index , if and for all and , then for all such ,

• Each as well as monotonically decrease in .

•  .

In the above, hides constants dependent only on and the constants in Assumptions 1 and 3.

Proof.

The first two parts of the lemma follow from Lemma 6 and the fact that by Assumption 3, . As to the last part, define (where ) as the first indices such that for all , (where is taken to be as large as possible). By Lemma 6, we have the following:

• For all ,  .

•  .

• For all and any , we have .

Combining this with Lemma 5, it follows that for any , and any ,

 |wj(tr+1)2−wj′(tr+1)2| ≤ |wj(tr)2−wj′(tr)2|+O(1)⋅η2exp(−2r)⋅(1+exp(−r)1/k−1ηk) ≤ |wj(tr)2−wj′(tr)2|+O(1)⋅(η2exp(−2r)+ηexp(−r)k) ,

as well as

 |wj(T)2−wj′(T)2| ≤ |wj(ts)2−wj(ts)2|+O(1)⋅(η2exp(−2s)+ηexp(−s)k) .

Repeatedly applying the last two displayed equations, and using Assumption 3, we get that

 |wj(T)2−wj′(T)2| ≤ |wj(1)2−wj′(1)