One of the biggest open problems in theoretical machine learning is to explain why deep artificial neural networks can be efficiently trained in practice, using simple gradient-based methods. Such training requires optimizing complex, highly non-convex objective functions, which seem intractable from a worst-case viewpoint. Over the past few years, much research has been devoted to this question, but it remains largely unanswered.
Trying to understand simpler versions of this question, significant attention has been devoted to linear neural networks, which are predictors mathematically defined as , with being a set of parameter matrices, and being the depth parameter (e.g. Saxe et al. (2013); Kawaguchi (2016); Hardt and Ma (2016); Lu and Kawaguchi (2017); Bartlett et al. (2018); Laurent and Brecht (2018)). The optimization problem associated with training such networks can be formulated as
for some matrix-valued function . Although much simpler than general feedforward neural networks (which involve additional non-linear functions), it is widely believed that Eq. (1) captures important aspects of neural network optimization problems. Moreover, Eq. (1) has a simple algebraic structure, which makes it more amenable to analysis. In particular, it is known that when is convex and differentiable, Eq. (1) has no local minima except global ones (see Laurent and Brecht (2018) and references therein). In other words, if an optimization algorithm converges to some local minimum, then it must converge to a global minimum.
Importantly, this no-local-minima result does not imply that gradient-based methods indeed solve Eq. (1) efficiently: Even when they converge to local minima (which is not always guaranteed, say in case the parameters diverge), the number of required iterations might be arbitrarily large. To study this question, Bartlett et al. (2018) recently considered the special case where ( being the Frobenius norm) for square matrices , using gradient descent starting from for all . Specifically, the authors prove a polynomial-time convergence guarantee when is positive semidefinite. On the other hand, when
is symmetric and with negative eigenvalues, it is shown that gradient descent with this initialization will never converge.
Although these results provide important insights, they crucially assume that each is initialized exactly at the identity . Since in practice parameters are initialized randomly, it is natural to ask whether such results hold with random initialization. Indeed, even though gradient descent might fail to converge with a specific initialization, it could be that even a tiny random perturbation is sufficient for polynomial-time convergence. To take a particularly simple special case, consider the objective for . It is an easy exercise to show that gradient descent starting from any (and sufficiently small step sizes) will fail to converge to an optimal solution111Essentially, at any iteration and will remain equal and nonnegative, hence and .. On the other hand, polynomial-time convergence holds with random initialization (see Du et al. (2018)).
Unfortunately, analyzing the dynamics of gradient descent on objectives such as Eq. (1) appears to be quite challenging. Thus, in this note, we consider a more tractable special case of Eq. (1), where the matrices are scalars:
We show that under mild conditions on the function , and with standard initializations (including Xavier initialization and any reasonable initialization close to ), gradient descent will require iterations to converge. We complement this by showing that iterations suffice for convergence to an -optimal point in these cases. The take-home message is that even if we focus on linear neural networks, natural objective functions, and random initializations, the associated optimization problems can be intractable for gradient descent to solve when the depth is large. As we discuss in Sec. 4, this does not mean that gradient-based methods cannot learn deep linear networks in general. However, the results do imply that one would need to make additional assumptions or algorithmic modifications to circumvent these negative results.
Finally, we note that our results provide a possibly interesting contrast to the recent work of Arora et al. (2018), which suggests that increasing depth can sometimes accelerate the optimization process. Here we show that at least in some cases, the opposite occurs: Adding depth can quickly turn a trivial optimization problem into an intractable one for gradient descent.
We use bold-faced letters to denote vectors. Given a vector, refers to its -th coordinate. , and refer to the Euclidean norm, the -norm and the infinity norm respectively. We let and be a shorthand for . Also, we define a product over an empty set as being equal to . Since our main focus is to study the dependence on the network depth , we use the standard notation to hide constants independent of , and to hide constants and factors logarithmic in .
Gradient Descent. We consider the standard gradient descent method for unconstrained optimization of functions in Euclidean space, which given an initialization point , performs repeated iterations of the form for (where is the gradient, and is a step size parameter). For objectives as in Eq. (2), we have , and gradient descent takes the form
Random Initialization. One of the most common initialization methods for neural networks is Xavier initialization (Glorot and Bengio, 2010), which in the setting of Eq. (1) corresponds to choosing each entry of each matrix
independently from a zero-mean distribution with variance(usually uniform or Gaussian). This ensures that the variance of the network outputs (with respect to the initialization) is constant irrespective of the network size. Motivated by residual networks, Hardt and Ma (2016) and Bartlett et al. (2018) consider initializing each independently at , possibly with some random perturbation. In this paper we denote such an initialization scheme as a near-identity initialization. Since we focus here on the case as in Eq. (2), Xavier initialization corresponds to choosing each independently from a zero-mean, unit-variance distribution, and near-identity initialization corresponds to choosing each close to .
3 Exponential Convergence Time for Gradient Descent
For our negative results, we impose the following mild conditions on the function in Eq. (2):
is differentiable, Lipschitz continuous and strictly monotonically increasing on any interval where . Moreover, .
Here, we assume that is fixed, and our goal is to study the convergence time of gradient descent on Eq. (2) as a function of the depth . Some simple examples satisfying Assumption 1 in the context of machine learning include and (e.g., squared loss and logistic loss with respect to the input/output pair , respectively). We note that this non-symmetry with respect to positive/negative values is completely arbitrary, and one can prove similar results if their roles are reversed.
3.1 Xavier Initialization
We begin with the case of Xavier initialization, where we initialize all coordinates of in Eq. (2) independently from a zero-mean, unit variance distribution. We will consider any distribution which satisfies the following:
are drawn i.i.d. from a zero-mean, unit variance distribution such that
where are absolute constants independent of .
The first part of the assumption is satisfied for any distribution with bounded density. As to the second part, the following lemma shows that it is satisfied for uniform and Gaussian distributions (with an explicit), and in fact for any non-trivial distribution (with a distribution-dependent ):
If is drawn from a zero-mean, unit-variance Gaussian, then .
is drawn from a zero-mean, unit-variance uniform distribution, then.
If is drawn from any zero-mean, unit variance distribution not supported on a single value, then .
The first two parts follow from standard results on Gaussian and uniform distributions. As to the third part, by Jensen’s inequality and the fact that is a strictly concave function, . ∎
With such an initialization, we now show that gradient descent is overwhelmingly likely to take at least exponential time to converge:
In the above, hides dependencies on the absolute constants in the theorem statement and the assumptions. The proof (as all other major proofs in this paper) appears in the appendix.
The intuition behind the theorem is quite simple: Under our assumptions, it is easy to show that the product of any coordinates from is overwhelmingly likely to be exponentially small in . Since the derivative of our objective w.r.t. any has the form , it follows that the gradient is exponentially small in . Moreover, we show that the gradient is exponentially small at any point within a bounded distance from the initialization (which is the main technical challenge of the proof, since the gradient is by no means Lipschitz). As a result, gradient descent will only make exponentially small steps. Assuming we start from a point bounded away from a global minimum, it follows that the number of required iterations must be exponentially large in .
We note that the observation that Xavier initialization leads to highly skewed values in deep enough networks is not new (seeSaxe et al. (2013); Pennington et al. (2017)), and has motivated alternative initializations such as orthogonal initialization222It is interesting to note that in our setting, orthogonal initialization amounts to choosing each in , which can easily cause non-convergence, e.g. for when and small enough step sizes.. Our main contribution is to rigorously analyze how this affects the optimization process for our setting.
3.2 Near-Identity Initialization
We now turn to consider initializations where each is initialized close to . Here, it will be convenient to make deterministic rather than stochastic assumptions on the initialization point (which are satisfied with high probability for reasonable distributions):
For some absolute constants independent of , gradient descent is initialized at a point which satisfies and .
To justify this assumption, note that if are chosen i.i.d. and not in the range of for some , then their product is likely to explode or vanish with .
As before, hides dependencies on the absolute constants in the theorem statement, as well as those in the assumptions.
The formal proof appears in the appendix. To help explain its intuition, we provide in Figure 1 the actual evolution of for a typical run of gradient descent, when and we initialize all coordinates reasonably close to . Recall that for any , the gradient descent updates take the form
where . Thus, initially, all parameters decrease with , as to be expected. However, as their value fall to around or below , their product decreases rapidly to . Since the gradient of each scales as , the magnitude of the gradients becomes very small, and the algorithm makes only slow progress. Eventually, one of the parameters becomes negative, in which case all other parameters start increasing, and the algorithm converges. However, the length of the slow middle phase can be shown to be exponential in the depth / number of parameters .
3.3 A Positive Result
Having established that the number of iterations is at least , we now show that this is nearly tight. Specifically, we prove that gradient descent indeed converges in the settings studied so far, with a number of iterations scaling as (this can be interpreted as a constant for any constant ). For simplicity, we prove this in the case where , but the technique can be easily generalized to other convex under mild conditions. We note that the case of and each initialized to is covered by the results in Bartlett et al. (2018). However, here we show a convergence result for other values of , and even if are not all initialized at .
We will use the following assumptions on our objective and parameters:
The following hold for some absolute positive constants independent of :
The initialization satisfies the following:
The assumptions and ensure that the objective satisfies the conditions of our negative results, for both Xavier and near-identity initializations (the other cases can be studied using similar techniques).
Consider the objective . Under Assumption 4, for any step size for some large enough constant , and for any , the number of gradient descent iterations required for is at most .
In this work, we showed that for one-dimensional deep linear neural networks, gradient descent can easily require exponentially many iterations (in the depth of the network) to converge. It is important to emphasize, though, that this does not imply that gradient-based methods fail to learn linear networks in general: First of all, our results are specific to the case where the parameter matrix of each layer is one-dimensional, and do not necessarily extend to higher dimensions. A possibly interesting exception is when , and both and the initialization are diagonal matrices. In that case, it is easy to show that the matrices produced by gradient descent remain diagonal, and the objective can be rewritten as a sum of independent one-dimensional problems for which our results would apply. However, this reasoning fails for non-diagonal initializations and target matrices . Based on some numerical experiments, we believe that even in the non-diagonal case, gradient descent can sometime require exponential time to converge, but this phenomenon is not particularly common, and it is quite likely that this can be avoided under reasonable assumptions. Finally, we focused on standard gradient descent, and it is quite possible that our results can be circumvented using other gradient-based algorithms (for example, by adding random noise to the gradient updates or using adaptive step sizes).
Despite these reservations, we believe our results point to a potential obstacle in understanding the convergence of gradient-based methods for linear networks: At the very least, one would have to rule out one-dimensional layers, or consider algorithms other than plain gradient descent, in order to establish polynomial-time convergence guarantees for deep linear networks.
- Arora et al.  Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018.
Bartlett et al. 
Peter Bartlett, Dave Helmbold, and Phil Long.
Gradient descent with identity initialization efficiently learns positive definite linear transformations.In International Conference on Machine Learning, pages 520–529, 2018.
- Du et al.  Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. arXiv preprint arXiv:1806.00900, 2018.
Glorot and Bengio 
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
- Hardt and Ma  Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
- Karimi et al.  Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
- Kawaguchi  Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
- Laurent and Brecht  Thomas Laurent and James Brecht. Deep linear networks with arbitrary loss: All local minima are global. In International Conference on Machine Learning, pages 2908–2913, 2018.
- Lu and Kawaguchi  Haihao Lu and Kenji Kawaguchi. Depth creates no bad local minima. arXiv preprint arXiv:1702.08580, 2017.
- Pennington et al.  Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in neural information processing systems, pages 4785–4795, 2017.
- Polyak  Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.
- Saxe et al.  Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
- Steele  J Michael Steele. The Cauchy-Schwarz master class: an introduction to the art of mathematical inequalities. Cambridge University Press, 2004.
Appendix A Proofs
a.1 Proof of Thm. 1
The proof is based on the following two lemmas:
Suppose are drawn i.i.d. from a distribution such that for some . Then
For any fixed , by Markov’s inequality and the i.i.d. assumption,
Taking a union bound over all , the result follows. ∎
Let be fixed. Let such that and . Then for any such that , it holds that as well as .
We claim that it is enough to prove the following:
Indeed, this would imply that for any satisfying the conditions above, and any s.t. , we must have , and therefore , as well as by definition of , as required.
To prove Eq. (3), we first state and prove the following auxiliary result:
This statement holds by the following calculation:
where is due to the fact that is -Lipschitz in , and the assumption that .
Change the sign of every and to be positive
For any such that , change to equal .
Drop a coordinate which maximizes .
for some fixed constants and any large enough . Moreover, again by Assumption 2, it holds for any that , so by a union bound,
Finally, by Assumption 2, Markov’s inequality and a union bound,
Combining the last three displayed equations with a union bound, and applying Lemma 3 (with , , and ), we get the following: With probability at least over the choice of ,
For any at a distance at most from , we have
This has two implications:
Since the gradient descent updates are of the form , and we can assume by the theorem’s conditions, the number of iterations required to get to a distance larger than from is at least
which is at least iterations.
As long as we are at a distance smaller than the above, . In particular, for large enough , so by Assumption 1 and definition of , we have that is lower bounded by a constant independent of .
Overall, we get that with probability at least , we initialize at some region in which all points are at least suboptimal, and at least iterations are required to escape it. This immediately implies our theorem.
a.2 Proof of Thm. 2
We begin with the following auxiliary lemma, and then turn to analyze the dynamics of gradient descent in our setting.
For any positive scalars such that ,
Taking the -th root and switching sides, the inequality in the lemma is equivalent to proving
Letting , and for all , the above is equivalent to proving that
namely that the sum of the geometric means of two positive sequencesand is at most the geometric mean of their sum . This follows from the superadditivity of the geometric mean (see Steele [2004, Exercise 2.11]) ∎
If and for some positive constants , then for any ,
where is some constant dependent only on and the function .
By assumption, and . Therefore, by our assumptions on , the displayed equation above implies that
for some constant dependent on and as required. ∎
Suppose that at some iteration , for some constant independent of , it holds that and for some . Then after at most iterations, if for all , then
Each as well as monotonically decrease in
For all ,
In the above, hides constants dependent only on and the function .
If , we can pick , and the lemma trivially holds. Otherwise, let be the smallest (positive) index such that (if no such index exists, take , although the arguments below imply that must be finite). Since we assume for all are positive, and is monotonically increasing,
so monotonically decreases in . Moreover, these are all positive numbers by assumption, so monotonically decreases in as well. This shows the first part of the lemma.
As to the second part, the displayed equation above, the fact that and decrease in , and our assumptions on imply that for any ,
where hides constants dependent only on and . As to the third part of the lemma, fix some , and repeatedly apply the displayed equation above for , to get that that (which is still by the lemma assumptions). In that case,
where follows from Lemma 4 and the fact that . The right hand side in turn is at most for any for some constant . In particular, if , then by choosing s.t. , we get that even though , which contradicts the definition of . Hence as stated in the lemma. ∎
The first two parts of the lemma follow from Lemma 6 and the fact that by Assumption 3, . As to the last part, define (where ) as the first indices such that for all , (where is taken to be as large as possible). By Lemma 6, we have the following:
For all , .
For all and any , we have .
Combining this with Lemma 5, it follows that for any , and any ,
as well as