Recent success in machine learning involves training deep neural networks using randomly initialized first order methods, which requires optimizing highly non-convex functions. Compared with nonlinear deep neural networks, deep linear networks are arguably more amenable to theoretical analysis. It is widely believed that deep linear networks already captures important aspects of optimization in deep learning(Saxe et al., 2014). Therefore, theoreticians have tried to study this problem in recent years. However, a strong global convergence guarantee is still missing.
A series of recent papers analyzed landscape of the deep linear network optimization problem (Kawaguchi, 2016; Hardt & Ma, 2016; Lu & Kawaguchi, 2017; Yun et al., 2017; Zhou & Liang, 2018; Laurent & Brecht, 2018). However, these results do not imply convergence of gradient-based methods to the global minimum. Recently, Bartlett et al. (2018); Arora et al. (2018a) directly analyzed the trajectory generated by gradient descent, and showed that gradient descent converges to global minimum under further assumptions on both data and global minimum. These results require specially designed initialization schemes, and do not apply to commonly used random initializations. In Section 2.1 we describe above results in more details.
A recent work by Shamir (2018) showed an exponential lower bound of randomly initialized gradient descent for narrow linear neural networks. More precisely, he showed that for an -layer linear neural network in which the input, output and all hidden dimensions are equal to , gradient descent with Xavier initialization (Glorot & Bengio, 2010) requires at least iteration to converge. This result demonstrates the intrinsic difficulty of optimizing deep networks: even in the basic setting, the convergence time for randomly initialized gradient descent can be exponential in depth. Nevertheless, this lower bound only holds for narrow neural networks. It is possible that making the hidden layers wider (which is usually the case in practice) can eliminate such exponential dependence on depth. This gives rise to the following questions:
Can randomly initialized gradient descent optimize wide deep linear networks in polynomial time? If so, what is a sufficient width in hidden layers?
We answer the first question positively and give a concrete quantitative result for the second question. We prove that as long as the width of hidden layers is at least 111We omit dependence on other parameters here. See Theorem 4.1 for the precise requirements.
, gradient descent with Xavier initialization with high probability converges to the global minimum of theloss at a linear rate under no assumption. To our knowledge, this is the first polynomial time global convergence guarantee for randomly initialized gradient descent for deep linear networks. Furthermore, our convergence rate is tight in the sense that it matches the convergence rate of applying gradient descent to the convex (
-layer) linear regression problem.
Our polynomial upper bound for the wide linear neural network and the exponential lower bound for the narrow linear neural network together demonstrate that width provably matters in guaranteeing the efficiency of randomly initialized first-order methods for optimizing deep linear nets.222There are other techniques to help optimization in deep models, such as skip-connections (He et al., 2016). Analyses of those approaches are beyond the scope of this paper.
which utilized a time-varying Gram matrix (or preconditioner) along the trajectory of gradient descent. We adopt the same idea of using such Gram matrix. In the setting of wide linear neural networks, we carefully upper and lower bound eigenvalues of this Gram matrix throughout the optimization process, which together with some perturbation analysis implies linear convergence. In order to establish this at initialization, we need to analyze spectral properties of product of Gaussian random matrices and show that these properties hold throughout the trajectory of gradient descent.
2 Related Work
2.1 Optimization for Deep Linear Neural Networks
Ge et al. (2015); Jin et al. (2017) showed that if an objective function satisfies that (1) all local minima are global, and (2) all saddle points are strict (i.e., there exists a negative curvature), then randomly perturbed gradient descent can escape all saddle points and find a global minimum. Motivated by this, a series of papers (Kawaguchi, 2016; Hardt & Ma, 2016; Lu & Kawaguchi, 2017; Yun et al., 2017; Zhou & Liang, 2018; Laurent & Brecht, 2018) studied these landscape properties for optimizing deep linear networks. While it was established that all local minima are global, unfortunately the strict saddle property is not satisfied even for -layer linear neural networks. Therefore, using landscape properties alone is not sufficient for proving global convergence.
Instead of using the indirect landscape-based approach, an alternative is to directly analyze the trajectory generated by a concrete optimization algorithm like gradient descent. The current paper also belongs to this category.
Saxe et al. (2014) gave a thorough empirical study on deep linear networks, showing that they exhibit some learning patterns similar to nonlinear networks. Ji & Telgarsky (2019) studied the dynamics of gradient descent to optimize a deep linear neural network for classification problems, and showed that the risk converges to and the solution found is a max-margin solution. Arora et al. (2018b)
observed that adding more layers can accelerate optimization for certain loss functions.Du et al. (2018a) showed that using gradient descent, layers are automatically balanced.
All the above results do not show concrete convergence rates of gradient descent. The most related papers are Bartlett et al. (2018) and Arora et al. (2018a). Here we give a detailed description of their results.
Bartlett et al. (2018) showed that if one uses identity initialization, the input data is whitened, and the target matrix is either close to identity or positive definite, then gradient descent converges to the target matrix at a linear rate. Their result highly depends on the identity initialization scheme and has strong requirements on the input data and the target. Arora et al. (2018a) showed that if the initialization is balanced and the initial loss is smaller the loss of any low-rank solution by a margin, then gradient descent converges to global minimum at a linear rate. However, their initialization scheme requires a special SVD step which is not used in practice, and the initial loss condition happens with exponentially small probability when the input and output dimensions are large. Our result improves upon these two papers by (i) allowing fully random initialization, and (ii) removing all assumptions on the input data and the target.
2.2 Optimization for Other Neural Networks
Many papers tried to identify the two desired geometric landscape properties of objective functions for non-linear neural networks (Freeman & Bruna, 2016; Nguyen & Hein, 2017; Venturi et al., 2018; Soudry & Carmon, 2016; Du & Lee, 2018; Soltanolkotabi et al., 2018; Haeffele & Vidal, 2017). Unfortunately, these properties do not hold even for simple non-linear shallow neural networks (Yun et al., 2019; Safran & Shamir, 2018).
A series of recent papers used trajectory-based methods to analyze gradient descent for shallow neural networks under strong data assumptions (Tian, 2017; Soltanolkotabi, 2017; Brutzkus & Globerson, 2017; Li & Yuan, 2017; Zhong et al., 2017; Zhang et al., 2018; Du et al., 2018c, d). These results are restricted to shallow neural networks, and the assumptions are not satisfied in practice.
. For deep ReLU neural networks,Allen-Zhu et al. (2018); Zou et al. (2018) showed that if the width of hidden layers is , then gradient descent converges to loss. ( is the number of training samples.) Du et al. (2018b)
considered non-linear smooth activation functions like soft-plus, and showed that if the width of hidden layers is, then gradient descent converges to loss.333They also showed if one uses skip-connections He et al. (2016), then the width only depends polynomially on . We only focus on fully-connected neural networks in this paper. All these results need additional assumptions on data, which also show up in the required width. Compared with them, we have a much better bound on the required width ( v.s. or ), although this is not a fair comparison because linear networks are simpler than non-linear ones. But given that we obtain a near linear dependence on depth, our result may shed light on the limit of required width in optimizing non-linear neural networks.
to denote the Euclidean norm of a vector or the spectral norm of a matrix, and useto denote the Frobenius norm of a matrix. For a symmetric matrix, let and be its maximum and minimum eigenvalues, and let be its -th largest eigenvalue. Similarly, for a general matrix , let and
be its maximum and minimum singular values, and letbe its -th largest singular value.
be the identity matrix and. Denote by
the standard Gaussian distribution, and bythe distribution with degrees of freedom. Let be the unit sphere in .
Let be the vectorization of a matrix in column-first order. The Kronecker product between two matrices and is defined as
where is the element in the -th entry of .
We use to represent a sufficiently large universal constant throughout the paper. The specific value of can be different from line to line.
3.2 Problem Setup
We are given training samples . Let be the input data matrix and be the label matrix.
Consider the problem of training a depth- linear neural network with hidden layer width by minimizing the loss over data:
where and are weight matrices to be learned. Here is a scaling factor corresponding to Xavier initialization444We adopt this scaling factor so that we can initialize all weights from . (Glorot & Bengio, 2010), for which we provide a justification in Section 3.3.
We consider the vanilla gradient descent (GD) algorithm for objective (1) with random initialization:
We initialize all the entries of independently from . Let be the weight matrices at initialization.
Then we update the weights using GD: for and ,
where is the learning rate.
For notational convenience, we denote for every . We also define (of appropriate dimension) for completeness.
We use the time index for all variables that depend on , e.g., , , etc.
3.3 On the Scaling Factor
The scaling factor ensures that the network at initialization preserves the size of every input in expectation.
For any , we have
4 Main Result
In this section we present our main result. First note that when (which we will assume), the deep linear network we study has the same representation power as a linear map (). Hence, the optimal value for our objective function (1) is equal to the optimal value of the following linear regression problem:
Let be a minimizer of with minimum spectral norm.555Our theorem holds for any minimizer of . Since our bound improves when is smaller, we simply define to be a minimum-spectral-norm minimizer. Let , and define which is the condition number of .
Our main theorem is the following:
for some and a sufficiently large universal constant and we set . Then with probability at least over the random initialization, we have
Theorem 4.1 establishes that if the width of each layer is sufficiently large, randomly initialized gradient descent can reach a global minimum at a linear convergence rate. Notably, our result is fully polynomial in the sense that we only require polynomially large width and the convergence time is also polynomial. To our knowledge, this is the first polynomial time convergence guarantee for randomly initialized gradient descent on deep linear networks.
Ignoring logarithmic factors and assuming , our requirement on width (4) is . It remains open whether this dependence is tight for randomly initialized gradient descent to find a global minimum in polynomial time.
In terms of convergence rate, if we set the learning rate to be , then the predicted ratio of decrease in each iteration is , so the number of iterations needed to reach loss is . This matches the convergence rate of gradient descent on the linear regression (convex!) problem (3).
Furthermore, notice that our requirement on the learning rate is . When , this also exactly recovers the convergence result for applying gradient descent to the linear regression problem (3). The reason why is in the denominator will be clear in the proof. At a high level, we show that optimizing a deep linear network is similar to a linear regression problem with the covariance matrix being , which thus requires scaling down the learning rate by a factor of .
5 Proof Overview
In this section we give an overview for the proof of Theorem 4.1.
First, we note that a simple reduction implies that we can make the following assumption without loss of generality:
(Without loss of generality) , , , and .
Now we proceed to sketch the proof of Theorem 4.1. The key idea is to examine the dynamics of the network prediction on data during optimization, namely:
With this notation, the network prediction at iteration is , and the loss value at iteration is . Hence how evolves is directly related to how loss decreases.
The gradient of our objective function (1) is
Then using the update rule (2) we write
where contains all high-order terms (those with or higher). Multiplying this equation by on the right we get
Vectorizing the above equation and using the property of Kronecker product: , we obtain
Notice that is always positive semi-definite (PSD) because it is the sum of terms, each of which is the Kronecker product between two PSD matrices.
Suppose we are able to set . Then (8) would imply
Therefore, if we have a lower bound on for all , we will have linear convergence as desired. We will indeed prove the following bounds on and for all , which will essentially complete the proof:
We use the following approach to bound and :
Here we have used the property that for symmetric matrices and , every eigenvalue of is the product of an eigenvalue of and an eigenvalue of . Therefore, it suffices to obtain upper and lower bounds on the singular values of and . In Section 6, we establish these bounds for initialization (). Then we finish the proof of Theorem 4.1 in Section 7.
6 Properties at Initialization
In this section we establish some properties of the weight matrices generated by random initialization.
The following lemma shows that when multiplying a fixed vector by a series of Gaussian matrices with large width, the resulting vector’s norm is concentrated.
Suppose , and consider independent random matrices with i.i.d. entries. Then for any , with probability at least we have
See Appendix B. ∎
The next three propositions show the key properties of products of weight matrices at initialization.
For any , with probability at least we have
Let . Since and , we know and . Also, from Lemma 6.1 we know that for any fixed , with probability at least we have .
The rest of the proof is by a standard -net argument. Let . Take an -net for with . By a union bound, with probability at least , for all simultaneously we have . Suppose this happens for every . Next, for any , there exists such that . Then we have
Taking supreme over , we obtain
For the lower bound, we have
Taking infimum over we get .
The success probability is at least since . ∎
For any , with probability at least we have
For any , with probability at least we have
Let . From Lemma 6.1 we know that for any fixed , with probability at least we have .
Take a small constant and partition the index set into where for each . For each , taking a -net for all the unit vectors supported in , i.e., a -net for the set , we know that
with probability at least . Then taking a union bound over all , we know that (11) holds for all simultaneously with probability at least . Conditioned on this, for any , we can partition its coordinates and write it as the sum where and for each . Then we have
This means . The success probability is at least since . ∎
To close this section, we bound the loss value at initialization, which proves the first part of Theorem 4.1.
With probability at least , we have .
See Appendix D. ∎
7 Proof of the Main Theorem
Here we define which is the upper bound on from Proposition 6.5.
From our requirement on (4), we know
Now we establish our convergence result conditioned on all properties in (12). Specifically, we use induction on to simultaneously prove the following three properties , and for all :
Notice that if we prove for all , we will finish the proof of Theorem 4.1.
The proof of Theorem 4.1 is finished after the above three claims are proved.
7.1 Proof of Claim 7.1
Denote . From we know for all .
From the gradient expression (5), for all and all we can bound:
where we have used .
Then we can bound for all :
This proves .
7.2 Proof of Claim 7.2
Let and denote . Then using we will show the followings:
Combing them with (12), we will finish the proof of .
First we prove (17). For