Width Provably Matters in Optimization for Deep Linear Neural Networks

We prove that for an L-layer fully-connected linear neural network, if the width of every hidden layer is Ω̃ (L · r · d_out·κ^3 ), where r and κ are the rank and the condition number of the input data, and d_out is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an ϵ-suboptimal solution is O(κ(1/ϵ)). Our polynomial upper bound on the total running time for wide deep linear networks and the (Ω(L)) lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/02/2020

On the Global Convergence of Training Deep Linear ResNets

We study the convergence of gradient descent (GD) and stochastic gradien...
06/24/2020

Towards Understanding Hierarchical Learning: Benefits of Neural Representations

Deep neural networks can empirically perform efficient hierarchical lear...
10/29/2020

What can we learn from gradients?

Recent work (<cit.>) has shown that it is possible to reconstruct the in...
11/02/2021

Subquadratic Overparameterization for Shallow Neural Networks

Overparameterization refers to the important phenomenon where the width ...
07/13/2020

Probabilistic bounds on data sensitivity in deep rectifier networks

Neuron death is a complex phenomenon with implications for model trainab...
09/17/2021

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method

We propose a computationally-friendly adaptive learning rate schedule, "...
01/16/2020

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

The selection of initial parameter values for gradient-based optimizatio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent success in machine learning involves training deep neural networks using randomly initialized first order methods, which requires optimizing highly non-convex functions. Compared with nonlinear deep neural networks, deep linear networks are arguably more amenable to theoretical analysis. It is widely believed that deep linear networks already captures important aspects of optimization in deep learning 

(Saxe et al., 2014). Therefore, theoreticians have tried to study this problem in recent years. However, a strong global convergence guarantee is still missing.

A series of recent papers analyzed landscape of the deep linear network optimization problem (Kawaguchi, 2016; Hardt & Ma, 2016; Lu & Kawaguchi, 2017; Yun et al., 2017; Zhou & Liang, 2018; Laurent & Brecht, 2018). However, these results do not imply convergence of gradient-based methods to the global minimum. Recently, Bartlett et al. (2018); Arora et al. (2018a) directly analyzed the trajectory generated by gradient descent, and showed that gradient descent converges to global minimum under further assumptions on both data and global minimum. These results require specially designed initialization schemes, and do not apply to commonly used random initializations. In Section 2.1 we describe above results in more details.

A recent work by Shamir (2018) showed an exponential lower bound of randomly initialized gradient descent for narrow linear neural networks. More precisely, he showed that for an -layer linear neural network in which the input, output and all hidden dimensions are equal to , gradient descent with Xavier initialization (Glorot & Bengio, 2010) requires at least iteration to converge. This result demonstrates the intrinsic difficulty of optimizing deep networks: even in the basic setting, the convergence time for randomly initialized gradient descent can be exponential in depth. Nevertheless, this lower bound only holds for narrow neural networks. It is possible that making the hidden layers wider (which is usually the case in practice) can eliminate such exponential dependence on depth. This gives rise to the following questions:

Can randomly initialized gradient descent optimize wide deep linear networks in polynomial time? If so, what is a sufficient width in hidden layers?

Our Contribution:

We answer the first question positively and give a concrete quantitative result for the second question. We prove that as long as the width of hidden layers is at least 111We omit dependence on other parameters here. See Theorem 4.1 for the precise requirements.

, gradient descent with Xavier initialization with high probability converges to the global minimum of the

loss at a linear rate under no assumption. To our knowledge, this is the first polynomial time global convergence guarantee for randomly initialized gradient descent for deep linear networks. Furthermore, our convergence rate is tight in the sense that it matches the convergence rate of applying gradient descent to the convex (

-layer) linear regression problem.

Compared with previous work (Bartlett et al., 2018; Arora et al., 2018a) that gave convergence rate guarantees for linear neural networks, our result has several advantages:

  • Our result applies to the widely used Xavier random initialization, while Bartlett et al. (2018) used identity initialization, and Arora et al. (2018a) assumed that initialization is “balanced” and somewhat close to the global minimum.

  • Our result does not have any assumption on the input data, while Bartlett et al. (2018); Arora et al. (2018a) both required whitened data.

  • Our result does not have any assumption on the global minimum, while Bartlett et al. (2018) assumed it to be either close to identity or positive definite, and Arora et al. (2018a) required it to have full rank.

Our polynomial upper bound for the wide linear neural network and the exponential lower bound for the narrow linear neural network together demonstrate that width provably matters in guaranteeing the efficiency of randomly initialized first-order methods for optimizing deep linear nets.222There are other techniques to help optimization in deep models, such as skip-connections (He et al., 2016). Analyses of those approaches are beyond the scope of this paper.

Our Technique:

Our proof technique is related to the recent work Arora et al. (2018a, b); Du et al. (2018b)

which utilized a time-varying Gram matrix (or preconditioner) along the trajectory of gradient descent. We adopt the same idea of using such Gram matrix. In the setting of wide linear neural networks, we carefully upper and lower bound eigenvalues of this Gram matrix throughout the optimization process, which together with some perturbation analysis implies linear convergence. In order to establish this at initialization, we need to analyze spectral properties of product of Gaussian random matrices and show that these properties hold throughout the trajectory of gradient descent.

2 Related Work

2.1 Optimization for Deep Linear Neural Networks

Landscape Analysis:

Ge et al. (2015); Jin et al. (2017) showed that if an objective function satisfies that (1) all local minima are global, and (2) all saddle points are strict (i.e., there exists a negative curvature), then randomly perturbed gradient descent can escape all saddle points and find a global minimum. Motivated by this, a series of papers (Kawaguchi, 2016; Hardt & Ma, 2016; Lu & Kawaguchi, 2017; Yun et al., 2017; Zhou & Liang, 2018; Laurent & Brecht, 2018) studied these landscape properties for optimizing deep linear networks. While it was established that all local minima are global, unfortunately the strict saddle property is not satisfied even for -layer linear neural networks. Therefore, using landscape properties alone is not sufficient for proving global convergence.

Trajectory Analysis:

Instead of using the indirect landscape-based approach, an alternative is to directly analyze the trajectory generated by a concrete optimization algorithm like gradient descent. The current paper also belongs to this category.

Saxe et al. (2014) gave a thorough empirical study on deep linear networks, showing that they exhibit some learning patterns similar to nonlinear networks. Ji & Telgarsky (2019) studied the dynamics of gradient descent to optimize a deep linear neural network for classification problems, and showed that the risk converges to and the solution found is a max-margin solution. Arora et al. (2018b)

observed that adding more layers can accelerate optimization for certain loss functions.

Du et al. (2018a) showed that using gradient descent, layers are automatically balanced.

All the above results do not show concrete convergence rates of gradient descent. The most related papers are Bartlett et al. (2018) and Arora et al. (2018a). Here we give a detailed description of their results.

Bartlett et al. (2018) showed that if one uses identity initialization, the input data is whitened, and the target matrix is either close to identity or positive definite, then gradient descent converges to the target matrix at a linear rate. Their result highly depends on the identity initialization scheme and has strong requirements on the input data and the target. Arora et al. (2018a) showed that if the initialization is balanced and the initial loss is smaller the loss of any low-rank solution by a margin, then gradient descent converges to global minimum at a linear rate. However, their initialization scheme requires a special SVD step which is not used in practice, and the initial loss condition happens with exponentially small probability when the input and output dimensions are large. Our result improves upon these two papers by (i) allowing fully random initialization, and (ii) removing all assumptions on the input data and the target.

2.2 Optimization for Other Neural Networks

Many papers tried to identify the two desired geometric landscape properties of objective functions for non-linear neural networks (Freeman & Bruna, 2016; Nguyen & Hein, 2017; Venturi et al., 2018; Soudry & Carmon, 2016; Du & Lee, 2018; Soltanolkotabi et al., 2018; Haeffele & Vidal, 2017). Unfortunately, these properties do not hold even for simple non-linear shallow neural networks (Yun et al., 2019; Safran & Shamir, 2018).

A series of recent papers used trajectory-based methods to analyze gradient descent for shallow neural networks under strong data assumptions  (Tian, 2017; Soltanolkotabi, 2017; Brutzkus & Globerson, 2017; Li & Yuan, 2017; Zhong et al., 2017; Zhang et al., 2018; Du et al., 2018c, d). These results are restricted to shallow neural networks, and the assumptions are not satisfied in practice.

Recent breakthroughs were made in the optimization for extremely over-parametrized non-linear neural networks (Du et al., 2019, 2018b; Li & Liang, 2018; Allen-Zhu et al., 2018; Zou et al., 2018)

. For deep ReLU neural networks,

Allen-Zhu et al. (2018); Zou et al. (2018) showed that if the width of hidden layers is , then gradient descent converges to loss. ( is the number of training samples.) Du et al. (2018b)

considered non-linear smooth activation functions like soft-plus, and showed that if the width of hidden layers is

, then gradient descent converges to loss.333They also showed if one uses skip-connections He et al. (2016), then the width only depends polynomially on . We only focus on fully-connected neural networks in this paper. All these results need additional assumptions on data, which also show up in the required width. Compared with them, we have a much better bound on the required width ( v.s. or ), although this is not a fair comparison because linear networks are simpler than non-linear ones. But given that we obtain a near linear dependence on depth, our result may shed light on the limit of required width in optimizing non-linear neural networks.

3 Preliminaries

3.1 Notation

We use

to denote the Euclidean norm of a vector or the spectral norm of a matrix, and use

to denote the Frobenius norm of a matrix. For a symmetric matrix, let and be its maximum and minimum eigenvalues, and let be its -th largest eigenvalue. Similarly, for a general matrix , let and

be its maximum and minimum singular values, and let

be its -th largest singular value.

Let

be the identity matrix and

. Denote by

the standard Gaussian distribution, and by

the distribution with degrees of freedom. Let be the unit sphere in .

Let be the vectorization of a matrix in column-first order. The Kronecker product between two matrices and is defined as

where is the element in the -th entry of .

We use to represent a sufficiently large universal constant throughout the paper. The specific value of can be different from line to line.

3.2 Problem Setup

We are given training samples . Let be the input data matrix and be the label matrix.

Consider the problem of training a depth- linear neural network with hidden layer width by minimizing the loss over data:

(1)

where and are weight matrices to be learned. Here is a scaling factor corresponding to Xavier initialization444We adopt this scaling factor so that we can initialize all weights from . (Glorot & Bengio, 2010), for which we provide a justification in Section 3.3.

We consider the vanilla gradient descent (GD) algorithm for objective (1) with random initialization:

  • We initialize all the entries of independently from . Let be the weight matrices at initialization.

  • Then we update the weights using GD: for and ,

    (2)

    where is the learning rate.

For notational convenience, we denote for every . We also define (of appropriate dimension) for completeness.

We use the time index for all variables that depend on , e.g., , , etc.

3.3 On the Scaling Factor

The scaling factor ensures that the network at initialization preserves the size of every input in expectation.

Claim 3.1.

For any , we have

Proof.

First, it is easy to see that for a random matrix

with i.i.d. entries and any vector , the distribution of is . We rewrite as

where . Then we have , () and . Therefore, are independent random variables. It follows that

4 Main Result

In this section we present our main result. First note that when (which we will assume), the deep linear network we study has the same representation power as a linear map (). Hence, the optimal value for our objective function (1) is equal to the optimal value of the following linear regression problem:

(3)

Let be a minimizer of with minimum spectral norm.555Our theorem holds for any minimizer of . Since our bound improves when is smaller, we simply define to be a minimum-spectral-norm minimizer. Let , and define which is the condition number of .

Our main theorem is the following:

Theorem 4.1.

Suppose

(4)

for some and a sufficiently large universal constant and we set . Then with probability at least over the random initialization, we have

Theorem 4.1 establishes that if the width of each layer is sufficiently large, randomly initialized gradient descent can reach a global minimum at a linear convergence rate. Notably, our result is fully polynomial in the sense that we only require polynomially large width and the convergence time is also polynomial. To our knowledge, this is the first polynomial time convergence guarantee for randomly initialized gradient descent on deep linear networks.

Ignoring logarithmic factors and assuming , our requirement on width (4) is . It remains open whether this dependence is tight for randomly initialized gradient descent to find a global minimum in polynomial time.

In terms of convergence rate, if we set the learning rate to be , then the predicted ratio of decrease in each iteration is , so the number of iterations needed to reach loss is . This matches the convergence rate of gradient descent on the linear regression (convex!) problem (3).

Furthermore, notice that our requirement on the learning rate is . When , this also exactly recovers the convergence result for applying gradient descent to the linear regression problem (3). The reason why is in the denominator will be clear in the proof. At a high level, we show that optimizing a deep linear network is similar to a linear regression problem with the covariance matrix being , which thus requires scaling down the learning rate by a factor of .

5 Proof Overview

In this section we give an overview for the proof of Theorem 4.1.

First, we note that a simple reduction implies that we can make the following assumption without loss of generality:

Assumption 5.1.

(Without loss of generality) , , , and .

See Appendix A for justification. Therefore we will work under Assumption 5.1 from now on.

Now we proceed to sketch the proof of Theorem 4.1. The key idea is to examine the dynamics of the network prediction on data during optimization, namely:

With this notation, the network prediction at iteration is , and the loss value at iteration is . Hence how evolves is directly related to how loss decreases.

The gradient of our objective function (1) is

(5)

Then using the update rule (2) we write

where contains all high-order terms (those with or higher). Multiplying this equation by on the right we get

Vectorizing the above equation and using the property of Kronecker product: , we obtain

(6)

where

(7)

Notice that is always positive semi-definite (PSD) because it is the sum of terms, each of which is the Kronecker product between two PSD matrices.

Now we assume that the high-order term in (6) is very small (which we will rigorously prove) and ignore it for now. Then (6) implies

(8)

Suppose we are able to set . Then (8) would imply

Therefore, if we have a lower bound on for all , we will have linear convergence as desired. We will indeed prove the following bounds on and for all , which will essentially complete the proof:

(9)

We use the following approach to bound and :

(10)

Here we have used the property that for symmetric matrices and , every eigenvalue of is the product of an eigenvalue of and an eigenvalue of . Therefore, it suffices to obtain upper and lower bounds on the singular values of and . In Section 6, we establish these bounds for initialization (). Then we finish the proof of Theorem 4.1 in Section 7.

6 Properties at Initialization

In this section we establish some properties of the weight matrices generated by random initialization.

The following lemma shows that when multiplying a fixed vector by a series of Gaussian matrices with large width, the resulting vector’s norm is concentrated.

Lemma 6.1.

Suppose , and consider independent random matrices with i.i.d. entries. Then for any , with probability at least we have

Proof.

See Appendix B. ∎

The next three propositions show the key properties of products of weight matrices at initialization.

Proposition 6.2.

For any , with probability at least we have

Proof.

Let . Since and , we know and . Also, from Lemma 6.1 we know that for any fixed , with probability at least we have .

The rest of the proof is by a standard -net argument. Let . Take an -net for with . By a union bound, with probability at least , for all simultaneously we have . Suppose this happens for every . Next, for any , there exists such that . Then we have

Taking supreme over , we obtain

For the lower bound, we have

Taking infimum over we get .

The success probability is at least since . ∎

Proposition 6.3.

For any , with probability at least we have

Proof.

The proof is similar to the proof of Proposition 6.2 and is deferred to Appendix C. ∎

Proposition 6.4.

For any , with probability at least we have

Proof.

Let . From Lemma 6.1 we know that for any fixed , with probability at least we have .

Take a small constant and partition the index set into where for each . For each , taking a -net for all the unit vectors supported in , i.e., a -net for the set , we know that

(11)

with probability at least . Then taking a union bound over all , we know that (11) holds for all simultaneously with probability at least . Conditioned on this, for any , we can partition its coordinates and write it as the sum where and for each . Then we have

This means . The success probability is at least since . ∎

To close this section, we bound the loss value at initialization, which proves the first part of Theorem 4.1.

Proposition 6.5.

With probability at least , we have .

Proof.

See Appendix D. ∎

7 Proof of the Main Theorem

In this section we prove Theorem 4.1 based on ingredients from Sections 5 and 6.

From Propositions 6.2, 6.3, 6.4 and 6.5, we know that with probability at least , the following conditions of initialization are satisfied simultaneously:

(12)

Here we define which is the upper bound on from Proposition 6.5.

From our requirement on (4), we know

(13)

Now we establish our convergence result conditioned on all properties in (12). Specifically, we use induction on to simultaneously prove the following three properties , and for all :

  • :

  • :

  • :

Notice that if we prove for all , we will finish the proof of Theorem 4.1.

The initial conditions and follow directly from (12), and is trivially true. In order to establish , and for all , in Sections 7.1-7.3 we will prove respectively the following claims for all :

Claim 7.1.

.

Claim 7.2.

.

Claim 7.3.

.

The proof of Theorem 4.1 is finished after the above three claims are proved.

7.1 Proof of Claim 7.1

Denote . From we know for all .

From the gradient expression (5), for all and all we can bound:

(14)

where we have used .

Then we can bound for all :

This proves .

7.2 Proof of Claim 7.2

Let and denote . Then using we will show the followings:

(15)
(16)
(17)

Combing them with (12), we will finish the proof of .

First we prove (17). For