# Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for nonlinear networks. In this work, we analyze for the first time the speed of convergence for natural gradient descent on nonlinear neural networks with the squared-error loss. We identify two conditions which guarantee the efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) is full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks (i.e., with one hidden layer), we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that K-FAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions.

## Authors

• 15 publications
• 12 publications
• 38 publications
• ### Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

We study the problem of training deep neural networks with Rectified Lin...
11/21/2018 ∙ by Difan Zou, et al. ∙ 18

• ### Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks

We analyze Gradient Descent applied to learning a bounded target functio...
05/07/2018 ∙ by Santosh Vempala, et al. ∙ 0

• ### Implicit bias of gradient descent for mean squared error regression with wide neural networks

We investigate gradient descent training of wide neural networks and the...
06/12/2020 ∙ by Hui Jin, et al. ∙ 22

• ### On Symmetry and Initialization for Neural Networks

This work provides an additional step in the theoretical understanding o...
07/01/2019 ∙ by Ido Nachum, et al. ∙ 0

• ### Persistency of Excitation for Robustness of Neural Networks

When an online learning algorithm is used to estimate the unknown parame...
11/04/2019 ∙ by Kamil Nar, et al. ∙ 12

• ### Exponential convergence of Sobolev gradient descent for a class of nonlinear eigenproblems

We propose to use the Łojasiewicz inequality as a general tool for analy...
12/04/2019 ∙ by Ziyun Zhang, et al. ∙ 0

• ### The Global Optimization Geometry of Shallow Linear Neural Networks

We examine the squared error loss landscape of shallow linear neural net...
05/13/2018 ∙ by Zhihui Zhu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Because training large neural networks is costly, there has been much interest in using second-order optimization to speed up training (Becker and LeCun, 1989; Martens, 2010; Martens and Grosse, 2015), and in particlar natural gradient descent (Amari, 1998, 1997). Recently, scalable approximations to natural gradient descent have shown practical success in a variety of tasks and architectures (Martens and Grosse, 2015; Grosse and Martens, 2016; Wu et al., 2017; Zhang et al., 2018a; Martens et al., 2018). Natural gradient descent has an appealing interpretation as optimizing over a Riemannian manifold using an intrinsic distance metric; this implies the updates are invariant to transformations such as whitening (Ollivier, 2015; Luk and Grosse, 2018). It is also closely connected to Gauss-Newton optimization, suggesting it should achieve fast convergence in certain settings (Pascanu and Bengio, 2013; Martens, 2014; Botev et al., 2017).

Does this intuition translate into faster convergence? Amari (1998) provided arguments in the affirmative, as long as the cost function is well approximated by a convex quadratic. However, it remains unknown whether natural gradient descent can optimize neural networks faster than gradient descent — a major gap in our understanding. The problem is that the optimization of neural networks is both nonconvex and non-smooth, making it difficult to prove nontrivial convergence bounds. In general, finding a global minimum of a general non-convex function is an NP-complete problem, and neural network training in particular is NP-complete (Blum and Rivest, 1992).

However, in the past two years, researchers have finally gained substantial traction in understanding the dynamics of gradient-based optimization of neural networks. Theoretically, it has been shown that gradient descent starting from a random initialization is able to find a global minimum if the network is wide enough (Li and Liang, 2018; Du et al., 2018b, a; Zou et al., 2018; Allen-Zhu et al., 2018; Oymak and Soltanolkotabi, 2019). The key technique of those works is to show that neural networks become well-behaved if they are largely overparameterized in the sense that the number of hidden units is polynomially large in the size of the training data. However, most of these works have focused on standard gradient descent, leaving open the question of whether similar statements can be made about other optimizers.

Most convergence analysis of natural gradient descent has focused on simple convex quadratic objectives (e.g. (Martens, 2014)). Very recently, the convergence properties of NGD were studied in the context of linear networks (Bernacchia et al., 2018), where linear activation was used. While the linearity assumption simplifies the analysis of training dynamics (Saxe et al., 2013), linear networks are severely limited in terms of their expressivity, and it’s not clear which conclusions will generalize from linear to nonlinear networks.

In this work, we analyze natural gradient descent for nonlinear networks. We give two simple and generic conditions on the Jacobian matrix which guarantee efficient convergence to a global minimum. We then apply this analysis to a particular distribution over two-layer ReLU networks which has recently been used to analyze the convergence of gradient descent (Li and Liang, 2018; Du et al., 2018a; Oymak and Soltanolkotabi, 2019). We show that for sufficiently high network width, NGD will converge to the global minimum. We give bounds on the convergence rate of two-layer ReLU networks that are much better than the analogous bounds that have been proven for gradient descent (Du et al., 2018b; Wu et al., 2019; Oymak and Soltanolkotabi, 2019), while allowing for much higher learning rates. Moreover, in the limit of infinite width, and assuming a squared error loss, we show that NGD converges in just one iteration.

The main contributions of our work are summarized as follows:

• We provide the first global convergence result for natural gradient descent in training overparameterized neural networks (two-layer ReLU networks) where the number of hidden units is polynomially larger than the number of training samples. We show that natural gradient descent gives an improvement in convergence rate given the same learning rate as gradient descent, where is a Gram matrix depends on the data.

• Second, we show that natural gradient enables us to use a much larger step size, resulting in an even faster convergence rate. Specifically, the maximal step size of natural gradient descent is for (polynomially) wide networks while the best result for step size of gradient descent (Wu et al., 2019) is , where is the number of training examples.

• We show that K-FAC (Martens and Grosse, 2015), an approximate natural gradient descent method, also converges to global minima with linear rate, although this result requires a higher level of overparameterization compared to GD and exact NGD.

• We analyze the generalization properties of NGD, showing that the improved convergence rates don’t come at the expense of worse generalization.

## 2 Related Works

Recently, there have been many works studying the optimization problem in deep learning, i.e., why in practice many neural network architectures reliably converge to global minima (zero training error). One popular way to attack this problem is to analyze the underlying loss surface

(Hardt and Ma, 2016; Kawaguchi, 2016; Kawaguchi and Bengio, 2018; Nguyen and Hein, 2017; Soudry and Carmon, 2016). The main argument of those works is that there are no bad local minima. It has been proven that gradient descent can find global minima (Ge et al., 2015; Lee et al., 2016) if the loss surface satisfies: (1) all local minima are global and (2) all saddle points are strict in the sense that there exists at least one negative curvature direction. Unfortunately, most of those works rely on unrealistic assumptions (e.g., linear activations (Hardt and Ma, 2016; Kawaguchi, 2016)) and cannot generalize to practical neural networks. Moreover, Yun et al. (2018) shows that small nonlinearity in shallow networks can create bad local minima.

Another way to understand the optimization of neural networks is to analyze optimization trajectories. Our work also falls within the category. However, most of them focus on the case of gradient descent. Bartlett et al. ; Arora et al. (2019a) studied the optimization trajectory of deep linear networks and showed that gradient descent can find global minima under some assumptions. Previously, the dynamics of linear networks have also been studied by Saxe et al. (2013); Advani and Saxe (2017). For nonlinear neural networks, a series of papers (Tian, 2017; Brutzkus and Globerson, 2017; Du et al., 2017; Li and Yuan, 2017; Zhang et al., 2018b) studied a specific class of shallow two-layer neural networks together with strong assumptions on input distribution as well as realizability of labels, proving global convergence of gradient descent. Very recently, there are some works proving global convergence of gradient descent (Li and Liang, 2018; Du et al., 2018b, a; Allen-Zhu et al., 2018; Zou et al., 2018) or adaptive gradient methods (Wu et al., 2019) on overparameterized neural networks. More specifically, Li and Liang (2018); Allen-Zhu et al. (2018); Zou et al. (2018) analyzed the dynamics of weights and showed that the gradient cannot be small if the objective value is large. On the other hand, Du et al. (2018b, a); Wu et al. (2019) studied the dynamics of the outputs of neural networks, where the convergence properties are captured by a Gram matrix. Our work is very similar to Du et al. (2018b); Wu et al. (2019). We note that these papers all require the step size to be sufficiently small to guarantee the global convergence, leading to slow convergence.

To our knowledge, there is only one paper (Bernacchia et al., 2018) studying the global convergence of natural gradient for neural networks. However, Bernacchia et al. (2018) only studied deep linear networks with infinitesimal step size and squared error loss functions. In this sense, our work is the first one proving global convergence of natural gradient descent on nonlinear networks.

There have been many attempts to understand the generalization properties of neural networks since Zhang et al. (2016)’s seminal paper. Researchers have proposed norm-based generalization bounds (Neyshabur et al., 2015, 2017; Bartlett and Mendelson, 2002; Bartlett et al., 2017; Golowich et al., 2017), compression bounds (Arora et al., 2018) and PAC-Bayes bounds (Dziugaite and Roy, 2017, 2018; Zou et al., 2018). Recently, overparameterization of neural networks together with good initialization has been believed to be one key factor of good generalization. Neyshabur et al. (2019) empirically showed that wide neural networks stay close to the initialization, thus leading to good generalization. Theoretically, researchers did prove that overparameterization as well as linear convergence jointly restrict the weights to be close to the initialization (Du et al., 2018b, a; Allen-Zhu et al., 2018; Zou et al., 2018). The most closely related paper is Arora et al. (2019b), which shows that the optimization and generalization phenomenon can be explained by a Gram matrix. The main difference is that our analysis is based on natural gradient descent, which converges faster and provably generalizes as well as gradient descent.

Lastly, our work is also related to kernel methods, especially the neural tangent kernel (Jacot et al., 2018; Lee et al., 2019), since the Gram matrix used in this work is exactly the neural tangent kernel. However, the original NTK analysis applied only to infinitely wide networks.

## 3 Convergence Analysis of Natural Gradient Descent

We begin our convergence analysis of natural gradient descent – under appropriate conditions – for the neural network optimization problem. Formally, we consider a generic neural network with a single output and squared error loss for simplicity111It is easy to extend to multi-output networks and other loss functions, here we focus on single-output and quadratic just for notational simplicity., where denots all parameters of the network (i.e. weights and biases). Given a training dataset , we want to minimize the following loss function:

 L(θ)=1nn∑i=1ℓ(f(θ,xi),yi)=12nn∑i=1(f(θ,xi)−yi)2. (1)

One main focus of this paper is to analyze the following procedure:

 θ(k+1)=θ(k)−ηF(θ(k))−1∂L(θ(k))∂θ(k), (2)

where is the step size, and is the Fisher information matrix associated with the network’s predictive distribution over (which is implied by its loss function and is for the squared error loss) and the dataset’s distribution over .

As shown by Martens (2014), the Fisher matrix is equivalent to the generalized Gauss-Newton matrix, defined as

if the predictive distribution is in the exponential family, such as categorical distribution (for classification) or Gaussian distribution (for regression).

is the Jacobian matrix of with respect to the parameters and is the Hessian of the loss with respect to the network prediction (which is in our setting). Therefore, with the squared error loss, the Fisher matrix can be compactly written as (which coincides with classical Gauss-Newton matrix), where is the Jacobian matrix for the whole dataset. In practice, when the number of parameters is larger than number of samples we have, the Fisher information matrix is surely singular. In that case, we can take the generalized inverse (Bernacchia et al., 2018) with , which gives the following update rule:

 θ(k+1)=θ(k)−ηJ⊤(JJ⊤)−1(u−y), (3)

where and .

We now introduce two conditions on the network that suffice for proving the global convergence of NGD to a minimizer which achieves zero training loss (and is therefore a global minimizer). To motivate these two conditions we make the following observations. First, the global minimizer is characterized by the condition that the gradient in the output space is zero for each case (i.e. ). Meanwhile, local minima are characterized by the condition that the gradient with respect to the parameters is zero. Thus, one way to avoid finding local minima that aren’t global is to ensure that the parameter gradient is zero if and only if the output space gradient (for each case) is zero. It’s not hard to see that this property holds as long as remains non-singular throughout optimization (or equivalently that always has full row rank). The following two conditions ensure that this happens, by first requiring that this property hold at initialization time, and second that changes slowly enough that it remains true in a big enough neighborhood around .

###### Condition 1 (Full row rank of Jacobian matrix).

The Jacobian matrix at the initialization is full row rank, or equivalently, the Gram matrix is positive definite.

###### Remark 1.

Condition 1 implies that , which means the Fisher information matrix is singular and we have to use the generalized inverse except in the case where .

###### Condition 2 (Stable Jacobian).

There exists such that for all parameters that satisfy , we have

 ∥J(θ)−J(0)∥2≤C3√λmin(G(0)). (4)

This condition shares the same spirit with the Lipschtiz smoothness assumption in classical optimization theory. It implies (with small ) that the network is close to a linearized network (Lee et al., 2019) around the initialization and therefore natural gradient descent update is close to the gradient descent update in the output space. Along with Condition 1, we have the following theorem.

###### Theorem 1 (Natural gradient descent).

Let Condition 1 and 2 hold. Suppose we optimize with NGD using a step size . Then for we have

 ∥u(k)−y∥22≤(1−η)k∥u(0)−y∥22. (5)

To be noted, is the squared error loss up to a constant. Due to space constraints we only give a short sketch of the proof here. The full proof is given in Appendix B.

Proof Sketch. Our proof relies on the following insights. First, if the Jacobian matrix is full row rank, this guarantees linear convergence for infinitesimal step size. The linear convergence property restricts the parameters to be close to the initialization, which implies the Jacobian matrix is always full row rank throughout the training, and therefore natural gradient descent with infinitesimal step size converges to global minima. Furthermore, given the network is close to a linearized network (since the Jacobian matrix is stable with respect to small perturbations around the initialization), we are able to extend the proof to discrete time with a large step size.

In summary, we prove that NGD exhibits linear convergence to the global minimizer of the neural network training problem, under Conditions 1 and 2. We believe our arguments in this section are general (i.e., architecture-agnostic), and can serve as a recipe for proving global convergence of natural gradient descent in other settings.

### 3.1 Other Loss Functions

We note that our analysis can be easily extended to more general loss function class. Here, we take the class of functions that are -strongly convex with -Lipschitz gradients as an example. Note that strongly convexity is a very mild assumption since we can always add regularization to make the convex loss strongly convex. Therefore, this function class includes regularized cross-entropy loss (which is typically used in classification) and squared error (for regression). For this type of loss, we need a strong version of Condition 2.

###### Condition 3 (Stable Jacobian).

There exists such that for all parameters that satisfy where

 ∥J(θ)−J(0)∥2≤C3√λmin(G(0)). (6)
###### Theorem 2.

Under Condition 1 and 3, but with -strongly convex loss function with -Lipschitz gradient (), and we set the step size , then we have for

 (7)

The key step of proving Theorem 2 is to show if is large enough, then natural gradient descent is approximately gradient descent in the output space. Thus the results can be easily derived according to standard bounds for convex optimization. Due to the page limit, we defer the proof to the Appendix C.

###### Remark 2.

In Theorem 2, the convergence rate depends on the condition number , which can be removed if we take into the curvature information of the loss function. In other words, we expect that the bound has no dependency on if we use the Fisher matrix rather than the classical Gauss-Newton (assuming Euclidean metric in the output space (Luk and Grosse, 2018)) in Theorem 2.

## 4 Optimizing Overparameterized Neural Networks

In Section 3, we analyzed the convergence properties of natural gradient descent, under the abstract Conditions 1 and 2

. In this section, we make our analysis concrete by applying it to a specific type of overparameterized network (with a certain random initialization). We show that Conditions 1 and 2 hold with high probability. We therefore establish that NGD exhibits linear convergence to a global minimizer for such networks.

### 4.1 Notation

We let . We use , to denote the Kronecker and Hadamard products. And we use and to denote row-wise and column-wise Khatri-Rao products, respectively. For a matrix , we use to denote its -th entry. We use to denote the Euclidean norm of a vector or spectral norm of a matrix and to denote the Frobenius norm of a matrix. We use and

to denote the largest and smallest eigenvalue of a square matrix, and

and

to denote the largest and smallest singular value of a (possibly non-square) matrix. For a positive definite matrix

, we use to denote its condition number, i.e., . We also use to denote the standard inner product between two vectors. Given an event , we use to denote the indicator function for .

### 4.2 Problem Setup

Formally, we consider a neural network of the following form:

 f(w,a,x)=1√mm∑r=1arϕ(w⊤rx), (8)

where is the input, is the weight matrix (formed into a vector) of the first layer, is the output weight of hidden unit and

is the ReLU activation function (acting entry-wise for vector arguments). For

, we initialize the weights of first layer and output weight .

Following Du et al. (2018b); Wu et al. (2019), we make the following assumption on the data.

###### Assumption 1.

For all i, and . For any , .

This very mild condition simply requires the inputs and outputs have standardized norms, and that different input vectors are distinguishable from each other. Datasets that do not satisfy this condition can be made to do so via simple pre-processing.

Following Du et al. (2018b); Oymak and Soltanolkotabi (2019); Wu et al. (2019), we only optimize the weights of the first layer222We fix the second layer just for simplicity. Based on the same analysis, one can also prove global convergence for jointly training both layers., i.e., . Therefore, natural gradient descent can be simplified to

 w(k+1)=w(k)−ηJ⊤(JJ⊤)−1(u−y). (9)

Though this is only a shallow fully connected neural network, the objective is still non-smooth and non-convex (Du et al., 2018b) due to the use of ReLU activation function. We further note that this two-layer network model has been useful in understanding the optimization and generalization of deep neural networks (Xie et al., 2016; Li and Liang, 2018; Du et al., 2018b; Arora et al., 2019b; Wu et al., 2019), and some results have been extended to multi-layer networks (Du et al., 2018a).

Following Du et al. (2018b); Wu et al. (2019), we define the limiting Gram matrix as follows:

###### Definition 1 (Limiting Gram Matrix).

The limiting Gram matrix is defined as follows. For - entry, we have

 G∞ij=Ew∼N(0,ν2I)[x⊤ixjI{w⊤xi≥0,w⊤xj≥0}]=x⊤ixjπ−arccos(x⊤ixj)2π. (10)

This matrix coincides with neural tangent kernel (Jacot et al., 2018) for ReLU activation function. As shown by Du et al. (2018b), this matrix is positive definite and we define its smallest eigenvalue . In the same way, we can define its finite version with -entry .

### 4.3 Exact Natural Gradient Descent

In this subsection, we present our result for this setting. The main difficulty is to show that Conditions 1 and 2 hold. Here we state our main result.

###### Theorem 3 (Natural Gradient Descent for overparameterized Networks).

Under Assumption 1, if we i.i.d initialize , for , we set the number of hidden nodes , and the step size , then with probability at least over the random initialization we have for

 ∥u(k)−y∥22≤(1−η)k∥u(0)−y∥22. (11)

Even though the objective is non-convex and non-smooth, natural gradient descent with a constant step size enjoys a linear convergence rate. For large enough , we show that the learning rate can be chosen up to , so NGD can provably converge within steps. Compared to analogous bounds for gradient descent (Du et al., 2018a; Oymak and Soltanolkotabi, 2019; Wu et al., 2019), we improve the maximum allowable learning rate from to and also get rid of the dependency on . Overall, NGD (Theorem 3) gives an improvement over gradient descent.

Our strategy to prove this result will be to show that for the given choice of random initialization, Condition 1 and 2 hold with high probability. For proving Condition 1 hold, we used matrix concentration inequalities. For Condition 2, we show that , which implies the Jacobian is stable for wide networks. For detailed proof, we refer the reader to the Appendix D.1.

### 4.4 Approximate Natural Gradient Descent with K-FAC

Exact natural gradient descent is quite expensive in terms of computation or memory. In training deep neural networks, K-FAC (Martens and Grosse, 2015) has been a powerful optimizer for leveraging curvature information while retaining tractable computation. The K-FAC update rule for the two-layer ReLU network is given by

 w(k+1)=w(k)−η[(X⊤X)−1⊗(S(k)⊤S(k))−1]F−1K−FACJ(k)⊤(u(k)−y). (12)

where denotes the matrix formed from the input vectors (i.e. ), and is the matrix of pre-activation derivatives. Under the same argument as the Gram matrix , we get that is strictly positive definite with smallest eigenvalue (see Appendix D.3 for detailed proof).

We show that for sufficiently wide networks, K-FAC does converge linearly to a global minimizer. We further show, with a particular transformation on the input data, K-FAC does match the optimization performance of exact natural gradient for two-layer ReLU networks. Here we state the main result.

###### Theorem 4 (K-Fac).

Under the same assumptions as in Theorem 3, plus the additional assumption that , if we set the number of hidden units and step size , then with probability at least over the random initialization, we have for

 ∥u(k)−y∥22≤(1−ηλmax(X⊤X))k∥u(0)−y∥22. (13)

The key step in proving Theorem 4 is to show

 u(k+1)−u(k)≈[(X(X⊤X)−1X⊤)⊙I](y−u(k)). (14)
###### Remark 3.

The convergence rate of K-FAC is captured by the condition number of the matrix . Compared to gradient descent (Du et al., 2018b; Oymak and Soltanolkotabi, 2019) of which the convergence is decided by the Gram matrix , typically has smaller condition number, indicating faster convergence.

###### Remark 4.

The dependence of the convergence rate on in Theorem 4

may seem paradoxical, as K-FAC is invariant to invertible linear transformations of the data (including those that would change

). But we note that said transformations would also make the norms of the input vectors non-uniform, thus violating Assumption 1 in a way that isn’t repairable. Interestingly, there exists an invertible linear transformation which, if applied to the input vectors and followed by normalization, produces vectors that simultaneously satisfy Assumption 1 and the condition (thus improving the bound in Theorem 4 substantially). See Appendix A for details. Notably, K-FAC is not invariant to such pre-processing, as the normalization step is a nonlinear operation.

To quantify the degree of overparameterization (which is a function of the network width

) required to achieve global convergence under our analysis, we must estimate

. To this end, we observe that , and then apply the following lemma:

###### Lemma 1.

[Schur (1911)] For two positive definite matrices and , we have

 λmax(A⊙B)≤maxiAiiλmax(B) (15) λmin(A⊙B)≥miniAiiλmin(B)

The diagonal entries of are all since the inputs are normalized. Therefore, we have , so K-FAC requires a slightly higher degree of overparameterization than exact NGD under our analysis.

### 4.5 Bounding λ0

As pointed out by Allen-Zhu et al. (2018), it is unclear if is small or even polynomial. Here, we bound using matrix concentration inequalities and harmonic analysis. To leverage harmonic analysis, we have to assume the data are drawn i.i.d. from the unit sphere333This assumption is not too stringent since the inputs are already normalized. Moreover, we can relax the assumption of unit sphere input to separable input, which is used in Li and Liang (2018); Allen-Zhu et al. (2018); Zou et al. (2018). See Oymak and Soltanolkotabi (2019) (Theorem I.1) for more details..

###### Theorem 5.

Under this assumption on the training data, with probability ,

 λ0≜λmin(G∞)≥nβ/2,whereβ∈(0,0.5) (16)

Basically, Theorem 5 says that the Gram matrix

should have high chance of having large smallest eigenvalue if the training data are uniformly distributed. Intuitively, we would expect the smallest eigenvalue to be very small if all

are similar to each other. Therefore, some notion of diversity of the training inputs is needed. We conjecture that the smallest eigenvalue would still be large if the data are -separable (i.e., for any pair ), an assumption adopted by Li and Liang (2018); Allen-Zhu et al. (2018); Zou et al. (2018).

## 5 Generalization analysis

It is often speculated that NGD or other preconditioned gradient descent methods (e.g., Adam) perform worse than gradient descent in terms of generalization (Wilson et al., 2017). In this section, we show that NGD can provably generalize as well as GD, as least for two-layer ReLU networks.

Consider a loss function . The expected risk over the data distribution and the empirical risk over a training set are defined as

 LD(f)=E(x,y)∼D[ℓ(f(x),y)]andLS(f)=1nn∑i=1ℓ(f(xi),yi) (17)

It has been shown (Neyshabur et al., 2019) that the Redamacher complexity (Bartlett and Mendelson, 2002) for two-layer ReLU networks depends on . By a standard generalization bound of Rademacher complexity, we have the following bound (see Appendix E.1 for proof):

###### Theorem 6.

Given a target error parameter and failure probability . Suppose and . For any 1-Lipschitz loss function, with probability at least over random initialization and training samples, the two-layer neural network trained by NGD for iterations has expected loss bounded as:

 LD(f(w,a))≤√2y⊤(G∞)−1yn+3√log(6/δ)2n+ϵ (18)

which has the same form as for gradient descent in Arora et al. (2019b).

## 6 Conclusion

We’ve analyzed for the first time the rate of convergence to a global optimum for (both exact and approximate) natural gradient descent on nonlinear neural networks. Particularly, we identified two conditions which guarantee the global convergence, i.e., the Jacobian matrix with respect to the parameters is full row rank and stable for perturbations around the initialization. Based on these insights, we improved the convergence rate of gradient descent by a factor of on two-layer ReLU networks by using natural gradient descent. Beyond that, we also showed that the improved convergence rates don’t come at the expense of worse generalization.

## Acknowledgements

We thank Jeffrey Z. HaoChen, Shengyang Sun and Mufan Li for helpful discussion.

## Appendix A The Forster Transform

In a breakthrough paper in the area of communication complexity, Forster [2002] used the existence of a certain kind of dataset transformation as the key technical tool in the proof of his main result. The Theorem which establishes the existence of this transformation is paraphrased below.

###### Theorem 7 (Forster [2002], Theorem 4.1).

Suppose is a matrix such that all subsets of size at most

of its rows are linearly independent. Then there exists an invertible matrix

such that if we post-multiply by (i.e. apply to each row), and then normalize each row by its 2-norm, the resulting matrix satisfies .

###### Remark 5.

Note that the technical condition about linear independence can be easily be made to hold for an arbitrary by adding an infinitesimal random perturbation, assuming it doesn’t hold to begin with.

This result basically says that for any set of vectors, there is a linear transformation of said vectors which makes their normalized versions (given by the rows of ) satisfy . So by combining this linear transformation with normalization we produce a set of vectors that simultaneously satisfy Assumption 1, while also satisfying .

Forster’s proof of Theorem 7 can be interpreted as defining a transformation function on (initialized at ), and showing that it has a fixed point with the required properties. One can derive an algorithm from this by repeatedly applying the transformation to , which consists of "whitening" followed by normalization, until is sufficiently close to . The matrix is then simply the product of the "whitening" transformation matrices, up to a scalar constant. While no explicit finite-time convergence guarantees are given for this algorithm by Forster [2002], we have implemented it and verified that it does indeed converge at a reasonable rate. The algorithm is outlined below.

## Appendix B Proof of Theorem 1

We prove the result in two steps: we first provide a convergence analysis for natural gradient flow, i.e., natural gradient descent with infinitesimal step size, and then take into account the error introduced by discretization and show global convergence for natural gradient descent.

To guarantee global convergence for natural gradient flow, we only need to show that the Gram matrix is positive definite throughout the training. Intuitively, for successfully finding global minima, the network must satisfy the following condition, i.e., the gradient with respect to the parameters is zero only if the gradient in the output space is zero. It suffices to show that the Gram matrix is positive definite, or equivalently, the Jacobian matrix is full row rank.

By Condition 1 and Condition 2, we immediately obtain the following lemma that if the parameters stay close to the initialization, then the Gram matrix is positive definite throughout the training.

###### Lemma 2.

If , then we have .

###### Proof of Lemma 2.

Based on the inequality that where denotes singular value, we have

 σmin(J)≥σmin(J(0))−∥J−J(0)∥2 (19)

By Condition 2, we have , thus we get which completes the proof. ∎

With the assumption that throughout the training, we are now ready to prove global convergence for natural gradient flow. Recall the dynamics of natural gradient flow in weight space,

 ddtθ(t)=1nF(t)†J(t)⊤(y−u(t)) (20)

Accordingly, we can calculate the dynamics of the network predictions.

 ddtu(t) =1nJ(t)F(t)†J(t)⊤(y−u(t)) (21) =J(t)J(t)⊤G(t)−1G(t)−1J(t)J(t)⊤(y−u(t))

Since the Gram matrix is positive definite, its inverse does exist. Therefore, we have

 ddtu(t)=y−u(t) (22)

By the chain rule, we get the dynamics of the loss in the following form:

 ddt∥y−u(t)∥22=−2(y−u(t))⊤(y−u(t)) (23)

By integrating eqn. (23), we find that .

That completes the continuous time analysis, under the assumption that the parameters stay close to the initialization. The discrete case follows similarly, except that we need to account for the discretization error. Analogously to eqn. (21), we calculate the difference of predictions between two consecutive iterations.

 u(k+1)−u(k) =u(θ(k)−ηJ(k)⊤G(k)−1(u(k)−y))−u(θ(k)) (24) =−∫1s=0⟨∂u(θ(s))∂θ⊤,ηJ(k)⊤G(k)−1(u(k)−y)⟩ds =−∫1s=0⟨∂u(θ(k))∂θ⊤,ηJ(k)⊤G(k)−1(u(k)−y)⟩dsη(y−u(k))

where we have defined .

Next we bound the norm of the second term () in the RHS of eqn. (24). Using Condition 2 and Lemma 2 we have that

 ∥∥to21.4pt\vboxto21.4pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}% \pgfsys@moveto{8.499756pt}{0.0pt}\pgfsys@curveto{8.499756pt}{4.694286pt}{4.694% 286pt}{8.499756pt}{0.0pt}{8.499756pt}\pgfsys@curveto{-4.694286pt}{8.499756pt}{% -8.499756pt}{4.694286pt}{-8.499756pt}{0.0pt}\pgfsys@curveto{-8.499756pt}{-4.69% 4286pt}{-4.694286pt}{-8.499756pt}{0.0pt}{-8.499756pt}\pgfsys@curveto{4.694286% pt}{-8.499756pt}{8.499756pt}{-4.694286pt}{8.499756pt}{0.0pt}\pgfsys@closepath% \pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-7.499886pt% }{-2.499962pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{1}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicture∥∥2 ≤η∥∥∥∫1s=0J(θ(s))−J(θ(k))ds∥∥∥2∥∥J(k)⊤G(k)−1(u(k)−y)∥∥2 (25) ≤η2C3√λmin(G(0))1√λmin(G(k))∥u(k)−y∥2 ≤η2C3√λmin(G(0))32√λmin(G(0))∥u(k)−y∥2 =ηC∥u(k)−y∥2.

In the first inequality, we used the fact (based on Condition 2) that

 ∥∥∥∫1s=0J(θ(s))−J(θ(k))ds∥∥∥2 ≤∥J(θ(k))−J(θ(0))∥2+∥J(θ(k+1))−J(θ(0))∥2 (26) ≤2C3√λmin(G(0))

Lastly, we have

 ∥y−u(k+1)∥22 =∥y−u(k)−(u(k+1)−u(k))∥22 (27) =∥y−u(k)∥22−2(y−u(k))⊤(u(k+1)−u(k))+∥u(k+1)−u(k)∥22 ≤(1−2η+2ηC+η2(1+C)2)∥y−u(k)∥22 ≤(1−η)∥y−u(k)∥22.

In the last inequality of eqn. (27), we use the assumption that .

So far, we have assumed the parameters fall within a certain radius around the initialization. We now justify this assumption.

###### Lemma 3.

If Conditions 1 and 2 hold, then as long as , we have

 ∥θ(k+1)−θ(0)∥2≤3∥y−u(0)∥2√λmin(G(0)). (28)
###### Proof of Lemma 3.

We use the norm of each update to bound the distance of the parameters to the initialization.

 ∥θ(k+1)−θ(0)∥2 ≤ηk∑s=0∥J(s)⊤G(s)⊤(y−u(s))∥2 (29) ≤ηk∑s=0∥y−u(s)∥2√λmin(G(s)) ≤ηk∑s=0(1−η)s/2∥y−u(0)∥2√49λmin(G(0)) ≤3∥y−u(0)∥2√λmin(G(0)).

This completes the proof. ∎

At first glance, the proofs in Lemma 2 and 3 seem to be circular. Here, we prove that their assumptions continue to be jointly satisfied.

###### Lemma 4.

Assuming Conditions 1 and 2, we have (1) and (2) throughout the training.

###### Proof.

We prove the lemma by contradiction. Suppose the conclusion does not hold for all iterations. Let’s say (1) holds at iteration but not iteration . Then we know, there must exist such that from , otherwise we can show that (1) holds at iteration as well by Lemma 3. However, by Lemma 2, we know that since (1) holds for , contradiction. ∎

Notably, Lemma 4 shows that and throughout the training if Conditions 1 and 2 hold. This completes the proof of our main result.

## Appendix C Proof of Theorem 2

Here, we prove Theorem 2 by induction. Our inductive hypothesis is the following condition.

###### Condition 4.

At the -th iteration, we have .

We first use the norm of the gradient to bound the distance of the weights. Here we slightly abuse notation, .

 ∥θ(k+1)−θ(0)∥2 ≤ηk∑s=0∥J(k)⊤G(k)−1∇uL(u(k))∥2 (30) ≤ηLk∑s=0∥J(k)⊤G(k)−1∥2∥y−u(k)∥2 ≤ηL∞∑s=0(1−2ημLμ+L)s/2∥y−u(0)∥2√49λmin(G(0)) =3(1+κ)∥y−u(0)∥22√λmin(G(0))

where . The second inequality is based on the -Lipschitz gradient assumption444That the gradient of is -Lipschitz implies the gradient of is also -Lipschitz. and the fact that . Also, we have

 u(k+1)−u(k)=η∇uL(u(k))+ηP(k)∇uL(u(k)) (31)

In analogy to eqn. (24), . Next, we introduce a well-known Lemma for -strongly convex and -Lipschitz gradient loss.

###### Lemma 5 (Co-coercivity for μ-strongly convex loss).

If the loss function is -strongly convex with -Lipschitz gradient, for any , the following inequality holds.

 (∇L(u)−∇L(y))⊤(u−y)≥μLμ+L∥u−y∥22+1μ+L∥∇L(u)−∇L(y)∥22 (32)

Now, we are ready to bound :

 ∥u(k+1)−y∥22 =∥u(k)−η(I+P(k))∇uL(u(k))−y∥22 (33) ≤∥u(k)−y∥22−2η∇uL(u(k))⊤(u(k)−y) +2η∥P∥2∥∇uL(u(k))∥2∥u(k)−y∥2+η2(1+∥P(k)∥2)2∥∇uL(u(k))∥22 ≤∥u(k)−y∥22−2ημLμ+L∥u(k)−y∥22−2ημ+L∥∇uL(u(k))∥22 +2η∥P∥2∥∇uL(u(k))∥2∥