Stochastic gradient descent (SGD) still is the workhorse for many practical problems. However, it converges slow, and can be difficult to tune. It is possible to precondition SGD to accelerate its convergence remarkably. But many attempts in this direction either aim at solving specialized problems, or result in significantly more complicated methods than SGD. This paper proposes a new method to estimate a preconditioner such that the amplitudes of perturbations of preconditioned stochastic gradient match that of the perturbations of parameters to be optimized in a way comparable to Newton method for deterministic optimization. Unlike the preconditioners based on secant equation fitting as done in deterministic quasi-Newton methods, which assume positive definite Hessian and approximate its inverse, the new preconditioner works equally well for both convex and non-convex optimizations with exact or noisy gradients. When stochastic gradient is used, it can naturally damp the gradient noise to stabilize SGD. Efficient preconditioner estimation methods are developed, and with reasonable simplifications, they are applicable to large scaled problems. Experimental results demonstrate that equipped with the new preconditioner, without any tuning effort, preconditioned SGD can efficiently solve many challenging problems like the training of a deep neural network or a recurrent neural network requiring extremely long term memories.

## Authors

• 7 publications
• ### Research of Damped Newton Stochastic Gradient Descent Method for Neural Network Training

First-order methods like stochastic gradient descent(SGD) are recently t...
03/31/2021 ∙ by Jingcheng Zhou, et al. ∙ 0

• ### Recurrent neural network training with preconditioned stochastic gradient descent

This paper studies the performance of a recently proposed preconditioned...
06/14/2016 ∙ by Xi-Lin Li, et al. ∙ 0

• ### Quasi-Newton Optimization in Deep Q-Learning for Playing ATARI Games

Reinforcement Learning (RL) algorithms allow artificial agents to improv...
11/06/2018 ∙ by Jacob Rafati, et al. ∙ 0

• ### SGB: Stochastic Gradient Bound Method for Optimizing Partition Functions

This paper addresses the problem of optimizing partition functions in a ...
11/03/2020 ∙ by Jing Wang, et al. ∙ 0

• ### On the Relationship Between the OpenAI Evolution Strategy and Stochastic Gradient Descent

Because stochastic gradient descent (SGD) has shown promise optimizing n...
12/18/2017 ∙ by Xingwen Zhang, et al. ∙ 0

Stochastic gradient descent (SGD) has been the dominant optimization met...
07/29/2019 ∙ by Erhan Bilal, et al. ∙ 0

• ### First-order Newton-type Estimator for Distributed Estimation and Inference

This paper studies distributed estimation and inference for a general st...
11/28/2018 ∙ by Xi Chen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Stochastic gradient descent (SGD) has a long history in signal processing and machine learning

[2, 3, 1, 4, 5, 20, 21]. In adaptive signal processing, an exact gradient might be unavailable in a time-varying setting, and typically it is replaced with instantaneous gradient, a stochastic gradient with mini-batch size 1 [2, 1]. In machine learning like the training of neural networks, deterministic gradient descent is either expensive when the training data are large, or unnecessary when the training samples are redundant [5]. SGD keeps to be a popular choice due to its simplicity and proved efficiency in solving large-scale problems. However, SGD may converge slow and is difficult to tune, especially for large-scale problems. When the Hessian matrix is available and small, second order optimization methods like the Newton method might be the best choice, but in practice, this is seldom the case. For example, calculation of Hessian for a fairly standard feedforward neural network can be much more complicated than its gradient evaluation [6]. In deterministic optimization, quasi-Newton methods and (nonlinear) conjugate gradient methods are among the most popular choices, and they converge fast once the solution is located in a basin of attraction. However, these methods require a line search step, which can be problematic when the cost function cannot be efficiently evaluated, a typical scenario where SGD is used. Still, they are applied with successes to machine learning problems like neural network training, and are made available in standard toolboxes, e.g., the Matlab neural network toolbox [7]. The highly specialized Hessian-free neural network training methods in [8] represent the state of the art in this direction. There are attempts to adapt the deterministic quasi-Newton methods to stochastic optimization [10, 9, 11]

. However, due to the existence of gradient noise and infeasibility of line search in stochastic optimization, the resultant methods either impose strong restrictions such as convexity on the target problems, or are significantly more complicated than SGD. On the other hand, numerous specialized SGD methods are developed for different applications. In blind source separation and independent component analysis, relative (natural) gradient is proposed to replace the regular gradient in SGD

[20, 21]

. However, in general the natural gradient descent using metrics like Fisher information can be as problematic as Newton method for large-scale problems. In neural network training, a number of specialized methods are developed to improve the convergence of SGD, and to name a few, the classic momentum method and Nesterov’s accelerated gradient, the RMSProp method and its variations, various step size control strategies, pre-training, clever initialization, and a few recent methods coming up with element-wise learning rates

[12, 13, 14, 15, 16, 17]. Clearly, we need a stochastic optimization method that is as simple and widely applicable as SGD, converges as fast as a second order method, and lastly but not the least, is user friendly, requiring little tuning effort.

In this paper, we carefully examine SGD in the general non-convex optimization setting. We show that it is possible to design a preconditioner that works well for both convex and non-convex problems, and such a preconditioner can be estimated exclusively from the noisy gradient information. However, when the problem is non-convex, the preconditioner cannot be estimated following the conventional ways of Hessian estimation as done in the quasi-Newton methods. For the general non-convex optimization, a new method is required to estimate the preconditioner, which is not necessarily the inverse of Hessian, but related to it. This new preconditioner has several desired properties. It reduces the eigenvalue spread of preconditioned SGD to speed up convergence, scales the stochastic gradient in a way comparable to Newton method such that step size selection is trivial, and lastly, it has a built-in gradient noise suppression mechanism to stabilize preconditioned SGD when the gradient is heavily noisy. Practical implementation methods are developed, and applications to a number of interesting problems demonstrate the usefulness of our methods.

## Ii Background

### Ii-a Sgd

Although this paper focuses on stochastic optimization, deterministic gradient descent can be viewed as a special case of SGD without gradient noise, and the theories and methods developed in this paper are applicable to it as well. Here, we consider the minimization of cost function

 f(θ)=E[ℓ(θ,z)], (1)

where

is a parameter vector to be optimized,

is a random vector,

is a loss function, and

takes expectation over . For example, in the problem of classification using neural network, represents the vector containing all the tunable weights in the neural network, is the pair of feature vector and class label , and typically is a differentiable loss like the mean squared error, or the cross entropy loss, or the (multi-class) hinge loss. Gradient descent can be used to learn as

 θ[new]=θ[old]−μg(θ[old]), (2)

where is a positive step size, and

 g(θ)=E[∂ℓ(θ,z)∂θ] (3)

is the gradient of with respect to . Evaluation of the expectation in (3) may be undesirable or not possible. In SGD, this expectation is approximated with sample average, and thus leading to the following update rule for ,

 ^g(θ) = 1nn∑i=1∂ℓ(θ,zi)∂θ, (4) θ[new] = θ[old]−μ^g(θ[old]), (5)

where the hat suggests that the variable under it is estimated, is the mini-batch size, and denotes the th sample, typically randomly drawn from the training data. Now is a random vector, and it is useful to rewrite it as

 ^g(θ)=g(θ)+ϵ′ (6)

to clearly show its deterministic and random parts, where random vector models the approximation error due to replacing expectation with sample average. Although being popular due to its simplicity, SGD may converge slow and the selection of is nontrivial, as revealed in the following subsection.

### Ii-B Second Order Approximation

We consider the following second order approximation of around a point ,

 f(θ)≈f(θ0)+gT0(θ−θ0)+12(θ−θ0)TH0(θ−θ0), (7)

where superscript denotes transpose, is the gradient at , and

 H0=∂2f(θ)∂θT∂θ∣∣∣θ=θ0

is the Hessian matrix at . Note that is symmetric by its definition. With this approximation, the gradients of with respect to around can be evaluated as

 g(θ) ≈ g0+H0(θ−θ0), (8) ^g(θ) = g0+H0(θ−θ0)+ϵ, (9)

where contains the errors introduced in both (6) and (8). Using (9), around , the learning rule (5) turns into the following linear system,

 θ[new]=(I−μH0)θ[old]−μ(g0−H0θ0+ϵ), (10)

where

is a conformable identity matrix. Behaviors of such a linear system largely depend on the selection of

and the distribution of the eigenvalues of . In practice, this linear system may have a large dimension and be ill-conditioned. Furthermore, little is known about . The selection of is largely based on trial and error, and still, the convergence may be slow. However, it is possible to precondition such a linear system to accelerate its convergence remarkably as shown in the next section.

## Iii Preconditioned SGD

A preconditioned SGD is defined by

 θ[new]=θ[old]−μP^g(θ[old]), (11)

where is a conformable matrix called preconditioner. The SGD in (5) is a special case of (11) with , and we should call it the plain SGD. Noting that our target is to minimize the cost function , must be positive definite, and symmetric as well by convention, such that the search direction always points to a descent direction. In convex optimization, inverse of the Hessian is a popular preconditioner. However, in general, the Hessian is not easy to obtain, not always positive definite, and for SGD, such an inverse of Hessian preconditioner may significantly amplify the gradient noise, especially when the Hessian is ill-conditioned. In this section, we detail the behaviors of preconditioned SGD in a general setting where the problem can be non-convex, and is not necessarily related to the Hessian.

### Iii-a Convergence of Preconditioned SGD

We consider the same second order approximation given in (7). With preconditioning, the learning rule in (10) becomes

 θ[new]=(I−μPH0)θ[old]−μP(g0−H0θ0+ϵ). (12)

To proceed, let us first prove the following statement.

Proposition 1: All the eigenvalues of are real, and for nonzero eigenvalues, their signs are the same as those of the eigenvalues of .

Proof: Let denote the principal square root of . Since is positive definite, is symmetric and positive definite as well. First, we show that and have the same eigenvalues. Supposing

is an eigenvector of

associated with eigenvalue , i.e., , then we have

 PH0(P0.5v)=P0.5P0.5H0P0.5v=λP0.5v.

As is positive definite, . Thus is an eigenvector of associated with eigenvalue . Similarly, if is an eigenvector of associated with eigenvalue , then by rewriting as

 (P0.5H0P0.5)P−0.5v=λP−0.5v,

we see that is an eigenvalue of as well. Hence and have identical eigenvalues, and they are real as is symmetric. Second, matrices and are congruent, and thus their eigenvalues have the same signs, which implies that the eigenvalues of and have the same signs as well.

Basically, Proposition 1 states that a preconditioner does not change the local geometric structure around in the sense that a local minimum, or maximum, or saddle point before preconditioning keeps to be a local minimum, or maximum, or saddle point after preconditioning, as shown in a numerical example given in Fig. 1.

By introducing eigenvalue decomposition

 PH0=VDV−1,

we can rewrite (12) as

 ϑ[new]=(I−μD)ϑ[old]−μV−1P(g0−H0θ0+ϵ), (13)

where diagonal matrix and nonsingular matrix contain the eigenvalues and eigenvectors of respectively, and defines a new parameter vector in the transformed coordinates. Since is diagonal, (13) suggests that each dimension of evolves independently. This greatly simplifies the study of preconditioned SGD. Without loss of generality, we consider the th dimension of ,

 ϑ[new]i=(1−μdi)ϑ[old]i−μhi−μvi, (14)

where and are the th elements of and respectively, is the th diagonal element of

, and random variable

is the th element of random vector . We consider the following three cases.

): By choosing , we have . Then repeatedly applying (14) will let the expectation of converge to

. When the variance of

is bounded, the variance of is bounded as well.

): For any step size , we have . Iteration (14) always pushes away from .

): This should be a transient state since always drifts away from such a singular point due to gradient noise.

Similar pictures hold true on the convergence of as well. When is positive definite, the diagonal elements of are positive according to Proposition 1. Then by choosing , repeated applications of iteration (13) let the expectation of converge to , a local minimum of around , where denotes the maximum eigenvalue. When is negative definite, is located around a local maximum of , and iteration (13) pushes the expectation of away from . When has both positive and negative eigenvalues, by choosing , the part of associated with positive eigenvalues is attracted to , and the part associated with negative eigenvalues is repelled away from . Saddle point is instable due to gradient noise.

### Iii-B Three Desirable Properties of a Preconditioner

We expect a good preconditioner to have the following three desired properties.

In order to achieve approximately uniform convergence rates on all the coordinates of , all the eigenvalues of

should have similar amplitudes. We use the standard deviation of the logarithms of the absolute eigenvalues of a matrix to measure its eigenvalue spread. The eigenvalue spread gain of a preconditioner

is defined as the ratio of the eigenvalue spread of to the eigenvalue spread of . Larger eigenvalue spread gain is preferred, and as a base line, a plain SGD has eigenvalue spread gain .

#### Iii-B2 Normalized eigenvalue amplitudes

We hope that all the absolute eigenvalues of are close to to facilitate the step size selection in preconditioned SGD. Note that in deterministic optimization, the step size can be determined by line search, and thus the scales of eigenvalues are of less interest. However, in stochastic optimization, without the help of line search and knowledge of Hessian, step size selection can be tricky. We use the mean absolute eigenvalues of to measure the scaling effect of . In a well preconditioned SGD, a normalized step size, i.e., , should work well for the whole learning process, eliminating any manual step size tweaking effort.

#### Iii-B3 Large stochastic gradient noise suppression gain

Unlike the deterministic optimization, preconditioning for SGD comes at a price, amplification of gradient noise. For the plain SGD in (5), the gradient noise energy is ; while for the preconditioned SGD in (11), this noise energy is . We use the preconditioned gradient noise energy of Newton method, which is , as the reference, and define the noise suppression gain of preconditioner as

 E[(ϵ′)TH−20ϵ′]E[(ϵ′)TP2ϵ′]=tr{H−20E[ϵ′(ϵ′)T]}tr{P2E[ϵ′(ϵ′)T]},

where takes the trace of a matrix. A good approximation of noise suppression gain is .

Unfortunately, due to the existence of gradient noise, generally we cannot find a single preconditioner that simultaneously satisfies all our expectations. When the gradient noise vanishes, for nonsingular , we indeed can find at least one ideal preconditioner such that all the eigenvalues of have unitary amplitude, as stated in the following proposition.

Proposition 2: For any nonsingular symmetric , there exists at least one preconditioner such that all the absolute eigenvalues of are unitary, and such a is unique when is positive or negative definite.

Proof: First, we show that such a preconditioner exists. Assuming the eigenvalue decomposition of is , we can construct a desired preconditioner as , where

is an orthogonal matrix,

is a diagonal matrix, and takes the element-wise absolute value.

Second, we show that such a preconditioner is unique when is positive or negative definite. When is positive definite, according to Proposition 1, has positive eigenvalues. If all these eigenvalues are , then , i.e., . For negative definite , similarly we can show that .

However, such a preconditioner is not necessarily unique for an indefinite Hessian. For example, for Hessian matrix

 H0=[100−1],

any preconditioner having form

 P=1α2−β2[α2+β22αβ2αβα2+β2],|α|≠|β|

makes have unitary absolute eigenvalues.

## Iv Preconditioner Estimation Criteria

In practice, the gradient may be easily evaluated, but not for the Hessian matrix. Thus we focus on the preconditioner estimation methods only using the noisy stochastic gradient information. We first discuss two criteria based on secant equation fitting. Although they are not ideal for non-convex optimization, it is still beneficial to study them in detail as they are intimately related to the deterministic and stochastic quasi-Newton methods. We then propose a new preconditioner estimation criterion suitable for both convex and non-convex stochastic optimizations, and show how it overcomes the fundamental limitations of secant equation fitting based solutions.

### Iv-a Criteria Based on Secant Equation Fitting

Let be a small perturbation of around . From (9), we know that

 ^g(θ+δθ)−^g(θ)=H0δθ+ε, (15)

where is a random vector accounting for the errors introduced by stochastic approximation of gradients and second order approximation of cost function. It is proposed to use the same randomly sampled training data to calculate and for two reasons. First, this practice reduces stochastic noise, and makes sure that when . Second, it avoids reloading or regenerating training data. To simplify the notation, we rewrite (15) as

 δ^g=H0δθ+ε, (16)

where denotes a random perturbation of stochastic gradient. We call (16) the stochastic secant equation. In quasi-Newton methods, the secant equation is used to derive diverse forms of estimators for the Hessian or its inverse, and Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula is among the widely used ones. In deterministic optimization, BFGS is used along with line search to ensure that the updated Hessian estimate is always positive definite. However, in stochastic optimization, due to the existence of gradient noise and the infeasibility of line search, attempts in this direction can only achieve limited successes. One common assumption to justify the use of (16) for preconditioner estimation is that the true Hessians around are positive definite, i.e., the optimization problem is convex or already is in a basin of attraction. This can be a serious limitation in practice, although it greatly simplifies the design of stochastic quasi-Newton methods. Nevertheless, secant equation based Hessian estimators are still widely adopted in both deterministic and stochastic optimizations.

#### Iv-A1 Criterion 1

With sufficient independent pairs of around , we will be able to estimate by fitting the secant equation (16). It is natural to assume that the error

is Gaussian distributed, and thus a reasonable criterion for preconditioner estimation is

 c1(P)=E[∥∥δ^g−P−1δθ∥∥2], (17)

where denotes the Euclidean length of a vector, and the associated are regarded as random vectors, and takes expectation over them. The preconditioner determined by this criterion is called preconditioner 1. Using equation

 dP−1=−P−1dPP−1, (18)

we can show that the derivative of with respect to is

 ∂c1(P)∂P=P−1(e1δθT+δθeT1)P−1, (19)

where denotes differentiation, and . Noting that is symmetric, the gradient of with respect to is symmetric as well, or equivalently, the symmetric part of the gradient of with respect to without considering symmetry constraint. By letting the gradient be zero, we can find the optimal by solving equation

 RθP−1+P−1Rθ−(Rθg+Rgθ)=0, (20)

where , , and .

Equation (20) is a continuous Lyapunov equation well known in control theory. Using result

 vec(ABC)=(CT⊗A)vec(B), (21)

we can rewrite (20) as

 (I⊗Rθ+Rθ⊗I)vec(P−1)=vec(Rθg+Rgθ) (22)

to solve for , where is the vector formed by stacking the columns of , denotes matrix Kronecker product, and is a conformable identity matrix.

However, the solution given by (22) is not intuitive. We can solve for directly with the following mild assumptions:

• has zero mean, i.e., .

• and are uncorrelated, i.e., .

• Covariance matrix is positive definite.

With the above assumptions, we have

 Rgθ=H0Rθ,Rθg=RθH0. (23)

Using (23), (20) becomes

 Rθ(P−1−H0)+(P−1−H0)Rθ=0, (24)

or equivalently

 (I⊗Rθ+Rθ⊗I)vec(P−1−H0)=0 (25)

by using (21). Since is positive definite, is positive definite as well. Hence by (25), i.e., . Thus as in the deterministic optimization, with enough independent pairs of

, criterion 1 leads to an asymptotically unbiased estimation of

. Such an unbiasedness property is preferred in deterministic optimization, however, it might be undesirable in stochastic optimization as such a preconditioner may significantly amplify the gradient noise. Furthermore, when preconditioner 1 is estimated using finite pairs of , cannot be guaranteed to be positive definite even if is positive definite.

#### Iv-A2 Criterion 2

By rewriting (16) as , we may introduce another criterion for secant equation fitting,

 c2(P)=E[∥Pδ^g−δθ∥2]. (26)

Naturally, the preconditioner determined by this criterion is called preconditioner 2. The derivative of with respect to is

 ∂c2(P)∂P=δ^geT2+e2δ^gT, (27)

where . By letting the derivative of with respect to be zero, we find that the optimal satisfies equation

 PRg+RgP−Rgθ−Rθg=0, (28)

where .

Again, (28) is a continuous Lyapunov equation, and can be numerically solved. Unfortunately, there does not exist a simple closed-form relationship between the optimal and . Still, analytical solutions in simplified scenarios can cast crucial insight into the properties of this criterion. By assuming assumption A4,

 Rθ=σ2θI,Rε=E[εεT]=σ2εI, (29)

the optimal can be shown to be

 P=m∑i=1λiλ2i+σ2ε/σ2θuiuTi, (30)

where is the eigenvalue decomposition of with being its th eigenvector associated with eigenvalue . As criterion 1, criterion 2 leads to an unbiased estimation of when . But unlike criterion 1, the optimal here underestimates the inverse of Hessian when . Such a built-in annealing mechanism is actually desired in stochastic optimization. Again, when preconditioner 2 is estimated using finite pairs of , it can be indefinite even if is positive definite.

### Iv-B A New Criterion for Preconditioner Estimation

As examined in Section III, the essential utility of a preconditioner is to ensure that enjoys approximately uniform convergence rates across all directions. Amplitudes, but not the signs, of the eigenvalues of Hessian are to be normalized. There is no need to explicitly estimate the Hessian. We propose the following new criterion for preconditioner estimation,

 c3(P)=E[δ^gTPδ^g+δθTP−1δθ], (31)

and call it and resultant preconditioner criterion 3 and preconditioner 3 respectively. The rationale behind criterion 3 is that when is minimized, we should have

 ∂c3(κP)∂κ∣∣∣κ=1=0,

which suggests its two terms, and , are equal, and thus the amplitudes of perturbations of preconditioned stochastic gradients match that of parameter perturbations as in the Newton method where the inverse of Hessian is the preconditioner. Furthermore, is invariant to the changes of the signs of and , i.e., pairs resulting in the same cost. We present the following proposition to rigorously justify the use of criterion 3. To begin with, we first prove a lemma.

Lemma 1: If is symmetric, , and is a diagonal matrix with distinct nonnegative diagonal elements, then we have , where is an arbitrary diagonal matrix with diagonal elements being either or .

Proof: Let the eigenvalue decomposition of be . Then suggests . Noting that the eigenvalue decomposition of is unique when it has distinct diagonal elements, we must have and , i.e., .

Proposition 3: For positive definite covariance matrices and , criterion determines an optimal positive definite preconditioner scaling as . The optimal is unique when has distinct eigenvalues.

Proof: The derivative of with respect to is

 ∂c3(P)∂P=E[δ^gδ^gT]−P−1E[δθδθT]P−1. (32)

By letting the gradient be zero, we obtain the following equation for optimal ,

 PRgP−Rθ=0, (33)

which has the form of a continuous time algebraic Riccati equation known in control theory, but lacking the linear term of . To solve for , we rewrite (33) as

 (R−0.5θPR−0.5θ)R0.5θRgR0.5θ(R−0.5θPR−0.5θ)=I, (34)

where denotes the principal square root of a positive definite matrix . By introducing eigenvalue decomposition

 R0.5θRgR0.5θ=UDUT, (35)

we can rewrite (34) as

 (UTR−0.5θPR−0.5θU)D(UTR−0.5θPR−0.5θU)=I, (36)

or equivalently,

 D=(UTR−0.5θPR−0.5θU)−2. (37)

When does not have repeated eigenvalues, the diagonal elements of are distinct, and the solution must have form

 P=R0.5θU(DsignD−0.5)UTR0.5θ (38)

by Lemma 1. For positive definite , we can only choose , and thus the optimal is unique. When has repeated eigenvalues, (38) still gives a valid solution, but not necessarily the only one. The assumption of non-singularity of and is used to make the involved matrix inversion feasible.

Unlike the two secant equation fitting based preconditioners, preconditioner 3 is guaranteed to be positive definite as long as the estimated covariance matrices and are positive definite. To gain more insight into the properties of preconditioner 3, we consider closed-form solutions in simplified scenarios that can explicitly link the optimal to . When and have the simple forms shown in (29), the closed-form solution for is

 P=m∑i=11√λ2i+σ2ε/σ2θuiuTi, (39)

where is the eigenvalue decomposition of . The eigenvalues of are . When , we have

 λi√λ2i+σ2ε/σ2θ→σθλiσε.

Thus for heavily noisy gradient, the optimal preconditioner cannot improve the eigenvalue spread, but are damping the gradient noise. Such a built-in step size adjusting mechanism is highly desired in stochastic optimization. When , reduces to the ideal preconditioner constructed in the proof of Proposition 2, and all the eigenvalues of have unitary amplitude. These properties make preconditioner 3 an ideal choice for preconditioned SGD.

### Iv-C Relationship to Newton Method

In a Newton method for deterministic optimization, we have . Here is an exact gradient perturbation. Let us rewrite this secant equation in matrix form as , and compare it with relation from Proposition 3. Now it is clear that preconditioner 3 scales the stochastic gradient in a way comparable to Newton method in deterministic optimization.

The other two preconditioners either over or under compensate the stochastic gradients, inclining to cause divergence or slowdown convergence as shown in our experimental results. To simplify our analysis work, we consider the closed-form solutions of the first two preconditioners. For preconditioner 1, we have under assumptions A1, A2 and A3. Then with (16), we have

 PE[δ^gδ^gT]P−E[δθδθT]=H−10E[εεT]H−10⪰0,

which suggests that preconditioner 1 over compensates the stochastic gradient, where means that the matrix on the left side of is positive semidefinite definite. For preconditioner 2, using the closed-form solution in (30), we have

 PE[δ^gδ^gT]P−E[δθδθT]=−m∑i=1σ2εuiuTiλ2i+σ2ε/σ2θ⪯0,

which suggests that preconditioner 2 under compensates the stochastic gradient, where means that the matrix on the left side of is negative semidefinite.

## V Preconditioner Estimation Methods

It is possible to design different preconditioner estimation methods based on the criteria proposed in Section IV. To minimize the overhead of preconditioned SGD, we only consider the simplest preconditioner estimation methods: SGD algorithms for learning minimizing the criteria in Section IV with mini-batch size . We call the SGD for the primary SGD to distinguish it from the SGD for learning . In each iteration of preconditioned SGD, we first evaluate the gradient twice to obtain and . Then the pair, , is used to update the preconditioner estimation once. Lastly, (11) is used to update to complete one iteration of preconditioned SGD.

### V-a Dense Preconditioner

We focus on the algorithm design for criterion 3. Algorithms for the other two criteria are similar, and will be given without detailed derivation. In this subsection, we do not assume that the preconditioner has any sparse structure except for being symmetric, i.e. it is a dense matrix. Elements in the Hessian, and thus the preconditioner, can have a large dynamic range. Additive learning rules like the regular gradient descent may converge slow when the preconditioner is poorly initialized. We find that multiplicative updates, e.g., the relative (natural) gradient descent, could perform better due to its equivariant property [21, 20]. However, to use the relative gradient descent, we need to find a Lie group representation for . It is clear that positive definite matrices do not form a Lie group under matrix multiplication operation. Let us consider the Cholesky factorization

 P=QTQ, (40)

where is an upper triangular matrix with positive diagonal elements. It is straightforward to show that all upper triangular matrices with positive diagonal elements form a Lie group under matrix multiplication operation. Thus in order to use the relative gradient descent, we shall learn the matrix . One desired by-product of Cholesky factorization is that the resultant triangular system can be efficiently solved by forward or backward substitution.

Following the third criterion, the instantaneous cost to be minimized is

 ^c3(P)=δ^gTPδ^g+δθTP−1δθ, (41)

which is an approximation of (31) with mini-batch size . We consider a small perturbation of given by , where is an infinitely small upper triangle matrix such that still belongs to the same Lie group. The relative gradient is defined by

 ∇E=∂^c3(Q+EQ)∂E∣∣∣E=0,

where with slight abuse of notation, is rewritten as a function of . Using (18), we can show that

 ∇E=2triu(Qδ^gδ^gTQT−Q−TδθδθTQ−1), (42)

where operator takes the upper triangular part of a matrix. Then can be updated as

 Q[new]=Q[old]−μQ∇EQ[old], (43)

where is a small enough step size such that the diagonal elements of keep to be positive. To simplify the step size selection, normalized step size

 μQ=μQ,0max|∇E| (44)

can be used, where , and denotes the maximum element-wise absolute value of . Another normalized step size

 μQ=μQ,0max|diag(∇E)|

can be useful, and also ensures that belongs to the same Lie group when , where denotes the maximum absolute value of the diagonal elements of . Our experiences suggest that the two step size normalization strategies are comparable in stochastic optimization. However, the one given in (44) seems to be preferred in deterministic optimization as it leads to more stable convergence.

We summarize the complete preconditioned SGD as below.

One complete iteration of preconditioned SGD with preconditioner 3

Inputs are and ; outputs are and .

• Sample , and calculate and .

• Calculate , and via solving triangular system .

• Update the preconditioner by

 Q[new]=Q[old]−μQ,0max|∇E|∇EQ[old],

where , and .

• Update by

 θ[new]=θ[old]−μθ,0(Q[new])TQ[new]^g(θ[old]),

where .

Here, elements of can be sampled from the Gaussian distribution with a small variance. Then the only tunable parameters are the two normalized step sizes, and .

Algorithm design for the other two preconditioners is similar, and we give their relative gradients for updating without derivation. For criterion 1, the relative gradient is

 ∇E=2triu(Q−TδθeT1Q−1+Q−Te1δθTQ−1),

where . For criterion 2, the relative gradient is

 ∇E=2triu(Qδ^geT2QT+Qe2δ^gTQT),

where . It is worthy to mention that by writing when using criteria 1 and 2, we are forcing the preconditioners to be positive definite, but the optimal solution minimizing criterion 1 or 2 is not necessarily positive definite. For criterion 3, by writing , we are selecting the positive definite solution among many possible ones.

### V-B Preconditioner with Sparse Structures

In practice, can have millions of free parameters to learn, and thus a dense might have trillions of elements to estimate and store. Clearly, a dense representation for is no longer feasible. For large-scale problems, we need to assume that has certain sparse structures so that it can be manipulated. One extreme example is to treat as a diagonal matrix. However, such a simplification is too coarse, and generally does not lead to significant performance gain over the plain SGD. Application specific knowledge may play the key role in determining a proper form for .

One example is to assume has structure

 Q=[Q11Q120Q22],

where is an upper triangular matrix, is a dense matrix, is a diagonal matrix, and all the diagonal elements of

are positive. It is straightforward to show that such limited-memory triangular matrices form a Lie group, and thus relative gradient descent applies here. Cholesky factor of ill-conditioned matrix can be well approximated with this form. However, the dimension of

should be properly selected to achieve a good trade off between representation complexity and accuracy.

In many situations, the parameters to be optimized naturally form a multi-dimensional array, reflecting certain built-in structures of the training data and the parameter space. Hence we may approximate the preconditioner as Kronecker products of a series of small matrices. For example, preconditioner for a parameter array may be approximated as Kronecker product of three small matrices with sizes , and . Such a bold simplification often works well in practice. Interestingly, similar ideas have been exploited in [19, 18], and achieved certain successes.

To explain the above idea clearly, let us consider a concrete example where the parameters to be optimized naturally form a two dimensional array, i.e., a matrix, denoted by . Its stochastic gradient, , is a matrix with the same dimensions. The preconditioned SGD for updating is

 vec(Θ[new])=vec(Θ[old])−μPvec[^G(Θ[old])]. (45)

To simplify the preconditioner estimation, we may assume that , and thus can rewrite (45) as

 Θ[new]=Θ[old]−μP1^G(Θ[old])P2 (46)

using (21), where and are two positive definite matrices with conformable dimensions. Similarly, by adopting Cholesky factorizations

 P1=QT1Q1,P2=QT2Q2,

we can use relative gradients to update as and . For criterion 3, the two relative gradients are given by

 A = Q1δ^GQT2, B = Q−T2δΘTQ−11, ∇E1 = 2triu(AAT−BTB), ∇E2 = 2triu(ATA−BBT),

where can be calculated by solving linear system to avoid explicit matrix inversion. The relative gradients for preconditioner updating with criteria 1 and 2 have similar forms, and are not listed here.

## Vi Experimental Results

Five sets of experimental results are reported in this section. For the preconditioner estimation, we always choose mini-batch size , initialize to an identity matrix, set in (44), and sample from Gaussian distribution , where is the accuracy in double precision. For the primary SGD, except for the blind equalization example where the mini-batch size is , we always use mini-batch size . Step size for the preconditioned SGD is selected in , and in many cases, this range is good for the plain SGD as well. Supplementary materials, including a Matlab code package reproducing all the experimental results here, is available on https://sites.google.com/site/lixilinx/home/psgd.

### Vi-a Criteria and Preconditioners Comparisons

This experiment compares the three preconditioner estimators developed in Section V. The true Hessian is a

symmetric random constant matrix, and its elements are drawn from normal distribution

. We vary three factors for performance study: positive definite Hessian and indefinite Hessian; noise free gradient and heavily noisy gradient with signal-to-noise ratio dB; large scale Hessian () and tiny scale Hessian (). Totally we have different testing scenarios. Samples of and are generated by model (16), and the task is to estimate a preconditioner with the three desirable properties listed in Section III.B using SGD.

### Vi-B Blind Equalization (Deconvolution)

Let us consider a blind equalization (deconvolution) problem using constant modulus criterion [2]. Blind equalization is an important communication signal processing problem, and SGD is a default method. This problem is weakly non-convex in the sense that using a equalizer with enough taps, the constant modulus criterion has no local minimum [22]

. In our settings, the source signal is uniformly distributed in range

, the communication channel is simulated by linear filter , the equalizer, , is an adaptive finite impulse response (FIR) filter with taps, initialized by setting its center tap to and other taps to zeros, and the mini-batch size is . Step sizes of the compared algorithms are manually adjusted such that their steady state intersymbol interference (ISI) performance indices are about , where ISI is defined by , and .

Fig. 3 summarizes the results. A preconditioner constructed in the same way as in the proof of Proposition 2 using an adaptively estimated Hessian matrix achieves the fastest convergence, however, such a solution may be too complicated in practice. Precondition 3 performs reasonably well considering its simplicity. Preconditioner 2 completely fails to improve the convergence with step size , while a larger step size leads to divergence. Preconditioner 1 does accelerate the convergence due to the weak non-convexity nature of cost function. The plain SGD converges the slowest, and still its steady state ISI is bumpy and higher than .

### Vi-C A Toy Binary Classification Problem

In this problem, the input features are and , independently and uniformly distributed in , and the class label is

 y=mod[round(10x1−10x2)/2],

where rounds a real number to its nearest integer, and denotes modulus after division. Fig. 4 shows the defined zebra stripe like pattern to be learned. It is a challenging classification problem due to its sharp, interleaved, and curved class boundaries. A two layer feedforward neural network with

hidden nodes is trained as the classifier. As a common practice, the input features are normalized before feeding to the network, and the network coefficients of a certain layer are initialized as random numbers with variance proportional to the inverse of the number of nodes connected to this layer. Nonlinear function

is used. Cross entropy loss is the training cost. A dense preconditioner is estimated and used in preconditioned SGD.

Fig. 5 presents one set of typical learning curves with eight different settings. By using the plain SGD as a base line, we find that preconditioner 2 fails to speed up the convergence; both preconditioner 1 and preconditioner 3 significantly accelerate the convergence, and yet preconditioner 3 performs far better than preconditioner 1. Note that there is no trivial remedy to improve the convergence of SGD. A larger step size makes SGD diverge. We have tried RMSProp, and found that it does accelerate the convergence during the initial iterations, but eventually converges to solutions inferior to that of SGD.

### Vi-D Recurrent Neural Network Training

Let us consider a more challenging problem: learning extremely long term dependencies using recurrent neural network. The addition problem initially proposed in [23] is considered. The input is a sequence of two rows. The first row contains random numbers independently and uniformly distributed in . Elements in the second row are zeros, except two of them are marked with . The desired output of the whole sequence is the sum of the two random numbers from the first row marked with in the second row. More details can be found in [23] and our supplemental materials.

In our settings, the sequence length is , and a standard (vanilla) recurrent neural network with hidden states are trained to predict the output by minimizing the mean squared error. The feedforward coefficients are initialized as Gaussian random numbers with mean zero and variance

, and the feedback matrix is initialized as a random orthogonal matrix to encourage long term memories. Backpropagation through time

[4] is used for gradient calculation. Totally, this network has tunable parameters. A dense preconditioner might be expensive, and thus a sparse one is used. Coefficients in the first layer naturally form a matrix, and coefficients in the second layer form a vector. Thus we approximate the preconditioner for gradient of parameters in the first layer as Kronecker product of a matrix with a matrix, and the preconditioner for gradient of parameters in the second layer is a matrix. Preconditioner for gradient of the whole parameter vector is the direct sum of the preconditioner of the first layer and the one of the second layer.

Fig. 6 summarizes the results of a typical run. Only the preconditioned SGD using preconditioner 3 converges. A recurrent neural network can be regarded as a feedforward one with extremely large depth by unfolding its connections in time. It is known that the issues of vanishing and exploding gradients arise in a deep neural network, and SGD can hardly handle them [24]. Preconditioned SGD seems perform quite well on such challenging problems. More testing results on the eight pathological recurrent neural network training problems proposed in [23] are reported in supplementary materials and [25], which suggests that preconditioned SGD performs no worse than Hessian-free optimization, although our method has a significantly lower complexity and involves less parameter tweaking than it.

### Vi-E MNIST Handwritten Recognition

In the last experiment, we consider the well known MNIST handwritten recognition problem [26]. The training data are images of handwritten di