On the Performance of Preconditioned Stochastic Gradient Descent

03/26/2018 ∙ by Xi-Lin Li, et al. ∙ 0

This paper studies the performance of preconditioned stochastic gradient descent (PSGD), which can be regarded as an enhance stochastic Newton method with the ability to handle gradient noise and non-convexity at the same time. We have improved the implementation of PSGD, unrevealed its relationship to equilibrated stochastic gradient descent (ESGD) and batch normalization, and provided a software package (https://github.com/lixilinx/psgd_tf) implemented in Tensorflow to compare variations of PSGD and stochastic gradient descent (SGD) on a wide range of benchmark problems with commonly used neural network models, e.g., convolutional and recurrent neural networks. Comparison results clearly demonstrate the advantages of PSGD in terms of convergence speeds and generalization performances.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

psgd_tf

Tensorflow implementation of preconditioned stochastic gradient descent


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Stochastic gradient descent (SGD) and its variations, e.g., SGD with either classic or Nesterov momentum, RMSProp, Adam, adaptive learning rates, etc., are popular in diverse stochastic optimization problems, e.g., machine learning and adaptive signal processing

[1, 2, 3, 4, 5, 6]. These first order methods are simple and numerically stable, but often suffer from slow convergence and inefficiency in optimizing non-convex models. Off-the-shelf second order methods from convex optimizations, e.g., the quasi-Newton method, conjugate gradient method and truncated Newton method, i.e., the Hessian-free optimization, are attracting more attentions [7, 8, 9, 10], and find many successful applications in stochastic optimizations. Most second order methods require large mini-batch sizes, and have high complexities for large-scale problems. At the same time, searchings for new optimization theories and learning rules are always active, and methods like natural gradient descent, relative gradient descent, equilibrated SGD (ESGD), feature normalization [11, 12, 13, 15, 14], etc., provide us with great insight into the properties of parameter spaces and cost function surfaces in stochastic optimizations.

This paper proposes a family of online second order stochastic optimization methods based on the theory of preconditioned SGD (PSGD) [16]

. Unlike most second order methods, PSGD explicitly considers the gradient noises in stochastic optimizations, and works well with non-convex problems. It adaptively estimates a preconditioner from noisy Hessian vector products with natural or relative gradient descent, and preconditions the stochastic gradient to accelerate convergence. We closely study the performance of five forms of preconditioners, i.e., dense, diagonal, sparse LU decomposition, Kronecker product, and scaling-and-normalization preconditioners. ESGD and feature normalization

[14, 15]

are shown to be closely related to PSGD with specific forms of preconditioners. We consider two different ways to evaluate the Hessian vector product, an important measurement that helps PSGD to adaptively extract the curvature information of cost surfaces. We also recommend three methods to regularize the preconditioner estimation problem when the Hessian becomes ill-conditioned, or numerical errors dominate the Hessian vector product evaluations due to floating point arithmetic. We further provide a software package implemented in Tensorflow

111https://www.tensorflow.org/ for the comparisons between variations of SGD and PSGD with different preconditioners on a wide range of benchmark problems, which include synthetic and real world data, and involve most commonly used neural network architectures, e.g., recurrent and convolutional networks. Experimental results suggest that Kronecker product preconditioners, including the scaling-and-normalization one, are particularly suitable for training neural networks since affine maps are extensively used there for feature transformations.

Ii Background

Ii-a Notations

Let us consider the minimization of cost function

(1)

where

takes expectation over random variable

,

is a loss function, and

is the model parameter vector to be optimized. For example, in a classification problem, could be the cross entropy loss, is a pair of input feature vector and class label, vector consists of all the trainable parameters in the considered classification model, and takes average over all samples from the training data set. By assuming second order differentiable model and loss, we could approximate as a quadratic function of within a trust region around , i.e.,

(2)

where is the sum of approximation errors and constant terms independent of , is a symmetric matrix, and subscript in , and reminds us that these three terms depend on . Now, we may rewrite (1) as

(3)

where , , and . We do not impose any assumption, e.g., positive definiteness, on except for being symmetric. Thus the quadratic surface in the trust region could be non-convex. To simplify our notations, we no longer consider the higher order approximation errors included in , and simply assume that is a quadratic function of in the considered trust region.

Ii-B A Brief Review of PSGD

PSGD uses preconditioned stochastic gradient to update as

(4)

where is a normalized step size, is an estimate of obtained by replacing expectation with sample average, and is a positive definite preconditioner adaptively updated along with . Within the considered trust region, let us write the stochastic gradient, , explicitly as

(5)

where and are estimates of and , respectively. Let be a random perturbation of , and be small enough such that still resides in the same trust region. Then, (5) suggests the following resultant perturbation of stochastic gradient,

(6)

where accounts for the error due to replacing with . Note that by definition, is a random vector dependent on . PSGD pursues the preconditioner via minimizing criterion

(7)

where takes expectation over . We typically introduce factorization , and update instead of directly as can be efficiently learned with natural or relative gradient descent on the Lie group of nonsingular matrices. Detailed learning rules for with different forms can be found in Appendix A. For our preconditioner estimation problem, relative and natural gradients are equivalent, and further details on these two gradients can be found in [11, 12].

Under mild conditions, criterion (7) determines a unique positive definite [16]. The resultant preconditioner is perfect in the sense that it preconditions the stochastic gradient such that

(8)

which is comparable to relationship

(9)

where is the perturbation of noiseless gradient, and we assume that is invertible such that can be rewritten as (9). Thus, PSGD can be viewed as an enhanced Newton method that can handle gradient noise and non-convexity at the same time.

Note that in the presence of gradient noise, the optimal and given by (8

) are not unbiased estimates of

and , respectively. Actually, even if is positive definite and available,

may not always be a good preconditioner since it could significantly amplify the gradient noise along the directions of the eigenvectors of

associated with small eigenvalues, and might lead to divergence. More specifically,

[16] shows that

(10)

where means that is nonnegative definite.

Iii Implementations of PSGD

Iii-a Hessian Vector Product Calculation

Iii-A1 Approximate Solution

The original PSGD method relies on (6) to calculate the Hessian vector product, . This numerical differentiation method is simple, and only involves gradient calculations. However, it requires to be small enough such that and reside in the same trust region. In practice, numerical error might be an issue when handling small numbers with floating point arithmetic. This concern becomes more grave with the emerging of half precision math in neural network training. Still, the approximate solution is empirically proved to work well, especially with double precision floating point arithmetic, and does not involve any second order derivative calculation.

Iii-A2 Exact Solution

An alternative way to calculate the Hessian vector product is via

(11)

Here, we no longer require to be small enough. The above trick is known for a long time [17]

. However, hand coded second order derivative is error prone even for moderately complicated models. Nowadays, this choice becomes attractive due to the wide availability of automatic differentiation softwares, e.g., Tensorflow, Pytorch

222http://pytorch.org/, etc.. Note that second and higher order derivatives may not be fully supported in certain softwares, e.g., the latest Tensorflow, version 1.6, does not support second order derivative for its while loop. The exact solution is typically computationally more expensive than the approximate one, although both may have the same order of computational complexity.

Iii-B Different Forms of Preconditioner

Iii-B1 Dense Preconditioner

We call a dense preconditioner if it does not have any sparse structure except for being symmetric. A dense preconditioner is practical only for small-scale problems with up to thousands of trainable parameters, since it requires parameters to represent it, where is the length of . We are mainly interested in sparse, or limited memory, preconditioners, whose representations only require or less parameters.

Iii-B2 Diagonal Preconditioner

Diagonal preconditioner probably is one of the simplest. From (

7), we are ready to find the optimal solution as

(12)

where and denote element wise multiplication and division, respectively. For

drawn from standard multivariate normal distribution,

reduces to a vector with unit entries, and (12) gives the equilibration preconditioner in the equilibrated SGD (ESGD) proposed in [13]. Thus, ESGD is PSGD with a diagonal preconditioner. The Jacobi preconditioner is not optimal by criterion (7), and indeed is observed to show inferior performances in [13].

Iii-B3 Sparse LU Decomposition Preconditioner

A sparse LU (SPLU) decomposition preconditioner is given by , where , and and are lower and upper triangular matrices, respectively. To make SPLU preconditioner applicable to large-scale problems, except for the diagonals, only the first columns of and the first rows of can have nonzero entries, where is the order of the SPLU preconditioner.

Iii-B4 Kronecker Product Preconditioner

A Kronecker product preconditioner is given by , where , and denotes Kronecker product. Kronecker product preconditioners have been previously exploited in [16, 18, 19]

. We find that they are particularly suitable for preconditioning gradients of tensor parameters. For example, for a tensor

with shape , we flatten into a column vector with length , and use Kronecker product preconditioner

(13)

to have preconditioned gradient , where , , and are three positive definite matrices with shapes , , and , respectively. Such a preconditioner only requires parameters for its representation, while a dense one requires parameters. For neural network learning, is likely to be a matrix parameter, and thus preconditioner of form should be used.

Iii-B5 SCaling-And-Normalization (SCAN) Preconditioner

SCAN preconditioner is a specific Kronecker product preconditioner specially designed for neural network training, where , is a diagonal matrix, and only entries of the diagonal and last column of can have nonzero values. As explained in Section IV.B, PSGD with a SCAN preconditioner is equivalent to SGD with normalized input features and scaled output features. It is not difficult to verify that for such sparse and with positive diagonal entries, matrices with decomposition form a Lie group. Hence, natural or relative gradient descent applies to SCAN preconditioner estimation as well.

It is not possible to enumerate all feasible forms of preconditioners. Except for a few cases, we cannot find closed form solution for the optimal with a desired form when given enough pairs of

. Hence, it is important to design preconditioners with proper forms such that efficient learning, e.g., natural or relative gradient descent, is available. Table I summarizes the ones we have discussed, and their degrees of freedoms. These simple preconditioners can be used as building blocks for forming larger preconditioners via direct sum and/or Kronecker product operations. All preconditioners obtained in this way can be efficiently learned with natural or relative gradient descent, and such technical details can be found in Appendix A.

Preconditioner Number of parameters
Dense
SPLU with order
Kronecker product
Diagonal
SCAN
TABLE I: Number of parameters in for preconditioning gradient of a matrix parameter with shape

Iii-C Regularization of PSGD

In practice, a second order method might suffer from vanishing second order derivative and ill-conditioned Hessian. The vanishing gradient problem is well known in training deep neural network. Similarly, under certain numerical conditions, the Hessian could be too small to be accurately calculated with floating point arithmetic even if the model and loss are differential everywhere. Non-differential model and loss also could lead to vanishing or ill-conditioned Hessian. For example, second order derivative of the commonly used rectified linear unit (ReLU) is zero almost everywhere, and undefined at the original point. We observe that vanishing or ill-conditioned Hessian could cause numerical difficulties to PSGD, although it does not use the Hessian directly. Here, we propose three remedies to regularize the estimation of preconditioner in PSGD.

Iii-C1 Traditional Damping

This method damps the Hessian by adding a diagonal loading term, , to it, where is a small number, and

is the identity matrix. In practice, all we need to do is to replace

with when estimating the preconditioner.

Iii-C2 Non-Convexity Compatible Damping

The traditional damping works well only for convex problems. For non-convex optimization, it could make things worse if the Hessian has eigenvalues close to . In the proposed non-convexity compatible damping, all we need to do is to replace with when estimating the preconditioner, where is a random vector having the same probability density distribution as that of , and independent of .

Iii-C3 Preconditioned Gradient Clipping

Gradient clipping is a commonly used trick to stabilize the training of recurrent neural network. We find it useful to stabilize PSGD as well. PSGD with preconditioned gradient clipping is given by

(14)

where takes norm of a vector, and is a threshold.

Note that the gradient noises already play a role similar to the non-convexity compatible damping. Thus, we may have no need to use the traditional or the non-convexity compatible damping when the batch size is not too large. Preconditioned gradient clipping is a simple and useful trick for solving problems with badly conditioned Hessians.

Iv Applications to Neural Network Learning

Iv-a Affine Transformations in Neural Networks

Element wise nonlinearity and affine transformation,

(15)

are the two main building blocks of most neural networks, where is a matrix parameter, and both and are feature vectors optionally augmented with . Since most neural networks use parameterless nonlinearities, all the trainable parameters are just a list of affine transformation matrices. By assigning a Kronecker product preconditioner to each affine transformation matrix, we are using the direct sum of a list Kronecker product preconditioners as the preconditioner for the whole model parameter vector. Our experiences suggest that this approach provides a good trade off between computational complexities and performance gains.

It is not difficult to spot out the affine transformations in most commonly used neural networks, e.g., feed forward neural network, vanilla recurrent neural network (RNN), gated recurrent unit (GRU)

[21]

, long short-term memory (LSTM)

[22]

, convolutional neural network (CNN)

[2], etc.. For example, in a two dimensional CNN, the input features may form a three dimensional tensor with shape , and the filter coefficients could form a four dimensional tensor with shape , where is the height of image patch, is the width of image patch, is the number of input channels, and is the number of output channels. To rewrite the convolution as an affine transformation, we just need to reshape the filter tensor into a matrix with size , and flatten the input image patch into a column vector with length .

Iv-B On the Role of Kronecker Product Preconditioner

By using Kronecker product preconditioner , the learning rule for can be written as

(16)

where and are two positive definite matrices with proper dimensions. With factorizations and , we can rewrite (16) as

(17)

Let us introduce matrix , and noticing that

(18)

we can rewrite (17) simply as

(19)

Correspondingly, the affine transformation in (15) is rewritten as , where and . Hence, the PSGD in (16) is equivalent to the SGD in (19) with transformed feature vectors and .

We know that feature whitening and normalization could accelerate convergence, and batch normalization and self-normalizing neural networks are such examples [14, 15]. Actually, feature normalization can be viewed as PSGD with a specific SCAN preconditioner with constraint and a proper . This fact is best explained by considering the following example with two input features,

(20)

where and

are the mean and standard deviation of

, respectively. However, we should be aware that explicit input feature normalization is only empirically shown to accelerate convergence, and has little meaning in certain scenarios, e.g., RNN learning where features may not have any stationary distribution. Furthermore, feature normalization cannot normalize the input and output features simultaneously. PSGD provides a more general and principled approach to find the optimal preconditioner, and applies to a much broader range of applications. A SCAN preconditioner does not necessarily “normalize” the input features in the sense of mean removal and variance normalization.

V Experimental Results

V-a Tensorflow Implementation

Tensorflow is one the most popular machine learning frameworks with automatic differentiation support. We have defined a bunch of benchmark problems with both synthetic and real world data, and implemented SGD, RMSProp, ESGD, and PSGD with five forms of preconditioners. One trick worthy to point out is that in our implementations, we use the preconditioner from last iteration to precondition the current gradient. In this way, preconditioning gradient and updating preconditioner can be processed in parallel. The original method in [16] updates preconditioner and model parameters sequentially. It may marginally speed up the convergence, but one iteration could take longer wall time.

To make our comparison results easy to analyze, we try to keep settings simple and straightforward, and do not consider commonly used neural network training tricks like momentum, time varying step size, drop out, batch normalization, etc.. Moreover, tricks like drop out and batch normalization cannot be directly applied to RNN training. Preconditioners of PSGD always are initialized with identity matrix, and updated with a constant normalized step size, , and mini-batch size . We always set for the SPLU preconditioner. We independently sample each entry of from normal distribution with mean and variance when (6) is used to approximate the Hessian vector product, and mean variance when (11) is used, where is the single precision machine epsilon. We only report the results of PSGD with exact Hessian vector product here since the versions with approximated one typically give similar results. The training loss is smoothed to keep our plots legible. Any further experimental details and results not reported here, e.g., step sizes, preconditioned gradient clipping thresholds, mini-batch sizes, neural network initial guesses, training and testing sample sizes, training loss smoothing factor, etc., can be found in our package at https://github.com/lixilinx/psgd_tf.

V-B Selected Experimental Results

V-B1 Experiment 1

We consider the addition problem first proposed in [22]. A vanilla RNN is trained to predict the mean of two marked real numbers randomly located in a long sequence. Further details on the addition problem can be found in [22] and our implementations. Here, we deliberately use mini-batch size . Fig. 1 summarizes the results. PSGD with any preconditioner outperforms the first order methods. It is clear that PSGD is able to damp the gradient noise and accelerate convergence simultaneously.


Fig. 1: Convergence curves on the addition problem with a standard RNN and mini-batch size . Both the training and testing losses are mean square errors (MSE).

V-B2 Experiment 2

Experiment 1 shows that SGD like methods have great difficulties in optimizing certain models. Hence, many thoughtfully designed neural network architectures are proposed to facilitate learning, e.g., LSTM [22], GRU [21], residual network [20], etc.. As revealed by its name, LSTM provides the designs for learning tasks requiring long term memories. Still, we find that with first order methods, LSTM completely fails to solve the delayed-XOR benchmark problem proposed in [22]. Fig. 2 shows the convergence curves of seven tested methods. Only PSGD with dense, SPLU, Kronecker product and SCAN preconditioners can successfully solve this problem. Bumpy convergence curves suggest that Hessians at local minima is ill-conditioned. We would like to point out that a vanilla RNN successes to solve this problem when trained with PSGD, and fails when trained with first order methods as well. More details are given in our package. Hence, selecting the right training method is at least as important as choosing a proper model.


Fig. 2: Convergence curves on the delayed-XOR problem with LSTM. Training and testing losses are the logistic loss and classification error rate, respectively.

V-B3 Experiment 3

This experiment considers the well known MNIST333http://yann.lecun.com/exdb/mnist/ handwritten digits recognition task using CNN. We do not augment the training data with affine or elastic distorted images. However, we do randomly shift the original training image by pixels both horizontally and vertically, and nest the shifted one into a larger, , image. A CNN consisting of four convolutional, two average pooling and one fully connected layers is adopted. The traditional nonlinearity, tanh, is used. Fig. 3 shows the convergence curves. PSGD with dense preconditioner is not considered due to its excessively high complexity. We find that PSGD with any preconditioner outperforms the first order methods on the training set. PSGD with the Kronecker product preconditioner also outperforms the first order methods on the test set, achieving average testing classification error rates about . Such a performance is impressive as we do not use any complicated data augmentation.


Fig. 3: Convergence curves on the MNIST handwritten digits recognition task using CNN. Training and testing losses are cross entropy and classification error rate, respectively. The testing loss is smoothed to make results from different methods more distinguishable.

V-B4 Experiment 4

We consider the CIFAR10444https://www.cs.toronto.edu/~kriz/cifar.html

image classification problem using a CNN with four convolutional layers, two fully connected layers, and two max pooling layers. All layers use the leaky ReLU,

. Apparently, the model is not second order differentiable everywhere. Nevertheless, PSGD does not use the Hessian, and we find that it works well with such non-differentiable models. Experimental settings are similar to the CIFAR10 classification examples in our package, and here we mainly describe the differences. In this example, we update the preconditioner at the th iteration only if . The total number of iterations is

. Thus, the average update rate of preconditioner is about once per five iterations. In this way, the average wall time per iteration of PSGD is closely comparable to that of SGD. The Keras ImageDataGenerator is used to augment the training data with setting

for the width_shift_range, height_shift_range and zoom_range, and default settings for others. Batch size is , step size is for RMSProp and for other methods. Fig. 4 shows the comparison results. The SPLU preconditioner seems not a good choice here as it has too many parameters, but uses no prior information of the neural network parameter space. PSGD with Kronecker product preconditioners, including the SCAN one, perform the best, achieving average test classification error rates slightly lower than . Although the diagonal preconditioner has more parameters than the SCAN one, its performance is just close to that of first order methods.


Fig. 4: Convergence curves on the CIFAR10 image classification task using CNN. Training and testing losses are cross entropy and classification error rate, respectively. The testing loss is smoothed to make results from different methods more distinguishable.

V-B5 Experiment 5

Here, we consider an image autoencoder consisting of three convolution layers for encoding and three deconvolution layers for decoding. Training and testing images are from the CIFAR10 database. Fig. 5 summarizes the results. SGD performs the worst. ESGD and PSGD with SCAN preconditioner give almost identical convergence curves, and outperform RMSProp. PSGD with Kronecker product preconditioner clearly outperforms all the other methods.


Fig. 5: Convergence curves for the image autoencoder trained on CIFAR10 database. Both the training and testing losses are MSEs.

V-C Complexities of PSGD

V-C1 Computational Complexities

We consider computational complexity per iteration. Compared with SGD, PSGD comes with three major extra complexities:

  • C1: evaluation of the Hessian vector product;

  • C2: preconditioner updating;

  • C3: preconditioned gradient calculation.

C1 typically has the same complexity as SGD [17]. Depending on the neural network architectures, complexity of C1 and SGD varies a lot. For the simplest feed forward neural network, SGD has complexity , where is the shape of the largest matrix in the model, and is the mini-batch size. For a vanilla RNN, the complexity rises to , where is the back propagation depth. More complicated models may have higher complexities. Complexities of C2 and C3 depend on the form of preconditioner. For a Kronecker product preconditioner, C2 has complexity , and C3 has complexity . One simple way to reduce the complexities of C2 and C3 is to split those big matrices into smaller ones, and let each smaller matrix keep its own Kronecker product preconditioner. Another practical way is to update the preconditioner less frequently as shown in Experiment 4. Since the curvatures are likely to evolve slower than the gradients, PSGD with skipped preconditioner update often converges as fast as a second order method, and at the same time, has an average wall time per iteration closely comparable to that of SGD.

V-C2 Wall Time Comparisons

On our machines and with the above benchmark problems, the wall time per iteration of PSGD without skipped preconditioner update typically just doubles that of SGD. This is not astonishing since many parts of PSGD may be processed in parallel. For example, updating preconditioner and preconditioning gradient can be executed in parallel as the preconditioner from last iteration is used to precondition the current gradient. Preconditioners for all the affine transformation matrices in the model can be updated in parallel as well once is prepared.

Here, we list the median wall time per iteration of each method in Experiment 5 running on a GeForce GTX 1080 Ti graphics card: 0.007 s for SGD; 0.008 s for RMSProp; 0.014 s for ESGD; 0.015 s for PSGD with SCAN preconditioner; 0.017 s for PSGD with Kronecker product preconditioner; and 0.017 s for PSGD with sparse LU preconditioner.

V-D Working with Large mini-Batch Sizes

Using large mini-batch sizes and step sizes could save training time, but also might bring new issues such as poor convergence and over fitting [23]. The gradients become more deterministic with the increase of mini-batch sizes. Without preconditioning and with badly conditioned Hessians, the model parameters will be adapted only along a few directions of the eigenvectors of Hessian associated with large eigenvalues. For many methods, this behavior leads to slow and poor convergences. PSGD seems suffer less from such concerns since the preconditioner is optimized in a way to make has unitary absolute eigenvalues. Hence, the model parameters are updated in a balanced manner in all directions. We have tried mini-batch size on our benchmark problems, and observe no meaningful performance loss.

V-E Applications to General Optimization Problems

The boundary between stochastic and general optimization problems blurs with the increase of mini-batch sizes. Thus, a well designed stochastic optimization method should also work properly on the general mathematical optimization. Preliminary results show that PSGD does perform well on many mathematical optimization benchmark problems. A Rosenbrock function

555https://en.wikipedia.org/wiki/Rosenbrock_function minimization demo is included in our package, and PSGD finds the global minimum with about iterations. Methods like SGD, RMSProp, Adam, batch normalization, etc., are apparently not prepared for these problems. Further discussion in this direction strays away from our focus.

Vi Conclusions

We have proposed a family of preconditioned stochastic gradient descent (PSGD) methods with approximate and exact Hessian vector products, and with dense, sparse LU decomposition, diagonal, Kronecker product, and SCaling-And-Normalization (SCAN) preconditioners. The approximate Hessian vector product via numerical differentiation is a valid alternative to the more costly exact solution, especially when automatic second order differentiation is unavailable or the exact solution is expensive to obtain. We have shown that equilibrated SGD and feature normalization are closely related to PSGD with specific forms of preconditioners. We have compared PSGD with variations of SGD in several performance indices, e.g., training loss, testing loss, and wall time per iteration, on benchmark problems with different levels of difficulties. These first order methods fail completely on tough benchmark problems with synthetic data, and show inferior performance on problems with real world data. The Kronecker product preconditioners, including the SCAN one, are particularly suitable for training neural networks since affine maps are extensively used there for feature transformations. A PSGD software package implemented in Tensorflow is available at https://github.com/lixilinx/psgd_tf.

Appendix A: Learning Preconditioners on the Lie Group

Vi-a Lie Group and Natural/Relative Gradient

We consider the Lie group of invertible matrices as it is closely related to our preconditioner estimation problem. Assume

is an invertible matrix. All such invertible matrices with the same dimension as

form a Lie group. We consider , a mapping from this Lie group to .

In natural gradient descent, the metric tensor depends on . One example is to define the distance between and as

where denotes trace. Intuitively, the parameter space around becomes more curved when is approaching a singular matrix. With this metric, the natural gradient is given by

In relative gradient descent, we assume , and it can be shown that

Hence, the natural and relative gradients have the same form.

It is also possible to choose other metrics in the natural gradient. For example, natural gradient with metric and relative gradient with form lead to another form of gradient,

For a specific problem, we should choose the natural or relative gradient with a form that can simplify the gradient descent learning rule. For example, for mapping , both the above considered metrics lead to natural gradient , while metric leads to natural gradient . Clearly, the previous form is preferred due to its simplicity. For preconditioner learning, we are most interested in groups that have sparse representations.

Vi-B Learning on the Group of Triangular Matrices

It is straightforward to show that upper or lower triangular matrices with positive diagonals form a Lie group. Natural and relative gradients may have the same form on this group as well. For learning rule

it is convenient to choose the following normalized step size,

where , and is a matrix norm of . In our implementations, we simply use the max norm. It is clear that the new still resides on the same Lie group with this normalized step size.

Vi-C Natural/Relative Gradients for with Different Forms

Vi-C1 Arbitrary Invertible

We let . Then the relative gradient of criterion (7) with respect to can be shown to be

In practice, it is convenient to constrain to be a triangular matrix such that can be efficiently calculated by back substitution. For triangular with positive diagonals, is known to be the Cholesky factorization of .

Vi-C2 with Factorization

We let

Then, the relative gradients can be shown to be

where is the rewritten in matrix form. Again, by constraining and to be triangular matrices, can be efficiently calculated with back substitution.

Vi-C3 with Direct Sum Decomposition

For , we can update and separately as they are orthogonal, where denotes direct sum.

Vi-C4 with LU Decomposition

We assume has LU decomposition , where

are lower and upper triangular matrices, respectively. When and are diagonal matrices, this LU decomposition is sparse.

Let and . Then, the relative gradients of criterion (7) with respect to and can be shown to be

For sparse and , we let and . Then, with relationships

it is trivial to inverse and as long as and are small triangular matrices.

References

  • [1] B. Widrow and S. D. Stearns, Adaptive Signal Processing. Englewood Cliffs, New Jersey: Prentice-Hall, Inc., 1985.
  • [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
  • [3]

    I. Sutskever, J. Martens, G. Dahl, and G. E. Hinton, “On the importance of momentum and initialization in deep learning,” In

    30th Int. Conf. Machine Learning, Atlanta, 2013, pp. 1139–1147.
  • [4] G. Hinton, Neural Networks for Machine Learning. Retrieved from http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.
  • [5] D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” in 3rd Int. Conf. Learning Representations, San Diego, 2015.
  • [6] T. Schaul, S. Zhang, and Y. LeCun, “No more pesky learning rates,” arXiv:1206.1106, 2013.
  • [7] J. Martens and I. Sutskever, “Training deep and recurrent neural networks with Hessian-free optimization,” In Neural Networks: Tricks of the Trade, 2nd ed., vol. 7700, G. Montavon, G. B. Orr, and K.-R. Müller, Ed. Berlin Heidelberg: Springer, 2012, pp. 479–535.
  • [8] N. N. Schraudolph, J. Yu, and S. Günter, “A stochastic quasi-Newton method for online convex optimization,” J. Mach. Learn. Res., vol. 2, pp. 436–443, Jan. 2007.
  • [9] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer, “A stochastic quasi-Newton method for large-scale optimization,” SIAM J. Optimiz., vol. 26, no. 2, pp. 1008–1031, Jan. 2014.
  • [10] B. Antoine, B. Leon, and G. Patrick, “SGD-QN: careful quasi-Newton stochastic gradient descent,” J. Mach. Learn. Res., vol. 10, pp. 1737–1754, Jul. 2009.
  • [11] J.-F. Cardoso and B. Laheld, “Equivariant adaptive source separation,” IEEE Trans. Signal Process., vol. 44, no. 12, pp. 3017–3030, Dec. 1996.
  • [12] S. Amari, “Natural gradient works efficiently in learning,” Neural Computation, vol. 10, no. 2, pp. 251–276, Feb. 1998.
  • [13] Y. N. Dauphin, H. Vries, and Y. Bengio, “Equilibrated adaptive learning rates for non-convex optimization,” in Advances in Neural Information Processing Systems, 2015, pp. 1504–1512.
  • [14] S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” https://arxiv.org/abs/1502.03167, 2015.
  • [15] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” in Advances in Neural Information Processing Systems, 2017.
  • [16] X.-L. Li, “Preconditioned stochastic gradient descent,” IEEE Trans. Neural Networks and Learning Systems, vol. 29, no. 5, May, 2018.
  • [17] B. A. Pearlmutter, “Fast exact multiplication by the Hessian,” Neural Computation, vol. 6, pp. 147–160, 1994.
  • [18] J. Martens and R. B. Grosse, “Optimizing neural networks with Kronecker-factored approximate curvature,” in Proc. 32nd Int. Conf. Machine Learning, 2015, pp. 2408–2417.
  • [19] D. Povey, X. Zhang, and S. Khudanpur, “Parallel training of DNNs with natural gradient and parameter averaging,” in Proc. Int. Conf. Learning Representations, 2015.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” https://arxiv.org/abs/1512.03385, 2015.
  • [21]

    K. Cho, B. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “On the properties of neural machine translation: encoder-decoder approaches,” in

    Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.
  • [22] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no.8, pp. 1735–1780, 1997.
  • [23] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang, “On large-batch training for deep learning: generalization gap and sharp minima,” in 5th Int. Conf. Learning Representations, Toulon, France, 2017.