psgd_tf
Tensorflow implementation of preconditioned stochastic gradient descent
view repo
This paper studies the performance of preconditioned stochastic gradient descent (PSGD), which can be regarded as an enhance stochastic Newton method with the ability to handle gradient noise and non-convexity at the same time. We have improved the implementation of PSGD, unrevealed its relationship to equilibrated stochastic gradient descent (ESGD) and batch normalization, and provided a software package (https://github.com/lixilinx/psgd_tf) implemented in Tensorflow to compare variations of PSGD and stochastic gradient descent (SGD) on a wide range of benchmark problems with commonly used neural network models, e.g., convolutional and recurrent neural networks. Comparison results clearly demonstrate the advantages of PSGD in terms of convergence speeds and generalization performances.
READ FULL TEXT VIEW PDFTensorflow implementation of preconditioned stochastic gradient descent
Stochastic gradient descent (SGD) and its variations, e.g., SGD with either classic or Nesterov momentum, RMSProp, Adam, adaptive learning rates, etc., are popular in diverse stochastic optimization problems, e.g., machine learning and adaptive signal processing
[1, 2, 3, 4, 5, 6]. These first order methods are simple and numerically stable, but often suffer from slow convergence and inefficiency in optimizing non-convex models. Off-the-shelf second order methods from convex optimizations, e.g., the quasi-Newton method, conjugate gradient method and truncated Newton method, i.e., the Hessian-free optimization, are attracting more attentions [7, 8, 9, 10], and find many successful applications in stochastic optimizations. Most second order methods require large mini-batch sizes, and have high complexities for large-scale problems. At the same time, searchings for new optimization theories and learning rules are always active, and methods like natural gradient descent, relative gradient descent, equilibrated SGD (ESGD), feature normalization [11, 12, 13, 15, 14], etc., provide us with great insight into the properties of parameter spaces and cost function surfaces in stochastic optimizations.This paper proposes a family of online second order stochastic optimization methods based on the theory of preconditioned SGD (PSGD) [16]
. Unlike most second order methods, PSGD explicitly considers the gradient noises in stochastic optimizations, and works well with non-convex problems. It adaptively estimates a preconditioner from noisy Hessian vector products with natural or relative gradient descent, and preconditions the stochastic gradient to accelerate convergence. We closely study the performance of five forms of preconditioners, i.e., dense, diagonal, sparse LU decomposition, Kronecker product, and scaling-and-normalization preconditioners. ESGD and feature normalization
[14, 15]are shown to be closely related to PSGD with specific forms of preconditioners. We consider two different ways to evaluate the Hessian vector product, an important measurement that helps PSGD to adaptively extract the curvature information of cost surfaces. We also recommend three methods to regularize the preconditioner estimation problem when the Hessian becomes ill-conditioned, or numerical errors dominate the Hessian vector product evaluations due to floating point arithmetic. We further provide a software package implemented in Tensorflow
^{1}^{1}1https://www.tensorflow.org/ for the comparisons between variations of SGD and PSGD with different preconditioners on a wide range of benchmark problems, which include synthetic and real world data, and involve most commonly used neural network architectures, e.g., recurrent and convolutional networks. Experimental results suggest that Kronecker product preconditioners, including the scaling-and-normalization one, are particularly suitable for training neural networks since affine maps are extensively used there for feature transformations.Let us consider the minimization of cost function
(1) |
where
takes expectation over random variable
,is a loss function, and
is the model parameter vector to be optimized. For example, in a classification problem, could be the cross entropy loss, is a pair of input feature vector and class label, vector consists of all the trainable parameters in the considered classification model, and takes average over all samples from the training data set. By assuming second order differentiable model and loss, we could approximate as a quadratic function of within a trust region around , i.e.,(2) |
where is the sum of approximation errors and constant terms independent of , is a symmetric matrix, and subscript in , and reminds us that these three terms depend on . Now, we may rewrite (1) as
(3) |
where , , and . We do not impose any assumption, e.g., positive definiteness, on except for being symmetric. Thus the quadratic surface in the trust region could be non-convex. To simplify our notations, we no longer consider the higher order approximation errors included in , and simply assume that is a quadratic function of in the considered trust region.
PSGD uses preconditioned stochastic gradient to update as
(4) |
where is a normalized step size, is an estimate of obtained by replacing expectation with sample average, and is a positive definite preconditioner adaptively updated along with . Within the considered trust region, let us write the stochastic gradient, , explicitly as
(5) |
where and are estimates of and , respectively. Let be a random perturbation of , and be small enough such that still resides in the same trust region. Then, (5) suggests the following resultant perturbation of stochastic gradient,
(6) |
where accounts for the error due to replacing with . Note that by definition, is a random vector dependent on . PSGD pursues the preconditioner via minimizing criterion
(7) |
where takes expectation over . We typically introduce factorization , and update instead of directly as can be efficiently learned with natural or relative gradient descent on the Lie group of nonsingular matrices. Detailed learning rules for with different forms can be found in Appendix A. For our preconditioner estimation problem, relative and natural gradients are equivalent, and further details on these two gradients can be found in [11, 12].
Under mild conditions, criterion (7) determines a unique positive definite [16]. The resultant preconditioner is perfect in the sense that it preconditions the stochastic gradient such that
(8) |
which is comparable to relationship
(9) |
where is the perturbation of noiseless gradient, and we assume that is invertible such that can be rewritten as (9). Thus, PSGD can be viewed as an enhanced Newton method that can handle gradient noise and non-convexity at the same time.
Note that in the presence of gradient noise, the optimal and given by (8
) are not unbiased estimates of
and , respectively. Actually, even if is positive definite and available,may not always be a good preconditioner since it could significantly amplify the gradient noise along the directions of the eigenvectors of
associated with small eigenvalues, and might lead to divergence. More specifically,
[16] shows that(10) |
where means that is nonnegative definite.
The original PSGD method relies on (6) to calculate the Hessian vector product, . This numerical differentiation method is simple, and only involves gradient calculations. However, it requires to be small enough such that and reside in the same trust region. In practice, numerical error might be an issue when handling small numbers with floating point arithmetic. This concern becomes more grave with the emerging of half precision math in neural network training. Still, the approximate solution is empirically proved to work well, especially with double precision floating point arithmetic, and does not involve any second order derivative calculation.
An alternative way to calculate the Hessian vector product is via
(11) |
Here, we no longer require to be small enough. The above trick is known for a long time [17]
. However, hand coded second order derivative is error prone even for moderately complicated models. Nowadays, this choice becomes attractive due to the wide availability of automatic differentiation softwares, e.g., Tensorflow, Pytorch
^{2}^{2}2http://pytorch.org/, etc.. Note that second and higher order derivatives may not be fully supported in certain softwares, e.g., the latest Tensorflow, version 1.6, does not support second order derivative for its while loop. The exact solution is typically computationally more expensive than the approximate one, although both may have the same order of computational complexity.We call a dense preconditioner if it does not have any sparse structure except for being symmetric. A dense preconditioner is practical only for small-scale problems with up to thousands of trainable parameters, since it requires parameters to represent it, where is the length of . We are mainly interested in sparse, or limited memory, preconditioners, whose representations only require or less parameters.
Diagonal preconditioner probably is one of the simplest. From (
7), we are ready to find the optimal solution as(12) |
where and denote element wise multiplication and division, respectively. For
drawn from standard multivariate normal distribution,
reduces to a vector with unit entries, and (12) gives the equilibration preconditioner in the equilibrated SGD (ESGD) proposed in [13]. Thus, ESGD is PSGD with a diagonal preconditioner. The Jacobi preconditioner is not optimal by criterion (7), and indeed is observed to show inferior performances in [13].A sparse LU (SPLU) decomposition preconditioner is given by , where , and and are lower and upper triangular matrices, respectively. To make SPLU preconditioner applicable to large-scale problems, except for the diagonals, only the first columns of and the first rows of can have nonzero entries, where is the order of the SPLU preconditioner.
A Kronecker product preconditioner is given by , where , and denotes Kronecker product. Kronecker product preconditioners have been previously exploited in [16, 18, 19]
. We find that they are particularly suitable for preconditioning gradients of tensor parameters. For example, for a tensor
with shape , we flatten into a column vector with length , and use Kronecker product preconditioner(13) |
to have preconditioned gradient , where , , and are three positive definite matrices with shapes , , and , respectively. Such a preconditioner only requires parameters for its representation, while a dense one requires parameters. For neural network learning, is likely to be a matrix parameter, and thus preconditioner of form should be used.
SCAN preconditioner is a specific Kronecker product preconditioner specially designed for neural network training, where , is a diagonal matrix, and only entries of the diagonal and last column of can have nonzero values. As explained in Section IV.B, PSGD with a SCAN preconditioner is equivalent to SGD with normalized input features and scaled output features. It is not difficult to verify that for such sparse and with positive diagonal entries, matrices with decomposition form a Lie group. Hence, natural or relative gradient descent applies to SCAN preconditioner estimation as well.
It is not possible to enumerate all feasible forms of preconditioners. Except for a few cases, we cannot find closed form solution for the optimal with a desired form when given enough pairs of
. Hence, it is important to design preconditioners with proper forms such that efficient learning, e.g., natural or relative gradient descent, is available. Table I summarizes the ones we have discussed, and their degrees of freedoms. These simple preconditioners can be used as building blocks for forming larger preconditioners via direct sum and/or Kronecker product operations. All preconditioners obtained in this way can be efficiently learned with natural or relative gradient descent, and such technical details can be found in Appendix A.
Preconditioner | Number of parameters |
---|---|
Dense | |
SPLU with order | |
Kronecker product | |
Diagonal | |
SCAN |
In practice, a second order method might suffer from vanishing second order derivative and ill-conditioned Hessian. The vanishing gradient problem is well known in training deep neural network. Similarly, under certain numerical conditions, the Hessian could be too small to be accurately calculated with floating point arithmetic even if the model and loss are differential everywhere. Non-differential model and loss also could lead to vanishing or ill-conditioned Hessian. For example, second order derivative of the commonly used rectified linear unit (ReLU) is zero almost everywhere, and undefined at the original point. We observe that vanishing or ill-conditioned Hessian could cause numerical difficulties to PSGD, although it does not use the Hessian directly. Here, we propose three remedies to regularize the estimation of preconditioner in PSGD.
This method damps the Hessian by adding a diagonal loading term, , to it, where is a small number, and
is the identity matrix. In practice, all we need to do is to replace
with when estimating the preconditioner.The traditional damping works well only for convex problems. For non-convex optimization, it could make things worse if the Hessian has eigenvalues close to . In the proposed non-convexity compatible damping, all we need to do is to replace with when estimating the preconditioner, where is a random vector having the same probability density distribution as that of , and independent of .
Gradient clipping is a commonly used trick to stabilize the training of recurrent neural network. We find it useful to stabilize PSGD as well. PSGD with preconditioned gradient clipping is given by
(14) |
where takes norm of a vector, and is a threshold.
Note that the gradient noises already play a role similar to the non-convexity compatible damping. Thus, we may have no need to use the traditional or the non-convexity compatible damping when the batch size is not too large. Preconditioned gradient clipping is a simple and useful trick for solving problems with badly conditioned Hessians.
Element wise nonlinearity and affine transformation,
(15) |
are the two main building blocks of most neural networks, where is a matrix parameter, and both and are feature vectors optionally augmented with . Since most neural networks use parameterless nonlinearities, all the trainable parameters are just a list of affine transformation matrices. By assigning a Kronecker product preconditioner to each affine transformation matrix, we are using the direct sum of a list Kronecker product preconditioners as the preconditioner for the whole model parameter vector. Our experiences suggest that this approach provides a good trade off between computational complexities and performance gains.
It is not difficult to spot out the affine transformations in most commonly used neural networks, e.g., feed forward neural network, vanilla recurrent neural network (RNN), gated recurrent unit (GRU)
[21], long short-term memory (LSTM)
[22], convolutional neural network (CNN)
[2], etc.. For example, in a two dimensional CNN, the input features may form a three dimensional tensor with shape , and the filter coefficients could form a four dimensional tensor with shape , where is the height of image patch, is the width of image patch, is the number of input channels, and is the number of output channels. To rewrite the convolution as an affine transformation, we just need to reshape the filter tensor into a matrix with size , and flatten the input image patch into a column vector with length .By using Kronecker product preconditioner , the learning rule for can be written as
(16) |
where and are two positive definite matrices with proper dimensions. With factorizations and , we can rewrite (16) as
(17) |
Let us introduce matrix , and noticing that
(18) |
we can rewrite (17) simply as
(19) |
Correspondingly, the affine transformation in (15) is rewritten as , where and . Hence, the PSGD in (16) is equivalent to the SGD in (19) with transformed feature vectors and .
We know that feature whitening and normalization could accelerate convergence, and batch normalization and self-normalizing neural networks are such examples [14, 15]. Actually, feature normalization can be viewed as PSGD with a specific SCAN preconditioner with constraint and a proper . This fact is best explained by considering the following example with two input features,
(20) |
where and
are the mean and standard deviation of
, respectively. However, we should be aware that explicit input feature normalization is only empirically shown to accelerate convergence, and has little meaning in certain scenarios, e.g., RNN learning where features may not have any stationary distribution. Furthermore, feature normalization cannot normalize the input and output features simultaneously. PSGD provides a more general and principled approach to find the optimal preconditioner, and applies to a much broader range of applications. A SCAN preconditioner does not necessarily “normalize” the input features in the sense of mean removal and variance normalization.
Tensorflow is one the most popular machine learning frameworks with automatic differentiation support. We have defined a bunch of benchmark problems with both synthetic and real world data, and implemented SGD, RMSProp, ESGD, and PSGD with five forms of preconditioners. One trick worthy to point out is that in our implementations, we use the preconditioner from last iteration to precondition the current gradient. In this way, preconditioning gradient and updating preconditioner can be processed in parallel. The original method in [16] updates preconditioner and model parameters sequentially. It may marginally speed up the convergence, but one iteration could take longer wall time.
To make our comparison results easy to analyze, we try to keep settings simple and straightforward, and do not consider commonly used neural network training tricks like momentum, time varying step size, drop out, batch normalization, etc.. Moreover, tricks like drop out and batch normalization cannot be directly applied to RNN training. Preconditioners of PSGD always are initialized with identity matrix, and updated with a constant normalized step size, , and mini-batch size . We always set for the SPLU preconditioner. We independently sample each entry of from normal distribution with mean and variance when (6) is used to approximate the Hessian vector product, and mean variance when (11) is used, where is the single precision machine epsilon. We only report the results of PSGD with exact Hessian vector product here since the versions with approximated one typically give similar results. The training loss is smoothed to keep our plots legible. Any further experimental details and results not reported here, e.g., step sizes, preconditioned gradient clipping thresholds, mini-batch sizes, neural network initial guesses, training and testing sample sizes, training loss smoothing factor, etc., can be found in our package at https://github.com/lixilinx/psgd_tf.
We consider the addition problem first proposed in [22]. A vanilla RNN is trained to predict the mean of two marked real numbers randomly located in a long sequence. Further details on the addition problem can be found in [22] and our implementations. Here, we deliberately use mini-batch size . Fig. 1 summarizes the results. PSGD with any preconditioner outperforms the first order methods. It is clear that PSGD is able to damp the gradient noise and accelerate convergence simultaneously.
Experiment 1 shows that SGD like methods have great difficulties in optimizing certain models. Hence, many thoughtfully designed neural network architectures are proposed to facilitate learning, e.g., LSTM [22], GRU [21], residual network [20], etc.. As revealed by its name, LSTM provides the designs for learning tasks requiring long term memories. Still, we find that with first order methods, LSTM completely fails to solve the delayed-XOR benchmark problem proposed in [22]. Fig. 2 shows the convergence curves of seven tested methods. Only PSGD with dense, SPLU, Kronecker product and SCAN preconditioners can successfully solve this problem. Bumpy convergence curves suggest that Hessians at local minima is ill-conditioned. We would like to point out that a vanilla RNN successes to solve this problem when trained with PSGD, and fails when trained with first order methods as well. More details are given in our package. Hence, selecting the right training method is at least as important as choosing a proper model.
This experiment considers the well known MNIST^{3}^{3}3http://yann.lecun.com/exdb/mnist/ handwritten digits recognition task using CNN. We do not augment the training data with affine or elastic distorted images. However, we do randomly shift the original training image by pixels both horizontally and vertically, and nest the shifted one into a larger, , image. A CNN consisting of four convolutional, two average pooling and one fully connected layers is adopted. The traditional nonlinearity, tanh, is used. Fig. 3 shows the convergence curves. PSGD with dense preconditioner is not considered due to its excessively high complexity. We find that PSGD with any preconditioner outperforms the first order methods on the training set. PSGD with the Kronecker product preconditioner also outperforms the first order methods on the test set, achieving average testing classification error rates about . Such a performance is impressive as we do not use any complicated data augmentation.
We consider the CIFAR10^{4}^{4}4https://www.cs.toronto.edu/~kriz/cifar.html
image classification problem using a CNN with four convolutional layers, two fully connected layers, and two max pooling layers. All layers use the leaky ReLU,
. Apparently, the model is not second order differentiable everywhere. Nevertheless, PSGD does not use the Hessian, and we find that it works well with such non-differentiable models. Experimental settings are similar to the CIFAR10 classification examples in our package, and here we mainly describe the differences. In this example, we update the preconditioner at the th iteration only if . The total number of iterations is. Thus, the average update rate of preconditioner is about once per five iterations. In this way, the average wall time per iteration of PSGD is closely comparable to that of SGD. The Keras ImageDataGenerator is used to augment the training data with setting
for the width_shift_range, height_shift_range and zoom_range, and default settings for others. Batch size is , step size is for RMSProp and for other methods. Fig. 4 shows the comparison results. The SPLU preconditioner seems not a good choice here as it has too many parameters, but uses no prior information of the neural network parameter space. PSGD with Kronecker product preconditioners, including the SCAN one, perform the best, achieving average test classification error rates slightly lower than . Although the diagonal preconditioner has more parameters than the SCAN one, its performance is just close to that of first order methods.Here, we consider an image autoencoder consisting of three convolution layers for encoding and three deconvolution layers for decoding. Training and testing images are from the CIFAR10 database. Fig. 5 summarizes the results. SGD performs the worst. ESGD and PSGD with SCAN preconditioner give almost identical convergence curves, and outperform RMSProp. PSGD with Kronecker product preconditioner clearly outperforms all the other methods.
We consider computational complexity per iteration. Compared with SGD, PSGD comes with three major extra complexities:
C1: evaluation of the Hessian vector product;
C2: preconditioner updating;
C3: preconditioned gradient calculation.
C1 typically has the same complexity as SGD [17]. Depending on the neural network architectures, complexity of C1 and SGD varies a lot. For the simplest feed forward neural network, SGD has complexity , where is the shape of the largest matrix in the model, and is the mini-batch size. For a vanilla RNN, the complexity rises to , where is the back propagation depth. More complicated models may have higher complexities. Complexities of C2 and C3 depend on the form of preconditioner. For a Kronecker product preconditioner, C2 has complexity , and C3 has complexity . One simple way to reduce the complexities of C2 and C3 is to split those big matrices into smaller ones, and let each smaller matrix keep its own Kronecker product preconditioner. Another practical way is to update the preconditioner less frequently as shown in Experiment 4. Since the curvatures are likely to evolve slower than the gradients, PSGD with skipped preconditioner update often converges as fast as a second order method, and at the same time, has an average wall time per iteration closely comparable to that of SGD.
On our machines and with the above benchmark problems, the wall time per iteration of PSGD without skipped preconditioner update typically just doubles that of SGD. This is not astonishing since many parts of PSGD may be processed in parallel. For example, updating preconditioner and preconditioning gradient can be executed in parallel as the preconditioner from last iteration is used to precondition the current gradient. Preconditioners for all the affine transformation matrices in the model can be updated in parallel as well once is prepared.
Here, we list the median wall time per iteration of each method in Experiment 5 running on a GeForce GTX 1080 Ti graphics card: 0.007 s for SGD; 0.008 s for RMSProp; 0.014 s for ESGD; 0.015 s for PSGD with SCAN preconditioner; 0.017 s for PSGD with Kronecker product preconditioner; and 0.017 s for PSGD with sparse LU preconditioner.
Using large mini-batch sizes and step sizes could save training time, but also might bring new issues such as poor convergence and over fitting [23]. The gradients become more deterministic with the increase of mini-batch sizes. Without preconditioning and with badly conditioned Hessians, the model parameters will be adapted only along a few directions of the eigenvectors of Hessian associated with large eigenvalues. For many methods, this behavior leads to slow and poor convergences. PSGD seems suffer less from such concerns since the preconditioner is optimized in a way to make has unitary absolute eigenvalues. Hence, the model parameters are updated in a balanced manner in all directions. We have tried mini-batch size on our benchmark problems, and observe no meaningful performance loss.
The boundary between stochastic and general optimization problems blurs with the increase of mini-batch sizes. Thus, a well designed stochastic optimization method should also work properly on the general mathematical optimization. Preliminary results show that PSGD does perform well on many mathematical optimization benchmark problems. A Rosenbrock function
^{5}^{5}5https://en.wikipedia.org/wiki/Rosenbrock_function minimization demo is included in our package, and PSGD finds the global minimum with about iterations. Methods like SGD, RMSProp, Adam, batch normalization, etc., are apparently not prepared for these problems. Further discussion in this direction strays away from our focus.We have proposed a family of preconditioned stochastic gradient descent (PSGD) methods with approximate and exact Hessian vector products, and with dense, sparse LU decomposition, diagonal, Kronecker product, and SCaling-And-Normalization (SCAN) preconditioners. The approximate Hessian vector product via numerical differentiation is a valid alternative to the more costly exact solution, especially when automatic second order differentiation is unavailable or the exact solution is expensive to obtain. We have shown that equilibrated SGD and feature normalization are closely related to PSGD with specific forms of preconditioners. We have compared PSGD with variations of SGD in several performance indices, e.g., training loss, testing loss, and wall time per iteration, on benchmark problems with different levels of difficulties. These first order methods fail completely on tough benchmark problems with synthetic data, and show inferior performance on problems with real world data. The Kronecker product preconditioners, including the SCAN one, are particularly suitable for training neural networks since affine maps are extensively used there for feature transformations. A PSGD software package implemented in Tensorflow is available at https://github.com/lixilinx/psgd_tf.
We consider the Lie group of invertible matrices as it is closely related to our preconditioner estimation problem. Assume
is an invertible matrix. All such invertible matrices with the same dimension as
form a Lie group. We consider , a mapping from this Lie group to .In natural gradient descent, the metric tensor depends on . One example is to define the distance between and as
where denotes trace. Intuitively, the parameter space around becomes more curved when is approaching a singular matrix. With this metric, the natural gradient is given by
In relative gradient descent, we assume , and it can be shown that
Hence, the natural and relative gradients have the same form.
It is also possible to choose other metrics in the natural gradient. For example, natural gradient with metric and relative gradient with form lead to another form of gradient,
For a specific problem, we should choose the natural or relative gradient with a form that can simplify the gradient descent learning rule. For example, for mapping , both the above considered metrics lead to natural gradient , while metric leads to natural gradient . Clearly, the previous form is preferred due to its simplicity. For preconditioner learning, we are most interested in groups that have sparse representations.
It is straightforward to show that upper or lower triangular matrices with positive diagonals form a Lie group. Natural and relative gradients may have the same form on this group as well. For learning rule
it is convenient to choose the following normalized step size,
where , and is a matrix norm of . In our implementations, we simply use the max norm. It is clear that the new still resides on the same Lie group with this normalized step size.
We let . Then the relative gradient of criterion (7) with respect to can be shown to be
In practice, it is convenient to constrain to be a triangular matrix such that can be efficiently calculated by back substitution. For triangular with positive diagonals, is known to be the Cholesky factorization of .
We let
Then, the relative gradients can be shown to be
where is the rewritten in matrix form. Again, by constraining and to be triangular matrices, can be efficiently calculated with back substitution.
For , we can update and separately as they are orthogonal, where denotes direct sum.
We assume has LU decomposition , where
are lower and upper triangular matrices, respectively. When and are diagonal matrices, this LU decomposition is sparse.
Let and . Then, the relative gradients of criterion (7) with respect to and can be shown to be
For sparse and , we let and . Then, with relationships
it is trivial to inverse and as long as and are small triangular matrices.
I. Sutskever, J. Martens, G. Dahl, and G. E. Hinton, “On the importance of momentum and initialization in deep learning,” In
30th Int. Conf. Machine Learning, Atlanta, 2013, pp. 1139–1147.K. Cho, B. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “On the properties of neural machine translation: encoder-decoder approaches,” in
Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.