Pytorch implementation of preconditioned stochastic gradient descent
We study two types of preconditioners and preconditioned stochastic gradient descent (SGD) methods in a unified framework. We call the first one the Newton type due to its close relationship to Newton method, and the second one the Fisher type as its preconditioner is closely related to the inverse of Fisher information matrix. Both preconditioners can be derived from one framework, and efficiently learned on any matrix Lie groups designated by the user using natural or relative gradient descent. Many existing preconditioners and methods are special cases of either the Newton type or the Fisher type ones. Experimental results on relatively large scale machine learning problems are reported for performance study.READ FULL TEXT VIEW PDF
Pytorch implementation of preconditioned stochastic gradient descent
This paper investigates the use of preconditioner for accelerating gradient descent, especially in large scale machine learning problems. Stochastic gradient descent (SGD) and its variations, e.g., momentum (Rumelhart et al., 1986; Nesterov, 1983)
, Adagrad and RMSProp(John et al., 2011), Adam (Kingma & Ba, 2015), etc., are popular choices due to their simplicity and wide applicability. These simple methods do not use well normalized step size, could converge slow, and might involve more controlling parameters requiring fine tweaking. Convex optimization is a well studied field (Boyd & Vandenberghe, 2004). Many off-the-shelf methods there, e.g., (nonlinear) conjugate gradient descent, quasi-Newton methods, Hessian-free optimizations, etc., can be applied to small and middle scale machine learning problems without much modifications. However, these convex optimization methods may have difficulty in handling gradient noise and scaling up to problems with hundreds of millions of free parameters. For a large family of machine learning problems, natural gradient with the Fisher information metric is equivalent to a preconditioned gradient using inverse of the Fisher information matrix as the preconditioner (Amari, 1998). Natural gradient and its variations, e.g., Kronecker-factored approximate curvature (KFAC) (Martens & Grosse, 2015) and the one in (Povey et al., 2015), all use such preconditioners. Other less popular choices are the equilibrated preconditioner (Dauphin et al., 2015) and the one proposed in (Li, 2018). Momentum or the heavy-ball method provides another independent way to accelerate converge (Nesterov, 1983; Rumelhart et al., 1986). Furthermore, momentum and preconditioner can be combined to further accelerate convergence as shown in Adam (Kingma & Ba, 2015).
This paper groups the above mentioned preconditioners and preconditioned SGD methods into two classes, the Newton type and the Fisher type. The Newton type is closely related to the Newton method, and is suitable for general purpose optimizations. The Fisher type preconditioner relates to the inverse of Fisher information matrix, and is limited to a large subclass of stochastic optimization problems where the Fish information metric can be well defined. Both preconditioners can be derived from one framework, and learned on any matrix Lie groups designated by the user with almost the same natural or relative gradient descent methods.
We consider the minimization of cost function
takes expectation over random variable,
is a loss function, and
is the model parameter vector to be optimized. For example, in a classification problem,could be the cross entropy loss, is a pair of input feature vector and class label, vector consists of all the trainable parameters in the classification model, and takes average over all samples from the training data set. By assuming second order differentiable model and loss, we could approximate as a quadratic function of within a trust region around , i.e., , where is the sum of approximation error and constant term independent of , is a symmetric matrix, and subscript in , and reminds us that these three terms depend on . Clearly, these three terms depend on as well. We do not explicitly show this dependence to simplify our notations since we just consider parameter updates in the same trust region. Now, we may rewrite (1) as
where , , and . We do not impose any assumption, e.g., positive definiteness, on except for being symmetric. Thus, the quadratic surface in the trust region could be non-convex. To simplify our notations, we no longer consider the higher order approximation error included in , and simply assume that is a quadratic function of in the trust region.
Let us consider a certain iteration. Preconditioned SGD updates as
where is the step size,
is an estimate ofobtained by replacing expectation with sample average, and positive definite matrix could be a fixed or adaptive preconditioner. By letting , we can rewrite (3) as
where denotes the principal square root of . Hence, (4) suggests that preconditioned SGD is equivalent to SGD in a transformed parameter domain. Within the considered trust region, let us write the stochastic gradient, , explicitly as
for updating within the assumed trust region, where
is the identity matrix. A properly determinedcould significantly accelerate convergence of the locally linear system in (6).
We review a few facts shown in (Li, 2018) before introducing our main contributions. Let be a random perturbation of , and be small enough such that still resides in the same trust region. Then, (5) suggests the following resultant perturbation of stochastic gradient,
where accounts for the error due to replacing with . Note that by definition, is a random vector dependent on both and . The preconditioner in (Li, 2018) is pursued by minimizing criterion
where subscript in denotes taking expectation over . Under mild conditions, criterion (8) determines a unique positive definite , which is optimal in the sense that it preconditions the stochastic gradient such that
which is comparable to relationship , where is the perturbation of noiseless gradient, and we assume that is invertible, but not necessarily positive definite. Clearly, this preconditioner is comparable to . Preconditioned SGD with this preconditioner inherits the scale-invariance property of Newton method, regardless of the amount of gradient noise.
Note that in the presence of gradient noise, the optimal and given by (9
) are not unbiased estimates ofand , respectively. Actually, even if is positive definite and available,
may not always be a good preconditioner since it could significantly amplify the gradient noise along the directions of the eigenvectors of
associated with small eigenvalues, and lead to divergence. More specifically, it is shown in(Li, 2018) that , where means that is nonnegative definite.
Preconditioner estimation criterion (8) requires to be small enough such that and
reside in the same trust region. In practice, numerical error might be an issue when handling small numbers with floating point arithmetic. This concern becomes more grave with the popularity of single and even half precision math in large scale neural network training. Luckily, (7) relates to Hessian-vector product, which can be efficiently evaluated with automatic differentiation software tools. Let be a random vector with the same dimension as . Then, (5) suggests the following method for Hessian-vector product evaluation,
Now, replacing in (8) with leads to our following new preconditioner estimation criterion,
where the subscript in suggests taking expectation over . We no longer have the need to assume to be an arbitrarily small vector. It is important to note that this criterion only requires the Hessian-vector product. The Hessian itself is not of interest. We call (11) the Newton type preconditioner estimation criterion as the resultant preconditioned SGD method is closely related to the Newton method.
We consider the parameter estimation problems where the Fisher information matrix can be well defined by . Replacing in (8) with leads to criterion
where is a shorthand for stochastic gradient, and is a damping factor. Clearly, is independent of . Let us further assume that
is drawn from standard multivariate normal distribution, i.e., and . Then, we could simplify as
By letting the derivative of with respect to be zero, the optimal positive definite solution for is readily shown to be
When is a gradient estimation obtained by taking average over independent samples, is related to the Fisher information matrix by
We call this preconditioner the Fisher type one due to its close relationship to the Fisher information matrix. One can easily modify this preconditioner to obtain an unbiased estimation of . Let be an exponential moving average of . Then, after replacing the in (14) with and setting , will be an unbiased estimation of . Generally, it might be acceptable to keep the bias term, , in (15) for two reasons: it is nonnegative definite and regularizes the inversion in (14
); it vanishes when the parameters approach a stationary point. Actually, the Fisher information matrix could be singular for many commonly used models, e.g., finite mixture models, neural networks, hidden Markov models. We might not able to inversefor these singular statistical models without using regularization or damping. A Fisher type preconditioner with loses the scale-invariance property of a Newton type preconditioner. Both and can be useful preconditioners when the step size and damping factor are set properly.
Following the ideas in (Li, 2018), we can show that (11) determines a unique positive definite preconditioner if and only if is positive definite and has distinct eigenvalues. Other minimum solutions of criterion (11) are either indefinite or negative definite, and are not interested for our purpose. The proof itself has limited novelty. We omit it here. Instead, let us consider the simplest case, where is a scalar parameter, to gain some intuitive understandings of criterion (11). For scalar parameter, it is trivial to show that the optimal solutions minimizing (11) are
where , , , and are replaced with their plain lower case letters, and we have used the fact that and are independent. For gradient descent, we choose the positive solution, although the negative one gives the global minimum of (11). With the positive preconditioner, eigenvalue of the locally linear system in (6) is
Now, it is clear that this optimal preconditioner damps the gradient noise when is large, and preconditions the locally linear system in (6) such that its eigenvalue has unitary amplitude when the gradient noise vanishes. Convergence is ensured when a normalized step size, i.e., , is used. For with higher dimensions, eigenvalues of the locally linear system in (6) is normalized into range as well, in a way similar to (17).
Let us take the Newton type preconditioner as an example to derive its learning rule. Learning rule for the Fisher type preconditioner is the same except for replacing the Hessian-vector product with stochastic gradient. Here, Lie group always refers to the matrix Lie group.
It is inconvenient to optimize directly as it must be a symmetric and positive definite matrix. Instead, we represent the preconditioner as , and learn . Now, must be a nonsingular matrix as both and diverge when is singular. Invertible matrices with the same dimension form a Lie group. In practice, we are more interested in Lie groups with sparse representations. Examples of such groups are given in the next section. Let us consider a proper small perturbation of , , such that still lives on the same Lie group. The distance between and can be naturally defined as (Amari, 1998). Intuitively, this distance is larger for the same amount of perturbation when
is more close to singular. With the above tensor metric definition, natural gradient for learninghas form
For example, when lives on the group of invertible upper triangular matrices, is given by
where takes the upper triangular part of a matrix. Another way to derive (18) is to let , and consider the derivative with respect to , where is a proper small matrix such that still lives on the same Lie group. Gradient derived in this way is known as relative gradient (Cardoso & Laheld, 1996). For our preconditioner learning problem, relative gradient and natural gradient have the same form. Now, can be updated using natural or relative gradient descent as
In practice, it is convenient to use the following learning rule with normalized step size,
where , and takes the norm of a matrix. One simple choice for matrix norm is the maximum absolute value of a matrix.
Note that natural gradient can take different forms. One should not confuse the natural gradient on the Lie group derived from a tensor metric with the natural gradient for parameter estimation derived from a Fisher information metric.
One iteration of the Newton type preconditioned SGD consists of the following steps.
Evaluate stochastic gradient .
Draw from , and evaluate Hessian-vector product .
Update parameters with .
Update preconditioners with .
The two step sizes, and , are normalized. They should take values in range . The specific form of depends on the Lie group to be considered. For example, for upper triangular , we have , where can be efficiently calculated with back substitution.
We only need to replace in the Newton type preconditioned SGD with to obtain the Fisher type one. Its one iteration consists of the following steps.
Evaluate stochastic gradient .
Update parameters with .
Update preconditioners with .
Only the step size for preconditioner updating is normalized. There is no simple way to jointly determine the proper ranges for step size and damping factor . Again, may take different forms on different Lie groups. For upper triangular , we have , where . Here, it is important to note that the natural or relative gradient for with the form given in (13) involves explicit matrix inversion. However, matrix inversion can be avoided by using the in (12), which includes as an auxiliary variable. It is highly recommended to avoid explicit matrix inversion for large .
There are many ways to modify the above preconditioned SGD methods. Since curvatures typically evolves slower than gradients, one can update the preconditioner less frequently to save average wall time per iteration. Combining preconditioner and momentum may further accelerate convergence. For recurrent neural network learning, we may need to clip the norm of preconditioned gradients to avoid excessively large parameter updates. Most importantly, we can choose different Lie groups for learning our preconditioners to achieve a good trade off between performance and complexity.
In practice, we seldom consider the Lie group consisting of dense invertible matrices for preconditioner estimation when the problem size is large. Lie groups with sparse structures are of more interests. To begin with, let us recall a few facts about Lie group. If and are two Lie groups, then , , and all are Lie groups, where and denote Kronecker product and direct sum, respectively. Furthermore, for any matrix with compatible dimensions, block matrix
still forms a Lie group. We do not show proofs of the above statements here as they are no more than a few lines of algebraic operations. These simple rules can be used to design many useful Lie groups for preconditioner learning. We already know that invertible upper triangular matrices form a Lie group. Here, we list a few useful ones with sparse representations.
Diagonal matrices with the same dimension and positive diagonal entries form a Lie group with reducible representation. Preconditioners learned on this group are called diagonal preconditioners.
For matrix parameter , we can flatten into a vector, and precondition its gradient using a Kronecker product preconditioner with having form . Clearly, is a Lie group as long as and are two Lie groups. Let us check its role in learning the following affine transformation
where is the input feature vector augmented with , and is the output feature vector. After reverting the flattened back to its matrix form, the preconditioned SGD learning rule for is
Correspondingly, the affine transformation in (23) is rewritten as , where and are the transformed input and output feature vectors, respectively. Hence, the preconditioned SGD in (24) is equivalent to the SGD in (25) with transformed feature vectors and . We know that feature whitening and normalization could significantly accelerate convergence. A Kronecker product preconditioner plays a similar role in learning the affine transformation in (23).
This is a special Kronecker product preconditioner by constraining to be a diagonal matrix, and to be a sparse matrix where only its diagonal and last column can have nonzero values. Note that with nonzero diagonal entries forms a Lie group. Hence, is a Lie group as well. We call it a scaling and normalization preconditioner as it resembles a preconditioner that scales the output features and normalizes the input features. Let us check the transformed features and . It is clear that is an element-wisely scaled version of as is a diagonal matrix. To make a “normalized” feature vector, needs to be an input feature vector augmented with . Let us check a simple example to verify this point. We consider an input vector with two features, and write down its normalized features explicitly as below,
are the mean and standard deviation of, respectively. It is straightforward to show that the feature normalization operation in (26) forms a Lie group with four freedoms. For the scaling-and-normalization preconditioner, we have no need to force the last diagonal entry of to be . Hence, the group of feature normalization operation is a subgroup of .
This is another special Kronecker product preconditioner by constraining to be a diagonal matrix, and to be an upper triangular matrix with positive diagonal entries. We call it a scaling-and-whitening preconditioner since it resembles a preconditioner that scales the output features and whitens the input features. Again, the input feature vector must be augmented with such that the whitening operation forms a Lie group represented by upper triangular matrices with being its last diagonal entry. This is a subgroup of as we have no need to fix ’s last diagonal entry to .
It is not possible to enumerate all kinds of Lie groups suitable for preconditioner learning. For example, Kronecker product preconditioner with form could be suitable for preconditioning gradients of a third order tensor. The normalization and whitening groups are just two special cases of the groups with the form shown in (22), and there are numerous more choices having sparsities between that of these two. Regardless of the detailed form of , all such preconditioners share the same form of learning rule shown in (21). Without much tuning effort, they all can be efficiently learned using natural or relative gradient descent with normalized step size.
Adagrad, RMSProp and Adam all use Fisher type preconditioner living on the group of diagonal matrices with positive definite entries. This is a simple group. Optimal solution for has closed-form solution where and denote element wise multiplication and division, respectively. In practice, simple exponential moving average is used to replace the expectation when using this preconditioner.
For diagonal preconditioner, the optimal solution minimizing has closed-form solution . For , reduces to a vector with unit entries. Then, this optimal solution gives the equilibration preconditioner in (Dauphin et al., 2015).
The preconditioners considered in (Povey et al., 2015) and (Martens & Grosse, 2015) are closely related to the Fisher type Kronecker product preconditioners. Since all these Kronecker product preconditioners approximate the same inverse of Fisher information matrix, we can hardly tell which one is theoretically better. In practice, we learn this type of preconditioner by minimizing criterion using natural or relative gradient descent. Compared with other methods, one distinct advantage of our method is that explicit matrix inversion is avoided by introducing auxiliary vector . Another advantage is that our method is derived from a unified framework. There is no need to invent different preconditioner learning rules when we switch the Lie group representations.
Batch normalization can be viewed as preconditioned SGD using a specific scaling-and-normalization preconditioner with constraint and
from the feature normalization Lie group. However, we should be aware that explicit input feature normalization is only empirically shown to accelerate convergence, and has little meaning in certain scenarios, e.g., recurrent neural network learning where features may not have any stationary first and second order statistics. Both the Newton and Fisher type preconditioned SGD methods provide a more general and principled approach to find the optimal preconditioner, and apply to a broader range of applications. Generally, a scaling-and-normalization preconditioner does not necessarily “normalize” the input features in the sense of mean removal and variance normalization.
Let us consider the minimization of Rosenbrock function, , starting from initial guess
. This is a well known benchmark problem for mathematical optimization. The compared methods use fixed step size. For each method, the best step size is selected from sequence. For gradient descent, the best step size is . For momentum method, the moving average factor is , and the best step size is
. For Nesterov momentum, the best step size is. For preconditioned SGD, is initialized to and lives on the group of triangular matrices. For the Fisher type method, we set , and step sizes and for preconditioner and parameter updates, respectively. For the Newton type method, we set step sizes and for preconditioner and parameter updates, respectively. Figure 1 summarizes the results. The Newton type method performs the best, converging to the optimal solution using about iterations. The Fisher type method does not fit into this problem, and performs poorly as expected. Mathematical optimization is not our focus. Still, this example shows that the Newton type preconditioned SGD works well for mathematical optimization.
We consider the ImageNet ILSVRC2012 database for the image classification task. The well known AlexNet is considered. We follow the descriptions in(Alex et al., 2012)
as closely as possible to set up our experiment. One main difference is that we do not augment the training data. Another big difference is that we use a modified local response normalization (LRN). The LRN function from TensorFlow implementation is not second order differentiable. We have to approximate the local energy used for LRN with a properly scaled global energy to facilitate Hessian-vector product evaluation. Note that convolution can be rewritten as correlation between the flattened input image patches and filter coefficients. In this way, we find that there are eight matrices to be optimized in the AlexNet, and their shapes are:, and . We have tried diagonal and scaling-and-normalization preconditioners for each matrix. Denser preconditioners, e.g., the Kronecker product one, require hundreds of millions parameters for representations, and are too expensive to run on our platform. Each compared method is trained with epochs, mini-batch size , step size for the first epochs, and for the last epochs. We have compared several methods with multiple settings, and only report the ones with reasonably good results here. For Adam, the initial step size is set to . For batch normalization, initial step size is , and its moving average factors for momentum and statistics used for feature normalization are and , respectively. The momentum method uses initial step size , and moving average factor for momentum. Preconditioned SGD performs better with the scaling-and-normalization preconditioner. Its is initialized to , and updated with normalized step size . For the Fisher type preconditioner, we set and initial step size . For the Newton type preconditioner, its initial step size is . Figure 2 summarizes the results. Training loss for batch normalization is only for reference purpose as normalization alters the regularization term in AlexNet. We see that the scaling-and-normalization preconditioner does accelerate convergence, although it is super sparse. The Newton type preconditioned SGD performs the best, and achieves top-1 validation accuracy about when using only one crop for testing.
We consider the world level language modeling problem with reference implementation available from https://github.com/pytorch/examples. The Wikitext-2 database with tokens is considered. The task is to predict the next token from history observations. Our tested network consists of six layers, i.e., encoding layer, LSTM layer, dropout layer, LSTM layer, dropout layer, and decoding layer. For each LSTM layer, we put all its coefficients into a single matrix by defining output and augmented input feature vectors as in , where is a discrete time index, is the input, is the hidden state, and is the cell state. The encoding layer’s weight matrix is the transpose of that of the decoding layer. Thus, we totally get three matrices to be optimized. With hidden layer size , shapes of these three matrices are , , and , respectively. For all methods, the step size is reduced to one fourth of the current value whenever the current perplexity on validation set is larger than the best one ever found. For SGD, the initial step size is , and the gradient is clipped with threshold . The momentum method diverges quickly even we reduce the step size to . For Adam, we choose initial step size and damping factor . We have tried diagonal, scaling-and-normalization and scaling-and-whitening preconditioners for each matrix. The encoding (decoding) matrix is too large to consider KFAC like preconditioner. The diagonal preconditioner performs the worst, and the other two have comparable performance. For both types of preconditioned SGD, the clipping threshold for preconditioned gradient is , the initial step size is , and is initialized to . We set for the Fisher type preconditioned SGD. With dropout rate , the best three test perplexities are by SGD, by Newton type preconditioned SGD, and by Fisher type preconditioned SGD. With dropout rate , the best three test perplexities are by Newton type preconditioned SGD, by SGD, and by Fisher type preconditioned SGD. Although SGD performs well on the test set, both types of preconditioned SGD have significantly lower training losses than SGD with either dropout rate. Figure 3 summarizes the results when the dropout rate is . Methods involving momentum perform poorly here. Again, both preconditioners accelerate convergence significantly despite their high sparsity.
Compared with SGD, the Fisher type preconditioned SGD adds limited computational complexity when sparse preconditioners are adopted. The Newton type preconditioned SGD requires Hessian-vector product, which typical has complexity comparable to that of gradient evaluation. Thus, using SGD as the base line, the Newton type preconditioned SGD approximately doubles the computational complexity per iteration, while the Fisher type SGD has similar complexity. Wall time per iteration of preconditioned SGD highly depends on the implementations. Ideally, the preconditioners and parameters could be updated in a parallel and asynchronous way such that SGD and preconditioned SGD have comparable wall time per iteration.
We have put our TensorFlow and Pytorch implementations onhttps://github.com/lixilinx. More experimental results comparing different preconditioners on diverse benchmark problems can be found there. For the ImageNet experiment, all compared methods are implemented in Tensorflow, and require two days and a few hours to finish epochs on a GeForce GTX 1080 Ti GPU. The word level language modeling experiment is implemented in Pytorch. We have rewritten the word embedding function to enable second order derivative. For this task, SGD and the Fisher type preconditioned SGD have similar wall time per iteration, while the Newton type method requires about more wall time per iteration than SGD when running on the same GPU.
Two types of preconditioners and preconditioned SGD methods are studied. The one requiring Hessian-vector product for preconditioner estimation is suitable for general purpose optimization. We call it the Newton type preconditioned SGD due to its close relationship to Newton method. The other one only requires gradient for preconditioner estimation. We call it the Fisher type preconditioned SGD as its preconditioner is closely related to the inverse of Fisher information matrix. Both preconditioners can be efficiently learned using natural or relative gradient descent on any matrix Lie groups designated by the user. The Fisher type preconditioned SGD has lower computational complexity, but may require more tuning efforts on selecting its step size and damping factor. The Newton type preconditioned SGD has higher computational complexity, but is more user friendly due to its use of normalized step size and built-in gradient noise damping ability. Both preconditioners, even with very sparse representations, are shown to considerably accelerate convergence on relatively large scale problems.
ImageNet classification with deep convolutional neural networks.In NIPS, pp. 1097–1105, 2012.