Covariance matrices, Hessian matrices and, more generally speaking, symmetric matrices, play a major role in machine learning. In density models containing covariance matrices (e.g.
mixtures of Gaussians), estimation and inference involves computing the inverse of such matrices and computing products of such matrices with vectors. The size of these matrices grow quadratically with the dimension of the space, and some of the computations grow cubically. This renders direct evaluations impractical in large dimension, making approximations necessary. Depending on the problem, different approaches have been proposed.
In Gaussian mixture models (GMM), multiple high-dimensional Gaussian functions must be evaluated. In very large models with high dimension and many mixture components, the computational cost can be prohibitive. Approximations are often used to reduce the computational complexity, including diagonal approximations, low-rank approximations, and shared covariance matrices between multiple mixture components.
In Bayesian inference with Gaussian models, and in variational inference using the Laplace approximation, one must compute high-dimensional Gaussian integrals to marginalize over the latent variables. Estimating and manipulating the covariance matrices and their inverse can quickly become expensive computationally.
In machine learning and statistical model estimation, objective functions must be optimized with respect to high-dimensional parameters. Exploiting the second-order properties of the objective function to speed up learning (or to regularize the estimation) is always a challenge when the dimension is large, particularly when the objective function is non quadratic, or even non convex (as is the case with deep learning models).
In optimization, quasi-Newton methods such as BFGS have been proposed to keep track the inverse Hessian as the optimization proceeds. While BFGS has quadratic complexity per iteration, Limited-Storage BFGS (LBFGS) uses a factorized form of the inverse Hessian approximation that reduces the complexity to linear times an adjustable factor [Noc80]. Quasi-Newton methods have been experimented with extensively to speed up optimization in machine learning, including LBFGS [BL89, LBOM98, SYG07, BBG09, NCL11], although the inherently batch nature of LBFGS has limited its applicability to large-scale learning [DCM12]. Some authors have addressed the issue of approximating the Hessian for ML systems with complex structures, such as deep learning architectures. This involves back-propagating curvatures through the computational graph of the function [BL89, LBOM98, CE11, MSS12], which is particularly easy when computing diagonal approximations or Hessians of recurrent neural nets. Others have attempted to identify a few dominant eigen-directions in which the curvatures are larger than the others, which can be seen as a low-rank Hessian approximation [LSP93]
2 Linearithmic Symmetric Matrix Approximation
A symmetric matrix of size can be factorized (diagonalized) as where is an orthogonal (rotation) matrix, and is a diagonal matrix. In general, multiplying a vector by requires operations, and the full set of orthogonal matrices has free parameters. The main idea of this paper is very simple: parameterize the rotation matrix as a product of elementary rotations within 2D planes spanned by pairs of coordinates (sometimes called Givens rotations). This makes the compuational complexity of the product of by a vector to linearithmic instead of quadratic. It also makes the computation of the inverse trivial
. The second idea is to compute the best such approximation of a matrix by least-square optimization with stochastic gradient descent.
The question is how good of an approximation to real-world covariance and Hessian matrices can we get by restricting ourselves to these linearithmic rotations instead of the full set of rotations.
The main intuition that the approximation may be valid comes from a conjecture in random matrix theory according to which the distribution of random matrices obtained randomly drawingsuccessive Givens rotations is dense in the space of rotation matrices [Ben13]
. Proven methods for generating orthogonal matrices with a uniform distribution require
Givens rotation, with an arrangement based on the butterfly operators in the Fast Fourier Transform[Gen98].
Indeed, to guarantee that pairs of coordinates are shuffled in a systematic way, we chose to arrange the Givens rotations in the same way as the butterfly operators in the Fast Fourier Transform. To simplify the discussion, we assume from now on that is a power of , unless specified otherwise. We decompose
where is a sparse rotation formed by independent pairwise rotations, between pairs of coordinates
where , and . For instance, with , we obtain the following matrices :
where and .
There are pairwise rotations, grouped into matrices, such that each matrice represents independent rotations. It is the minimal number of matrices that keeps the pairwise rotations structure, and this particular arrangement has the property that there can be an interaction between each pair of coordinates of the input.
The final decomposition of the symmetric matrix is therefore a sequence of sparse matrices, namely
Since all the matrices are rotations, we have and thus the decomposition of is the obtained by simply replacing the elements of by their inverse.
It is important to notice that the problem becomes easier if
has eigenspaces of high dimension. If there are groups of eivenvalues roughly equal, the rotations between vectors belonging to the corresponding eigenspaces become irrelevant. This is similar to low rank approximation, where all small eigenvalues are grouped into the same eigenspace, but it can work for eigenspaces with non-zero eigenvalues.
3 Approximating Hessian matrices
Let be the rotation parameters of the matrices and the diagonal elements of . Let us name the whole set of parameters, so
When we need the explicit parametrization, we denote the rotation matrix parametrized by and the diagonal matrix . Finally, let be the set of all matrices .
In order to find the best set of parameters , we first notice that the sequence of matrices in the decomposition (2) can be seen as a machine learning problem. Although the problem is unfortunately not convex, it can still be minimized by using standard gradient descent techniques, such that SGD or minibatch gradient descent.
Let be a set of -dimensional vectors. We train our model to predict the output when given the input . We use a least square loss, so the function we minimize is
However, the parameterization through the angles makes the problem complicated and hard to optimize. Therefore, during one gradient step, we relax the parametrization and allow each matrix to have
independent parameters, and we reproject the matrix on the set of the rotation matrices after the parameter update. The problem becomes learning a multilayer linear neural network, with sparse connections and shared weights.
Formally, we use an extended set of parameters, , which has one parameter per non-zero element of each matrix . The new set of matrices parametrized by , which we denote , contains matrices which are not rotation matrices. However, because of the Givens structure inside the matrices, it is trivial to project a matrix from on the set , by simply projecting every Givens on the set of the rotations. The projection of a Givens is
We summarize the learning procedure in Algorithm 1, using SGD. Computing the gradient of
is performed by a gradient backpropagation. Note that we don’t need the matrix, but only a way to compute the products .
The hyperparameter, which doesn’t have to be constant, is the learning rate. There is no reason that the same learning rate should be applied to the part of in and to the part in . Therefore, can be a diagonal matrix instead of a scalar. We have observed that the learning procedure converges faster when the part in has a smaller learning rate.
There are several ways to obtain the input vectors , depending on the intended use of the approximated matrix. If we only need to obtain a fast and compact representation, we can draw random vectors. We found that vectors uniformly sampled on the unit sphere or in a hypercube centered on zero perform well. In certain cases, we may need to evaluate products for in a certain region of . In this case, using vectors in this specific area can provide better results, since more of the expressivity of our decomposition will be used on this specific region. The set of vectors can also appear naturally, as we can see in the next part when we approximate the Hessian of another optimization problem.
In order to confirm our hypothesis, we ran our algorithm on synthetic and natural Hessian matrices. Synthetic matrices are obtained by randomly drawing rotations and eigenvalues, while the natural matrix we tried is the Hessian of the MNIST dataset.
To evaluate the quality of the approximation, we measure the average angle between the and for a set of random vectors .
The synthetic matrices are generated as where is a random rotation matrix, uniformly drawn on the set of rotation matrices, and is a diagonal matrix, with where is sampled from a zero-mean and Gaussian distribution, except a small number of randomly selected that are drawn from a Gaussian of mean
The results are displayed on Figure 1. The average angle goes down to about degrees in a space of dimension , with the number of large eigenvalues .
For sanity check purposes, we also try directly learning a random rotation matrix using a single
matrix. The loss function is then
The results are also shown on Figure 1. It doesn’t perform as well
as learning a Hessian, which was expected since we don’t have eigenspaces of
high dimension, but it still reaches degrees in dimension .
We performed another sanity check : if the matrix is actually the form of a matrix , then we learn it correctly (loss function reaching zero), and we also learn with loss function going quickly to zero.
It appears that the number of larger eigenvalues has an influence on the quality of the approximation : the average error angle after convergence is lower when is close to or to the size of the space (about degrees in the latter case). On the other hand, when the eigenvalues get split into two groups of about the same size, the average error angle can is about degrees. The average angle as a function of is shown on Figure 2.
We tested our algorithm on the covariance of the MNIST dataset. It reached an angle of degrees. Figure 3 shows real and approximated covariance matrices.
4 Hessian matrices of loss functions
In this section, we denote (or ) the approximated Hessian.
4.1 Batch gradient descent
Although learning an approximation of a fixed Hessian matrix has uses, our method can be used in an online fashion to track slowly-varying Hessian matrices over the course of an optimization procedure.
In numerous cases, we have an optimization problem involving too many variables to explicitly compute the Hessian, preventing us from using direct second order methods. Our Hessian representation can be used to store and evaluate the Hessian matrix. Thanks to the learning procedure involved, if the Hessian of the loss function changes, our estimation will be able to adapt.
Let us consider an optimization problem where we aim to minimize a loss function where is the set of parameters. Therefore, we are looking for
Or, in case of a non convex problem, we are looking for a that gives
a “low” value for (“low” being problem dependent).
In addition, let us suppose we are in a case where it is possible to compute the gradient of at every point . It is the case, for instance, for neural networks, using the backpropagation algorithm, or whenever we have an efficient way to compute the gradient.
It is well known that even if the problem is quadratic, and therefore the Hessian matrix doesn’t depend on the point , a gradient descent algorithm can perform poorly when has eigenvalues with different magnitude. This situation becomes even worse when the loss function is not quadratic in the parameters and the Hessian is not constant (we use notation for the Hessian matrix at point ). However, if is known, we can replace the gradient step by . This modifies the descent direction to take the curvature into account and allows quadratic convergence if the function is locally quadratic. Since is too large to compute or store, we can rely on approximations, such as in LBFGS or Truncated Conjugate Gradient.
Our method provides another way to track an approximation of . By definition, we have
In the context of minimization, let us suppose we have taken a step from point and arrived at the current point . Then we know that
We can use this relation as a training sample for the matrix .
Thus, we start with an approximated matrix set to identity, and we begin a regular gradient descent. At each step, we obtain a training point as defined in Equation 10, and we perform one step in learning (following the gradient of the loss function ). Then we perform a second step in the minimization of , folloing vector , and so on. Note that our representation of allows us to compute the inverse easily. The procedure is summarized in Algorithm 2.
A number of implementation details must be considered :
It appears that in natural Hessians, many of the eigenvalues are small or zero, especially when we approach a local minimum. Therefore, the inverse of can diverge. We solve this issue by imposing a minimal value on the elements of when computing their inverse.
Similarly, eigenvalues can be negative. It is a problem in the minimization since it makes second order methods diverge if not taken into account. There are several ways to deal with such eigenvalues. We chose to set negative elements of to at the same time as the normalization of .
The norm of (as defined in Algorithm 2) can vary from one step to the other. However, since the Hessian approximation is linear, scaling and by the same factor is equivalent to scaling the learning rate by the same factor. To have a better control on the learning rate, we rescale both and by a factor .
Although it performs poorly, due to the small eigenvalues, it is also possible to directly learn the inverse Hessian matrix using the relation
4.2 Stochastic and minibatch gradient descent
For most problems, stochastic gradient descent (SGD) performs better than batch descent. Here, we will consider the intermediate case, minibatch descent, where each step uses only a part of the dataset. It covers both plain SGD and batch methods by setting the size of the minibatch to or to the size of the dataset.
In minibatch setting, the loss function becomes , where is a subset of the dataset, changing at each iteration. We have if is the whole dataset. Because of this difference, the previous analysis doesn’t hold anymore, since the Hessian of is not going to be the same as the Hessian of .
However, we will see that if we use the Hessian learning in this setting, it still perform something very similar to minibatch gradient, on the Hessian approximation. When learning the approximation, instead of training pairs , we now obtain pairs , where and . Let be the whole dataset and be minibatches (so ). Since the gradient operator is linear, if , then
This simply states that if we sum gradients, taken at the same point, from a set of minibatches such that the dataset is their disjoint union, then we obtain the batch gradient. We have scaled the batch gradient by the size of the dataset for clarity purposes in the following part.
We will show that the same property holds for the loss function . The loss function for the Hessian approximation corresponding to minibatch is . Its gradient is
So we get
Another issue with minibatch descent lies in the difficulty to reuse the gradient
computed at step to compute at step . Indeed,
. In the
batch method, we had already computed at the previous
step, and we could reuse it. With the minibatch approach, the previous step
used , which is not computed on the same minibatch.
The easiest way to overcome this issue is to recompute at step , but it implies doubling the number of gradients computed. Although we don’t have a fully satisfying answer to this problem, a possibility to reduce the number of gradients computed would be to use the same minibatch a few times consecutively before changing to .
The approximated Hessians are sparse matrices with coefficients each. They can be efficiently implemented using standard sparse linear algebra, such as spblas. The total number of operations for evaluating an approximate Hessian is . If we neglect the fact that sparse matrices involve more operations, one evaluation coses MulAdd operations. Computing the gradient of has the same cost, thanks to the backpropagation algorithm. Accumulating the gradients cost another . So tracking the Hessian matrix using our method adds about operations at each iteration and, when using minibatches, possibly another gradient evaluation.
In terms of memory, there are parameters to store, and another is required to store the gradient of during learning.
5 Applications and future work
The linearithmic structure makes the Hessian factorization scalable to larger models. In particular, large inference problems invloving Gaussian functions are good candidates for our method. Evaluating a product is even faster, since it can be rewritten , and therefore be performed in operations. This is particularly adapted to inference, where it is possible to spend some time to learn the approximate the matrix once and for all. Gaussian mixture models and Bayesian inference with Gaussian models could profit from this approximation.
Although some work has still to be done, speeding up optimization by tracking the approximated Hessian could help optimizing function. In particular, deep neural network could profit from this method, since their loss function tend to be non isotropic, so they could profit from fast approximated second order methods.
- [BBG09] Antoine Bordes, Léon Bottou, and Patrick Gallinari. Sgd-qn: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research, 10:1737–1754, July 2009.
- [Ben13] Gérard Ben Arous. personal communication, 2013.
- [BL89] S. Becker and Y. LeCun. Improving the convergence of back-propagation learning with second-order methods. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proc. of the 1988 Connectionist Models Summer School, pages 29–37, San Mateo, 1989. Morgan Kaufman.
- [CE11] Olivier Chapelle and Dumitru Erhan. Improved preconditioner for hessian free optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
- [DCM12] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc Le, Mark Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Ng. Large scale distributed deep networks. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1232–1240. 2012.
- [Gen98] A. Genz. Methods for generating random orthogonal matrices. In H. Niederreiter and J. Spanier, editors, Monte Carlo and Quasi-Monte Carlo Methods, page 199–213. Springer, 1998.
- [GLC11] Guangzhi Cao, Leonardo R. Bachega, and Charles A. Bouman. The sparse matrix transform for covariance estimation and analysis of high dimensional signals. IEEE Trans. on Image Processing, 20(3):625–640, 2011.
- [LBOM98] Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In G. Orr and Muller K., editors, Neural Networks: Tricks of the trade. Springer, 1998.
Y. LeCun, P. Simard, and B. Pearlmutter.
Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors.In S. Hanson, J. Cowan, and L. Giles, editors, Advances in Neural Information Processing Systems (NIPS 1992), volume 5. Morgan Kaufmann Publishers, San Mateo, CA, 1993.
- [MSS12] James Martens, Ilya Sutskever, and Kevin Swersky. Estimating the hessian by back-propagating curvature. In Proc of ICML, 2012. arXiv:1206.6464.
- [NCL11] Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, Andrew Ng, and Quoc V Le. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 265–272, 2011.
- [Noc80] Jorge Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of computation, 35(151):773–782, 1980.
Nicol Schraudolph, Jin Yu, and Simon Günter.
A stochastic quasi-newton method for online convex optimization.
In Proc. 11th Intl. Conf. on Artificial Intelligence and
, pages 433–440. Society for Artificial Intelligence and Statistics, 2007.