## 1 Introduction

Contemporary machine learning heavily features empirical risk minimization, on loss functions of the form

(1) |

where is a data-set and

are parameters of the optimization problem (e.g. the weights of a neural network). Data sub-sampling in such problems gives rise to stochastic optimization problems: If gradients are computed over a randomly sampled batch

of data-points, they give rise to a stochastic gradient(2) |

If is large and is sampled iid., then

is an unbiased estimator and, by the multivariate Central Limit Theorem, is approximately Gaussian distributed (assuming

) around the full-data gradient as [e.g.][§2.18]balles2017coupling,vaart_1998(3) |

where is the empirical covariance over . For problems of large scale in both data and parameter-space, stochastic optimization using stochastic gradient descent robbins1951stochastic and its by now many variants (e.g., momentum polyak1964some, Adam kingma2014adam, etc.) are standards. For (non-stochastic) gradient descent, it is a classic result that convergence depends on the condition-number of the objective’s Hessian . For example, Thm. 3.4 in Nocedal & Wright NoceWrig06 states that the iterates of gradient descent with optimal local step sizes on a twice-differentiable objective converge to a local optimum such that, for sufficiently large ,

(4) |

where and are the largest and smallest eigen-value of the Hessian , respectively. In noise-free optimization, it is thus common practice to try and reduce the condition number of the Hessian, by a linear re-scaling of the input space using a *pre-conditioner* .
For ill-conditioned problems (), effective pre-conditioning can drastically improve the convergence. Choosing a good pre-conditioner is a problem-dependent art. A good pre-conditioner should decrease the condition number of the problem and be cheap to apply, either through sparseness or low-rank structure.
Sometimes pre-conditioners can be constructed as a good ‘a-priori’ guess of the inverse Hessian . I no such good guess is available, then in the deterministic/non-stochastic setting, iterative linear-algebra methods can be used to build a low-rank pre-conditioner. For example, if

spans the leading eigen-vectors of the Hessian, the corresponding eigen-values can be re-scaled to change the spectrum, and by extension the condition number. Iterative methods such as those based on the Lanczos process [][§10.1]golub2012matrix try to consecutively expand a helpful subspace by choosing

-conjugate vectors, which can fail in the presence of inexact computations. Due to these intricate instabilities [][p. 282]trefethen1997numerical such algorithms tend not to work with the level of stochasticity encountered in practical machine learning applications. Below, we propose a framework for the efficient construction of pre-conditioners in settings with noise-corrupted Hessian-vector products available.Our algorithm consists of three main components: We first build a probabilistic Gaussian inference model for matrices from noisy matrix-vector products (Section 2.1) by extending existing work on matrix-variate Gaussian inference. Then we construct an active algorithm that selects informative vectors aiming to explore the Hessian’s dominant eigen-directions (Section 2.2). The structure of this algorithm is inspired by that of the classic Arnoldi and Lanczos iterations designed for noise-free problems. Finally (Section 2.3), we provide some “plumbing” to empirically estimate hyper-parameters and efficiently construct a low-rank pre-conditioner and extend the algorithm to the case of high-dimensional models (Section 2.4). We evaluate the algorithm on some simple experiments to empirically study its properties as a way to construct pre-conditioners and test it on both a low-dimensional^{1}

^{1}1Code repository., and a high-dimensional deep learning problem . While we use pre-conditioning as the principal setting, the main contribution of our framework is the ability to construct matrix-valued estimates in the presence of noise, and to do so at complexity

*linear*in the width and height of the matrix. It is thus applicable to problems of large scale, also in domains other than pre-conditioning and optimization in general. In contrast to a simple averaging of random observations, our algorithm actively chooses projections in an effort to improve the estimate.

### 1.1 Related Work

Our approach is related to optimization methods that try to emulate the behaviour of Newton’s method without incurring its cubic per-step cost. That is, iterative optimization updates that try to find a search direction that is an approximate solution to the linear problem

(5) |

This includes quasi-Newton methods like BFGS and its siblings dennis1977quasi, and Hessian-free optimization pearlmutter1994fast,Martens2010. These methods try to keep track of the Hessian during the optimization. Pre-conditioning is a simpler approach that separates the estimation of the (inverse) Hessian from the on-line phase of optimization and moves it to an initialization phase. Our algorithm could in principle be run in every single step of the optimizer, such as in Hessian-free optimization. However, this would multiply the incurred cost, which is why we here only study its use for pre-conditioning. There are stochastic variants of quasi-Newton methods and other algorithms originally constructed for noise-free optimization [e.g.][]schraudolph2007stochastic,byrd2016stochastic. These are generally based on collecting independent random (not actively designed) samples of quasi-Newton updates. Estimates can also be obtained by regularizing estimates [e.g.][]2018arXiv180204310W or partly update the stochastic gradient by reusing elements of a batch [e.g.][]2018arXiv180205374B. The conceptual difference between these methods and ours is that we actively try to design informative observations by explicitly taking evaluation uncertainty into account. Our inference scheme is an extension of Gaussian models for inference on matrix elements, which started with early work by DAWID01041981 and was recently extended in the context of probabilistic numerical methods hennig2013quasi,hennig2015,2018arXiv180204310W. Our primary addition to these works is the algebra required for dealing with structured observation noise.

## 2 Theory

Our goal in this work is to construct an active inference algorithm for low-rank pre-conditioning matrices that can deal with data in the form of Hessian-vector multiplications corrupted by significant (not just infinitesimal) Gaussian noise. To this end, we will adopt a probabilistic viewpoint with a Gaussian observation likelihood, and design an active evaluation policy that aims to efficiently collect informative, non-redundant Hessian-vector products. The algorithm will be designed so that it produces a Gaussian posterior measure over the Hessian of the objective function, such that the posterior mean is a low-rank matrix.

### 2.1 Matrix Inference

Bayesian inference on matrices can be realised efficiently in a Gaussian framework by re-arranging the matrix elements into a vector , then performing standard Gaussian inference on this vector [e.g.][]DAWID01041981. Although Hessian matrices are square and symmetric, the following derivations apply equally well to rectangular matrices. There are specializations for symmetric matrices hennig2015, but since they significantly complicate the derivations below, we use this weaker model. Assume that we have access to observations of matrix-vector products along the search directions . In the vectorized notation, this amounts to a linear projection of through the Kronecker product matrix :

If the observations are exact (noise-free), the likelihood function is a Dirac distribution,

(6) |

For conjugate inference, we assign a Gaussian prior over , with a prior mean matrix and a covariance consisting of a Kronecker product of 2 symmetric positive-definite matrices.

(7) |

This is an equivalent re-formulation of the *matrix-variate Gaussian* distribution DAWID01041981. For simplicity, and since we are inferring a Hessian matrix (which is square and symmetric), we set . This combination of prior (7) and likelihood (6) has previously been discussed [detailed derivation, e.g., in][]hennig2015. It gives rise to a Gaussian posterior distribution, whose mean matrix is given by

(8) |

where is found as the solution to the linear system

(9) |

using the Kronecker product’s property that (note that the matrix can be inverted in ).

#### 2.1.1 Adding Noise

While noise-free observations of Gaussian-distributed matrices have been studied before, observation noise beyond the fully iid. case, [cf.][]2018arXiv180204310W is a challenging complication. A lightweight inference-scheme for noisy observations is one of the core contributions of this paper. In empirical risk minimization problems, mini-batching replaces a large or infinite sum (the population risk, Eq. (1)) into a small sum of individual loss functions (Eq. (2)). Analogous to Eq. (3), the Hessian of the batch risk is thus corrupted relative to the true Hessian by a Gaussian likelihood:

(10) |

If we now compute matrix-vector products of this batch Hessian with vectors , even under the simplifying assumption of Kronecker structure in the covariance, the observation likelihood becomes

(11) |

The subscript in Eq. (11) is there to highlight the diagonal structure of the right matrix due to the independent batches we use to calculate the Hessian-vector products. Relative to Eq (9), this changes the Gram matrix to be inverted from to . To get the posterior of in the noisy setting after observations, instead of Eq. (9), we now have to solve the linear problem

(12) |

This is a more challenging computation, since the sum of Kronecker products does not generally have an analytic inverse. However, Eq. (12) is a so-called matrix pencil problem, which can be efficiently addressed using a generalized eigen-decomposition [e.g.][§7.7]golub2012matrix. That is, by a matrix and diagonal matrix such that

(13) |

Eigen-decompositions distribute over a Kronecker product loan00kronec. And this property is inherited by the generalized eigen-decomposition, which offers a convenient way to rewrite a matrix in terms of the other. Eq. (12) can thus be solved with two generalized eigen-decompositions. The left and right parts of the Kronecker products of Eq. (12) are written with separate generalized eigen-decompositions as

contain the generalized eigen-vectors and eigen-values from the left Kronecker term and are analogous for the right Kronecker term.

In the first step above, the left matrix is expressed in terms of the right matrix in both terms by means of the generalized eigen-decomposition. In remaining step, the conjugacy property in Eq. (13) is used to simplify the expression to the inversion of a diagonal matrix and the Kronecker product of the generalized eigen-vectors. The solution now becomes

(14) |

where refers to the Hadamard product of the two matrices. Using this form, we can represent the posterior mean estimate for with Eq. (8) where is replaced with the solution from Eq. (14).

This matrix can generally not be written in a simpler analytic form. If the prior mean is chosen as a simple matrix (e.g. a scaled identity), then would admit fast multiplication and inversion (using the matrix inversion lemma) due to its low-rank outer-product structure.

### 2.2 Active Inference

The preceding section constructed an inference algorithm that turns noisy projections of a matrix into a low-rank estimate for the latent matrix. The second ingredient of our proposed algorithm, outlined in this section, is an active policy that chooses non-redundant projection directions to efficiently improve the posterior mean. Algorithm 1 summarizes as pseudo-code.
The structure of this algorithm is motivated, albeit not exactly matched to, that of stationary iterative linear solvers and eigen-solvers such as GMRES saad1986gmres and Conjugate gradients hestenes1952methods (and the corresponding eigen-value-iterations, the Arnoldi process and the Lanczos process. Cf. [][§10]golub2012matrix and [][§VI]trefethen1997numerical). These algorithms can be interpreted as optimization methods that iteratively expand a low-rank approximation to (e.g. in the case of Conjugate gradients / Lanczos) the Hessian of a quadratic problem, then solve the quadratic problem within the span of this approximation. In our algorithm, the exact low-rank approximation is replaced by the posterior mean *estimate* arising from the Bayesian inference routine described in Section 2.1. This leads the algorithm to suppress search directions that are co-linear with those collected in previous iterations, focussing instead on the efficient collection of new information.
Readers familiar with linear solvers like conjugate gradients will be able to recognise the structural similarity of Algorithm 1 to linear solvers, with two differing aspects. Each iteration constructs a projection direction , collects one matrix-vector multiplication, and rescales them by a step size (here set to 1 and omitted). A linear solver would update the solution and residual using , and but we let the algorithm stay at and sample new search directions and projections. The core difference to a solver is in line 7: Where the classic solvers would perform a Gram-Schmidt step, we instead explicitly perform Gaussian inference on the Hessian . In the noise-free limit the proposed method would choose the same search directions as projection methods, a superclass of iterative solvers containing algorithms such as GMRES and CG hennig2015.

### 2.3 Algorithmic Details

Like all numerical methods, our algorithm requires the tuning of some internal parameters and some implementational engineering. We set all free parameters in an empirical fashion, via cheap pre-computations, and use standard linear-algebra tools for the internals.

#### 2.3.1 Estimating Parameters

The Gaussian prior (7) and likelihood (11) (also used as parameters of Alg. 1) have several parameters. For simplicity and to limit computational cost, we set all these to scaled identity matrices: prior mean

, variance

and noise covariance . We set these parameters empirically: Before the method tries to estimate the Hessian, it gathers gradients and Hessian-gradient products locally, from a small number of initial batches. Then we set the parameters of the prior over as the empirical estimates(15) |

The noise variance for the likelihood set by an empirical esimate

Since batch elements can vary quite extremely, however, we use a more robust choice, by setting it to the median of the variance estimates. The mean gradient () from the initial sampling is used for the first iteration of the Hessian-inference scheme, line 2. A new search direction along which a Hessian-vector product is to be computed, is obtained by applying the inverse of the current estimated Hessian to a stochastic gradient:

(16) |

The estimated Hessian is updated by first solving Eq. (12) and using the result in Eq. (8) to get the posterior mean: It is a sum of the prior mean and a low-rank outer product of two matrices, with the number of parameters and the number of observations/iterations. A diagonal prior mean offers efficient inversion of the estimated Hessian by the matrix inversion lemma, and subsequent multiplication in Eq. (16). In the experiments below, the algorithm is used to construct a pre-conditioner for stochastic optimizers. In this application, it runs at the beginning of an optimization process, collecting several batches “in-place”, each time computing a noisy matrix-Hessian product. To find the dominant eigen-directions, especially if noise is significant, the solver usually requires a number of iterations, producing a posterior estimate of rank . To reduce the computational cost and memory requirements in the subsequent actual optimization run, we reduce the rank of this approximation down to

using a standard fast singular value decomposition and obtain the singular values

and the left singular vectors and estimate the Hessian . This is equivalent to taking the symmetric part of a polar decomposition which yields the closest symmetric approximation to a matrix in Frobenius norm HIGHAM1988103. For an -dimensional problem and a pre-conditioner of rank , this method requires the storage of numbers and has a computational overhead of additional FLOPS compared to a standard sgd update. Such overhead is negligible for batch-sizes , the typical case in machine learning.Progress of pre-conditioned and standard sgd on the MNIST logistic regression problem (details in text). Top plot shows progress on the raw optimization objective (train loss), the bottom plot shows generalization in terms of test accuracy. Two lines are shown for sgd. The solid one uses a step-size

optimized for good performance. This directly induces a step-size for the pre-conditioner. For comparison, we also show progress of sgd when directly using this step size—it makes sgd unstable.#### 2.3.2 Pre-conditioning

A pre-conditioner is constructed to rescale the stochastic gradients in the direction of the singular vectors.

(17) |

By rescaling the gradients with , the linear system in Eq. (5) is transformed into . The goal, as outlined above, is to reduce the condition number of the transformed Hessian . If the estimated vectors are indeed the real eigen-vectors and eigen-values, this approach would rescale these directions to have length . Theoretically would be ideal if the real eigen-pairs are used. When instead an approximation of the subspace is used with poor approximations of the eigen-values , it is possible to scale a direction too much so the eigen-vectors corresponding to the largest eigen-values become the new smallest eigen-values. In our experiments this rarely happened because the Hessian contained many eigen-values and so could be used. The improved condition number allows usage of a larger step-size. We scale up by the empirical improvement of the condition number, i.e. the fixed step-size of sgd is multiplied with (in Eq. (17)), the ratio between largest and the smallest estimated eigen-value. Using the current notation we can write the pre-conditioned sgd update as

### 2.4 High-Dimensional Modification

Deep learning has become the benchmark for stochastic optimizers and it imposes many new constraints on the optimizers. Due to the large number of parameters, even if we would have access to the fraction of eigen-directions which make the Hessian ill-conditioned it would likely be inefficient to use because each vector has the same size as the network, and the approach would only work if a few of the eigen-values of the Hessian have a clear separation in magnitude. This changes the functionality we can expect from a pre-conditioner to keeping the parameter updates relevant with respect to the current condition number rather than finding all directions with problematic curvature. Some simplifications are required in order to adapt the algorithm for deep learning. The most important change is that we approximate the Hessian as a layer-wise block matrix, effectively treating each layer of a deep net as an independent task. Hessian-vector products are calculated using automatic differentiation pearlmutter1994fast. Good estimates of the eigen-values for the algorithm proved difficult to get because of large deviations, likely due to the noisy Hessian-vector products. To alleviate this problem we changed the multiplicative update of the step-size presented in section 2.3.2 to redefining the step-length. Each time the algorithm is used to build a pre-conditioner, a new step-length is set to the scalar prior mean in Eq. (15). The last modification is how the empirical parameters of the prior (section 2.3.1) are set. is used as the new step-length and gave better results when a smaller step-size of was used and is estimated to , with . All the parameters of the prior and likelihood (, and ) are shared among the layers. No clear improvement was visible when treating the parameters separately for each layer.

## 3 Results

We analyze and compare the performance of the proposed algorithm on a simple test problem along with two standard applications in machine learning.

### 3.1 Regression

Figure 1 shows results from a conceptual test setup designed to showcase the algorithm’s potential: An ill-conditioned linear problem of a scale chosen such that the analytical solution can still be found for comparison. We used linear parametric least-squares regression on the SARCOS data-set VijayakumarS00 as a test setup. The data-set contains observations of a uni-variate^{2}^{2}2The data-set contains 7 such univariate target variables. Following convention, we used the first one. response function in a 21-dimensional space . We used the polynomial feature functions , with a linear mapping, , manually designed to make the problem ill-conditioned. The model with a quadratic loss function then yields a quadratic optimization problem,

(18) |

where is the map from weights to data. The exact solution of this problem is given by the regularized least-squares estimate . For this problem size, this solution can be computed easily, providing an oracle baseline for our experiments. But if the number of features were higher (e.g. ), then exact solutions would not be tractable. One could instead compute, as in deep learning, batch gradients from Eq. (18), and also produce a noisy approximation of the Hessian as the low rank matrix

(19) |

where is a batch. Clearly, multiplying an arbitrary vector with this low-rank matrix has cost , thus providing the functionality required for our noisy solver. Figure 1 compares the progress of vanilla sgd with that of pre-conditioned sgd if the pre-conditioner is constructed with our algorithm. In each case, the construction of the pre-conditioner was performed in 16 iterations of the inference algorithm. Even in the largest case of , this amounts to data read, and thus only a minuscule fraction of the overall runtime.
An alternative approach would be to compute the inverse of separately for each batch, then average over the batch-least-squares estimate . In our toy setup, this can again be done directly in the feature space. In an application with larger feature space, this is still feasible using the matrix-inversion lemma on Eq. (19), instead inverting a dense matrix of size . The Figure also shows the progression of this stochastic estimate (labelled as *avg-inv*, always using since smaller batch-sizes did not work at all). It performs much worse unless the batch-size is increased, which highlights the advantage of the active selection of projection directions for identifying appropriate eigen-vectors. A third option is to use an iterative solver with noise-corrupted observations to approach the optimum. In figure 1 a barely visible line labelled CG can be seen which used the method of conjugate gradients with a batch-size of 256. This method required a batch-size to show reasonable convergence on the training objective but would still perform poorly on the test set.

### 3.2 Logistic Regression

Figure 2 shows an analogous experiment on a more realistic, and *non-linear* problem: Classic linear logistic regression on the digits 3 and 5 from the MNIST data-set (i.e. using linear features , and ). Here we used the model proposed in [][§3.4]Rasmussen2005, which defines a convex, non-linear regularized empirical risk minimization problem that again allows the construction of stochastic gradients, and associated noisy Hessian-vector products. Analogous to Figure 1, Figure 2 shows progress of sgd and pre-conditioned sgd. As before, this problem is actually just small enough to compute an exact solution by Newton optimization (gray baseline in plot). And as before, computation of the pre-conditioner takes up a small fraction of the optimization runtime.

Training loss for sgd (left) and pre-conditioned sgd (right) on the CIFAR-10 data-set for different learning rates and batch-size of 32 over 250 epochs. Both graphs share y-axis and colors to facilitate comparison between the optimizers. The solid lines represent the mean of several individual runs plotted as translucent.

### 3.3 Deep Learning

For a high-dimensional test bed, we used a deep net consisting of convolutional and fully-connected layers (see appendix A

for details) to classify the CIFAR-10 data-set krizhevsky2009learning. The proposed algorithm was implemented in PyTorch paszke2017automatic, using the modifications listed in section

2.4.To stabilize the algorithm, a predetermined fixed learning-rate was used for the first epoch of the pre-conditioned sgd. Figure 3 and 4 compare the convergence for the proposed algorithm against sgd for training loss and test loss respectively on CIFAR-10. In both figures we see that the pre-conditioned sgd has similar performance to a well-tuned sgd regardless of the initial learning rate. To keep the cost of computations and storage low we used a rank 2 approximation of the Hessian that was recalculated at the beginning of every epoch. The cost of building the rank 2 approximation was 2–5% of the total computational cost per epoch.

The improved convergence of pre-conditioned sgd over normal sgd is mainly attributed the adaptive step-size, which seems to capture the general scale of the curvature to efficiently make progress. This approach offers a more rigorous way to update the step-length over popular empirical approaches using exponentially decaying learning-rates or division upon plateauing. By studying the scale of the found learning rate, see Fig. 5, we see that regardless of the initial value, all follow the same trajectory although spanning values across four orders of magnitude.

## 4 Conclusion

We have proposed an active probabilistic inference algorithm to efficiently construct pre-conditioners in stochastic optimization problems. It consists of three conceptual ingredients: First, a matrix-valued Gaussian inference scheme that can deal with structured Gaussian noise in observed matrix-vector products. Second, an active evaluation scheme aiming to collect informative, non-redundant projections. Third, additional statistical and linear algebra machinery to empirically estimate hyper-parameters and arrive at a low-rank pre-conditioner. The resulting algorithm was shown to significantly improve the behaviour of sgd in imperfectly conditioned problems, even in the case of severe observation noise typical for contemporary machine learning problems. It scales from low- to medium- and high-dimensional problems, where its behaviour qualitatively adapts from full and stable pre-conditioning to low-rank pre-conditioning and, eventually, scalar adaptation of the learning rate of sgd-type optimization.

#### Acknowledgements

Filip de Roos acknowledges support by the International Max Planck Research School for Intelligent Systems. Philipp Hennig gratefully acknowledges financial support by the European Research Council through ERC StG Action 757275 / PANAMA.

Supplementary Material

Active Probabilistic Inference on Matrices

for Pre-Conditioning in Stochastic Optimization

## Appendix A Deep Learning

All deep learning experiments used a neural network consisting of 3 convolutional layers with 64, 96 and 128 output channels of size , and followed by 3 fully connected layers of size 512, 256, 10 with cross entropy loss function on the output and regularization with magnitude

. All layers used the ReLU nonlinearity and the convolutional layers had additional max-pooling.

Comments

There are no comments yet.