A Multilevel Approach to Variance Reduction in the Stochastic Estimation of the Trace of a Matrix

The trace of a matrix function f(A), most notably of the matrix inverse, can be estimated stochastically using samples< x,f(A)x> if the components of the random vectors x obey an appropriate probability distribution. However such a Monte-Carlo sampling suffers from the fact that the accuracy depends quadratically of the samples to use, thus making higher precision estimation very costly. In this paper we suggest and investigate a multilevel Monte-Carlo approach which uses a multigrid hierarchy to stochastically estimate the trace. This results in a substantial reduction of the variance, so that higher precision can be obtained at much less effort. We illustrate this for the trace of the inverse using three different classes of matrices.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/18/2021

A Multilevel Approach to Stochastic Trace Estimation

This article presents a randomized matrix-free method for approximating ...
05/23/2017

Multilevel Monte Carlo Simulation of the Eddy Current Problem With Random Parameters

The multilevel Monte Carlo method is applied to an academic example in t...
09/26/2019

Multigrid deflation for Lattice QCD

Computing the trace of the inverse of large matrices is typically addres...
10/11/2018

Verification of Two-Dimensional Monte Carlo Ray-Trace Methodology in Radiation Heat Transfer Analysis

Despite the frequent appearance in the radiation heat transfer literatur...
06/02/2021

Probing for the Trace Estimation of a Permuted Matrix Inverse Corresponding to a Lattice Displacement

Probing is a general technique that is used to reduce the variance of th...
05/06/2019

Estimating the inverse trace using random forests on graphs

Some data analysis problems require the computation of (regularised) inv...
12/17/2020

Rounding error using low precision approximate random variables

For numerical approximations to stochastic differential equations using ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the situation where one is interested in the trace of a matrix function . Here, is the matrix obtained from and the function in the usual operator theoretic sense; see [Higham2008], e.g. Our focus is on the inverse , i.e. . Computing the trace is an important task arising in many applications. The trace of the inverse is required, for example, in the study of fractals [Sapoval1991], in generalized cross-validation and its applications [GolubMatt1995, GolubHeathWahba1979]. In network analysis, the Estrada index—a total centrality measure for networks—is defined as the trace of the exponential of the adjacency matrix of a graph [EstradaHigham2010, Ginosar2008] and an analogous measure is given by the trace of the resolvent [Estrada2011, Section 8.1]. For Hermitian positive definite matrices , one can compute the log-determinant as the trace of the logarithm of

. The log-determinant is needed in machine learning and related fields

[Rasmussen2005, Rue2005]. Further applications are discussed in [meyer2021hutch++, SaadUbaru2017, SaadUbaruJie2017]. A particular application which we want to prepare with this work is in lattice quantum chromodynamics (QCD), a computational approach in Theoretical Physics to simulate the interaction of the quarks as constituents of matter. Here, the trace of the inverse of the discretized Dirac operator yields the disconnected fermion loop contribution to an observable; see [SextonWeingarten1994]. As simulation methods get more and more precise, these contributions become increasingly important.

It is usually unfeasible to compute the diagonal entries directly as , the th canonical unit vector, and then obtain the trace by summation. For example, for the inverse this would mean that we have to solve linear systems, which is prohibitive for large values of .

One large class of methods which aims at circumventing this cost barrier are deterministic approximation techniques. Probing methods, for example, approximate

(1)

where the vectors are carefully chosen sums of canonical unit vectors and is not too large. Various approaches have been suggested and explored in order to keep small while at the same time achieving good accuracy in (1). This includes approaches based on graph colorings; see [BekasKokiopoulouSaad2007, Endress:2014qpa, TangSaad2012] e.g., and the hierarchical probing techniques from [Stathopoulos2013, LaeuchliStathopoulos2020]. In order for probing to yield good results, the matrix should expose a decay of the moduli of its entries when we move away from the diagonal, since the sizes of the entries farther away from the diagonal determine the accuracy of the approximation. Recent theoretical results in this direction were given in [FrSchiSchw2020]. Lanczos techniques represent another deterministic approximation approach and are investigated [Bentbib2021, JieSaad2018, Lin2016], e.g. Without giving details let us just mention that in order to improve their accuracy, deterministic approximation techniques can be combined with the stochastic techniques to be presented in the sequel; see [SaadUbaruJie2017], e.g.

In this paper we deal with the other big class of methods which aim at breaking the cost barrier using stochastic estimation. In principle, they work for any matrix and, for example, do not require a decay away from the diagonal. Our goal is to develop a multilevel Monte-Carlo method to estimate stochastically. Our approach can be regarded as a variance reduction technique applied to the classical stochastic “Hutchinson” estimator [Hutchinson90]

(2)

where the components of the random vectors obey an appropriate probability distribution. The variance of the estimator (2) decreases only like , which makes the method too costly when higher precisions are to be achieved. The multilevel approach aims at curing this by working with representations of at different levels. On the higher numbered levels, evaluating becomes increasingly cheap, while on the lower levels, which are more costly to evaluate, the variance is small.

This paper is organized as follows: In Section 2 we recall the general framework of multilevel Monte-Carlo estimators. In Section 3 we discuss Hutchinson’s method for stochastically estimating the trace before turning to our new multilevel approach in Section 4. This section also contains a comparison to known approaches based on deflation as a motivation of why the new multilevel method should provide additional efficiency. Finally, several numerical results are presented in Section 5.

2 Multilevel Monte-Carlo

We discuss the basics of the multilevel Monte-Carlo approach as a variance reduction technique. We place ourselves in a general setting, thereby closely following [Giles2015].

Assume that we are given a probability space with sample space , sigma-algebra and probability measure

. For a given random variable

, the standard Monte-Carlo approach estimates its expected value as the arithmetic mean

(3)

where the are independent events coming from . The variance of this estimator is , so the root mean square deviation has order . This indicates that the number of events has to increase quadratically with the accuracy required which is why, typically, higher accuracies require very high computational effort in this type of Monte-Carlo estimation.

The idea of multilevel Monte-Carlo is to split the random variable as a sum

(4)

where the random variables are regarded as contributions “at level ” to . This gives

and an unbiased estimator for

is obtained as

where the denote the independent events on each level. The variance of this estimator is

The idea is that we are able to find a multilevel decomposition of the form (4) in which the cost to evaluate is low when the variance is high and vice versa. As is explained in [Giles2015], the solution to the minimization problem which minimizes the total cost subject to achieving a given target variance

gives . Here, the Lagrangian multiplier satisfies , and the corresponding minimal total cost is

The typical situation is that the contributions on level are given as differences of approximations to on the various levels, i.e. we have

(5)

If we assume that the cost to evaluate decreases rapidly with the level , the cost for evaluating the differences is well approximated by . The ratio of the total cost encountered when reducing the variance to a given value between multilevel Monte-Carlo (with optimal choice of ) and standard Monte-Carlo (3) is then given by

This is the basic quantitative relation indicating how the costs to evaluate the and the variances of the differences have to relate in order for the multilevel approach to be more efficient than standard Monte-Carlo estimation of .

3 Stochastic estimation of the trace of a matrix

We now assume that we are given, in an indirect manner, a matrix for which we want to compute the trace

Our basic assumption is that the entries of are neither available directly nor that they can all be obtained at decent computational cost. This is typically the case when arises as a function of a large (and sparse) matrix, the most common case being the matrix inverse.

In a seminal paper [Hutchinson90], Hutchinson suggested to use a stochastic estimator to approximate . The following theorem summarizes his result together with the generalizations on the admissible probability spaces; see [DongLiu1993, Wilcox99], e.g.

Theorem 1

Let be a probability measure on a sample space and assume that the components of the vector are random variables depending on satisfying

(6)

Then

In particular, if the probability space is such that each component is independent from for , then

Proof

The proof is simple, but we repeat it here because the literature often treats only the real and not the general complex case. We have

where the last inequality follows from (6). Similarly

(7)

Since the components are assumed to be independent, we have except when (which does not occur in (7)) or or . This gives

and in the first sum by assumption, whereas in the second sum we have .

Note that as a definition for the variance of a complex random variable we used rather than to keep it real and non-negative.

Standard choices for the probability spaces are to take with identically and independently distributed (i.i.d.) components as

(8)
(9)
(10)
(11)
Corollary 1

If the components are i.i.d. with distribution (8) or (11), then

where denotes the Frobenius norm and the offdiagonal part of a matrix. If the components are i.i.d. with distribution (9) or (10), then

Proof

For the distributions (8) and (11), the components have only real values and . Therefore

For the distributions (9) and (10)we have , and thus

In a practical situation where we approximate by averaging over

samples we can compute the root mean square deviation along with the averages and rely on the law of large numbers to assess the probability that the computed mean lies within the

, or interval. Several results on Hutchinson’s method have been formulated which go beyond this asymptotic aspects by giving tail or concentration bounds; see [AvronToledo2011, CortinovisKressner2020, roosta2015improved], e.g. For the sake of illustration we here report a summary of these results as given in [meyer2021hutch++]. In our numerical examples, we will simply work with the root mean square deviation to assess accuracy.

Theorem 2

Let the distribution for the i.i.d. components of the random vectors be sub-Gaussian, and let . Then for we have that the probability for

(12)

is .

Note that if is symmetric positive semidefinite with

denoting its (non-negative) eigenvalues, then

implying that (12) yields a (probabilistic) relative error bound for the trace. Also note that the real distributions (8) and (11) are sub-Gaussian; see [meyer2021hutch++].

4 Multilevel Monte-Carlo for the trace of the inverse

We now turn to the situation where we want to estimate for a large and sparse matrix . Direct application of Theorem 1 shows that an unbiased estimator for is given by

(13)

where the vectors are independent random variables satisfying (6), and that its variance is

depending on whether the components of satisfy (8), (11) or (9), (10), respectively.

Each time we add a sample to (13) we have to solve a linear system with matrix and right hand side , and the cost for solving these linear systems determines the cost for each stochastic estimate. For a large class of matrices, multigrid methods represent particularly efficient linear solvers. We assume that this is the case for our matrix and now describe how to derive a multilevel Monte-Carlo method for the approximation of which uses the multigrid hierarchy not only for the linear solver, but also to obtain a good representation (5) required for a multilevel Monte-Carlo approach.

4.1 Derivation of a mutilevel Monte-Carlo method

Multigrid methods rely on the interplay between a smoothing iteration and a coarse grid correction which are applied alternatingly. In the geometric interpretation, where we view components of vectors as representing a continuous function on a discrete grid, the smoother has the property to make the error of the current iterate smooth, i.e. varying slowly from one grid point to the next. Such error can be represented accurately by a coarser grid, and the coarse grid correction solves for this coarse error on the coarse grid using a coarse grid representation of the matrix. The solution is then interpolated back to the original “fine” grid and applied as a correction to the iterate. The principle can be applied recursively using a sequence of coarser grids with corresponding operators, the solves on the coarsest grid being obtained by direct factorization.

To obtain a multilevel Monte-Carlo decomposition we discard the smoother and only consider the coarse grid operators and the intergrid transfer operators which we now describe algebraically. The coarse grid operators are given by a sequence of matrices

representing the original matrix on the different levels ; the prolongation and restriction operators

transfer data between the levels. Typically, when is Hermitian, one takes , and for given the coarse system matrices are often constructed using the Petrov-Galerkin approach

Using the accumulated prolongation and restriction operators

where we put by convention, we regard as the approximation to at level . We thus obtain a multilevel decomposition for the trace as

(14)

This gives

with the components of being i.i.d. stochastic variables satisfying (6). The unbiased multilevel Monte-Carlo estimator is then

where the vectors are stochastically independent samples of the random variable satisfying (6).

The following remarks collect some important observations about this stochastic estimator.

Remark 1

Computationally, the estimator requires to solve systems of the form with . Since the matrices arise from the multigrid hierarchy, we directly have a multigrid method available for these systems by restricting the method for to the levels .

Remark 2

Since for any two matrices and the trace of their product does not depend on the order,

(15)

we have

So, instead of estimating the contribution in (14) stochastically, we can also compute it directly by inverting the matrix and computing the product . Note that and are usually sparse with a maximum of , say, non-zero entries per row. The arithmetic work for is thus of order for the product plus for the inversion of and the product ). Since the variance of is presumably large, this direct computation can be much more efficient than a stochastic estimation, even when we aim at only quite low precision in the stochastic estimate.

Remark 3

There are situations where , for example in aggregation based multigrid methods, where the columns of are orthonormal and , see [Braess95, Brezinaetal2005]. Then

and

This means that instead of the multilevel decomposition (14) we can use

in which the stochastic estimation on level now involves random vectors from instead of .

4.2 Discussion of the multilevel Monte-Carlo method

A profound analysis of the proposed multilevel Monte-Carlo method must take the approximation properties of the representation of the matrix at the various levels into account. This is highly problem dependent, so that in this paper we only provide a discussion of heuristics on why the proposed approach has the potential to yield efficient multilevel Monte-Carlo schemes.

To simplify the discussion to follow, let us assume that the variance of the estimator at level is given by the square of the Frobenius norm of the off-diagonal part. This is the case, for example, if the components are i.i.d. with distribution (9) or (10); see Corollary 1

. This Frobenius norm can be related to the singular values of

. Recall that the singular value decomposition of a non-singular matrix

is

with left singular vectors , right singular vectors and positive singular values which we ordered by increasing value for convenience here. In the following we base all our discussion on singular values and vectors. It is therefore worthwhile to mention that in the case of a Hermitian matrix

this discussion simplifies in the sense that then the singular values are the moduli of the eigenvalues, and left and right singular vectors are identical and coincide with the eigenvectors.

Lemma 1

Let have singular values . Then

(17)

Proof

The equality is a basic fact from linear algebra, see [GvL], e.g. The formula (17) uses this and corrects for the vanishing diagonal part in .

For the trace of the inverse we have

(18)

since the reciprocals of the singular values of , are the singular values of . Therefore, in a simplified manner—disregarding the second term in (18)—it appears that the small singular values of are those who contribute most to the variance for the Hutchinson estimator (13) for . In high performance computing practice, deflation has thus become a common tool, see [DeGrand:2004qw, Endress:2014qpa, Gambhir_2017, Giusti:2004yp], e.g., to reduce the variance: One precomputes the , say, smallest singular values of in the singular value decomposition (4.2) together with their left singular vectors . With the orthogonal projector

(19)

we now have with

(20)

This shows that in we have deflated the small singular values of , so that we can expect a reduction of the variance when estimating the trace of this part stochastically. The trace of the second part is equal to the sum (see (15)), and . So the second part can be computed directly from the singular triplets computed for the deflation. If is Hermitian, the deflation approach simplifies and amounts to precomputing the smallest eigenpairs. We refer to the results in [Gambhir_2017] for a more in-depth analysis and discussion about the heuristics just presented.

The deflation approach is still quite costly, since one has to precompute the singular values and vectors, and if the size of the matrix increases it is likely that we have to increase to maintain the same reduction in the variance. Approximate deflation has thus been put forward as an alternative, see [Balietal2015, Romero_2020], where one can use larger values for while at the same time allowing that the contribution of the small singular values to the variance is eliminated only approximately. One then replaces by a more general projector of the form

where now and can be regarded as containing approximate left and right singular vectors, respectively, as their columns. Actually, it is sufficient that their range is spanned by such approximations to left and right singular vectors, since the construction of is invariant under transformations with non-singular matrices . In the decomposition we now have, again using (15),

If is relatively small, the second trace can be computed directly as in the exact deflation approach. If we take larger values for , we can estimate it stochastically. The inexact deflation approach then becomes a two-level Monte-Carlo method.

If we look at our multilevel Monte-Carlo decomposition (4) with just tow levels, then it differs from inexact deflation in that the value for is now very large, namely the grid size at level 2 which usually is . The matrix spanning the approximate singular vectors is replaced by the prolongation operator , and corresponds to the restriction operator . The multigrid construction principle should ensure that the range of contains good approximations to left singular vectors belonging to small singular values, and similarly for with respect to right singular vectors. This is why the variance reduction can be expected to be efficient. We thus have a large value of —proportional to —which targets at a high reduction of the variance of the first term. The second term involves the second level matrix representation, which is still of large size, and its trace estimator will, in addition, still have large variance. This is the reason why we extend the approach to involve many levels, ideally until a level where we can compute the trace directly, so that we do not suffer from a potentially high variance of a stochastic estimator.

To conclude this discussion, we note that several other techniques for variance reduction have been suggested which can also be regarded as two-level Monte-Carlo techniques. For example, [liu2014polynomial, baral2019] take a decomposition with an appropriately chosen polynomial and then estimates stochastically. The “truncated solver” method of [Alexandrou_2014] follows a related idea by subtracting an approximation to the inverse. A similar decomposition with being a truncated Chebyshev series approximation was considered in [Hanetal2017, Han2015, SaadUbaru2017], for example, in which case is actually neglected. The work then resides in the stochastic estimation of , thus avoiding to solve linear systems.

Finally, we refer to [meyer2021hutch++] for a recent further variance reduction technique for Hutchinson’s method, enhancing it by using vectors of the form with random vectors .

5 Numerical results

We consider three classes of matrices: The standard discrete 2d Laplacian, the 2d gauge Laplacian, and the Schwinger model. These three classes represent an increasingly complex hierarchy of problems which will eventually lead to our final, though yet unreached target, the Wilson-Dirac matrix arising in lattice QCD. The improvements of the multilevel approach compared to “plain” Hutchinson (13) are tremendous and typically reach two orders of magnitude or more. This is why we compare against deflated Hutchinson, where we deflate the smallest eigenpairs of the matrix . With holding the respective eigenvectors in its columns, we use the projector as in (19), resulting in the decomposition (20) Therein we estimate with the Hutchinson estimator whereas is obtained directly from the deflated eigenpairs. We always performed a rough scan to determine a number of deflated eigenpairs which s close to time-optimal. The deflated Hutchinson approach usually gains at least one order of magnitude in time and arithmetic cost over plain Hutchinson.

All our computations were done on a single thread of an Intel Xeon Processor E5-2699 v4, with a MATLAB R2021a implementation of our numerical experiments for the 2d Laplacian, and in Python for our tests with the gauge Laplacian and the Schwinger model. By default we aimed at a relative accuracy of . This is done as follows: We first perform five stochastic estimates, take their mean and subtract their root mean square deviation, giving the value . In the deflated Hutchinson method we now perform stochastic estimates as long as their root mean square deviation exceeds . For the multilevel Monte-Carlo method we have to prescribe a value for the root mean square deviations for the stochastic estimation of each of the traces

(21)

from (14), while we always compute the last term in (14) non-stochastically as , inverting explicitly. The requirement is to have

so the obvious approach is to put for all , and this is what we do in our first two examples. It might be advantageous, though, to allow for a larger value of on those level differences where the cost is high, and we do so in Example 3.

For each stochastic estimate for (21) we have to solve linear systems with the matrices and . This is done using a multigrid method based on the same prolongations , restrictions and coarse grid operators that we use to obtain our multilevel decomposition (14). However, when multigrid is used as a solver, we use the full hierarchy going down to coarse grids of very small sizes, whereas in the multilevel decomposition (14) used in multilevel Monte-Carlo we might stop at an earlier level.

For all experiments we report mainly two quantities. The first is the number of stochastic estimates that are performed at each level difference (21) for multilevel Monte-Carlo together with the number of stochastic estimates in deflated Hutchinson (which always require linear solves at the finest level). These numbers may be interpreted as illustrating how multilevel Monte-Carlo moves the higher variances to the coarser level differences. As a second quantity, we report the approximate arithmetic cost for both methods, deflated Hutchinson and multilevel Monte-Carlo, which we obtain using the following cost model: For every matrix-vector product of the form we assume a cost of , the number of nonzeros in . In this manner ,one unit in the cost model roughly corresponds to a multiplication plus an addition. This applies to the computation of residuals, of prolongations and restrictions and the coarsest grid solve in the multigrid solver as well as to the “global” restrictions and prolongations used in each stochastic estimate in multilevel Monte-Carlo. For the latter method, we also count the work for the direct computation of the trace at the coarsest level, which involves the inversion of the coarsest grid matrix and additional matrix-matrix products. This cost model thus only neglects vector-vector and scalar operations and is thus considered sufficiently accurate for our purposes.

Example 1

The discrete 2d Laplacian is the -matrix

which results from the finite difference approximation of the Laplace operator on an equidistant grid in the unit square with Dirichlet boundary conditions. Note that the eigenvalues of are explicitly known, so the trace of the inverse is, in principle, directly available as the sum of the inverses of the eigenvalues.

2d Laplace
63 92 3
127 44 4
255 64 5
511 76 6
Table 1: Parameters and quantities for Example 1

For the multigrid hierarchy we choose to be the standard bilinear interpolation from a grid of size to one of size ; see [Trottenberg2000]. Here, the number of grid points in one dimension on level is recursively given as