The behaviors of scientific phenomena and complex engineered systems are often characterized via matrices. The entries of these matrices may reflect interactions between variables, edges between vertices in a graph, or the local geometry of a complicated surface. When the systems under consideration become large, these matrices also become large and it rapidly becomes uninformative — and potentially infeasible — to investigate the individual entries. Instead, we frame our questions about these matrices in terms of aggregate properties that are insensitive to irrelevant details.
To this end, the eigenvalues and eigenvectors of these matrices often provide a succinct way to identify insightful global structure of complex problems. The eigenvalues of the Hessian matrix in an optimization problem help us understand whether the problem is locally convex or has saddle points. The influential PageRank algorithm(Brin and Page, 1998) examines the principal eigenvector of a graph for the most “important” vertices. The sum of the eigenvalues of an adjacency matrix is closely related to the number of triangles in a graph, and insights from the eigenvalues of the Laplacian matrix have given rise to the field of spectral graph theory (see, e.g., Chung (1997)). In dynamical systems, numerical algorithms, and control, eigenstructure helps us analyze local stability and typical behaviors (Luenberger, 1979; Bertsekas, 2016).
The problem of estimating the eigenvalue distribution of large matrices has a particularly long history in fields such as condensed matter physics and quantum chemistry, where one often wishes to estimate the density of states (DOS). The stochastic methods we develop in this paper build upon these decades of work, which we briefly review here. One of the core components of our procedure is the use of a stochastic estimator for the trace of a matrix by constructing a quadratic form with a random vector. This procedure is generally attributed to Hutchinson (1989), but it was coincidentally introduced in Skilling (1989). This latter work by Skilling sketches out an approach to eigenvalue estimation based on generalized traces and Chebyshev polynomials that strongly influences the present paper. Although Lanczos and Chebyshev polynomial expansions had been used for many years in physics and quantum chemistry (see, e.g., Haydock et al. (1972), Tal-Ezer and Kosloff (1984)), the combination of polynomial expansion and randomized trace estimation as in Skilling (1989) led to the kernel polynomial method for estimation the density of states in Silver and Röder (1994) and Silver et al. (1996), in which a smoothing kernel is convolved with the spectral density in order to reduce the Gibbs phenomenon. See Weiße et al. (2006) for a review of the kernel polynomial method and a broader historical perspective on these estimation techniques in physics and quantum chemistry. Outside of the kernel polynomial method specifically, these methods have also found use in quantum chemistry for stochastic Kohn-Sham density functional theory (Baer et al., 2013).
More broadly, these estimators have been used to perform estimation of the density of states, e.g., the spectral density of Hamiltonian systems. For example, Drabold and Sankey (1993)
used moment-constrained random vectors within the trace estimator to form estimates of the DOS based on the maximum entropy principle.Jeffrey and Smith (1997) took a related approach and used the trace estimator to compute first and second moments with which a Gaussian can be fit. Zhu et al. (1994) and Parker et al. (1996) used a truncated Chebyshev approximation to the delta function to estimate the spectral density of a Hamiltonian.
Motivated in part by these physics applications, a broader literature has developed around randomized trace estimators applied to polynomial expansions. For example, Bai et al. (1996), Bai and Golub (1996), and Golub and Meurant (1994) developed a set of fundamentally similar techniques using a quadrature-based view on the estimation of generalized traces, proving upper and lower bounds on the estimates. These techniques are closely related to those presented here, although we develop unbiased estimates of the matrix function in addition to the unbiased trace estimator, and address the case where the matrix-vector product is itself noisy. We also view the problem as one of estimating the entire spectral density, rather than the efficient estimation of a single scalar. Avron and Toledo (2011) also studied theoretical properties of the trace estimator, and developed convergence bounds, later extended by Roosta-Khorasani and Ascher (2015).
Lin et al. (2016) provides an excellent survey of different techniques for estimating the spectral density of a real symmetric matrix using matrix-vector products, including the Chebyshev expansions and kernel smoothing built upon in the present paper. Lin (2017) proposes variations of these techniques that have better convergence characteristics when the target matrix is approximately low rank. Xi et al. (2017) performs a theoretical and empirical investigation of similar ideas for the generalized eigenvalue problem. Closely related to the general DOS problem is the estimation of the number of eigenvalues in a particular interval, as such estimates can be used to construct histograms, and Di Napoli et al. (2016) proposes a class of KPM-like techniques for achieving this.
One function of the eigenspectrum that is of particular interest is the determinant; the log determinant can be viewed as a generalized trace under the matrix logarithm, as described in the proceeding section. To our knowledge, randomized polynomial-based estimates of the log determinant were first introduced by Skilling (1989), but were later rediscovered by Bai et al. (1996) and Barry and Pace (1999) and then again by Han et al. (2015) and Boutsidis et al. (2015), albeit with more theoretical investigation. An alternative randomized scheme for the determinant was proposed in Saibaba et al. (2016). Finally, there have also been recent proposals that incorporate additional structural assumptions in forming such estimates (Peng and Wang, 2015; Fitzsimons et al., 2017b, a).
Similarly, matrix inversion can be viewed as a transformation of the eigenspectrum and can be approximated via polynomials. The Neumann series is one classic example—essentially the Taylor series of the matrix inverse—when the spectral radius is less than one. This approximation has been used for polynomial preconditioning, originally proposed in Rutishauser (1959), although see also Johnson et al. (1983) and Saad (1985). More recently, these series have been proposed as an efficient way to solve linear systems with the Hessian in support of second-order stochastic optimization (Agarwal et al., 2016), using a very similar procedure to that described in Section 3 but with Neumann series.
Although this history represents a rich literature on randomized estimation of the eigenspectra of implicit matrices, this paper makes several distinct contributions. First, we introduce a randomized variation of the Chebyshev polynomial expansion that leads to provably unbiased estimates when combined with the stochastic trace estimator. Second, we show that such estimators can be generalized to the case where the matrix-vector product is itself randomized, which is of critical interest to stochastic optimization methods for large scale problems in statistics and machine learning. Unbiasedness is preserved in this case using generalizations of the Poisson estimator (Wagner, 1987; Fearnhead et al., 2010; Papaspiliopoulos, 2011). Third, we introduce a new order-independent smoothing kernel for Chebyshev polynomials, with appealing properties. Finally, we apply this ensemble of techniques at large scale to matrices with known spectra, in order to validate the methods.
2 Problem Formulation
The matrix of interest is assumed to be real, square, symmetric, and and of dimension , i.e., where . Many of the results in this paper generalize to complex and asymmetric matrices, but the real symmetric case is of primary interest for Hessians and undirected graphs. The primary challenge to be addressed is that
can be very large. For example, the modern stochastic optimization problems arising from the fitting of neural networks may have tens of billions of variables(Coates et al., 2013) and the graph of web pages has on the order of a trillion vertices (Alpert and Hajaj, 2008). When is this large, it cannot be explicitly represented as a matrix, but can only be indirectly accessed via matrix-vector products of the form , where we can choose . An additional challenge arises in the case of large scale model fitting, where the loss surface may be a sum of billions of terms, and so even cannot be feasibly computed. Instead, an unbiased but randomized estimator is formed from a “mini-batch” of data, as in stochastic optimization (Robbins and Monro, 1951).
We will assume that all of the eigenvalues of are within the interval , something achievable by dividing by its operator norm plus a small constant. With this assumption, the abstraction we choose for investigating is the generalized trace , where is an operator function that generalizes a scalar function to square matrices. We take to have a Chebyshev expansion :
where the functions are the Chebyshev polynomials of the first kind. These are defined via the recursion
Chebyshev approximations are appealing as they are optimal under the norm. Other polynomial expansions are optimal under different function norms, but this generalization is not considered here. We can generalize the Chebyshev polynomials to square matrices using the recursion
Let be diagonalizable into with orthonormal and diagonal . Then the matrix Chebyshev polynomial applies the scalar Chebyshev polynomial to each of the eigenvalues of .
The proposition is true by inspection for and . We proceed inductively using the two-term recursion. Assume that the proposition is true for and , then
Applying the Chebyshev recursion:
When has Chebyshev coefficients , we thus define as
This matrix function can now be seen to be simply applying to the eigenvalues of . Returning to the generalized trace, we see that it corresponds to first applying to each eigenvalue and then summing them:
where the are the eigenvalues of . By computing the traces of the Chebyshev polynomials, we can then can choose different to ask different questions about the spectra of . In the proceeding sections, we will describe the construction of estimators such that .
The generalized trace is a surprisingly flexible tool for interrogating the spectrum. In particular, it enables us to look at the eigenvalues via the normalized spectral measure, which is a sum of Dirac measures:
A generalized trace can thus be seen as an expectation under this spectral measure:
Through the spectral measure, one can compute empirical summaries of the matrix of interest, such as How many eigenvalues are within of zero? or How many negative eigenvalues are there?. Such questions can be viewed as particular instances of functions where they are, e.g., indicator-like functions.
In this paper, we will take this general view and seek to construct a general purpose spectral density estimate from the Chebyshev traces of using a smoothing kernel . If and , then we can define the smoothed spectral density
In the generalized trace framework, we can ask pointwise questions about by constructing a -specific . We are seeking to estimate a one-dimensional density, so a dense collection of pointwise estimates is useful for visualization, diagnosis, or to ask various post hoc questions such as those above. As in other kinds of density estimation, smoothing is sensible as it leads to practical numerics and we typically do not care about high sensitivity to small perturbations of the eigenvalues. Smoothing also minimizes the effects of the Gibbs phenomenon when using Chebyshev polynomials. Various smoothing kernels have been proposed for the case where the polynomials are truncated (See Weiße et al. (2006) for a discussion.), but in this work we will not have a fixed-order truncation and so we will derive a kernel that is 1) appropriate for density estimation, 2) reduces the Gibbs phenomenon, 3) maps naturally to , and 4) does not depend on the order of the approximation.
3 Randomized Matrix Power Series Estimation
In this section we discuss the construction of unbiased Monte Carlo estimates of functions applied to symmetric matrices. Two forms of Monte Carlo estimator will be used: 1) multiplications of independent random matrix variates in order to get unbiased estimates of matrix powers, and 2) Monte Carlo estimation of infinite polynomial series. These are framed in terms of randomized Chebyshev polynomials, slightly generalizing the scalar-valued power series importance sampling estimator described in Papaspiliopoulos (2011, Section 4.6).
The core observation in the randomized estimation of infinite series is that sums such as Equations (2.1) and (2.10) can be turned into expectations by introducing a proposal distribution that has support on the non-negative natural numbers . The estimator is then a weighted random power or polynomial of random order. We are here making the further assumption that we only have randomized estimates of the matrix and so these matrix powers must be constructed with care: the power of an expectation is not the expectation of the power. Instead, we use a product of independent estimates of the matrix, with the th independent estimate denoted , to construct an unbiased estimate of the matrix power. For example, in the case of fitting a large statistical or machine learning regression model, each Hessian estimate might arise from small and independent subsets of data. This relieves us of the burden of computing a Hessian that is a sum over millions or billions of data. The following lemma addresses such an unbiased construction for Chebyshev polynomials.
Let be a sequence of independent, unbiased estimators of a symmetric real matrix . From this sequence, construct a new sequence as follows:
Then is an unbiased estimate of the matrix-valued Chebyshev polynomial of the first kind .
Unbiasedness for is clear by inspection. For , we proceed inductively by assuming that and are unbiased. is independent of and therefore
Although this randomized estimator is unbiased, its variance explodes as a function of. The intuitive reason for this is that deterministic Chebyshev polynomials have a somewhat magical property: even though the leading coefficients increase exponentially with , the terms cancel out in such a way that the sum is still within the interval . However, in the case here the polynomial is random. The noise overwhelms the careful interaction between terms and results in an explosion of variance. However, it is possible to find an upper bound for this variance and identify situations where the Chebyshev coefficients decay fast enough to achieve finite variance. For a matrix , we denote its spectral norm by and its Frobenius norm by .
Let be a sequence of independent, unbiased estimators of a symmetric real matrix and let be defined as in (3.1). Suppose that
Then, for all ,
Let for . For each define
and note that for all and that is positive semidefinite for all . We can write the defining recurrence for as
This gives the following recurrence for :
since is independent of . Taking the trace of both sides gives
where we have use the inequality for positive semidefinite matrices and , together with the triangle inequality, and the definition of . We then see that for all ,
it follows that for all . ∎
Fortunately, in the situation of interest here, the coefficients weighting each of the Chebyshev terms fall off even more rapidly than the variance increases. In particular, the von Mises smoothing introduced in Section 3.1 has subexponential tails that counteract the explosion of the noisy Chebyshev polynomials. Thus with independent estimators to form stable unbiased estimates of the weighted Chebyshev polynomials, we can now form an unbiased estimate of .
Let be an analytic operator function with Chebyshev polynomial coefficients . Construct the sequence as in Lemma 3.1 above. Let be a probability distribution on the natural numbers (including zero)
be a probability distribution on the natural numbers (including zero), where is the probability mass function and if . Construct an estimator by first drawing a random nonnegative integer from and then computing . Then .
This simply combines importance sampling with the Chebyshev estimator from Lemma 3.1:
The variance of importance sampling estimates can be poor. In the following, we derive the proposal distribution that minimizes the variance of the estimator under the Frobenius norm, i.e., it minimizes the expected squared Frobenius norm of the difference between the estimator and its mean .
Suppose the sequence is such that
is finite. Then the proposal distribution where
The Frobenius norm can be written as a trace, so that
The second term is independent of , so it suffices to minimize the first term. Under this term can be expanded as
To show that this value is minimal, we consider an alternative distribution and apply Jensen’s inequality on the convex function to write
Thus we can see that offers expected squared Frobenius norm less than or equal to that expected under all other possible proposals . ∎
We also note that an importance sampler can be constructed with a hazard rate formulation, resulting in a “randomized truncation” that is still unbiased, using an estimator with as before:
Although this is appealing because it makes use of the intermediate steps of the recursive computation of , we have not been able to identify hazard rates that generally result in variance reduction. It nevertheless seems likely that such hazard rates exist.
3.1 Von Mises Smoothing
As discussed in Section 2, it is useful to frame the problem in terms of estimating the spectral measure . This is closely related to the kernel polynomial method (see, e.g., Weiße et al. (2006)) in which truncated Chebyshev approximations are smoothed with order-dependent Fejér, Jackson, or Lorentz kernels to minimize the Gibbs phenomenon in the approximation. Here we are not making fixed-order truncations, but smoothing is still desirable. We introduce the von Mises kernel to address our situation. It has support on the interval , integrates to one, and is approximately Gaussian for large , but with a convenient closed-from Chebyshev expansion:
where is the modified Bessel function of the first kind. This kernel arises from constructing a von Mises distribution on the unit circle and then projecting it down onto the interval , along with the appropriate Jacobian adjustment. The parameter is inversely related to the squared width of the kernel. Figure 1 shows several densities from different parameters, on the unit circle and projected.
Of particular interest is that the von Mises kernel has a closed-form Chebyshev polynomial expansion for any . Chebyshev polynomials can be constructed in three different ways. First, on the interval they can be constructed via the recursion in Section 2 or as . They can also be computed on the unit circle in the complex plane as where . Finally, they can be constructed via the phase alone, as on the unit circle, i.e., , so that . In the and cases, the polynomials are projected onto the interval . See Trefethen (2013) for a comprehensive treatment of these representations.
We focus only on the phase view. The von Mises distribution is a probability distribution defined on the unit circle, with density given by
Its variance is given by . For a fixed and , the coefficients can be computed via
The first coefficient for Chebyshev polynomials does not have the factor of two and so . Returning to the interval we have
For our purposes here, and so the ratio of modified Bessel functions is well approximated as:
This approximation is computationally useful and also makes it clear that the rapidly converge to zero as a function of .
3.2 Bounding the Variance of the Importance Samples
When computing an importance sampled estimate of the von Mises kernel, a natural choice of proposal distribution is to have it be proportional to the absolute value of the coefficients in Equations (3.17) and (3.18). Here the proposal distribution will be denoted as rather than to distinguish it from the constant :
where the normalizing constant is given by
Note that the factors are absorbed into . Observing that
we can see that the normalization constant has upper bound
In the theorem below we use this bound on and the bound on the Chebyshev estimators from Lemma 3.2 to provide an upper bound on the expected squared Frobenius norm of the overall importance sampler, as a function of and .
The expected squared Frobenius norm of the randomized Chebyshev estimator has upper bound
where for all .
We first use the law of total expectation and incorporate the bound from Lemma 3.2:
The and proposal are chosen to resolve to a simple form:
In particular, we get a straightforward bound on this term for by discarding the cosine:
Substituting this back into the overall bound and using the standard result ,
We can now insert the bound for to get the overall result:
This bound does not depend on , but we can see that the variance effectively grows exponentially with . This can be seen directly as a bias/variance tradeoff: larger corresponds to less smoothing and lower bias with a large variance penalty.
4 Skilling-Hutchinson Randomized Estimation of Generalized Traces
Having constructed an unbiased estimator of analytic operator functions, we now focus on estimation of the trace of such matrices. The classic randomized estimator described in Hutchinson (1989) and Skilling (1989), is the following, which we will call the Skilling-Hutchinson trace estimator:
Let be a square matrix and be a random vector such that . Then .
|trace of a scalar|
|invariance to cyclic permutation|
|linearity of trace|
|linearity of expectation|
The requirement for the random variableis fairly weak and is satisfied by, for example, standard normal and Rademacher variates. The properties of such estimators have been studied in Avron and Toledo (2011). Finally, the following theorem makes it clear that this trace estimator can be combined with the randomized Chebyshev expansion to construct an overall unbiased estimate of the generalized trace.
Following the simple argument from the proof of Proposition 4.1, we have
Since is independent of , then . ∎
The resulting basic algorithm that combines randomized Chebyshev polynomials, von Mises smoothing, and the Skilling-Hutchinson trace estimator is shown in Algorithm 1. For simplicity, the pseudocode shows a single sample for both the polynomial and the quadratic form.
4.1 Variance Reduction Through a Control Variate
The Skilling-Hutchinson trace estimator’s variance can be reduced via the construction of a control variate. Let and form the Skilling-Hutchinson trace estimator as . A natural choice of control variate is to compute the quadratic form using a second matrix which has a known trace. This is similar to a preconditioner in which one introduces a second related matrix that is somehow easier to deal with than the original. In this case, the new estimator is
for an appropriately chosen . Note that the expectation is preserved, i.e., . To compute the variance-minimizing , we require the following result from Avron and Toledo (2011):
The variance of the Skilling-Hutchinson trace estimator applied to a real symmetric is