1 Introduction
Symmetric positive definite matrices arise in many areas in a variety of guises: covariances, kernels, graph Laplacians, or otherwise. A basic computation with such matrices is evaluation of the bilinear form , where is a matrix function and , are given vectors. If , we speak of computing a bilinear inverse form (BIF) . For example, with ( canonical vector) is the diagonal entry of the inverse.
In this paper, we are interested in efficiently computing BIFs, primarily due to their importance in several machine learning contexts, e.g., evaluation of Gaussian density at a point, the Woodbury matrix inversion lemma, implementation of MCMC samplers for Determinantal Point Processes (Dpp), computation of graph centrality measures, and greedy submodular maximization (see Section 2).
When is large, it is preferable to compute iteratively rather than to first compute (using Cholesky) at a cost of operations. One could think of using conjugate gradients to solve approximately, and then obtain
. But several applications require precise bounds on numerical estimates to
(e.g., in MCMC based Dpp samplers such bounds help decide whether to accept or reject a transition in each iteration–see Section 5.1), which necessitates a more finessed approach.Gauss quadrature is one such approach. Originally proposed in [23] for approximating integrals, Gauss and Gausstype quadrature (i.e., GaussLobatto [43] and GaussRadau [51] quadrature) have since found application to bilinear forms including computation of [4]. Bai et al. also show that Gauss and (right) GaussRadau quadrature yield lower bounds, while GaussLobatto and (left) GaussRadau yield upper bounds on the BIF .
However, despite its long history and voluminous existing work (see e.g., [29]), our understanding of Gausstype quadrature for matrix problems is far from complete. For instance, it is not known whether the bounds on BIFs improve with more quadrature iterations; nor is it known how the bounds obtained from Gauss, GaussRadau and GaussLobatto quadrature compare with each other. We do not even know how fast the iterates of GaussRadau or GaussLobatto quadrature converge.
Contributions. We address all the aforementioned problems and make the following main contributions:

We show that the lower and upper bounds generated by Gausstype quadrature monotonically approach the target value (Theorems 4 and 6; Corr. 7). Furthermore, we show that for the same number of iterations, GaussRadau quadrature yields bounds superior to those given by Gauss or GaussLobatto, but somewhat surprisingly all three share the same convergence rate.

We demonstrate implications of our results for two tasks: (i) scalable Markov chain sampling from a
Dpp; and (ii) running a greedy algorithm for submodular optimization. In these applications, quadrature accelerates computations, and the bounds aid early stopping.
Indeed, on largescale sparse problems our methods lead to even several orders of magnitude in speedup.
Related Work.
There exist a number of methods for efficiently approximating matrix bilinear forms. Brezinski [12] and Brezinski et al. [13]
use extrapolation of matrix moments and interpolation to estimate the 2norm error of linear systems and the trace of the matrix inverse.
Fika et al. [20] extend the extrapolation method to BIFs and show that the derived oneterm and twoterm approximations coincide with Gauss quadrature, hence providing lower bounds. Further generalizations address for a Hermitian matrix [19]. In addition, other methods exist for estimating trace of a matrix function [3, 13, 18] or diagonal elements of matrix inverse [5, 60].Many of these methods may be applied to computing BIFs. But they do not provide intervals bounding the target value, just approximations. Thus, a blackbox use of these methods may change the execution of an algorithm whose decisions (e.g., whether to transit in a Markov Chain) rely on the BIF value to be within a specific interval. Such changes can break the correctness of the algorithm.
2 Motivating Applications
BIFs are important to numerous problems. We recount below several notable examples: in all cases, efficient computation of bounds on BIFs is key to making the algorithms practical.
Determinantal Point Processes. A determinantal point process (Dpp) is a distribution over subsets of a set (). In its Lensemble form, a Dpp uses a positive semidefinite kernel , and to a set
assigns probability
where is the submatrix of indexed by entries in . If we restrict to , we obtain a Dpp. Dpp’s are widely used in machine learning, see e.g., the survey [36].Exact sampling from a ()Dpp requires eigendecomposition of [33], which is prohibitive. For large , Metropolis Hastings (MH) or Gibbs sampling are preferred and stateoftheart. Therein the core task is to compute transition probabilities – an expression involving BIFs – which are compared with a random scalar threshold.
For MH [7, 1], the transition probabilities from a current subset (state) to are for ; and for . In a Dpp, the moves are swaps with transition probabilities for replacing by (and ). We illustrate this application in greater detail in Section 5.1.
Dpps are also useful for (repulsive) priors in Bayesian models [53, 37]. Inference for such latent variable models uses Gibbs sampling, which again involves BIFs.
Submodular optimization, Sensing. Algorithms for maximizing submodular functions can equally benefit from efficient BIF bounds. Given a positive definite matrix , the set function is submodular: for all and , it holds that .
Finding the set that maximizes is a key task for MAP inference with Dpps [25], matrix approximations by column selection [11, 59] and sensing Krause et al. [35]. For the latter, we model spatial phenomena (temperature, pollution) via Gaussian Processes and select locations to maximize the joint entropy of the observed variables, or the mutual information between observed and unobserved variables.
Greedy algorithms for maximizing monotone [49] or nonmonotone [14] submodular functions rely on marginal gains of the form
for and . The algorithms compare those gains to a random threshold, or find an item with the largest gain. In both cases, efficient BIF bounds offer speedups. They can be combined with lazy [47] and stochastic greedy algorithms [48].
Network Analysis, Centrality. When analyzing relationships and information flows between connected entities in a network, such as people, organizations, computers, smart hardwares, etc. [54, 40, 2, 17, 16, 9], an important question is to measure popularity, centrality, or importance of a node.
Several existing popularity measures can be expressed as the solution to a largescale linear system. For example, PageRank [50] is the solution to , and Bonacich centrality [10] is the solution to , where is the adjacency matrix. When computing local estimates, i.e., only a few entries of , we obtain exactly the task of computing BIFs [61, 39]. Moreover, we may only need local estimates to an accuracy sufficient for determining which entry is larger, a setting where our quadrature based bounds on BIFs will be useful.
Scientific Computing. In computational physics BIFs are used for estimating selected entries of the inverse of a large sparse matrix. More generally, BIFs can help in estimating the trace of the inverse, a computational substep in lattice Quantum Chromodynamics [15, 22], some signal processing tasks [31], and in Gaussian Process (GP) Regression [52]
, e.g., for estimating variances. In numerical linear algebra, BIFs are used in rational approximations
[57], evaluation of Green’s function [21], and selective inversion of sparse matrices [41, 42, 39]. A notable use is the design of preconditioners [8] and uncertainty quantification [6].Benefiting from fast iterative bounds. Many of the above examples use BIFs to rank values, to identify the largest value or compare them to a scalar or to each other. In such cases, we first compute fast, crude lower and upper bounds on a BIF, refining iteratively, just as far as needed to determine the comparison. Figure 1 in Section 4.4 illustrates the evolution of these bounds, and Section 5 explains details.
3 Background on Gauss Quadrature
For convenience, we begin by recalling key aspects of Gauss quadrature,^{1}^{1}1The summary in this section is derived from various sources: [24, 4, 29]. Experts can skim this section for collecting our notation before moving onto Section 4, which contains our new results. as applied to computing , for an symmetric positive definite matrix that has simpleeigenvalues, arbitrary vectors , and a matrix function . For a more detailed account of the relevant background on Gausstype quadratures please refer to Appendix A, or [29].
It suffices to consider thanks to the identity
Let be the eigendecomposition of where is orthonormal. Letting , we then have
Toward computing , a key conceptual step is to write the above sum as the RiemannStieltjes integral
(3.1) 
where , , and is piecewise constant measure defined by
Our task now reduces to approximating the integral (3.1), for which we invoke the powerful idea of Gausstype quadratures [23, 51, 43, 24]. We rewrite the integral (3.1) as
(3.2) 
where denotes the th degree approximation and denotes the remainder term. In representation (3.2) the weights , , and quadrature nodes are unknown, while the values are prescribed and lie outside the interval of integration .
Different choices of these parameters yield different quadrature rules: gives Gauss quadrature [23]; with () gives left (right) GaussRadau quadrature [51]; with and yields GaussLobatto quadrature [43]; while for general we obtain GaussChristoffel quadrature [24].
The weights , and nodes are chosen such that if is a polynomial of degree less than , then the interpolation is exact. For Gauss quadrature, we can recursively build the Jacobi matrix
(3.3) 
and obtain from its spectrum the desired weights and nodes. Theorem 1 makes this more precise.
Theorem 1.
[62, 30] The eigenvalues of form the nodes of Gauss quadrature; the weights
are given by the squares of the first components of the eigenvectors of
.If has the eigendecomposition , then for Gauss quadrature Thm. 1 yields
(3.4) 
Given and , our task is to compute and the Jacobi matrix . For BIFs, we have that , so (3.4) becomes , which can be computed recursively using the Lanczos algorithm [38]. For GaussRadau and GaussLobatto quadrature we can compute modified versions of Jacobi matrices (for left GaussRadau), (for right GaussRadau) and (for GaussLobatto) based on . The corresponding nodes and weights, and thus the approximation of GaussRadau and GaussLobatto quadratures, are then obtained from these modified Jacobi matrices, similar to Gauss quadrature. Aggregating all these computations yields an algorithm that iteratively obtains bounds on . The combined procedure, Gauss Quadrature Lanczos (GQL) [28], is summarily presented as Algorithm 1. The complete algorithm may be found in Appendix A.
Theorem 2.
It turns out that the bounds given by Gauss quadrature have a close relation to the approximation error of conjugate gradient (CG) applied to a suitable problem. Since we know the convergence rate of CG, we can obtain from it the following estimate on the relative error of Gauss quadrature.
Theorem 3 (Relative error Gauss quadrature).
The th iterate of Gauss quadrature satisfies the relative error bound
(3.5) 
where is the condition number of .
In other words, Thm. 3 shows that the iterates of Gauss quadrature have a linear (geometric) convergence rate.
4 Main Theoretical Results
In this section we summarize our main theoretical results. As before, detailed proofs may be found in Appendix B. The key questions that we answer are: (i) do the bounds on generated by GQL improve monotonically with each iteration; (ii) how tight are these bounds; and (iii) how fast do GaussRadau and GaussLobatto iterations converge? Our answers not only fill gaps in the literature on quadrature, but provide a theoretical base for speeding up algorithms for some applications (see Sections 2 and 5).
4.1 Lower Bounds
Our first result shows that both Gauss and right GaussRadau quadratures give iteratively better lower bounds on . Moreover, with the same number of iterations, right GaussRadau yields tighter bounds.
Theorem 4.
Let . Then, yields better bounds than but worse bounds than ; more precisely,
Combining Theorem 4 with the convergence rate of relative error for Gauss quadrature (Thm. 3) we obtain the following convergence rate estimate for right GaussRadau.
Theorem 5 (Relative error right GaussRadau).
For each iteration , the right GaussRadau iterate satisfies
4.2 Upper Bounds
Our second result compares GaussLobatto with left GaussRadau quadrature.
Theorem 6.
Let . Then, gives better upper bounds than but worse than ; more precisely,
This shows that bounds given by both GaussLobatto and left GaussRadau become tighter with each iteration. For the same number of iterations, left GaussRadau provides a tighter bound than GaussLobatto.
Combining the above two theorems, we obtain the following corollary for all four Gausstype quadratures.
Corollary 7 (Monotonicity).
With increasing , and give increasingly better lower bounds and and give increasingly better upper bounds, that is,
4.3 Convergence rates
Our next two results state linear convergence rates for left GaussRadau quadrature and GaussLobatto quadrature applied to computing the BIF .
Theorem 8 (Relative error left GaussRadau).
For each , the left GaussRadau iterate satisfies
where .
Theorem 8 shows that the error again decreases linearly, and it also depends on teh accuracy of , our estimate of the smallest eigenvalue that determines the range of integration. Using the relations between left GaussRadau and GaussLobatto, we readily obtain the following corollary.
Corollary 9 (Relative error GaussLobatto).
For each , the GaussLobatto iterate satisfies
where .
Remarks All aforementioned results assumed that is strictly positive definite with simple eigenvalues. In Appendix C, we show similar results for the more general case that is only required to be symmetric, and lies in the space spanned by eigenvectors of corresponding to distinct positive eigenvalues.
4.4 Empirical Evidence
Next, we empirically verify our the theoretical results shown above. We generate a random symmetric matrix with density , where each entry is either zero or standard normal, and shift its diagonal entries to make its smallest eigenvalue , thus making positive definite. We set and . We randomly sample
from a standard normal distribution. Figure
1 illustrates how the lower and upper bounds given by the four quadrature rules evolve with the number of iterations.Figure 1 (b) and (c) show the sensitivity of the rules (except Gauss quadrature) to estimating the extremal eigenvalues. Specifically, we use and .
The plots in Figure 1 agree with the theoretical results. First, all quadrature rules are seen to yield iteratively tighter bounds. The bounds obtained by the GaussRadau quadrature are superior to those given by Gauss and GaussLobatto quadrature (also numerically verified). Notably, the bounds given by all quadrature rules converge very fast – within 25 iterations they yield reasonably tight bounds.
It is valuable to see how the bounds are affected if we do not have good approximations to the extremal eigenvalues and . Since Gauss quadrature does not depend on the approximations and , its bounds remain the same in (a),(b),(c). Left GaussRadau depends on the quality of , and, with a poor approximation takes more iterations to converge (Figure 1(b)). Right GaussRadau depends on the quality of ; thus, if we use as our approximation, its bounds become worse (Figure 1(c)). However, its bounds are never worse than those obtained by Gauss quadrature.
Finally, GaussLobatto depends on both and , so its bounds become worse whenever we lack good approximations to or . Nevertheless, its quality is lowerbounded by left GaussRadau as stated in Thm. 6.
5 Algorithmic Results and Applications
Our theoretical results show that GaussRadau quadrature provides good lower and upper bounds to BIFs. More importantly, these bounds get iteratively tighter at a linear rate, finally becoming exact (see Appendix B). However, in many applications motivating our work (see Section 2), we do not need exact values of BIFs; bounds that are tight enough suffice for the algorithms to proceed. As a result, all these applications benefit from our theoretical results that provide iteratively tighter bounds. This idea translates into a retrospective framework for accelerating methods whose progress relies on knowing an interval containing the BIF. Whenever the algorithm takes a step (transition) that depends on a BIF (e.g., as in the next section, a state transition in a sampler if the BIF exceeds a certain threshold), we compute rough bounds on its value. If the bounds suffice to take the critical decision (e.g., decide the comparison), then we stop the quadrature. If they do not suffice, we take one or more additional iterations of quadrature to tighten the bound. Algorithm 2 makes this idea explicit.
We illustrate our framework by accelerating: (i) Markov chain sampling for ()Dpps; and (ii) maximization of a (specific) nonmonotone submodular function.
5.1 Retrospective Markov Chain ()Dpp
First, we use our framework to accelerate iterative samplers for Determinantal Point Processes. Specifically, we discuss MH sampling [34]; the variant for Gibbs sampling follows analogously.
The key insight is that all state transitions of the Markov chain rely on a comparison between a scalar and a quantity involving the bilinear inverse form. Given the current set , assume we propose to add element to . The probability of transitioning to state is . To decide whether to accept this transition, we sample ; if then we accept the transition, otherwise we remain at . Hence, we need to compute just accurately enough to decide whether . To do so, we can use the aforementioned lower and upper bounds on .
Let and be lower and upper bounds for this BIF in the th iteration of Gauss quadrature. If , then we can safely accept the transition, if , then we can safely reject the transition. Only if , we cannot make a decision yet, and therefore retrospectively perform one more iteration of Gauss quadrature to obtain tighter upper and lower bounds and . We continue until the bounds are sharp enough to safely decide whether to make the transition. Note that in each iteration we make the same decision as we would with the exact value of the BIF, and hence the resulting algorithm (Alg. 3) is an exact Markov chain for the Dpp. In each iteration, it calls Alg. 4, which uses stepwise lazy Gauss quadrature for deciding the comparison, while stopping as early as possible.
If we condition the Dpp on observing a set of a fixed cardinality , we obtain a Dpp. The MH sampler for this process is similar, but a state transition corresponds to swapping two elements (adding and removing at the same time). Assume the current set is . If we propose to delete and add to , then the corresponding transition probability is
(5.1) 
Again, we sample , but now we must compute two quantities, and hence two sets of lower and upper bounds: , for in the th Gauss quadrature iteration, and , for in the th Gauss quadrature iteration. Then if we have , we can safely accept the transition; and if we can safely reject the transition; otherwise, we tighten the bounds via additional GaussRadau iterations.
Refinements. We could perform one iteration for both and , but it may be that one set of bounds is already sufficiently tight, while the other is loose. A straightforward idea would be to judge the tightness of the lower and upper bounds by their difference (gap) , and decide accordingly which quadrature to iterate further.
But the bounds for and are not symmetric and contribute differently to the transition decision. In essence, we need to judge the relation between and , or, equivalently, the relation between and . Since the left hand side is “easy”, the essential part is the right hand side. Assuming that in practice the impact is larger when the gap is larger, we tighten the bounds for if , and otherwise tighen the bounds for . Details of the final algorithm with this refinement are shown in Appendix D.
5.2 Retrospective Double Greedy Algorithm
As indicated in Section 2, a number of applications, including sensing and information maximization with Gaussian Processes, rely on maximizing a submodular function given as . In general, this function may be nonmonotone. In this case, an algorithm of choice is the double greedy algorithm of Buchbinder et al. [14].
The double greedy algorithm starts with two sets and and serially iterates through all elements to construct a nearoptimal subset. At iteration , it includes element into with probability , and with probability it excludes from . The decisive value is determined by the marginal gains and :
For the logdet function, we obtain
where . In other words, at iteration the algorithm uniformly samples , and then checks if
and if true, adds to , otherwise removes it from .
This essential decision, whether to retain or discard an element, again involves bounding BIFs, for which we can take advantage of our framework, and profit from the typical sparsity of the data. Concretely, we retrospectively compute the lower and upper bounds on these BIFs, i.e., lower and upper bounds and on , and and on . If we safely add to ; if we safely remove from ; otherwise we compute a set of tighter bounds by further iterating the quadrature.
As before, the bounds for and may not contribute equally to the transition decision. We can again apply the refinement mentioned in Section 5.1: if we tighten bounds for , otherwise we tighten bounds for . The resulting algorithm is shown in Appendix E.
Data  Dimension  nnz  Density(%) 

Abalone  4,177  144,553  0.83 
Wine  4,898  2,659,910  11.09 
GR  5,242  34,209  0.12 
HEP  9,877  61,821  0.0634 
Epinions  75,879  518,231  0.009 
Slashdot  82,168  959,454  0.014 
Data. For all datasets we add an 1E3 times identity matrix to ensure positive definiteness.
Abalone  Wine  GR  HEP  Epinions  Slashdot  
Dpp  9.6E3  1x  8.5E2  1x  9.3E3  1x  6.5E2  1x  1.46  1x  5.85  1x 
5.4E4  17.8x  5.9E3  14.4x  4.3E4  21.6x  5.9E4  110.2x  3.7E3  394.6x  7.1E3  823.9x  
Dpp  1.4E2  1x  0.15  1x  1.7E2  1x  0.13  1x  2.40  1x  11.83  1x 
7.3E4  19.2x  1.1E2  13.6x  7.3E4  23.3x  9.2E4  141.3x  4.9E3  489.8x  1E2  1183x  
Dg  1025.6  1x  1951.3  1x  965.8  1x  6269.4  1x  
17.3  59.3x  423.2  4.6x  10  9.7x  25.3  247.8x  418  712.9 
5.3 Empirical Evidence
We perform experiments on both synthetic and realworld datasets to test the impact of our retrospective quadrature framework in applications. We focus on ()Dpp sampling and the double greedy algorithm for the logdet objective.
5.3.1 Synthetic Datasets
We generate small sparse matrices using methods similar to Section 4.4. For ()Dpp we generate matrices while for double greedy we use . We vary the density of the matrices from to . The running time and speedup are shown in Figure 2.
The results suggest that our framework greatly accelerates both Dpp sampling and submodular maximization. The speedups are particularly pronounced for sparse matrices. As the matrices become very sparse, the original algorithms profit from sparsity too, and the difference shrinks a little. Overall, we see that our framework has the potential to lead to substantial speedups for algorithms involving bilinear inverse forms.
5.3.2 Real Datasets
We further test our framework on realworld datasets of varying sizes. We selected 6 datasets, four of them are of small/medium size and two are large. The four small/mediumsized datasets are used in [26]. The first two of small/mediumsized datasets, Abalone and Wine^{2}^{2}2Available at http://archive.ics.uci.edu/ml/., are popular datasets for regression, and we construct sparse kernel matrices with an RBF kernel. We set the bandwidth parameter for Abalone as and that for Wine as and the cutoff parameter as for both datasets, as in [26]. The other two small/mediumsized datasets are GR (arXiv High Energy Physics collaboration graph) and HEP (arXiv General Relativity collaboration graph), where the kernel matrices are Laplacian matrices. The final two large datasets datasets are Epinions (Whotrustswhom network of Epinions) and Slashdot (Slashdot social network from Feb. 2009) ^{3}^{3}3Available at https://snap.stanford.edu/data/. with large Laplacian matrices. Dataset statistics are shown in Tab. 1.
The running times in Tab. 2 suggest that the iterative bounds from quadrature significantly accelerate ()Dpp sampling and double greedy on real data. Our algorithms lead to speedups of up to a thousand times.
On the large sparse matrices, the “standard” double greedy algorithm did not finish within 24 hours, due to the expensive matrix operations involved. With our framework, the algorithm needs only 15 minutes.
To our knowledge, these results are the first time to run Dpp and double greedy for information gain on such large datasets.
5.4 Numerical details
Instability.
As seen in Alg. 1, the quadrature algorithm is built upon Lanczos iterations. Although in theory Lanczos iterations construct a set of orthogonal Lanczos vectors, in practice the constructed vectors usually lose orthogonality after some iterations due to rounding errors. One way to deal with this problem is to reorthogonalize the vectors, either completely at each iteration or selectively [parlett1979lanczos]. Also, an equivalent Lanczos iteration proposed in [paige1971computation] which uses a different expression to improve local orthogonality. Further discussion on numerical stability of the method lies beyond the scope of this paper.
Preconditioning.
For Gauss quadrature on , the convergence rate of bounds is dependent on the condition number of . We can use preconditioning techniques to get a wellconditioned submatrix and proceed with that. Concretely, observe that for nonsingular ,
Thus, if is wellconditioned, we can use it with the vector in Gauss quadrature.
There exists various ways to obtain good preconditioners for an SPD matrix. A simple choice is to use . There also exists methods for efficiently constructing sparse inverse matrix [benzi1996sparse]. If happens to be an SDD matrix, we can use techniques introduced in [cheng2014scalable] to construct an approximate sparse inverse in near linear time.
6 Conclusion
In this paper we present a general and powerful computational framework for algorithms that rely on computations of bilinear inverse forms. The framework uses Gauss quadrature methods to lazily and iteratively tighten bounds, and is supported by our new theoretical results. We analyze properties of the various types of Gauss quadratures for approximating the bilinear inverse forms and show that all bounds are monotonically becoming tighter with the number of iterations; those given by GaussRadau are superior to those obtained from other Gausstype quadratures; and both lower and upper bounds enjoy a linear convergence rate. We empirically verify the efficiency of our framework and are able to obtain speedups of up to a thousand times for two popular examples: maximizing information gain and sampling from determinantal point processes.
Acknowledgements
This research was partially supported by NSF CAREER award 1553284 and a Google Research Award.
References
 Anari et al. [2016] Anari, Nima, Gharan, Shayan Oveis, and Rezaei, Alireza. Monte Carlo Markov chain algorithms for sampling strongly Rayleigh distributions and determinantal point processes. In COLT, 2016.
 Atzori et al. [2010] Atzori, Luigi, Iera, Antonio, and Morabito, Giacomo. The Internet of Things: A survey. Computer networks, 54(15):2787–2805, 2010.
 Bai & Golub [1996] Bai, Zhaojun and Golub, Gene H. Bounds for the trace of the inverse and the determinant of symmetric positive definite matrices. Annals of Numerical Mathematics, pp. 29–38, 1996.
 Bai et al. [1996] Bai, Zhaojun, Fahey, Gark, and Golub, Gene H. Some largescale matrix computation problems. Journal of Computational and Applied Mathematics, pp. 71–89, 1996.
 Bekas et al. [2007] Bekas, Constantine, Kokiopoulou, Effrosyni, and Saad, Yousef. An estimator for the diagonal of a matrix. Applied numerical mathematics, pp. 1214–1229, 2007.
 Bekas et al. [2009] Bekas, Constantine, Curioni, Alessandro, and Fedulova, Irina. Low cost high performance uncertainty quantification. In Proceedings of the 2nd Workshop on High Performance Computational Finance, 2009.
 Belabbas & Wolfe [2009] Belabbas, MohamedAli and Wolfe, Patrick J. Spectral methods in machine learning and new strategies for very large datasets. Proceedings of the National Academy of Sciences, pp. 369–374, 2009.
 Benzi & Golub [1999] Benzi, Michele and Golub, Gene H. Bounds for the entries of matrix functions with applications to preconditioning. BIT Numerical Mathematics, pp. 417–438, 1999.
 Benzi & Klymko [2013] Benzi, Michele and Klymko, Christine. Total communicability as a centrality measure. J. Complex Networks, pp. 124–149, 2013.
 Bonacich [1987] Bonacich, Phillip. Power and centrality: A family of measures. American Journal of Sociology, pp. 1170–1182, 1987.
 Boutsidis et al. [2009] Boutsidis, Christos, Mahoney, Michael W., and Drineas, Petros. An improved approximation algorithm for the column subset selection problem. In SODA, pp. 968–977, 2009.
 Brezinski [1999] Brezinski, Claude. Error estimates for the solution of linear systems. SIAM Journal on Scientific Computing, pp. 764–781, 1999.
 Brezinski et al. [2012] Brezinski, Claude, Fika, Paraskevi, and Mitrouli, Marilena. Estimations of the trace of powers of positive selfadjoint operators by extrapolation of the moments. Electronic Transactions on Numerical Analysis, pp. 144–155, 2012.
 Buchbinder et al. [2012] Buchbinder, Niv, Feldman, Moran, Naor, Joseph, and Schwartz, Roy. A tight linear time (1/2)approximation for unconstrained submodular maximization. In FOCS, 2012.
 Dong & Liu [1994] Dong, ShaoJing and Liu, KehFei. Stochastic estimation with noise. Physics Letters B, pp. 130–136, 1994.
 Estrada & Higham [2010] Estrada, Ernesto and Higham, Desmond J. Network properties revealed through matrix functions. SIAM Review, pp. 696–714, 2010.
 Fenu et al. [2013] Fenu, Caterina, Martin, David R., Reichel, Lothar, and Rodriguez, Giuseppe. Network analysis via partial spectral factorization and Gauss quadrature. SIAM Journal on Scientific Computing, pp. A2046–A2068, 2013.
 Fika & Koukouvinos [2015] Fika, Paraskevi and Koukouvinos, Christos. Stochastic estimates for the trace of functions of matrices via Hadamard matrices. Communications in StatisticsSimulation and Computation, 2015.
 Fika & Mitrouli [2015] Fika, Paraskevi and Mitrouli, Marilena. Estimation of the bilinear form for Hermitian matrices. Linear Algebra and its Applications, 2015.
 Fika et al. [2014] Fika, Paraskevi, Mitrouli, Marilena, and Roupa, Paraskevi. Estimates for the bilinear form with applications to linear algebra problems. Electronic Transactions on Numerical Analysis, pp. 70–89, 2014.
 Freericks [2006] Freericks, James K. Transport in multilayered nanostructures. The Dynamical MeanField Theory Approach, Imperial College, London, 2006.
 Frommer et al. [2012] Frommer, Andreas, Lippert, Thomas, Medeke, Björn, and Schilling, Klaus. Numerical Challenges in Lattice Quantum Chromodynamics: Joint Interdisciplinary Workshop of John Von Neumann Institute for Computing, Jülich, and Institute of Applied Computer Science, Wuppertal University, August 1999, volume 15. Springer Science & Business Media, 2012.
 Gauss [1815] Gauss, Carl F. Methodus nova integralium valores per approximationem inveniendi. apvd Henricvm Dieterich, 1815.
 Gautschi [1981] Gautschi, Walter. A survey of GaussChristoffel quadrature formulae. In EB Christoffel, pp. 72–147. Springer, 1981.
 Gillenwater et al. [2012] Gillenwater, Jennifer, Kulesza, Alex, and Taskar, Ben. Nearoptimal MAP inference for determinantal point processes. In NIPS, 2012.
 Gittens & Mahoney [2013] Gittens, Alex and Mahoney, Michael W. Revisiting the Nyström method for improved largescale machine learning. ICML, 2013.
 Golub [1973] Golub, Gene H. Some modified matrix eigenvalue problems. SIAM Review, pp. 318–334, 1973.
 Golub & Meurant [1997] Golub, Gene H. and Meurant, Gérard. Matrices, moments and quadrature II; how to compute the norm of the error in iterative methods. BIT Numerical Mathematics, pp. 687–705, 1997.
 Golub & Meurant [2009] Golub, Gene H. and Meurant, Gérard. Matrices, moments and quadrature with applications. Princeton University Press, 2009.
 Golub & Welsch [1969] Golub, Gene H. and Welsch, John H. Calculation of Gauss quadrature rules. Mathematics of computation, pp. 221–230, 1969.
 Golub et al. [2008] Golub, Gene H., Stoll, Martin, and Wathen, Andy. Approximation of the scattering amplitude and linear systems. Elec. Tran. on Numerical Analysis, pp. 178–203, 2008.
 Hestenes & Stiefel [1952] Hestenes, Magnus R. and Stiefel, Eduard. Methods of conjugate gradients for solving linear systems. J. Research of the National Bureau of Standards, pp. 409–436, 1952.
 Hough et al. [2006] Hough, J. Ben, Krishnapur, Manjunath, Peres, Yuval, and Virág, Bálint. Determinantal processes and independence. Probability Surveys, 2006.
 Kang [2013] Kang, Byungkon. Fast determinantal point process sampling with application to clustering. In NIPS, pp. 2319–2327, 2013.
 Krause et al. [2008] Krause, Andreas, Singh, Ajit, and Guestrin, Carlos. Nearoptimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, pp. 235–284, 2008.
 Kulesza & Taskar [2012] Kulesza, Alex and Taskar, Ben. Determinantal point processes for machine learning. arXiv:1207.6083, 2012.
 Kwok & Adams [2012] Kwok, James T. and Adams, Ryan P. Priors for diversity in generative latent variable models. In NIPS, pp. 2996–3004, 2012.
 Lanczos [1950] Lanczos, Cornelius. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm. Press Office Los Angeles, CA, 1950.
 Lee et al. [2014] Lee, Christina E., Ozdaglar, Asuman E., and Shah, Devavrat. Solving systems of linear equations: Locally and asynchronously. arXiv, abs/1411.2647, 2014.
 Leskovec et al. [2008] Leskovec, Jure, Lang, Kevin J., Dasgupta, Anirban, and Mahoney, Michael W. Statistical properties of community structure in large social and information networks. In WWW, pp. 695–704, 2008.
 Lin et al. [2011a] Lin, Lin, Yang, Chao, Lu, Jianfeng, and Ying, Lexing. A fast parallel algorithm for selected inversion of structured sparse matrices with application to 2D electronic structure calculations. SIAM Journal on Scientific Computing, pp. 1329–1351, 2011a.
 Lin et al. [2011b] Lin, Lin, Yang, Chao, Meza, Juan C., Lu, Jianfeng, Ying, Lexing, and E, Weinan. Selinv–an algorithm for selected inversion of a sparse symmetric matrix. ACM Transactions on Mathematical Software, 2011b.
 Lobatto [1852] Lobatto, Rehuel. Lessen over de differentiaalen integraalrekening: Dl. 2 Integraalrekening, volume 1. Van Cleef, 1852.
 Meurant [1997] Meurant, Gérard. The computation of bounds for the norm of the error in the conjugate gradient algorithm. Numerical Algorithms, pp. 77–87, 1997.
 Meurant [1999] Meurant, Gérard. Numerical experiments in computing bounds for the norm of the error in the preconditioned conjugate gradient algorithm. Numerical Algorithms, pp. 353–365, 1999.
 Meurant [2006] Meurant, Gérard. The Lanczos and conjugate gradient algorithms: from theory to finite precision computations, volume 19. SIAM, 2006.
 Minoux [1978] Minoux, Michel. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques, pp. 234–243. Springer, 1978.
 Mirzasoleiman et al. [2015] Mirzasoleiman, Baharan, Badanidiyuru, Ashwinkumar, Karbasi, Amin, Vondrák, Jan, and Krause, Andreas. Lazier than lazy greedy. In AAAI, 2015.
 Nemhauser et al. [1978] Nemhauser, George L.., Wolsey, Laurence A., and Fisher, Marshall L. An analysis of approximations for maximizing submodular set functions–I. Mathematical Programming, pp. 265–294, 1978.
 Page et al. [1999] Page, Lawrence, Brin, Sergey, Motwani, Rajeev, and Winograd, Terry. The PageRank citation ranking: bringing order to the web. Stanford InfoLab, 1999.
 Radau [1880] Radau, Rodolphe. Étude sur les formules d’approximation qui servent à calculer la valeur numérique d’une intégrale définie. J. de Mathématiques Pures et Appliquées, pp. 283–336, 1880.
 Rasmussen & Williams [2006] Rasmussen, Carl E. and Williams, Christopher K. I. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006.
 Rocková & George [2015] Rocková, Veronika and George, Edward I. Determinantal priors for variable selection, 2015.
 Scott [2012] Scott, John. Social network analysis. Sage, 2012.
 Sherman & Morrison [1950] Sherman, Jack and Morrison, Winifred J. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, pp. 124–127, 1950.
 Shewchuk [1994] Shewchuk, Jonathan R. An introduction to the conjugate gradient method without the agonizing pain, 1994.
 Sidje & Saad [2011] Sidje, Roger B. and Saad, Yousef. Rational approximation to the fermi–dirac function with applications in density functional theory. Numerical Algorithms, pp. 455–479, 2011.
 Stoer & Bulirsch [2013] Stoer, Josef and Bulirsch, Roland. Introduction to numerical analysis, volume 12. Springer Science & Business Media, 2013.
 Sviridenko et al. [2015] Sviridenko, Maxim, Vondrák, Jan, and Ward, Justin. Optimal approximation for submodular and supermodular optimization with bounded curvature. In SODA, 2015.
 Tang & Saad [2012] Tang, Jok M. and Saad, Yousef. A probing method for computing the diagonal of a matrix inverse. Numerical Linear Algebra with Applications, pp. 485–501, 2012.
 Wasow [1952] Wasow, Wolfgang R. A note on the inversion of matrices by random walks. Mathematical Tables and Other Aids to Computation, pp. 78–81, 1952.
 Wilf [1962] Wilf, Herbert S. Mathematics for the Physical Sciences. Wiley, New York, 1962.
Appendix A Further Background on Gauss Quadrature
We present below a more detailed summary of material on Gauss quadrature to make the paper selfcontained.
a.1 Selecting weights and nodes
We’ve described that the RiemannStieltjes integral could be expressed as
where denotes the th degree approximation and denotes a remainder term. The weights , and nodes are chosen such that for all polynomials of degree less than , denoted , we have exact interpolation . One way to compute weights and nodes is to set for and then use this exact nonlinear system. But there is an easier way to obtain weights and nodes, namely by using polynomials orthogonal with respect to the measure . Specifically, we construct a sequence of orthogonal polynomials such that is a polynomial in of degree exactly , and , are orthogonal, i.e., they satisfy
The roots of are distinct, real and lie in the interval of , and form the nodes for Gauss quadrature (see, e.g., [29, Ch. 6]).
Consider the two monic polynomials whose roots serve as quadrature nodes:
where for consistency. We further denote , where the sign is taken to ensure on . Then, for , we calculate the quadrature weights as
where denotes the derivative of with respect to . When the quadrature degenerates to Gauss quadrature and we have
Although we have specified how to select nodes and weights for quadrature, these ideas cannot be applied to our problem because the measure is unknown. Indeed, calculating the measure explicitly would require knowing the entire spectrum of , which is as good as explicitly computing , hence untenable for us. The next section shows how to circumvent the difficulties due to unknown .
a.2 Gauss Quadrature Lanczos (GQL)
The key idea to circumvent our lack of knowledge of is to recursively construct polynomials called Lanczos polynomials. The construction ensures their orthogonality with respect to . Concretely, we construct Lanczos polynomials via the following threeterm recurrence:
(A.1) 
while ensuring . We can express (A.1) in matrix form by writing
where , is th canonical unit vector, and is the tridiagonal matrix
(A.2) 
This matrix is known as the Jacobi matrix, and is closed related to Gauss quadrature. The following wellknown theorem makes this relation precise.
Theorem 10 ([62, 30]).
The eigenvalues of form the nodes of Gausstype quadratures. The weights are given by the squares of the first elements of the normalized eigenvectors of .
Thus, if has the eigendecomposition , then for Gauss quadrature Thm. 10 yields
(A.3) 
Specialization.
We now specialize to our main focus, , for which we prove more precise results. In this case, (A.3) becomes . The task now is to compute , and given , to obtain the Jacobi matrix .
Fortunately, we can efficiently calculate iteratively using the Lanczos Algorithm [38]. Suppose we have an estimate , in iteration of Lanczos, we compute the tridiagonal coefficients and and add them to this estimate to form . As to , assuming we have already computed , letting and invoking the ShermanMorrison identity [55] we obtain the recursion:
(A.4) 
where and can be recursively computed using a Choleskylike factorization of [29, p.31].
For GaussRadau quadrature, we need to modify so that it has a prescribed eigenvalue. More precisely, we extend to for left GaussRadau (
Comments
There are no comments yet.