Gauss quadrature for matrix inverse forms with applications

12/07/2015 ∙ by Chengtao Li, et al. ∙ MIT 0

We present a framework for accelerating a spectrum of machine learning algorithms that require computation of bilinear inverse forms u^ A^-1u, where A is a positive definite matrix and u a given vector. Our framework is built on Gauss-type quadrature and easily scales to large, sparse matrices. Further, it allows retrospective computation of lower and upper bounds on u^ A^-1u, which in turn accelerates several algorithms. We prove that these bounds tighten iteratively and converge at a linear (geometric) rate. To our knowledge, ours is the first work to demonstrate these key properties of Gauss-type quadrature, which is a classical and deeply studied topic. We illustrate empirical consequences of our results by using quadrature to accelerate machine learning tasks involving determinantal point processes and submodular optimization, and observe tremendous speedups in several instances.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Symmetric positive definite matrices arise in many areas in a variety of guises: covariances, kernels, graph Laplacians, or otherwise. A basic computation with such matrices is evaluation of the bilinear form , where is a matrix function and , are given vectors. If , we speak of computing a bilinear inverse form (BIF) . For example, with ( canonical vector) is the diagonal entry of the inverse.

In this paper, we are interested in efficiently computing BIFs, primarily due to their importance in several machine learning contexts, e.g., evaluation of Gaussian density at a point, the Woodbury matrix inversion lemma, implementation of MCMC samplers for Determinantal Point Processes (Dpp), computation of graph centrality measures, and greedy submodular maximization (see Section 2).

When is large, it is preferable to compute iteratively rather than to first compute (using Cholesky) at a cost of operations. One could think of using conjugate gradients to solve approximately, and then obtain

. But several applications require precise bounds on numerical estimates to

(e.g., in MCMC based Dpp samplers such bounds help decide whether to accept or reject a transition in each iteration–see Section 5.1), which necessitates a more finessed approach.

Gauss quadrature is one such approach. Originally proposed in [23] for approximating integrals, Gauss- and Gauss-type quadrature (i.e., Gauss-Lobatto [43] and Gauss-Radau [51] quadrature) have since found application to bilinear forms including computation of [4]. Bai et al. also show that Gauss and (right) Gauss-Radau quadrature yield lower bounds, while Gauss-Lobatto and (left) Gauss-Radau yield upper bounds on the BIF .

However, despite its long history and voluminous existing work (see e.g., [29]), our understanding of Gauss-type quadrature for matrix problems is far from complete. For instance, it is not known whether the bounds on BIFs improve with more quadrature iterations; nor is it known how the bounds obtained from Gauss, Gauss-Radau and Gauss-Lobatto quadrature compare with each other. We do not even know how fast the iterates of Gauss-Radau or Gauss-Lobatto quadrature converge.

Contributions. We address all the aforementioned problems and make the following main contributions:

  • We show that the lower and upper bounds generated by Gauss-type quadrature monotonically approach the target value (Theorems 4 and 6; Corr. 7). Furthermore, we show that for the same number of iterations, Gauss-Radau quadrature yields bounds superior to those given by Gauss or Gauss-Lobatto, but somewhat surprisingly all three share the same convergence rate.

  • We prove linear convergence rates for Gauss-Radau and Gauss-Lobatto explicitly (Theorems 5 and 8; Corr. 9).

  • We demonstrate implications of our results for two tasks: (i) scalable Markov chain sampling from a

    Dpp; and (ii) running a greedy algorithm for submodular optimization. In these applications, quadrature accelerates computations, and the bounds aid early stopping.

Indeed, on large-scale sparse problems our methods lead to even several orders of magnitude in speedup.

Related Work.

There exist a number of methods for efficiently approximating matrix bilinear forms. Brezinski [12] and Brezinski et al. [13]

use extrapolation of matrix moments and interpolation to estimate the 2-norm error of linear systems and the trace of the matrix inverse.

Fika et al. [20] extend the extrapolation method to BIFs and show that the derived one-term and two-term approximations coincide with Gauss quadrature, hence providing lower bounds. Further generalizations address for a Hermitian matrix [19]. In addition, other methods exist for estimating trace of a matrix function [3, 13, 18] or diagonal elements of matrix inverse [5, 60].

Many of these methods may be applied to computing BIFs. But they do not provide intervals bounding the target value, just approximations. Thus, a black-box use of these methods may change the execution of an algorithm whose decisions (e.g., whether to transit in a Markov Chain) rely on the BIF value to be within a specific interval. Such changes can break the correctness of the algorithm.

Our framework, in contrast, yields iteratively tighter lower and upper bounds (Section 4), so the algorithm is guaranteed to make correct decisions (Section 5).

2 Motivating Applications

BIFs are important to numerous problems. We recount below several notable examples: in all cases, efficient computation of bounds on BIFs is key to making the algorithms practical.

Determinantal Point Processes. A determinantal point process (Dpp) is a distribution over subsets of a set (). In its L-ensemble form, a Dpp uses a positive semidefinite kernel , and to a set

assigns probability

where is the submatrix of indexed by entries in . If we restrict to , we obtain a -Dpp. Dpp’s are widely used in machine learning, see e.g., the survey [36].

Exact sampling from a (-)Dpp requires eigendecomposition of [33], which is prohibitive. For large , Metropolis Hastings (MH) or Gibbs sampling are preferred and state-of-the-art. Therein the core task is to compute transition probabilities – an expression involving BIFs – which are compared with a random scalar threshold.

For MH [7, 1], the transition probabilities from a current subset (state) to are for ; and for . In a -Dpp, the moves are swaps with transition probabilities for replacing by (and ). We illustrate this application in greater detail in Section 5.1.

Dpps are also useful for (repulsive) priors in Bayesian models [53, 37]. Inference for such latent variable models uses Gibbs sampling, which again involves BIFs.

Submodular optimization, Sensing. Algorithms for maximizing submodular functions can equally benefit from efficient BIF bounds. Given a positive definite matrix , the set function is submodular: for all and , it holds that .

Finding the set that maximizes is a key task for MAP inference with Dpps [25], matrix approximations by column selection [11, 59] and sensing Krause et al. [35]. For the latter, we model spatial phenomena (temperature, pollution) via Gaussian Processes and select locations to maximize the joint entropy of the observed variables, or the mutual information between observed and unobserved variables.

Greedy algorithms for maximizing monotone [49] or non-monotone [14] submodular functions rely on marginal gains of the form

for and . The algorithms compare those gains to a random threshold, or find an item with the largest gain. In both cases, efficient BIF bounds offer speedups. They can be combined with lazy [47] and stochastic greedy algorithms  [48].

Network Analysis, Centrality. When analyzing relationships and information flows between connected entities in a network, such as people, organizations, computers, smart hardwares, etc. [54, 40, 2, 17, 16, 9], an important question is to measure popularity, centrality, or importance of a node.

Several existing popularity measures can be expressed as the solution to a large-scale linear system. For example, PageRank [50] is the solution to , and Bonacich centrality [10] is the solution to , where is the adjacency matrix. When computing local estimates, i.e., only a few entries of , we obtain exactly the task of computing BIFs [61, 39]. Moreover, we may only need local estimates to an accuracy sufficient for determining which entry is larger, a setting where our quadrature based bounds on BIFs will be useful.

Scientific Computing. In computational physics BIFs are used for estimating selected entries of the inverse of a large sparse matrix. More generally, BIFs can help in estimating the trace of the inverse, a computational substep in lattice Quantum Chromodynamics [15, 22], some signal processing tasks [31], and in Gaussian Process (GP) Regression [52]

, e.g., for estimating variances. In numerical linear algebra, BIFs are used in rational approximations 

[57], evaluation of Green’s function [21], and selective inversion of sparse matrices [41, 42, 39]. A notable use is the design of preconditioners [8] and uncertainty quantification [6].

Benefiting from fast iterative bounds. Many of the above examples use BIFs to rank values, to identify the largest value or compare them to a scalar or to each other. In such cases, we first compute fast, crude lower and upper bounds on a BIF, refining iteratively, just as far as needed to determine the comparison. Figure 1 in Section 4.4 illustrates the evolution of these bounds, and Section 5 explains details.

3 Background on Gauss Quadrature

For convenience, we begin by recalling key aspects of Gauss quadrature,111The summary in this section is derived from various sources: [24, 4, 29]. Experts can skim this section for collecting our notation before moving onto Section 4, which contains our new results. as applied to computing , for an symmetric positive definite matrix that has simpleeigenvalues, arbitrary vectors , and a matrix function . For a more detailed account of the relevant background on Gauss-type quadratures please refer to Appendix A, or [29].

It suffices to consider thanks to the identity

Let be the eigendecomposition of where is orthonormal. Letting , we then have

Toward computing , a key conceptual step is to write the above sum as the Riemann-Stieltjes integral

(3.1)

where , , and is piecewise constant measure defined by

Our task now reduces to approximating the integral (3.1), for which we invoke the powerful idea of Gauss-type quadratures [23, 51, 43, 24]. We rewrite the integral (3.1) as

(3.2)

where denotes the th degree approximation and denotes the remainder term. In representation (3.2) the weights , , and quadrature nodes are unknown, while the values are prescribed and lie outside the interval of integration .

Different choices of these parameters yield different quadrature rules: gives Gauss quadrature [23]; with  () gives left (right) Gauss-Radau quadrature [51]; with and yields Gauss-Lobatto quadrature [43]; while for general we obtain Gauss-Christoffel quadrature [24].

The weights , and nodes are chosen such that if is a polynomial of degree less than , then the interpolation is exact. For Gauss quadrature, we can recursively build the Jacobi matrix

(3.3)

and obtain from its spectrum the desired weights and nodes. Theorem 1 makes this more precise.

Theorem 1.

[62, 30] The eigenvalues of form the nodes of Gauss quadrature; the weights

are given by the squares of the first components of the eigenvectors of

.

If has the eigendecomposition , then for Gauss quadrature Thm. 1 yields

(3.4)

Given and , our task is to compute and the Jacobi matrix . For BIFs, we have that , so (3.4) becomes , which can be computed recursively using the Lanczos algorithm [38]. For Gauss-Radau and Gauss-Lobatto quadrature we can compute modified versions of Jacobi matrices  (for left Gauss-Radau),  (for right Gauss-Radau) and  (for Gauss-Lobatto) based on . The corresponding nodes and weights, and thus the approximation of Gauss-Radau and Gauss-Lobatto quadratures, are then obtained from these modified Jacobi matrices, similar to Gauss quadrature. Aggregating all these computations yields an algorithm that iteratively obtains bounds on . The combined procedure, Gauss Quadrature Lanczos (GQL) [28], is summarily presented as Algorithm 1. The complete algorithm may be found in Appendix A.

Theorem 2.

[44] Let , , , and be the -th iterates of Gauss, left Gauss-Radau, right Gauss-Radau, and Gauss-Lobatto quadrature, respectively, as computed by Alg. 1. Then, and provide lower bounds on , while and provide upper bounds.

  Input: Matrix , vector ; lower and upper bounds and on the spectrum of
  Output: : Gauss, right Gauss-Radau, left Gauss-Radau, and Gauss-Lobatto quadrature estimates for each
  Initialize: , ,
  for  to  do
     Update using a Lanczos iteration
     Solve for the modified Jacobi matrices , and .
     Compute , , and with Sherman-Morrison formula.
  end for
Algorithm 1 Gauss Quadrature Lanczos (GQL)

It turns out that the bounds given by Gauss quadrature have a close relation to the approximation error of conjugate gradient (CG) applied to a suitable problem. Since we know the convergence rate of CG, we can obtain from it the following estimate on the relative error of Gauss quadrature.

Theorem 3 (Relative error Gauss quadrature).

The -th iterate of Gauss quadrature satisfies the relative error bound

(3.5)

where is the condition number of .

In other words, Thm. 3 shows that the iterates of Gauss quadrature have a linear (geometric) convergence rate.

4 Main Theoretical Results

In this section we summarize our main theoretical results. As before, detailed proofs may be found in Appendix B. The key questions that we answer are: (i) do the bounds on generated by GQL improve monotonically with each iteration; (ii) how tight are these bounds; and (iii) how fast do Gauss-Radau and Gauss-Lobatto iterations converge? Our answers not only fill gaps in the literature on quadrature, but provide a theoretical base for speeding up algorithms for some applications (see Sections 2 and 5).

4.1 Lower Bounds

Our first result shows that both Gauss and right Gauss-Radau quadratures give iteratively better lower bounds on . Moreover, with the same number of iterations, right Gauss-Radau yields tighter bounds.

Theorem 4.

Let . Then, yields better bounds than but worse bounds than ; more precisely,

Combining Theorem 4 with the convergence rate of relative error for Gauss quadrature (Thm. 3) we obtain the following convergence rate estimate for right Gauss-Radau.

Theorem 5 (Relative error right Gauss-Radau).

For each iteration , the right Gauss-Radau iterate satisfies

4.2 Upper Bounds

Our second result compares Gauss-Lobatto with left Gauss-Radau quadrature.

Theorem 6.

Let . Then, gives better upper bounds than but worse than ; more precisely,

This shows that bounds given by both Gauss-Lobatto and left Gauss-Radau become tighter with each iteration. For the same number of iterations, left Gauss-Radau provides a tighter bound than Gauss-Lobatto.

Combining the above two theorems, we obtain the following corollary for all four Gauss-type quadratures.

Corollary 7 (Monotonicity).

With increasing , and give increasingly better lower bounds and and give increasingly better upper bounds, that is,

4.3 Convergence rates

Our next two results state linear convergence rates for left Gauss-Radau quadrature and Gauss-Lobatto quadrature applied to computing the BIF .

Theorem 8 (Relative error left Gauss-Radau).

For each , the left Gauss-Radau iterate satisfies

where .

Theorem 8 shows that the error again decreases linearly, and it also depends on teh accuracy of , our estimate of the smallest eigenvalue that determines the range of integration. Using the relations between left Gauss-Radau and Gauss-Lobatto, we readily obtain the following corollary.

Corollary 9 (Relative error Gauss-Lobatto).

For each , the Gauss-Lobatto iterate satisfies

where .

Remarks All aforementioned results assumed that is strictly positive definite with simple eigenvalues. In Appendix C, we show similar results for the more general case that is only required to be symmetric, and lies in the space spanned by eigenvectors of corresponding to distinct positive eigenvalues.

4.4 Empirical Evidence

Next, we empirically verify our the theoretical results shown above. We generate a random symmetric matrix with density , where each entry is either zero or standard normal, and shift its diagonal entries to make its smallest eigenvalue , thus making positive definite. We set and . We randomly sample

from a standard normal distribution. Figure 

1 illustrates how the lower and upper bounds given by the four quadrature rules evolve with the number of iterations.

Figure 1 (b) and (c) show the sensitivity of the rules (except Gauss quadrature) to estimating the extremal eigenvalues. Specifically, we use and .

(a)
(b)
(c)
Figure 1: Lower and upper bounds computed by Gauss-type quadrature in each iteration on with .

The plots in Figure 1 agree with the theoretical results. First, all quadrature rules are seen to yield iteratively tighter bounds. The bounds obtained by the Gauss-Radau quadrature are superior to those given by Gauss and Gauss-Lobatto quadrature (also numerically verified). Notably, the bounds given by all quadrature rules converge very fast – within 25 iterations they yield reasonably tight bounds.

It is valuable to see how the bounds are affected if we do not have good approximations to the extremal eigenvalues and . Since Gauss quadrature does not depend on the approximations and , its bounds remain the same in (a),(b),(c). Left Gauss-Radau depends on the quality of , and, with a poor approximation takes more iterations to converge (Figure 1(b)). Right Gauss-Radau depends on the quality of ; thus, if we use as our approximation, its bounds become worse (Figure 1(c)). However, its bounds are never worse than those obtained by Gauss quadrature.

Finally, Gauss-Lobatto depends on both and , so its bounds become worse whenever we lack good approximations to or . Nevertheless, its quality is lower-bounded by left Gauss-Radau as stated in Thm. 6.

5 Algorithmic Results and Applications

Our theoretical results show that Gauss-Radau quadrature provides good lower and upper bounds to BIFs. More importantly, these bounds get iteratively tighter at a linear rate, finally becoming exact (see Appendix B). However, in many applications motivating our work (see Section 2), we do not need exact values of BIFs; bounds that are tight enough suffice for the algorithms to proceed. As a result, all these applications benefit from our theoretical results that provide iteratively tighter bounds. This idea translates into a retrospective framework for accelerating methods whose progress relies on knowing an interval containing the BIF. Whenever the algorithm takes a step (transition) that depends on a BIF (e.g., as in the next section, a state transition in a sampler if the BIF exceeds a certain threshold), we compute rough bounds on its value. If the bounds suffice to take the critical decision (e.g., decide the comparison), then we stop the quadrature. If they do not suffice, we take one or more additional iterations of quadrature to tighten the bound. Algorithm 2 makes this idea explicit.

0:  Algorithm with transitions that depend on BIFs
  while algorithm not yet done do
     while no transition request for values of a BIF do
         proceed with the original algorithm
     end while
     if exist transition request for values of a BIF then
         while bounds on the BIF not tight enough to make the transition do
            Retrospectively run one more iteration of left and(or) right Gauss-Radau to obtain tighter bounds.
         end while
         Make the correct transition with bounds
     end if
  end while
Algorithm 2 Efficient Retrospective Framework

We illustrate our framework by accelerating: (i) Markov chain sampling for (-)Dpps; and (ii) maximization of a (specific) nonmonotone submodular function.

5.1 Retrospective Markov Chain (-)Dpp

First, we use our framework to accelerate iterative samplers for Determinantal Point Processes. Specifically, we discuss MH sampling [34]; the variant for Gibbs sampling follows analogously.

The key insight is that all state transitions of the Markov chain rely on a comparison between a scalar and a quantity involving the bilinear inverse form. Given the current set , assume we propose to add element to . The probability of transitioning to state is . To decide whether to accept this transition, we sample ; if then we accept the transition, otherwise we remain at . Hence, we need to compute just accurately enough to decide whether . To do so, we can use the aforementioned lower and upper bounds on .

Let and be lower and upper bounds for this BIF in the -th iteration of Gauss quadrature. If , then we can safely accept the transition, if , then we can safely reject the transition. Only if , we cannot make a decision yet, and therefore retrospectively perform one more iteration of Gauss quadrature to obtain tighter upper and lower bounds and . We continue until the bounds are sharp enough to safely decide whether to make the transition. Note that in each iteration we make the same decision as we would with the exact value of the BIF, and hence the resulting algorithm (Alg. 3) is an exact Markov chain for the Dpp. In each iteration, it calls Alg. 4, which uses step-wise lazy Gauss quadrature for deciding the comparison, while stopping as early as possible.

0:  Dpp kernel ; ground set
0:   sampled from exact Dpp
  Randomly Initialize
  while chain not mixed do
     Pick , uniformly randomly
     if  then
         
         Compute bounds , on the spectrum of
         if DppJudge(, , , , then
            
         end if
     else
         
         Compute bounds , on the spectrum of
         if not DppJudge(, , , , then
            
         end if
     end if
  end while
Algorithm 3 Gauss-Dpp ()
0:  target value ; vector , matrix ; lower and upper bounds and on the spectrum of
0:  Return true if , false otherwise
  while true do
     Run one Gauss-Radau iteration to get and for .
     if  then
         return true
     else if  then
         return false
     end if
     
  end while
Algorithm 4 DppJudge()

If we condition the Dpp on observing a set of a fixed cardinality , we obtain a -Dpp. The MH sampler for this process is similar, but a state transition corresponds to swapping two elements (adding and removing at the same time). Assume the current set is . If we propose to delete and add to , then the corresponding transition probability is

(5.1)

Again, we sample , but now we must compute two quantities, and hence two sets of lower and upper bounds: , for in the -th Gauss quadrature iteration, and , for in the -th Gauss quadrature iteration. Then if we have , we can safely accept the transition; and if we can safely reject the transition; otherwise, we tighten the bounds via additional Gauss-Radau iterations.

Refinements. We could perform one iteration for both and , but it may be that one set of bounds is already sufficiently tight, while the other is loose. A straightforward idea would be to judge the tightness of the lower and upper bounds by their difference (gap) , and decide accordingly which quadrature to iterate further.

But the bounds for and are not symmetric and contribute differently to the transition decision. In essence, we need to judge the relation between and , or, equivalently, the relation between and . Since the left hand side is “easy”, the essential part is the right hand side. Assuming that in practice the impact is larger when the gap is larger, we tighten the bounds for if , and otherwise tighen the bounds for . Details of the final algorithm with this refinement are shown in Appendix D.

5.2 Retrospective Double Greedy Algorithm

As indicated in Section 2, a number of applications, including sensing and information maximization with Gaussian Processes, rely on maximizing a submodular function given as . In general, this function may be non-monotone. In this case, an algorithm of choice is the double greedy algorithm of Buchbinder et al. [14].

The double greedy algorithm starts with two sets and and serially iterates through all elements to construct a near-optimal subset. At iteration , it includes element into with probability , and with probability it excludes from . The decisive value is determined by the marginal gains and :

For the log-det function, we obtain

where . In other words, at iteration the algorithm uniformly samples , and then checks if

and if true, adds to , otherwise removes it from .

This essential decision, whether to retain or discard an element, again involves bounding BIFs, for which we can take advantage of our framework, and profit from the typical sparsity of the data. Concretely, we retrospectively compute the lower and upper bounds on these BIFs, i.e., lower and upper bounds and on , and and on . If we safely add to ; if we safely remove from ; otherwise we compute a set of tighter bounds by further iterating the quadrature.

As before, the bounds for and may not contribute equally to the transition decision. We can again apply the refinement mentioned in Section 5.1: if we tighten bounds for , otherwise we tighten bounds for . The resulting algorithm is shown in Appendix E.

Figure 2: Running times (top) and corresponding speedup (bottom) on synthetic data. (-)Dpp is initialized with random subsets of size and corresponding running times are averaged over 1,000 iterations of the chain. All results are averaged over 3 runs.
Data Dimension nnz Density(%)
Abalone 4,177 144,553 0.83
Wine 4,898 2,659,910 11.09
GR 5,242 34,209 0.12
HEP 9,877 61,821 0.0634
Epinions 75,879 518,231 0.009
Slashdot 82,168 959,454 0.014
Table 1:

Data. For all datasets we add an 1E-3 times identity matrix to ensure positive definiteness.

Abalone Wine GR HEP Epinions Slashdot
Dpp 9.6E-3 1x 8.5E-2 1x 9.3E-3 1x 6.5E-2 1x 1.46 1x 5.85 1x
5.4E-4 17.8x 5.9E-3 14.4x 4.3E-4 21.6x 5.9E-4 110.2x 3.7E-3 394.6x 7.1E-3 823.9x
-Dpp 1.4E-2 1x 0.15 1x 1.7E-2 1x 0.13 1x 2.40 1x 11.83 1x
7.3E-4 19.2x 1.1E-2 13.6x 7.3E-4 23.3x 9.2E-4 141.3x 4.9E-3 489.8x 1E-2 1183x
Dg 1025.6 1x 1951.3 1x 965.8 1x 6269.4 1x
17.3 59.3x 423.2 4.6x 10 9.7x 25.3 247.8x 418 712.9
Table 2: Running time and speedup for (-)Dpp and double greedy. For results on each dataset (occupying two columns), the first column shows the running time (in seconds) and the second column shows the speedup. For each algorithm (occupying two rows), the first row shows results from the original algorithm and the second row shows results from algorithms using our framework. For Epinions and Slashdot, entries of “” indicate that the experiments did not finish within 24 hours.

5.3 Empirical Evidence

We perform experiments on both synthetic and real-world datasets to test the impact of our retrospective quadrature framework in applications. We focus on (-)Dpp sampling and the double greedy algorithm for the log-det objective.

5.3.1 Synthetic Datasets

We generate small sparse matrices using methods similar to Section 4.4. For (-)Dpp we generate matrices while for double greedy we use . We vary the density of the matrices from to . The running time and speedup are shown in Figure 2.

The results suggest that our framework greatly accelerates both Dpp sampling and submodular maximization. The speedups are particularly pronounced for sparse matrices. As the matrices become very sparse, the original algorithms profit from sparsity too, and the difference shrinks a little. Overall, we see that our framework has the potential to lead to substantial speedups for algorithms involving bilinear inverse forms.

5.3.2 Real Datasets

We further test our framework on real-world datasets of varying sizes. We selected 6 datasets, four of them are of small/medium size and two are large. The four small/medium-sized datasets are used in [26]. The first two of small/medium-sized datasets, Abalone and Wine222Available at http://archive.ics.uci.edu/ml/., are popular datasets for regression, and we construct sparse kernel matrices with an RBF kernel. We set the bandwidth parameter for Abalone as and that for Wine as and the cut-off parameter as for both datasets, as in [26]. The other two small/medium-sized datasets are GR (arXiv High Energy Physics collaboration graph) and HEP (arXiv General Relativity collaboration graph), where the kernel matrices are Laplacian matrices. The final two large datasets datasets are Epinions (Who-trusts-whom network of Epinions) and Slashdot (Slashdot social network from Feb. 2009) 333Available at https://snap.stanford.edu/data/. with large Laplacian matrices. Dataset statistics are shown in Tab. 1.

The running times in Tab. 2 suggest that the iterative bounds from quadrature significantly accelerate (-)Dpp sampling and double greedy on real data. Our algorithms lead to speedups of up to a thousand times.

On the large sparse matrices, the “standard” double greedy algorithm did not finish within 24 hours, due to the expensive matrix operations involved. With our framework, the algorithm needs only 15 minutes.

To our knowledge, these results are the first time to run Dpp and double greedy for information gain on such large datasets.

5.4 Numerical details

Instability.

As seen in Alg. 1, the quadrature algorithm is built upon Lanczos iterations. Although in theory Lanczos iterations construct a set of orthogonal Lanczos vectors, in practice the constructed vectors usually lose orthogonality after some iterations due to rounding errors. One way to deal with this problem is to reorthogonalize the vectors, either completely at each iteration or selectively [parlett1979lanczos]. Also, an equivalent Lanczos iteration proposed in [paige1971computation] which uses a different expression to improve local orthogonality. Further discussion on numerical stability of the method lies beyond the scope of this paper.

Preconditioning.

For Gauss quadrature on , the convergence rate of bounds is dependent on the condition number of . We can use preconditioning techniques to get a well-conditioned submatrix and proceed with that. Concretely, observe that for non-singular ,

Thus, if is well-conditioned, we can use it with the vector in Gauss quadrature.

There exists various ways to obtain good preconditioners for an SPD matrix. A simple choice is to use . There also exists methods for efficiently constructing sparse inverse matrix [benzi1996sparse]. If happens to be an SDD matrix, we can use techniques introduced in [cheng2014scalable] to construct an approximate sparse inverse in near linear time.

6 Conclusion

In this paper we present a general and powerful computational framework for algorithms that rely on computations of bilinear inverse forms. The framework uses Gauss quadrature methods to lazily and iteratively tighten bounds, and is supported by our new theoretical results. We analyze properties of the various types of Gauss quadratures for approximating the bilinear inverse forms and show that all bounds are monotonically becoming tighter with the number of iterations; those given by Gauss-Radau are superior to those obtained from other Gauss-type quadratures; and both lower and upper bounds enjoy a linear convergence rate. We empirically verify the efficiency of our framework and are able to obtain speedups of up to a thousand times for two popular examples: maximizing information gain and sampling from determinantal point processes.

Acknowledgements

This research was partially supported by NSF CAREER award 1553284 and a Google Research Award.

References

  • Anari et al. [2016] Anari, Nima, Gharan, Shayan Oveis, and Rezaei, Alireza. Monte Carlo Markov chain algorithms for sampling strongly Rayleigh distributions and determinantal point processes. In COLT, 2016.
  • Atzori et al. [2010] Atzori, Luigi, Iera, Antonio, and Morabito, Giacomo. The Internet of Things: A survey. Computer networks, 54(15):2787–2805, 2010.
  • Bai & Golub [1996] Bai, Zhaojun and Golub, Gene H. Bounds for the trace of the inverse and the determinant of symmetric positive definite matrices. Annals of Numerical Mathematics, pp. 29–38, 1996.
  • Bai et al. [1996] Bai, Zhaojun, Fahey, Gark, and Golub, Gene H. Some large-scale matrix computation problems. Journal of Computational and Applied Mathematics, pp. 71–89, 1996.
  • Bekas et al. [2007] Bekas, Constantine, Kokiopoulou, Effrosyni, and Saad, Yousef. An estimator for the diagonal of a matrix. Applied numerical mathematics, pp. 1214–1229, 2007.
  • Bekas et al. [2009] Bekas, Constantine, Curioni, Alessandro, and Fedulova, Irina. Low cost high performance uncertainty quantification. In Proceedings of the 2nd Workshop on High Performance Computational Finance, 2009.
  • Belabbas & Wolfe [2009] Belabbas, Mohamed-Ali and Wolfe, Patrick J. Spectral methods in machine learning and new strategies for very large datasets. Proceedings of the National Academy of Sciences, pp. 369–374, 2009.
  • Benzi & Golub [1999] Benzi, Michele and Golub, Gene H. Bounds for the entries of matrix functions with applications to preconditioning. BIT Numerical Mathematics, pp. 417–438, 1999.
  • Benzi & Klymko [2013] Benzi, Michele and Klymko, Christine. Total communicability as a centrality measure. J. Complex Networks, pp. 124–149, 2013.
  • Bonacich [1987] Bonacich, Phillip. Power and centrality: A family of measures. American Journal of Sociology, pp. 1170–1182, 1987.
  • Boutsidis et al. [2009] Boutsidis, Christos, Mahoney, Michael W., and Drineas, Petros. An improved approximation algorithm for the column subset selection problem. In SODA, pp. 968–977, 2009.
  • Brezinski [1999] Brezinski, Claude. Error estimates for the solution of linear systems. SIAM Journal on Scientific Computing, pp. 764–781, 1999.
  • Brezinski et al. [2012] Brezinski, Claude, Fika, Paraskevi, and Mitrouli, Marilena. Estimations of the trace of powers of positive self-adjoint operators by extrapolation of the moments. Electronic Transactions on Numerical Analysis, pp. 144–155, 2012.
  • Buchbinder et al. [2012] Buchbinder, Niv, Feldman, Moran, Naor, Joseph, and Schwartz, Roy. A tight linear time (1/2)-approximation for unconstrained submodular maximization. In FOCS, 2012.
  • Dong & Liu [1994] Dong, Shao-Jing and Liu, Keh-Fei. Stochastic estimation with noise. Physics Letters B, pp. 130–136, 1994.
  • Estrada & Higham [2010] Estrada, Ernesto and Higham, Desmond J. Network properties revealed through matrix functions. SIAM Review, pp. 696–714, 2010.
  • Fenu et al. [2013] Fenu, Caterina, Martin, David R., Reichel, Lothar, and Rodriguez, Giuseppe. Network analysis via partial spectral factorization and Gauss quadrature. SIAM Journal on Scientific Computing, pp. A2046–A2068, 2013.
  • Fika & Koukouvinos [2015] Fika, Paraskevi and Koukouvinos, Christos. Stochastic estimates for the trace of functions of matrices via Hadamard matrices. Communications in Statistics-Simulation and Computation, 2015.
  • Fika & Mitrouli [2015] Fika, Paraskevi and Mitrouli, Marilena. Estimation of the bilinear form for Hermitian matrices. Linear Algebra and its Applications, 2015.
  • Fika et al. [2014] Fika, Paraskevi, Mitrouli, Marilena, and Roupa, Paraskevi. Estimates for the bilinear form with applications to linear algebra problems. Electronic Transactions on Numerical Analysis, pp. 70–89, 2014.
  • Freericks [2006] Freericks, James K. Transport in multilayered nanostructures. The Dynamical Mean-Field Theory Approach, Imperial College, London, 2006.
  • Frommer et al. [2012] Frommer, Andreas, Lippert, Thomas, Medeke, Björn, and Schilling, Klaus. Numerical Challenges in Lattice Quantum Chromodynamics: Joint Interdisciplinary Workshop of John Von Neumann Institute for Computing, Jülich, and Institute of Applied Computer Science, Wuppertal University, August 1999, volume 15. Springer Science & Business Media, 2012.
  • Gauss [1815] Gauss, Carl F. Methodus nova integralium valores per approximationem inveniendi. apvd Henricvm Dieterich, 1815.
  • Gautschi [1981] Gautschi, Walter. A survey of Gauss-Christoffel quadrature formulae. In EB Christoffel, pp. 72–147. Springer, 1981.
  • Gillenwater et al. [2012] Gillenwater, Jennifer, Kulesza, Alex, and Taskar, Ben. Near-optimal MAP inference for determinantal point processes. In NIPS, 2012.
  • Gittens & Mahoney [2013] Gittens, Alex and Mahoney, Michael W. Revisiting the Nyström method for improved large-scale machine learning. ICML, 2013.
  • Golub [1973] Golub, Gene H. Some modified matrix eigenvalue problems. SIAM Review, pp. 318–334, 1973.
  • Golub & Meurant [1997] Golub, Gene H. and Meurant, Gérard. Matrices, moments and quadrature II; how to compute the norm of the error in iterative methods. BIT Numerical Mathematics, pp. 687–705, 1997.
  • Golub & Meurant [2009] Golub, Gene H. and Meurant, Gérard. Matrices, moments and quadrature with applications. Princeton University Press, 2009.
  • Golub & Welsch [1969] Golub, Gene H. and Welsch, John H. Calculation of Gauss quadrature rules. Mathematics of computation, pp. 221–230, 1969.
  • Golub et al. [2008] Golub, Gene H., Stoll, Martin, and Wathen, Andy. Approximation of the scattering amplitude and linear systems. Elec. Tran. on Numerical Analysis, pp. 178–203, 2008.
  • Hestenes & Stiefel [1952] Hestenes, Magnus R. and Stiefel, Eduard. Methods of conjugate gradients for solving linear systems. J. Research of the National Bureau of Standards, pp. 409–436, 1952.
  • Hough et al. [2006] Hough, J. Ben, Krishnapur, Manjunath, Peres, Yuval, and Virág, Bálint. Determinantal processes and independence. Probability Surveys, 2006.
  • Kang [2013] Kang, Byungkon. Fast determinantal point process sampling with application to clustering. In NIPS, pp. 2319–2327, 2013.
  • Krause et al. [2008] Krause, Andreas, Singh, Ajit, and Guestrin, Carlos. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, pp. 235–284, 2008.
  • Kulesza & Taskar [2012] Kulesza, Alex and Taskar, Ben. Determinantal point processes for machine learning. arXiv:1207.6083, 2012.
  • Kwok & Adams [2012] Kwok, James T. and Adams, Ryan P. Priors for diversity in generative latent variable models. In NIPS, pp. 2996–3004, 2012.
  • Lanczos [1950] Lanczos, Cornelius. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm. Press Office Los Angeles, CA, 1950.
  • Lee et al. [2014] Lee, Christina E., Ozdaglar, Asuman E., and Shah, Devavrat. Solving systems of linear equations: Locally and asynchronously. arXiv, abs/1411.2647, 2014.
  • Leskovec et al. [2008] Leskovec, Jure, Lang, Kevin J., Dasgupta, Anirban, and Mahoney, Michael W. Statistical properties of community structure in large social and information networks. In WWW, pp. 695–704, 2008.
  • Lin et al. [2011a] Lin, Lin, Yang, Chao, Lu, Jianfeng, and Ying, Lexing. A fast parallel algorithm for selected inversion of structured sparse matrices with application to 2D electronic structure calculations. SIAM Journal on Scientific Computing, pp. 1329–1351, 2011a.
  • Lin et al. [2011b] Lin, Lin, Yang, Chao, Meza, Juan C., Lu, Jianfeng, Ying, Lexing, and E, Weinan. Selinv–an algorithm for selected inversion of a sparse symmetric matrix. ACM Transactions on Mathematical Software, 2011b.
  • Lobatto [1852] Lobatto, Rehuel. Lessen over de differentiaal-en integraal-rekening: Dl. 2 Integraal-rekening, volume 1. Van Cleef, 1852.
  • Meurant [1997] Meurant, Gérard. The computation of bounds for the norm of the error in the conjugate gradient algorithm. Numerical Algorithms, pp. 77–87, 1997.
  • Meurant [1999] Meurant, Gérard. Numerical experiments in computing bounds for the norm of the error in the preconditioned conjugate gradient algorithm. Numerical Algorithms, pp. 353–365, 1999.
  • Meurant [2006] Meurant, Gérard. The Lanczos and conjugate gradient algorithms: from theory to finite precision computations, volume 19. SIAM, 2006.
  • Minoux [1978] Minoux, Michel. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques, pp. 234–243. Springer, 1978.
  • Mirzasoleiman et al. [2015] Mirzasoleiman, Baharan, Badanidiyuru, Ashwinkumar, Karbasi, Amin, Vondrák, Jan, and Krause, Andreas. Lazier than lazy greedy. In AAAI, 2015.
  • Nemhauser et al. [1978] Nemhauser, George L.., Wolsey, Laurence A., and Fisher, Marshall L. An analysis of approximations for maximizing submodular set functions–I. Mathematical Programming, pp. 265–294, 1978.
  • Page et al. [1999] Page, Lawrence, Brin, Sergey, Motwani, Rajeev, and Winograd, Terry. The PageRank citation ranking: bringing order to the web. Stanford InfoLab, 1999.
  • Radau [1880] Radau, Rodolphe. Étude sur les formules d’approximation qui servent à calculer la valeur numérique d’une intégrale définie. J. de Mathématiques Pures et Appliquées, pp. 283–336, 1880.
  • Rasmussen & Williams [2006] Rasmussen, Carl E. and Williams, Christopher K. I. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006.
  • Rocková & George [2015] Rocková, Veronika and George, Edward I. Determinantal priors for variable selection, 2015.
  • Scott [2012] Scott, John. Social network analysis. Sage, 2012.
  • Sherman & Morrison [1950] Sherman, Jack and Morrison, Winifred J. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, pp. 124–127, 1950.
  • Shewchuk [1994] Shewchuk, Jonathan R. An introduction to the conjugate gradient method without the agonizing pain, 1994.
  • Sidje & Saad [2011] Sidje, Roger B. and Saad, Yousef. Rational approximation to the fermi–dirac function with applications in density functional theory. Numerical Algorithms, pp. 455–479, 2011.
  • Stoer & Bulirsch [2013] Stoer, Josef and Bulirsch, Roland. Introduction to numerical analysis, volume 12. Springer Science & Business Media, 2013.
  • Sviridenko et al. [2015] Sviridenko, Maxim, Vondrák, Jan, and Ward, Justin. Optimal approximation for submodular and supermodular optimization with bounded curvature. In SODA, 2015.
  • Tang & Saad [2012] Tang, Jok M. and Saad, Yousef. A probing method for computing the diagonal of a matrix inverse. Numerical Linear Algebra with Applications, pp. 485–501, 2012.
  • Wasow [1952] Wasow, Wolfgang R. A note on the inversion of matrices by random walks. Mathematical Tables and Other Aids to Computation, pp. 78–81, 1952.
  • Wilf [1962] Wilf, Herbert S. Mathematics for the Physical Sciences. Wiley, New York, 1962.

Appendix A Further Background on Gauss Quadrature

We present below a more detailed summary of material on Gauss quadrature to make the paper self-contained.

a.1 Selecting weights and nodes

We’ve described that the Riemann-Stieltjes integral could be expressed as

where denotes the th degree approximation and denotes a remainder term. The weights , and nodes are chosen such that for all polynomials of degree less than , denoted , we have exact interpolation . One way to compute weights and nodes is to set for and then use this exact nonlinear system. But there is an easier way to obtain weights and nodes, namely by using polynomials orthogonal with respect to the measure . Specifically, we construct a sequence of orthogonal polynomials such that is a polynomial in of degree exactly , and , are orthogonal, i.e., they satisfy

The roots of are distinct, real and lie in the interval of , and form the nodes for Gauss quadrature (see, e.g., [29, Ch. 6]).

Consider the two monic polynomials whose roots serve as quadrature nodes:

where for consistency. We further denote , where the sign is taken to ensure on . Then, for , we calculate the quadrature weights as

where denotes the derivative of with respect to . When the quadrature degenerates to Gauss quadrature and we have

Although we have specified how to select nodes and weights for quadrature, these ideas cannot be applied to our problem because the measure is unknown. Indeed, calculating the measure explicitly would require knowing the entire spectrum of , which is as good as explicitly computing , hence untenable for us. The next section shows how to circumvent the difficulties due to unknown .

a.2 Gauss Quadrature Lanczos (GQL)

The key idea to circumvent our lack of knowledge of is to recursively construct polynomials called Lanczos polynomials. The construction ensures their orthogonality with respect to . Concretely, we construct Lanczos polynomials via the following three-term recurrence:

(A.1)

while ensuring . We can express (A.1) in matrix form by writing

where , is th canonical unit vector, and is the tridiagonal matrix

(A.2)

This matrix is known as the Jacobi matrix, and is closed related to Gauss quadrature. The following well-known theorem makes this relation precise.

Theorem 10 ([62, 30]).

The eigenvalues of form the nodes of Gauss-type quadratures. The weights are given by the squares of the first elements of the normalized eigenvectors of .

Thus, if has the eigendecomposition , then for Gauss quadrature Thm. 10 yields

(A.3)
Specialization.

We now specialize to our main focus, , for which we prove more precise results. In this case, (A.3) becomes . The task now is to compute , and given , to obtain the Jacobi matrix .

Fortunately, we can efficiently calculate iteratively using the Lanczos Algorithm [38]. Suppose we have an estimate , in iteration of Lanczos, we compute the tridiagonal coefficients and and add them to this estimate to form . As to , assuming we have already computed , letting and invoking the Sherman-Morrison identity [55] we obtain the recursion:

(A.4)

where and can be recursively computed using a Cholesky-like factorization of  [29, p.31].

For Gauss-Radau quadrature, we need to modify so that it has a prescribed eigenvalue. More precisely, we extend to for left Gauss-Radau (