1 Introduction
A preconditioner is helpful for solving a linear system via iterative methods if it reduces the number of iterations enough to offset the cost of constructing the preconditioner plus the additional cost of taking the iterations (typically an extra computation of the form per iteration.) Ignoring the latter component of this tradeoff, this corresponds to having a small condition number for the conjugate gradient method [trefethen1997numerical, deift2019conjugate]; a precise objective is unclear in the indefinite or unsymmetric case with general Krylov methods besides the general desire that the spectrum of be clustered [benzi1999orderings]. For instance, even if we have a small number of candidate preconditioners for our problem ready to use and assume that they add the same amount of time to compute each iteration, it is unclear how one would go about estimating which preconditioner would reduce the iteration count the most without actually solving a system with each preconditioner or doing a comparable amount of work. This preconditioner selection task is the focus of the present work.
1.1 Contributions
The core contribution of this work is the realization that randomized sketching methods make completely practical the computation of a helpful forecaster of preconditioner quality previously thought to be infeasible to compute.
In addition to this primary method for computing preconditioner stability, we provide a number of other results:

We prove that no practical deterministic algorithm, in a meaningful sense, could possibly be used to estimate the preconditioner stability .

We confirm the conjecture of [avron2011randomized] regarding the true asymptotic sample complexity of the Gaussian trace estimator using a substantially more direct proof than the general result provided in [wimmer2014optimal], at the same time confirming the tightness of our stability estimation algorithm convergence bound.

We provide a randomized algorithm which can provably select a preconditioner of approximately minimal stability among candidate preconditioners using computational resources equivalent to computing on the order of steps of the conjugate gradients algorithm.

We give a theoretical speedup to the initial preconditioner selection method which largely decouples the runtime dependence between the number of preconditioners and the desired accuracy in the situation that the input preconditioner stabilities satisfy an anticoncentration assumption.

Using our initial preconditioner selection algorithm, we create the first (to the best of our knowledge) method for preconditioning in kernel regression which is reported to never give a worse number of iterations than using no preconditioner in standard tests.
It is important to point out that while our experiments consider positive definite systems and preconditioners, our algorithms and runtime bounds hold asis for arbitrary matrices and preconditioners .
1.2 Prior Art
Current approaches to forecasting preconditioner quality are not robust in the sense that they can fail to select a highly performant preconditioner in favor of a poorly performing one in many realistic scenarios. This lack of robustness, even in the case of positive definite , results in part from the fact that the research community does not have robust methods for estimating condition numbers of large sparse matrices, which makes proxies for preconditioner quality necessary. For instance, Avron et al. [avron2019spectral] recently produced a condition number estimator in this setting which appears to perform admirably in many situations but does not always converge and at this point does not have rigorous theoretical backing.
The simplest forecasting criterion is that a preconditioner ought to be an ‘accurate’ approximation, in the sense that the Frobenius norm distance is small. This preconditioner accuracy criterion can be efficiently computed so long as one has access to the preconditioning matrix
in memory as well as the standard matrixvector product access to
. Moreover, for symmetric matrices this accuracy criterion is a useful proxy for the number of conjugate gradient iterations necessary to solve the preconditioned system in . This point was theoretically confirmed by Axelsson and Eijkhout [axelsson1990vectorizable], who showed that the condition number can be bounded in terms of . The accuracy criterion for forecasting preconditioner quality was heavily tested on an empirical level in [duff1989effect].Even in this setting, though, there exist accurate realworld preconditioners that give a poor iteration count because is very large [benzi1999orderings]. Unfortunately, to detect this issue one presumably needs to compute the entire matrix in memory solely from matrixvector product access to . The natural method for this, computing each of the columns from products with the standard basis vectors , takes the same leadingorder floating point cost as solving the system via conjugategradients, at least when matrixvector multiply access to is comparable to the matrixvector multiply cost of multiplying by . Perhaps as a result of this intuition, computing , or rather its close cousin the norm , was deemed impractical [benzi1999orderings, benzi2002preconditioning]. In practice, Chow and Saad [chow1997experimental] suggested using the lower bound for where is the vector of all ones. This has proved helpful for detecting preconditioner instability in some situations, and especially indefinite systems [benzi1999orderings].
The quantity known as ‘preconditioner stability,’ , is empirically the most reliable indicator of preconditioner performance from this class of forecasts. This is especially true among many nonsymmetric problems or problems which are far from diagonally dominant [benzi1999orderings]. Unfortunately, prior work has suggested that computing preconditioner stability is ‘impractical’ [benzi2002preconditioning] for effectively the same reason as why was considered impractical to compute.
In spite of this, Tyrtyshnikov’s influential paper [tyrtyshnikov1992optimal] gave a fast algorithm to solve the harder problem considered in this paper of determining the minimalstability preconditioner over the special class of circulant matrices. This algorithm relied heavily on properties of these matrices.
Our results in this paper will suggest that even for generic and
, one can get an arbitrarily accurate estimate of most of these quantities previously thought to be unusable for preconditioner selection. That said, it is crucial to note that the prior impracticality assessment is completely valid in a certain regard: no deterministic algorithms, as we also show, could possibly even approximate the preconditioner stability, for example. Thus an algorithm must fail at some of these tasks with some nonzero probability.
1.3 Overview
Section 2 motivates the need for a randomized algorithm for stability estimation with theory, responds with a sketchingbased solution, and uses it to create and analyze a method for preconditioner selection. This theory is empirically confirmed in Section 3 where we apply the primary preconditioner selection algorithm from Section 2 to solving generic realworld systems (Section 3.1) and creating more robust preconditioned solvers for kernel regression (Section 3.2.) After this we conclude in Section 4 and discuss open problems that could provide an avenue for future work.
1.3.1 Notation
Boldface letters like denote vectors while their uppercase analogues such as denote matrices. Such matrices and vectors can be either random or deterministic. The underlying scalar field is either the real numbers or the complex numbers unless otherwise specified. The adjoint of a matrix is denoted by . The inner product of two vectors and is denoted as
. The expectation of a random variable will be denoted with
, and the probability of an event is written . The norms and represent the Frobenius and Euclidean norms, respectively. The condition number is the condition number. The notation represents . and are normal and circularlysymmetric complex normal random variables, respectively.2 Algorithms
This section contains all of our core theoretical results and algorithms. In Section 2.1 we show that the only algorithms which can possibly estimate preconditioner stability must be randomized. The followup question of whether randomization can indeed work is answered in the affirmative in Section 2.2, where we show that a slight adaptation of a wellknown sketchingbased algorithm for computing Schatten norms perfectly fits our realistic access model to our preconditioner and matrix . Once we have a good estimator of preconditioner stability, a natural method for selecting the candidate preconditioner with minimal stability criterion presents itself in Section 2.3. It turns our that our algorithm can be trivially parallelized, and a testament to this fact is given in Section 2.3.1. In Section 2.4 we take advantage of highly informative results from the literature on trace estimation to provide useful approximation guarantees and runtime bounds for the previously presented algorithms. Section 2.4.1 returns the focus to Theorem 2 of Section 2.4, showing that the leading constant is tight and providing a more concise resolution of a conjecture from [avron2011randomized] than the more generalized result from [wimmer2014optimal]. In that section, we also comment that no randomized algorithm for estimating preconditioner stability which has access to matrixvector products of the form could possibly do better asymptotically than Algorithm LABEL:alg:stab by relying on this result [wimmer2014optimal, Thm. 3] from the trace estimation literature. Finally, this conversation is wrapped up in Section 2.4.2, where our bounds from Section 2.4 are utilized to provide a theoretical improvement to Algorithm LABEL:alg:pick.
2.1 Randomization is Necessary to Compute Preconditioner Stability
This paper provides a simple randomized algorithm which can accurately estimate the preconditioner stability in time faster than running a constant number of iterations of preconditioned conjugate gradients with the matrix and preconditioner . Through incorporating randomness, however, we must accept that the algorithm fails with some probability. This failure probability can be made arbitrarily small, but it would still be advantageous (for example, in missioncritical applications) to provide a deterministic algorithm for the same task with a comparable approximation guarantee. The purpose of this section is to crush that latter hope, and the following theorem does just that.
Theorem 1
Fix some . Suppose we have a deterministic algorithm which takes as input an arbitrary positive semidefinite matrix and positive definite matrix , and returns an estimate satisfying
after sequentially querying and observing matrix vector multiplies of the form
for where is a constant depending only on and . Then .
Proof
Fix for the remainder of the proof. Suppose to the contrary that suffices to compute . Let be the query vectors used by the algorithm in the case that always returns . Write for the orthogonal projection onto . Both of the positive semidefinite matrices and will return uniformly over , and thus since the algorithm is deterministic the estimated stabilities are equal. But since was an orthogonal projection onto a subspace of dimension strictly less than , and hence
by our approximation guarantee. This contradiction ensures that we must take . Since , matrixvector multiply access to is equivalent via a bijection to matrixvector multiply access to and , hence our strengthened statement of the result.
Of course, using queries suffices to achieve no error at all, and so the above lower bound is tight:
(1) 
where is any orthonormal basis for . Also, note that the condition that and be positive semidefinite gives a stronger result than if they were allowed to be arbitrary matrices.
In order to put Theorem 1 into better context, recall that the dominant cost of an iteration of preconditioned conjugate gradients [golub2012matrix, Alg. 11.5.1] is (a) computing for a vector , and (b) computing for a vector . Thus Theorem 1 says roughly that in the time it takes to even approximate deterministically, one can solve a system exactly (in exact arithmetic) by running the conjugate gradients algorithm for iterations. Since our whole goal of computing is to forecast how well would do as a preconditioner for solving the system , this means that any deterministic algorithm for this task is, in general, impractical for this task.
2.2 Computing Preconditioner Stability via Randomization
algocf[t]
Now we will demonstrate that, unlike the deterministic case, randomization makes it entirely practical to compute preconditioner stability. To see why this is intuitive, let be a standard (real or complex) Gaussian vector. Then
(2)  
(3)  
(4)  
(5) 
by the linearity of expectation, the cyclic property of the trace, and the fact that . Thus, if are independent standard normal vectors for all , the MonteCarlo squared stability estimate
(6) 
almost surely as
by the strong law of large numbers. We can rewrite the above estimator as
where is a matrix with independent and identically distributed elements . This stability estimation algorithm for is given as Algorithm LABEL:alg:stab. In Algorithm LABEL:alg:stab, we adjust whether is real or complex depending on for analysis reasons that will be apparent later in Section 2.4.It is important to note that the mathematical foundations of the above algorithm are not novel. It is equivalent in exact arithmetic to applying the trace estimators in [roosta2015improved] to the matrix and then taking the square root. It is also a simplification of the Schatten norm estimator in [woodruff2014sketching, Thm. 69] (relayed from [li2014sketching]) applied to . The reason we include Algorithm LABEL:alg:stab is not because of its mathematical novelty but because of its observational novelty: sketching algorithms using the matrixvector multiply access model are a perfect fit for interrogating the matrices and in the context of iterative methods for solving linear systems, since this kind of access to and are precisely what make that kind of access practical.
Of course, the presentation thus far does not help us choose how large the accuracy parameter should be. To that end, Theorem 2 in Section 2.4 presents a tight upper bound on the necessary to achieve a given error bound with high probability, although in practice just setting appears sufficient in many situations of interest – see Section 3.
2.3 Randomized Algorithm for Selecting the ‘Best’ Preconditioner
Once we have a practical way to compute preconditioner stability, a trivial algorithm for picking the preconditioner among candidates becomes natural. Namely, we can compute estimates for and then just return the preconditioner for which is minimized. This is presented as Algorithm LABEL:alg:pick. As we mentioned in the previous section, theoretical advice on how to pick will be given in Section 2.4. An improvement to this algorithm in the case there is a clear winner, relying on those analytical bounds, is included in Section 2.4.2.
algocf[t]
We note that the sketching matrix can be reused when computing the stability estimates in Algorithm LABEL:alg:stab. This is done in all our computational experiments in Section 3, and reduces the number of normal variates one needs to simulate from to . Reuse does not affect our theoretical upper bound presented as Theorem 3.
2.3.1 Parallelization
A convenient aspect of sketching based algorithms like Algorithm LABEL:alg:pick is that they can be parallelized extremely easily. For instance, suppose we are trying to pick the minimal stability preconditioner among candidates , and have processors , , which have access to and , respectively. Then we can compute each stability estimate in parallel at processor . Ignoring communication costs (which are a genuine concern in practice,) this would bring the runtime of the algorithm down to computing steps of preconditioned conjugate gradients with and the most computationintensive (in terms of matrixvector multiply access to ) preconditioner .
Taken to the extreme, one could similarly parallelize Algorithm LABEL:alg:pick over processors , and , assuming each had access to and the candidate preconditioner . Each processor would need to compute and return where is an independently sampled standard normal vector. Then in parallel for all processor could compute , at which point we could use processor to compute such that is minimal and return the corresponding . Ignoring communication costs again, this algorithm take fewer floating point operations than running one iteration of preconditioned conjugate gradients with and the most computationintensive preconditioner , plus flops that were used to turn the into the estimate .
2.4 Approximation Guarantees and Runtime Bounds
This section details runtime bounds and approximation guarantees for Algorithms LABEL:alg:stab and LABEL:alg:pick, as well as using those bounds to derive a theoretical improvement to Algorithm LABEL:alg:pick.
To start, we will give the following theorem, which says that to estimate preconditioner stability up to a multiplicative factor with failure probability at most , one may take in Algorithm LABEL:alg:stab. This is (in contrast to the deterministic case) completely independent of the underlying dimension. Theorem 2 result sharpens an analogous bound for trace estimators given in [roosta2015improved, Thm. 3], largely following the proof structure of Theorem 1 from that work. We bring the leading constant from a to as and allow for complex matrices. This Theorem is included as a result in its own right primarily since we can even show that the leading constant is tight; see Section 2.4.1.
Theorem 2
Let and be arbitrary matrices in where is invertible. If and are positive and less than one, taking ensures that in exact arithmetic the estimate of Algorithm LABEL:alg:stab satisfies
with probability at least . In particular, if , then the simpler condition ensures this same approximation guarantee.
Proof
Unitarily diagonalize as for some unitary and nonnegative diagonal matrix . By the rotation invariance of the Gaussian and the fact that is orthogonal when , is equal in distribution to in both the real and complex cases. This implies is equal in distribution to and we can focus on the latter, simpler quantity. By Markov’s inequality, for any ,
(7)  
(8)  
(9) 
Jensen’s inequality reduces
(10) 
so long as . Note that the last inequality covers both real and complex , and results from the moment generating function being bounded above uniformly by the real case for this range of . Taking in Equation 9 gives
(11) 
via the scalar inequality for all . The same argument for the lower tail, instead taking , gives
(12) 
as well. A union bound provides the desired result.
Using Theorem 2 we are able to prove an approximation guarantee for Algorithm LABEL:alg:pick via a union bound. In particular, to achieve an multiplicative approximation to the best of candidate preconditioners with probability at least we can take , again independent of the underlying dimension. This dependence on is quite weak, especially since in realistic applications we would only expect to have at most, say, fifty candidate preconditioners.
Theorem 3
Let be an arbitrary matrix, and be invertible candidate preconditioners for . If and are positive and less than one, taking ensures that the preconditioner returned by Algorithm LABEL:alg:pick satisfies
with probability at least . In particular, if the simpler condition ensures
with probability at least .
Proof
Start by fixing any . If we take , Theorem 2 ensures that
(13) 
except with probability at most . In particular, if we unfix the probability at least one of the does not satisfy Equation 13 is at most by a union bound. (Note that we did not need independence of the estimates here; this is why reusing the sketching matrix is valid.) Thus with probability at least all estimates satisfy Equation 13 simultaneously.
Write for the candidate preconditioner returned by Algorithm LABEL:alg:pick, and write for a candidate preconditioner which satisfies
(14) 
Then since the estimate of the stability of was at most that of by minimality, the simultaneous bounds of Equation 13 give
except with probability at most . Rearranging the inequality gives the desired result after substituting Equation 14.
The final simplified bound results from the scalar inequality when and simple algebraic manipulation.
2.4.1 The Constant in Theorem 2 is Tight
Most of the theory presented in this paper relies on Theorem 2 to create more sophisticated bounds. Since Algorithm LABEL:alg:stab is at its core a repurposing of a trace estimator using only matrix vector products, the work [wimmer2014optimal] applies and ensures that no randomized, adaptive algorithm for estimating the stability could possibly use asymptotically fewer matrixvector multiplies so long as the algorithm only has access to for query vectors . In this sense, Algorithm LABEL:alg:stab is optimal.
The theoreticallyinclined practitioner, however, also cares about knowing the optimality of our analysis in Theorem 2. The following Theorem says that our analysis in Theorem 2 is asymptotically tight even up to the leading effective constant which tends to for small . In the proof, as is the Lambert function [hoorfar2007approximation].
Theorem 4
Fix some . For any underlying dimension , there exists a positive semidefinite matrix , positive definite matrix , and some so that for any , taking guarantees the stability estimate returned by Algorithm LABEL:alg:stab fails to satisfy the equation
with probability at least .
Proof
If is a standard normal random variable then
(15) 
by [gordon1941values] for all . Setting the right hand side of Inequality 15 to and solving gives
(16) 
whenever , which is satisfied when .
Now let and , where is the first standard basis vector. We can observe that
(17) 
if
is a standard Gaussian vector. In particular, the standard deviation of
is . Thus since is a sample average of independent copies of these random variables, fixing ensures(18)  
(19)  
(20)  
(21) 
by the Central Limit Theorem
[vershynin2018high, Thm. 1.3.2] and Equation 16 as . This implies the existence of an so that ensures the relation(22) 
fails with probability at least under our relation defining . The simpler condition on given in the statement of this result follows from the bound for all from [hoorfar2007approximation, Thm. 2].
We point out that the above proof gives another confirmation of the conjecture in [avron2011randomized] regarding the true asymptotics of the Gaussian trace estimator. The work in [wimmer2014optimal] confirmed this conjecture to be true, but did so as a corollary of a general result regarding lower bounds for trace estimation algorithms. Our result above is much more direct.
2.4.2 An Improvement in the Presence of a Clear Winner
Algorithm LABEL:alg:pick is simple to implement and works well for selecting preconditioners of minimal stability, as we shall see in Section 3. Nevertheless, if we are selecting between preconditioners where some are clearly worse than the optimal preconditioner in terms of stability, our method seems excessive. Intuitively, we should be able to tell that terrible preconditioners will not be optimal with very rudimentary information. Algorithm LABEL:alg:improvement presents such a revision to Algorithm LABEL:alg:pick, iteratively refining the stability estimates we have and filtering out any preconditioners as soon as we can be confident they will not be optimal. Note that the algorithm crucially relies on the bounds from Section 2.4.
algocf[t]
We can prove that Algorithm LABEL:alg:improvement is actually an improvement over Algorithm LABEL:alg:pick by making an anticoncentration assumption about the input stabilities.
Theorem 5
Let be an arbitrary matrix, be invertible candidate preconditioners for , , and . Denoting , we will write
for the (shifted) cumulative distribution function of the input relative stabilities. If
uniformly over for some positive constant , then Algorithm LABEL:alg:improvement returns a preconditioner satisfyingwith probability at least using strictly fewer floating point operations than running iterations of the preconditioned conjugate gradients algorithm in with the most expensive preconditioner in terms of the number of floating point operations required to compute for input vectors .
Proof
The same Bonferronicorrection argument from the proof of Theorem 3 ensures that
(23) 
simultaneously for all over the course of the algorithm, except with probability at most . The rest of the proof will only rely on property 23, so everything we say will hold with this same probability.
If is the set in step of the algorithm and is in before the filtering at the end of step , Equation 23 implies
Thus, since initially, we know by induction that throughout the process of the entire algorithm. Now consider in the final step of Algorithm LABEL:alg:improvement. Since ,
(24) 
Rearranging and realizing that at gives our desired approximation guarantee.
Now we will exhibit the runtime bound by bounding at each step of Algorithm LABEL:alg:improvement. We claim that for all . To see this, note that if a candidate preconditioner is retained after filtering in any step of the algorithm,
(25) 
Thus the number of elements in just after step in the algorithm is at most since for . Our runtime bound follows from the sum
(26) 
where is the set during iteration of the algorithm. This gives the number of matrixvector multiplies of the form used by the algorithm after the first step. To see the final form, add on the multiplies done during the first iteration and plug in .
The anticoncentration condition in Theorem 5 intuitively asserts that the stabilities of the preconditioners do not cluster around the minimal stability. This is satisfied, for example, if at most some number of the candidate preconditioners have stability within a multiplicative factor of the optimal stability. The resulting constant gives an asymptotic runtime bound for Algorithm LABEL:alg:improvement of , decoupling the linear dependence in with the polynomial accuracy dependence on . Such an improvement is serious when is moderately large; while this example is contrived many other distributions on input data satisfy the assumptions of Theorem 5 with the same constant .
Of course, one would hope that Algorithm LABEL:alg:improvement does not perform poorly when the input data assumptions made in Theorem 5 are not satisfied. For example, this would happen when all preconditioners have extremely similar performance, to the point that even our target accuracy cannot distinguish their stabilities. Luckily, a constant always works in Theorem 5, so Algorithm LABEL:alg:improvement never suffers more than a multiplicative increase over Algorithm LABEL:alg:pick in number of floating point operations needed to select a preconditioner.
3 Experiments
The present section will jointly test the hypothesis that preconditioner stability is a good forecast for preconditioner quality along with the performance of our algorithms. This is done by evaluating how well Algorithm LABEL:alg:pick can select a candidate preconditioner which minimizes the number of conjugate gradients iterations required to achieve some fixed approximation quality in various situations. In Section 3.1, we detail one experiment of this type for generic realworld systems from the SuiteSparse Matrix Collection [davis2011university]. Section 3.2 details our other numerical experiment, where we use Algorithm LABEL:alg:pick to contribute improvements to the growing literature on preconditioned solvers for kernel regression problems.
3.1 Experiments with Sparse Systems
First we attempt a generic experiment on a collection of realworld sparse linear systems and simple preconditioners, testing how well the preconditioner chosen apriori by Algorithm LABEL:alg:pick compares to the minimaliterations preconditioner. For the target system , we fix a sampled for the entire experiment. The positive definite matrices are taken from the SuiteSparse/University of Florida Sparse Matrix Collection [davis2011university]. We include all matrices from the Boeing
and GHS_psdef
groups which have between 100,000 and 2,250,000 nonzero entries and are strictly positive definite.
We provide nine candidate preconditioners for Algorithm LABEL:alg:pick to select between, using block diagonal preconditioners in order to avoid existence issues of other common preconditioning methods [benzi2002preconditioning]:

The first candidate preconditioner is the trivial preconditioner , which is equivalent to using no preconditioner at all.

The preconditioner denotes a blockdiagonal pinching/truncation of the matrix with block size .

The preconditioner is the same blockdiagonal pinching, but performed after a Reverse CuthillMcKee ordering of the matrix [cuthill1969reducing].
To ensure uniqueness and clarity, blocking is performed by taking the matrix and constructing a block diagonal matrix with blocks of the form for . Since is positive definite, the resulting preconditioners are also positive definite [bhatia2009positive, Ex. 2.2.1.(viii)].
In Table 1 we present the number of iterations the preconditioned conjugate gradients algorithm took for each test matrix and preconditioner pair. The algorithm was run until the approximate solution satisfied . The number of iterations was limited to 50,000. Entries in Table 1 achieving this iteration limit are overwritten with ‘—’. The conjugate gradients algorithm applied to the matrices bcsstk36
, bcsstk38
, msc23052
, and vanbody
did not converge with any candidate preconditioner, so they are omitted in Table 1.
Matrix  Conjugate Gradients Iterations With Various Preconditioners  

apache1  3,538  3,513  3,286  3,283  3,270  3,265  3,269  3,710  3,693 
crystm01  122  54  39  34  30  27  27  24  21 
crystm02  138  54  38  35  34  29  30  24  24 
crystm03  143  54  38  34  33  30  29  25  24 
cvxbqp1  16,424  11,337  11,338  11,332  11,331  11,330  11,328  10,148  10,353 
gridgena  3,658  3,542  2,659  2,572  2,504  2,504  2,479  2,892  2,863 
jnlbrng1  139  131  126  126  125  125  125  130  130 
minsurfo  94  88  64  63  63  62  62  88  88 
msc10848  —  5,659  3,791  3,028  2,793  2,656  2,628  2,192  2,092 
obstclae  66  65  49  48  47  47  47  65  65 
oilpan  48,291  28,065  12,804  8,167  5,476  4,992  4,127  4,757  4,433 
torsion1  66  65  49  48  47  47  47  65  65 
wathen100  327  45  44  44  44  44  44  42  42 
wathen120  378  45  45  45  44  44  44  43  43 
, a constant sampled standard normally distributed
, and various candidate preconditioners.Even though larger block sizes ought to create better approximations of the original matrix, there are situations when smaller block sizes result in fewer conjugate gradients iterations. Similarly, there are some situations when the original ordering of the data is preferable over the Reverse CuthillMcKee ordering, and viceversa. As a result, it is unclear apriori which preconditioner one should choose to solve the linear system, and this is why someone might wish to use Algorithm LABEL:alg:pick to automate that choice.
We test this use of Algorithm LABEL:alg:pick under two parameter settings and . Algorithm LABEL:alg:pick is run for 1,000 independent trials for each matrixpreconditioner pairing. After the fact, we compare the number of iterations of preconditioned conjugate gradients would be necessary when using the recommendation of Algorithm LABEL:alg:pick relative to the minimal number of iterations possible if we knew in advance how many iterations each preconditioner would use.
3.1.1 Results
The results of our generic realworlduse experiment are presented in Table 2. Every cell is an approximation ratio, i.e. the number of iterations an algorithm for selecting preconditioners took divided by the minimal number of iterations possible using our set of candidate preconditioners. As such, an entry of 1.00 is optimal and represents the minimalnumberofiterations preconditioner being correctly selected. The column ‘WorstCase’ reports the approximation ratio if one deterministically selected the maximalnumberofiterations preconditioner in each setting. The column ‘Random’ reports the expected approximation ratio if one were to select a candidate preconditioner from Table 1 uniformly at random. The columns corresponding to Algorithm LABEL:alg:pick gives statistics of the empirical distribution of approximation ratios seen over the 1,000 independent trial runs of the method.
Matrix  WorstCase  Random  Algorithm LABEL:alg:pick Approximation Ratio  

Min  Mean  Max  Min  Mean  Max  
apache1  1.14  1.05  1.00  1.00  1.00  1.00  1.00  1.00 
crystm01  5.81  2.00  1.00  1.00  1.00  1.00  1.00  1.00 
crystm02  5.75  1.88  1.00  1.00  1.00  1.00  1.00  1.00 
crystm03  5.96  1.90  1.00  1.00  1.00  1.00  1.00  1.00 
cvxbqp1  1.62  1.15  1.12  1.12  1.12  1.12  1.12  1.12 
gridgena  1.48  1.15  1.00  1.00  1.01  1.00  1.00  1.00 
jnlbrng1  1.11  1.03  1.04  1.04  1.04  1.04  1.04  1.04 
minsurfo  1.52  1.20  1.00  1.00  1.00  1.00  1.00  1.00 
msc10848  23.90  3.97  1.00  1.00  1.00  1.00  1.00  1.00 
obstclae  1.40  1.18  1.00  1.00  1.00  1.00  1.00  1.00 
oilpan  11.70  3.26  1.00  1.09  1.15  1.07  1.08  1.15 
torsion1  1.40  1.18  1.00  1.00  1.00  1.00  1.00  1.00 
wathen100  7.79  1.79  1.00  1.00  1.00  1.00  1.00  1.00 
wathen120  8.79  1.89  1.00  1.00  1.00  1.00  1.00  1.00 
For 10 of the 14 test matrices reported, setting always picks the optimal preconditioner for the problem across every one of the 1,000 trials. If we take , this happens for 11 of the 14 test matrices. Moreover, even when the accuracy parameter , the returned preconditioner never needs more than 15% iterations over than the optimal choice.
One might wonder if taking to be even larger would result in approximation ratios concentrating more uniformly at the ideal 1.00 mark. This will not happen in general, and is where the goodproxy hypothesis is put to the test. For the oilpan
matrix, increasing from to raises the bestseen approximation ratio given by Algorithm LABEL:alg:pick from 1.00 to 1.07. Increasing causes the preconditioner returned by Algorithm LABEL:alg:pick to concentrate further around the true minimalstability preconditioner (see Theorem 3,) and so this implies that the preconditioner stability criterion itself is not perfect and will not in general forecast the exact preconditioner resulting in the minimal number of conjugate gradients iterations.
3.2 Experiments with Kernel Regression Preconditioners
This section will show that Algorithm LABEL:alg:pick can turn two simple preconditioners for the standard kernel regression problem into a robust, stateoftheart preconditioning method. As a corollary of this investigation, we exhibit how Algorithm LABEL:alg:pick performs well in situations when the ‘minimal accuracy’ criterion for selecting preconditioners fails, something left unanswered in the previous experiment.
3.2.1 A Quick Review
Kernel regression is a common statistical technique for nonlinear regression. In this setting, we have a dataset consisting of mappings from Euclidean space to the real line . We wish to find coefficients so that the functional mapping
(27) 
faithfully represents the empirical mapping in the sense that . In general, is just required to be a positive definite kernel, but in our experiment, we will only use the squared exponential kernel , parametrized by the lengthscale which controls the derivative of the model . The coefficients are found by solving the system
(28) 
where the positive definite Gram matrix , the output vector , and the noise standarad deviation is used for regularization so that the model fits well on outofsample data. In almost all kernel regression problems, and hence are dense. See [williams2006gaussian] for more background on this model and associated inference procedure.
3.2.2 Related Work
This experiment will test a preconditioning procedure for solving the linear system via conjugate gradients. There has been a recent interest in this general iterative framework for kernel regression, which largely focuses on creating quality preconditioners for the Gram matrices [avron2017faster, cutajar2016preconditioning, halko2011finding, rudi2017falkon, rudi2018fast]. This is necessary since for reasonable parameter settings even is often illconditioned.
Cutajar et al. [cutajar2016preconditioning] performed some initial legwork in this area, proposing eight candidate preconditioners. These preconditioners include a blockdiagonal approximation of , adding a larger regularizer and solving recursively, a Nyström approximation of the Gram matrix using data points as inducing points chosen uniformly at random, a coupling of the Nyström approximation with a blockdiagonal approximation, or replacing with an optimal lowrank factorization which can be computed via a randomized SVD [halko2011finding] or the Lanczos method [golub2012matrix, Sec. 10.1]. Both [cutajar2016preconditioning] and the work [avron2017faster] of Avron et al. use the Fourier features method of Rahimi and Recht [rahimi2008random] to create a preconditioner which replaces with a sketched version . The latter paper [avron2017faster] also proposes using the TensorSketch method of [pagh2013compressed] for creating a sketched preconditioner when using the polynomial kernel , though the necessary sketching dimension in their upper bounds is exponential in . The work of Rudi et al. [rudi2017falkon] also uses the Nyströmbased preconditioner like [cutajar2016preconditioning], combining it with other computational tricks. This Nyströmbased approach was refined with approximate leverage score sampling in more recent work by Rudi et al. [rudi2018fast].
An empirical issue within many of the above methods is illustrated perfectly in Figure 1 of [cutajar2016preconditioning]. For every preconditioner presented therein, there exist parameter settings for which using no preconditioner results in fewer iterations than using the preconditioner when solving for via conjugate gradients. As such, it is unclear how one would choose a preconditioner in practice which makes these solvers work well in practice.
3.2.3 Two Simple Geometrically Driven Preconditioners
Here we detail the two candidate preconditioners which we will use in our experiments. They both utilize a geometricallymotivated reordering of the data to empirically achieve superior performance to the preconditioners of [cutajar2016preconditioning] for many settings of the kernel parameters and .
The first preconditioner is a simple block diagonal pinching of a reordering of the data. The kernel regression model under the squared exponential kernel effectively asserts that points nearby in ought to have similar outputs . If the input data is highly clustered in , our model then ought to largely ignore points from different clusters when considering a point in some cluster. The first preconditioning algorithm turns this ‘ought to’ statement directly into an approximation of the Gram matrix . We first cluster the data in via the kmeans
or kmeans++
[arthur2007k] algorithm with clusters, constructing a permutation matrix that places points in the same cluster next to each other on the Gram matrix. At this point, we precondition the reordered system by creating a blockdiagonal pinching of the reordered matrix where each block corresponds to the points within a cluster. The resulting preconditioner is that pinching plus the true noise term .
The second preconditioner is a slightly more complex version of the first. After computing the permuted matrix , we compute a truncated rank approximation of where is diagonal and has orthonormal columns. At this point we compute the same block diagonal pinching of the error in approximation . The resulting preconditioner is then