A preconditioner is helpful for solving a linear system via iterative methods if it reduces the number of iterations enough to offset the cost of constructing the preconditioner plus the additional cost of taking the iterations (typically an extra computation of the form per iteration.) Ignoring the latter component of this trade-off, this corresponds to having a small condition number for the conjugate gradient method [trefethen1997numerical, deift2019conjugate]; a precise objective is unclear in the indefinite or unsymmetric case with general Krylov methods besides the general desire that the spectrum of be clustered [benzi1999orderings]. For instance, even if we have a small number of candidate preconditioners for our problem ready to use and assume that they add the same amount of time to compute each iteration, it is unclear how one would go about estimating which preconditioner would reduce the iteration count the most without actually solving a system with each preconditioner or doing a comparable amount of work. This preconditioner selection task is the focus of the present work.
The core contribution of this work is the realization that randomized sketching methods make completely practical the computation of a helpful forecaster of preconditioner quality previously thought to be infeasible to compute.
In addition to this primary method for computing preconditioner stability, we provide a number of other results:
We prove that no practical deterministic algorithm, in a meaningful sense, could possibly be used to estimate the preconditioner stability .
We confirm the conjecture of [avron2011randomized] regarding the true asymptotic sample complexity of the Gaussian trace estimator using a substantially more direct proof than the general result provided in [wimmer2014optimal], at the same time confirming the tightness of our stability estimation algorithm convergence bound.
We provide a randomized algorithm which can provably select a preconditioner of approximately minimal stability among candidate preconditioners using computational resources equivalent to computing on the order of steps of the conjugate gradients algorithm.
We give a theoretical speedup to the initial preconditioner selection method which largely decouples the runtime dependence between the number of preconditioners and the desired accuracy in the situation that the input preconditioner stabilities satisfy an anti-concentration assumption.
Using our initial preconditioner selection algorithm, we create the first (to the best of our knowledge) method for preconditioning in kernel regression which is reported to never give a worse number of iterations than using no preconditioner in standard tests.
It is important to point out that while our experiments consider positive definite systems and preconditioners, our algorithms and runtime bounds hold as-is for arbitrary matrices and preconditioners .
1.2 Prior Art
Current approaches to forecasting preconditioner quality are not robust in the sense that they can fail to select a highly performant preconditioner in favor of a poorly performing one in many realistic scenarios. This lack of robustness, even in the case of positive definite , results in part from the fact that the research community does not have robust methods for estimating condition numbers of large sparse matrices, which makes proxies for preconditioner quality necessary. For instance, Avron et al. [avron2019spectral] recently produced a condition number estimator in this setting which appears to perform admirably in many situations but does not always converge and at this point does not have rigorous theoretical backing.
The simplest forecasting criterion is that a preconditioner ought to be an ‘accurate’ approximation, in the sense that the Frobenius norm distance is small. This preconditioner accuracy criterion can be efficiently computed so long as one has access to the preconditioning matrix
in memory as well as the standard matrix-vector product access to. Moreover, for symmetric -matrices this accuracy criterion is a useful proxy for the number of conjugate gradient iterations necessary to solve the preconditioned system in . This point was theoretically confirmed by Axelsson and Eijkhout [axelsson1990vectorizable], who showed that the condition number can be bounded in terms of . The accuracy criterion for forecasting preconditioner quality was heavily tested on an empirical level in [duff1989effect].
Even in this setting, though, there exist accurate real-world preconditioners that give a poor iteration count because is very large [benzi1999orderings]. Unfortunately, to detect this issue one presumably needs to compute the entire matrix in memory solely from matrix-vector product access to . The natural method for this, computing each of the columns from products with the standard basis vectors , takes the same leading-order floating point cost as solving the system via conjugate-gradients, at least when matrix-vector multiply access to is comparable to the matrix-vector multiply cost of multiplying by . Perhaps as a result of this intuition, computing , or rather its close cousin the -norm , was deemed impractical [benzi1999orderings, benzi2002preconditioning]. In practice, Chow and Saad [chow1997experimental] suggested using the lower bound for where is the vector of all ones. This has proved helpful for detecting preconditioner instability in some situations, and especially indefinite systems [benzi1999orderings].
The quantity known as ‘preconditioner stability,’ , is empirically the most reliable indicator of preconditioner performance from this class of forecasts. This is especially true among many non-symmetric problems or problems which are far from diagonally dominant [benzi1999orderings]. Unfortunately, prior work has suggested that computing preconditioner stability is ‘impractical’ [benzi2002preconditioning] for effectively the same reason as why was considered impractical to compute.
In spite of this, Tyrtyshnikov’s influential paper [tyrtyshnikov1992optimal] gave a fast algorithm to solve the harder problem considered in this paper of determining the minimal-stability preconditioner over the special class of circulant matrices. This algorithm relied heavily on properties of these matrices.
Our results in this paper will suggest that even for generic and
, one can get an arbitrarily accurate estimate of most of these quantities previously thought to be unusable for preconditioner selection. That said, it is crucial to note that the prior impracticality assessment is completely valid in a certain regard: no deterministic algorithms, as we also show, could possibly even approximate the preconditioner stability, for example. Thus an algorithm must fail at some of these tasks with some nonzero probability.
Section 2 motivates the need for a randomized algorithm for stability estimation with theory, responds with a sketching-based solution, and uses it to create and analyze a method for preconditioner selection. This theory is empirically confirmed in Section 3 where we apply the primary preconditioner selection algorithm from Section 2 to solving generic real-world systems (Section 3.1) and creating more robust preconditioned solvers for kernel regression (Section 3.2.) After this we conclude in Section 4 and discuss open problems that could provide an avenue for future work.
Boldface letters like denote vectors while their upper-case analogues such as denote matrices. Such matrices and vectors can be either random or deterministic. The underlying scalar field is either the real numbers or the complex numbers unless otherwise specified. The adjoint of a matrix is denoted by . The inner product of two vectors and is denoted as
. The expectation of a random variable will be denoted with, and the probability of an event is written . The norms and represent the Frobenius and Euclidean norms, respectively. The condition number is the condition number. The notation represents . and are normal and circularly-symmetric complex normal random variables, respectively.
This section contains all of our core theoretical results and algorithms. In Section 2.1 we show that the only algorithms which can possibly estimate preconditioner stability must be randomized. The follow-up question of whether randomization can indeed work is answered in the affirmative in Section 2.2, where we show that a slight adaptation of a well-known sketching-based algorithm for computing Schatten norms perfectly fits our realistic access model to our preconditioner and matrix . Once we have a good estimator of preconditioner stability, a natural method for selecting the candidate preconditioner with minimal stability criterion presents itself in Section 2.3. It turns our that our algorithm can be trivially parallelized, and a testament to this fact is given in Section 2.3.1. In Section 2.4 we take advantage of highly informative results from the literature on trace estimation to provide useful approximation guarantees and runtime bounds for the previously presented algorithms. Section 2.4.1 returns the focus to Theorem 2 of Section 2.4, showing that the leading constant is tight and providing a more concise resolution of a conjecture from [avron2011randomized] than the more generalized result from [wimmer2014optimal]. In that section, we also comment that no randomized algorithm for estimating preconditioner stability which has access to matrix-vector products of the form could possibly do better asymptotically than Algorithm LABEL:alg:stab by relying on this result [wimmer2014optimal, Thm. 3] from the trace estimation literature. Finally, this conversation is wrapped up in Section 2.4.2, where our bounds from Section 2.4 are utilized to provide a theoretical improvement to Algorithm LABEL:alg:pick.
2.1 Randomization is Necessary to Compute Preconditioner Stability
This paper provides a simple randomized algorithm which can accurately estimate the preconditioner stability in time faster than running a constant number of iterations of preconditioned conjugate gradients with the matrix and preconditioner . Through incorporating randomness, however, we must accept that the algorithm fails with some probability. This failure probability can be made arbitrarily small, but it would still be advantageous (for example, in mission-critical applications) to provide a deterministic algorithm for the same task with a comparable approximation guarantee. The purpose of this section is to crush that latter hope, and the following theorem does just that.
Fix some . Suppose we have a deterministic algorithm which takes as input an arbitrary positive semi-definite matrix and positive definite matrix , and returns an estimate satisfying
after sequentially querying and observing matrix vector multiplies of the form
for where is a constant depending only on and . Then .
Fix for the remainder of the proof. Suppose to the contrary that suffices to compute . Let be the query vectors used by the algorithm in the case that always returns . Write for the orthogonal projection onto . Both of the positive semi-definite matrices and will return uniformly over , and thus since the algorithm is deterministic the estimated stabilities are equal. But since was an orthogonal projection onto a subspace of dimension strictly less than , and hence
by our approximation guarantee. This contradiction ensures that we must take . Since , matrix-vector multiply access to is equivalent via a bijection to matrix-vector multiply access to and , hence our strengthened statement of the result.
Of course, using queries suffices to achieve no error at all, and so the above lower bound is tight:
where is any orthonormal basis for . Also, note that the condition that and be positive semi-definite gives a stronger result than if they were allowed to be arbitrary matrices.
In order to put Theorem 1 into better context, recall that the dominant cost of an iteration of preconditioned conjugate gradients [golub2012matrix, Alg. 11.5.1] is (a) computing for a vector , and (b) computing for a vector . Thus Theorem 1 says roughly that in the time it takes to even approximate deterministically, one can solve a system exactly (in exact arithmetic) by running the conjugate gradients algorithm for iterations. Since our whole goal of computing is to forecast how well would do as a preconditioner for solving the system , this means that any deterministic algorithm for this task is, in general, impractical for this task.
2.2 Computing Preconditioner Stability via Randomization
Now we will demonstrate that, unlike the deterministic case, randomization makes it entirely practical to compute preconditioner stability. To see why this is intuitive, let be a standard (real or complex) Gaussian vector. Then
by the linearity of expectation, the cyclic property of the trace, and the fact that . Thus, if are independent standard normal vectors for all , the Monte-Carlo squared stability estimate
almost surely as
by the strong law of large numbers. We can rewrite the above estimator aswhere is a matrix with independent and identically distributed elements . This stability estimation algorithm for is given as Algorithm LABEL:alg:stab. In Algorithm LABEL:alg:stab, we adjust whether is real or complex depending on for analysis reasons that will be apparent later in Section 2.4.
It is important to note that the mathematical foundations of the above algorithm are not novel. It is equivalent in exact arithmetic to applying the trace estimators in [roosta2015improved] to the matrix and then taking the square root. It is also a simplification of the Schatten- norm estimator in [woodruff2014sketching, Thm. 69] (relayed from [li2014sketching]) applied to . The reason we include Algorithm LABEL:alg:stab is not because of its mathematical novelty but because of its observational novelty: sketching algorithms using the matrix-vector multiply access model are a perfect fit for interrogating the matrices and in the context of iterative methods for solving linear systems, since this kind of access to and are precisely what make that kind of access practical.
Of course, the presentation thus far does not help us choose how large the accuracy parameter should be. To that end, Theorem 2 in Section 2.4 presents a tight upper bound on the necessary to achieve a given error bound with high probability, although in practice just setting appears sufficient in many situations of interest – see Section 3.
2.3 Randomized Algorithm for Selecting the ‘Best’ Preconditioner
Once we have a practical way to compute preconditioner stability, a trivial algorithm for picking the preconditioner among candidates becomes natural. Namely, we can compute estimates for and then just return the preconditioner for which is minimized. This is presented as Algorithm LABEL:alg:pick. As we mentioned in the previous section, theoretical advice on how to pick will be given in Section 2.4. An improvement to this algorithm in the case there is a clear winner, relying on those analytical bounds, is included in Section 2.4.2.
We note that the sketching matrix can be re-used when computing the stability estimates in Algorithm LABEL:alg:stab. This is done in all our computational experiments in Section 3, and reduces the number of normal variates one needs to simulate from to . Reuse does not affect our theoretical upper bound presented as Theorem 3.
A convenient aspect of sketching based algorithms like Algorithm LABEL:alg:pick is that they can be parallelized extremely easily. For instance, suppose we are trying to pick the minimal stability preconditioner among candidates , and have processors , , which have access to and , respectively. Then we can compute each stability estimate in parallel at processor . Ignoring communication costs (which are a genuine concern in practice,) this would bring the runtime of the algorithm down to computing steps of preconditioned conjugate gradients with and the most computation-intensive (in terms of matrix-vector multiply access to ) preconditioner .
Taken to the extreme, one could similarly parallelize Algorithm LABEL:alg:pick over processors , and , assuming each had access to and the candidate preconditioner . Each processor would need to compute and return where is an independently sampled standard normal vector. Then in parallel for all processor could compute , at which point we could use processor to compute such that is minimal and return the corresponding . Ignoring communication costs again, this algorithm take fewer floating point operations than running one iteration of preconditioned conjugate gradients with and the most computation-intensive preconditioner , plus flops that were used to turn the into the estimate .
2.4 Approximation Guarantees and Runtime Bounds
This section details runtime bounds and approximation guarantees for Algorithms LABEL:alg:stab and LABEL:alg:pick, as well as using those bounds to derive a theoretical improvement to Algorithm LABEL:alg:pick.
To start, we will give the following theorem, which says that to estimate preconditioner stability up to a multiplicative factor with failure probability at most , one may take in Algorithm LABEL:alg:stab. This is (in contrast to the deterministic case) completely independent of the underlying dimension. Theorem 2 result sharpens an analogous bound for trace estimators given in [roosta2015improved, Thm. 3], largely following the proof structure of Theorem 1 from that work. We bring the leading constant from a to as and allow for complex matrices. This Theorem is included as a result in its own right primarily since we can even show that the leading constant is tight; see Section 2.4.1.
Let and be arbitrary matrices in where is invertible. If and are positive and less than one, taking ensures that in exact arithmetic the estimate of Algorithm LABEL:alg:stab satisfies
with probability at least . In particular, if , then the simpler condition ensures this same approximation guarantee.
Unitarily diagonalize as for some unitary and non-negative diagonal matrix . By the rotation invariance of the Gaussian and the fact that is orthogonal when , is equal in distribution to in both the real and complex cases. This implies is equal in distribution to and we can focus on the latter, simpler quantity. By Markov’s inequality, for any ,
Jensen’s inequality reduces
so long as . Note that the last inequality covers both real and complex , and results from the moment generating function being bounded above uniformly by the real case for this range of . Taking in Equation 9 gives
via the scalar inequality for all . The same argument for the lower tail, instead taking , gives
as well. A union bound provides the desired result.
Using Theorem 2 we are able to prove an approximation guarantee for Algorithm LABEL:alg:pick via a union bound. In particular, to achieve an -multiplicative approximation to the best of candidate preconditioners with probability at least we can take , again independent of the underlying dimension. This dependence on is quite weak, especially since in realistic applications we would only expect to have at most, say, fifty candidate preconditioners.
Let be an arbitrary matrix, and be invertible candidate preconditioners for . If and are positive and less than one, taking ensures that the preconditioner returned by Algorithm LABEL:alg:pick satisfies
with probability at least . In particular, if the simpler condition ensures
with probability at least .
Start by fixing any . If we take , Theorem 2 ensures that
except with probability at most . In particular, if we unfix the probability at least one of the does not satisfy Equation 13 is at most by a union bound. (Note that we did not need independence of the estimates here; this is why reusing the sketching matrix is valid.) Thus with probability at least all estimates satisfy Equation 13 simultaneously.
Write for the candidate preconditioner returned by Algorithm LABEL:alg:pick, and write for a candidate preconditioner which satisfies
Then since the estimate of the stability of was at most that of by minimality, the simultaneous bounds of Equation 13 give
except with probability at most . Rearranging the inequality gives the desired result after substituting Equation 14.
The final simplified bound results from the scalar inequality when and simple algebraic manipulation.
2.4.1 The Constant in Theorem 2 is Tight
Most of the theory presented in this paper relies on Theorem 2 to create more sophisticated bounds. Since Algorithm LABEL:alg:stab is at its core a repurposing of a trace estimator using only matrix vector products, the work [wimmer2014optimal] applies and ensures that no randomized, adaptive algorithm for estimating the stability could possibly use asymptotically fewer matrix-vector multiplies so long as the algorithm only has access to for query vectors . In this sense, Algorithm LABEL:alg:stab is optimal.
The theoretically-inclined practitioner, however, also cares about knowing the optimality of our analysis in Theorem 2. The following Theorem says that our analysis in Theorem 2 is asymptotically tight even up to the leading effective constant which tends to for small . In the proof, as is the Lambert- function [hoorfar2007approximation].
Fix some . For any underlying dimension , there exists a positive semi-definite matrix , positive definite matrix , and some so that for any , taking guarantees the stability estimate returned by Algorithm LABEL:alg:stab fails to satisfy the equation
with probability at least .
If is a standard normal random variable then
by [gordon1941values] for all . Setting the right hand side of Inequality 15 to and solving gives
whenever , which is satisfied when .
Now let and , where is the first standard basis vector. We can observe that
is a standard Gaussian vector. In particular, the standard deviation ofis . Thus since is a sample average of independent copies of these random variables, fixing ensures
by the Central Limit Theorem[vershynin2018high, Thm. 1.3.2] and Equation 16 as . This implies the existence of an so that ensures the relation
fails with probability at least under our relation defining . The simpler condition on given in the statement of this result follows from the bound for all from [hoorfar2007approximation, Thm. 2].
We point out that the above proof gives another confirmation of the conjecture in [avron2011randomized] regarding the true asymptotics of the Gaussian trace estimator. The work in [wimmer2014optimal] confirmed this conjecture to be true, but did so as a corollary of a general result regarding lower bounds for trace estimation algorithms. Our result above is much more direct.
2.4.2 An Improvement in the Presence of a Clear Winner
Algorithm LABEL:alg:pick is simple to implement and works well for selecting preconditioners of minimal stability, as we shall see in Section 3. Nevertheless, if we are selecting between preconditioners where some are clearly worse than the optimal preconditioner in terms of stability, our method seems excessive. Intuitively, we should be able to tell that terrible preconditioners will not be optimal with very rudimentary information. Algorithm LABEL:alg:improvement presents such a revision to Algorithm LABEL:alg:pick, iteratively refining the stability estimates we have and filtering out any preconditioners as soon as we can be confident they will not be optimal. Note that the algorithm crucially relies on the bounds from Section 2.4.
We can prove that Algorithm LABEL:alg:improvement is actually an improvement over Algorithm LABEL:alg:pick by making an anti-concentration assumption about the input stabilities.
Let be an arbitrary matrix, be invertible candidate preconditioners for , , and . Denoting , we will write
for the (shifted) cumulative distribution function of the input relative stabilities. If
for the (shifted) cumulative distribution function of the input relative stabilities. Ifuniformly over for some positive constant , then Algorithm LABEL:alg:improvement returns a preconditioner satisfying
with probability at least using strictly fewer floating point operations than running iterations of the preconditioned conjugate gradients algorithm in with the most expensive preconditioner in terms of the number of floating point operations required to compute for input vectors .
The same Bonferroni-correction argument from the proof of Theorem 3 ensures that
simultaneously for all over the course of the algorithm, except with probability at most . The rest of the proof will only rely on property 23, so everything we say will hold with this same probability.
If is the set in step of the algorithm and is in before the filtering at the end of step , Equation 23 implies
Thus, since initially, we know by induction that throughout the process of the entire algorithm. Now consider in the final step of Algorithm LABEL:alg:improvement. Since ,
Rearranging and realizing that at gives our desired approximation guarantee.
Now we will exhibit the runtime bound by bounding at each step of Algorithm LABEL:alg:improvement. We claim that for all . To see this, note that if a candidate preconditioner is retained after filtering in any step of the algorithm,
Thus the number of elements in just after step in the algorithm is at most since for . Our runtime bound follows from the sum
where is the set during iteration of the algorithm. This gives the number of matrix-vector multiplies of the form used by the algorithm after the first step. To see the final form, add on the multiplies done during the first iteration and plug in .
The anti-concentration condition in Theorem 5 intuitively asserts that the stabilities of the preconditioners do not cluster around the minimal stability. This is satisfied, for example, if at most some number of the candidate preconditioners have stability within a multiplicative factor of the optimal stability. The resulting constant gives an asymptotic runtime bound for Algorithm LABEL:alg:improvement of , decoupling the linear dependence in with the polynomial accuracy dependence on . Such an improvement is serious when is moderately large; while this example is contrived many other distributions on input data satisfy the assumptions of Theorem 5 with the same constant .
Of course, one would hope that Algorithm LABEL:alg:improvement does not perform poorly when the input data assumptions made in Theorem 5 are not satisfied. For example, this would happen when all preconditioners have extremely similar performance, to the point that even our target accuracy cannot distinguish their stabilities. Luckily, a constant always works in Theorem 5, so Algorithm LABEL:alg:improvement never suffers more than a multiplicative increase over Algorithm LABEL:alg:pick in number of floating point operations needed to select a preconditioner.
The present section will jointly test the hypothesis that preconditioner stability is a good forecast for preconditioner quality along with the performance of our algorithms. This is done by evaluating how well Algorithm LABEL:alg:pick can select a candidate preconditioner which minimizes the number of conjugate gradients iterations required to achieve some fixed approximation quality in various situations. In Section 3.1, we detail one experiment of this type for generic real-world systems from the SuiteSparse Matrix Collection [davis2011university]. Section 3.2 details our other numerical experiment, where we use Algorithm LABEL:alg:pick to contribute improvements to the growing literature on preconditioned solvers for kernel regression problems.
3.1 Experiments with Sparse Systems
First we attempt a generic experiment on a collection of real-world sparse linear systems and simple preconditioners, testing how well the preconditioner chosen a-priori by Algorithm LABEL:alg:pick compares to the minimal-iterations preconditioner. For the target system , we fix a sampled for the entire experiment. The positive definite matrices are taken from the SuiteSparse/University of Florida Sparse Matrix Collection [davis2011university]. We include all matrices from the
GHS_psdef groups which have between 100,000 and 2,250,000 non-zero entries and are strictly positive definite.
We provide nine candidate preconditioners for Algorithm LABEL:alg:pick to select between, using block diagonal preconditioners in order to avoid existence issues of other common preconditioning methods [benzi2002preconditioning]:
The first candidate preconditioner is the trivial preconditioner , which is equivalent to using no preconditioner at all.
The preconditioner denotes a block-diagonal pinching/truncation of the matrix with block size .
The preconditioner is the same block-diagonal pinching, but performed after a Reverse Cuthill-McKee ordering of the matrix [cuthill1969reducing].
To ensure uniqueness and clarity, blocking is performed by taking the matrix and constructing a block diagonal matrix with blocks of the form for . Since is positive definite, the resulting preconditioners are also positive definite [bhatia2009positive, Ex. 2.2.1.(viii)].
In Table 1 we present the number of iterations the preconditioned conjugate gradients algorithm took for each test matrix and preconditioner pair. The algorithm was run until the approximate solution satisfied . The number of iterations was limited to 50,000. Entries in Table 1 achieving this iteration limit are overwritten with ‘—’. The conjugate gradients algorithm applied to the matrices
vanbody did not converge with any candidate preconditioner, so they are omitted in Table 1.
|Matrix||Conjugate Gradients Iterations With Various Preconditioners|
, a constant sampled standard normally distributed, and various candidate preconditioners.
Even though larger block sizes ought to create better approximations of the original matrix, there are situations when smaller block sizes result in fewer conjugate gradients iterations. Similarly, there are some situations when the original ordering of the data is preferable over the Reverse Cuthill-McKee ordering, and vice-versa. As a result, it is unclear a-priori which preconditioner one should choose to solve the linear system, and this is why someone might wish to use Algorithm LABEL:alg:pick to automate that choice.
We test this use of Algorithm LABEL:alg:pick under two parameter settings and . Algorithm LABEL:alg:pick is run for 1,000 independent trials for each matrix-preconditioner- pairing. After the fact, we compare the number of iterations of preconditioned conjugate gradients would be necessary when using the recommendation of Algorithm LABEL:alg:pick relative to the minimal number of iterations possible if we knew in advance how many iterations each preconditioner would use.
The results of our generic real-world-use experiment are presented in Table 2. Every cell is an approximation ratio, i.e. the number of iterations an algorithm for selecting preconditioners took divided by the minimal number of iterations possible using our set of candidate preconditioners. As such, an entry of 1.00 is optimal and represents the minimal-number-of-iterations preconditioner being correctly selected. The column ‘Worst-Case’ reports the approximation ratio if one deterministically selected the maximal-number-of-iterations preconditioner in each setting. The column ‘Random’ reports the expected approximation ratio if one were to select a candidate preconditioner from Table 1 uniformly at random. The columns corresponding to Algorithm LABEL:alg:pick gives statistics of the empirical distribution of approximation ratios seen over the 1,000 independent trial runs of the method.
|Matrix||Worst-Case||Random||Algorithm LABEL:alg:pick Approximation Ratio|
For 10 of the 14 test matrices reported, setting always picks the optimal preconditioner for the problem across every one of the 1,000 trials. If we take , this happens for 11 of the 14 test matrices. Moreover, even when the accuracy parameter , the returned preconditioner never needs more than 15% iterations over than the optimal choice.
One might wonder if taking to be even larger would result in approximation ratios concentrating more uniformly at the ideal 1.00 mark. This will not happen in general, and is where the good-proxy hypothesis is put to the test. For the
oilpan matrix, increasing from to raises the best-seen approximation ratio given by Algorithm LABEL:alg:pick from 1.00 to 1.07. Increasing causes the preconditioner returned by Algorithm LABEL:alg:pick to concentrate further around the true minimal-stability preconditioner (see Theorem 3,) and so this implies that the preconditioner stability criterion itself is not perfect and will not in general forecast the exact preconditioner resulting in the minimal number of conjugate gradients iterations.
3.2 Experiments with Kernel Regression Preconditioners
This section will show that Algorithm LABEL:alg:pick can turn two simple preconditioners for the standard kernel regression problem into a robust, state-of-the-art preconditioning method. As a corollary of this investigation, we exhibit how Algorithm LABEL:alg:pick performs well in situations when the ‘minimal accuracy’ criterion for selecting preconditioners fails, something left unanswered in the previous experiment.
3.2.1 A Quick Review
Kernel regression is a common statistical technique for nonlinear regression. In this setting, we have a dataset consisting of mappings from Euclidean space to the real line . We wish to find coefficients so that the functional mapping
faithfully represents the empirical mapping in the sense that . In general, is just required to be a positive definite kernel, but in our experiment, we will only use the squared exponential kernel , parametrized by the length-scale which controls the derivative of the model . The coefficients are found by solving the system
where the positive definite Gram matrix , the output vector , and the noise standarad deviation is used for regularization so that the model fits well on out-of-sample data. In almost all kernel regression problems, and hence are dense. See [williams2006gaussian] for more background on this model and associated inference procedure.
3.2.2 Related Work
This experiment will test a preconditioning procedure for solving the linear system via conjugate gradients. There has been a recent interest in this general iterative framework for kernel regression, which largely focuses on creating quality preconditioners for the Gram matrices [avron2017faster, cutajar2016preconditioning, halko2011finding, rudi2017falkon, rudi2018fast]. This is necessary since for reasonable parameter settings even is often ill-conditioned.
Cutajar et al. [cutajar2016preconditioning] performed some initial leg-work in this area, proposing eight candidate preconditioners. These preconditioners include a block-diagonal approximation of , adding a larger regularizer and solving recursively, a Nyström approximation of the Gram matrix using data points as inducing points chosen uniformly at random, a coupling of the Nyström approximation with a block-diagonal approximation, or replacing with an optimal low-rank factorization which can be computed via a randomized SVD [halko2011finding] or the Lanczos method [golub2012matrix, Sec. 10.1]. Both [cutajar2016preconditioning] and the work [avron2017faster] of Avron et al. use the Fourier features method of Rahimi and Recht [rahimi2008random] to create a preconditioner which replaces with a sketched version . The latter paper [avron2017faster] also proposes using the TensorSketch method of [pagh2013compressed] for creating a sketched preconditioner when using the polynomial kernel , though the necessary sketching dimension in their upper bounds is exponential in . The work of Rudi et al. [rudi2017falkon] also uses the Nyström-based preconditioner like [cutajar2016preconditioning], combining it with other computational tricks. This Nyström-based approach was refined with approximate leverage score sampling in more recent work by Rudi et al. [rudi2018fast].
An empirical issue within many of the above methods is illustrated perfectly in Figure 1 of [cutajar2016preconditioning]. For every preconditioner presented therein, there exist parameter settings for which using no preconditioner results in fewer iterations than using the preconditioner when solving for via conjugate gradients. As such, it is unclear how one would choose a preconditioner in practice which makes these solvers work well in practice.
3.2.3 Two Simple Geometrically Driven Preconditioners
Here we detail the two candidate preconditioners which we will use in our experiments. They both utilize a geometrically-motivated reordering of the data to empirically achieve superior performance to the preconditioners of [cutajar2016preconditioning] for many settings of the kernel parameters and .
The first preconditioner is a simple block diagonal pinching of a reordering of the data. The kernel regression model under the squared exponential kernel effectively asserts that points nearby in ought to have similar outputs . If the input data is highly clustered in , our model then ought to largely ignore points from different clusters when considering a point in some cluster. The first preconditioning algorithm turns this ‘ought to’ statement directly into an approximation of the Gram matrix . We first cluster the data in via the
k-means++ [arthur2007k] algorithm with clusters, constructing a permutation matrix that places points in the same cluster next to each other on the Gram matrix. At this point, we precondition the re-ordered system by creating a block-diagonal pinching of the re-ordered matrix where each block corresponds to the points within a cluster. The resulting preconditioner is that pinching plus the true noise term .
The second preconditioner is a slightly more complex version of the first. After computing the permuted matrix , we compute a truncated rank- approximation of where is diagonal and has orthonormal columns. At this point we compute the same block diagonal pinching of the error in approximation