1 Motivation
In their elegant 1997 textbook on numerical linear algebra [trefethen1997numerical], Trefethen and Bau write,
“In ending this book with the subject of preconditioners, we find ourselves at the philosophical center of the scientific computing of the future… Nothing will be more central to computational science in the next century than the art of transforming a problem that appears intractable into another whose solution can be approximated rapidly. For Krylov subspace matrix iterations, this is preconditioning… we can only guess where this idea will take us.”
The next century has since arrived, and one of the most fruitful developments in matrix computations has been the emergence of new algorithms that use randomness in an essential way. This paper explores a topic at the nexus of preconditioning and randomized numerical linear algebra. We will show how to use a randomized matrix approximation algorithm to construct a preconditioner for an important class of linear systems that arises throughout data analysis and scientific computing.
1.1 The Preconditioner
Consider the regularized linear system
(1) 
Here and elsewhere, psd abbreviates the term “positive semidefinite.” This type of linear system emerges whenever we solve a regularized leastsquares problem. We will design a class of preconditioners for the problem Eq. 1.
Throughout this paper, we assume that we can access the matrix
through matrix–vector products
, commonly known as matvecs. The algorithms that we develop will economize on the number of matvecs, and they may not be appropriate in settings where matvecs are very expensive or there are cheaper ways to interact with the matrix.For a rank parameter , the randomized Nyström approximation of takes the form
(2) 
This matrix provides the best psd approximation of whose range coincides with the range of . The randomness in the construction ensures that is a good approximation to the original matrix
with high probability
[MartTroppSurvey, Sec. 14].We can form the Nyström approximation using matvecs with , plus some extra arithmetic. See Algorithm 1 for the implementation details.
Given the eigenvalue decomposition
of the randomized Nyström approximation, we construct the Nyström preconditioner:(3) 
In a slight abuse of terminology, we refer to as the rank of the Nyström preconditioner. The key point is that we can solve the linear system very efficiently, and the action of dramatically reduces the condition number of the regularized matrix .
We propose to use Eq. 3 in conjunction with the preconditioned conjugate gradient (PCG) algorithm. Each iteration of PCG involves a single matvec with , and a single linear solve with . When the preconditioned matrix has a modest condition number, the algorithm converges to a solution of Eq. 1 very quickly. See Algorithm 3 for pseudocode for Nyström PCG.
The randomized Nyström preconditioner Eq. 3 was suggested by P.G. Martinsson in the survey [MartTroppSurvey, Sec. 17], but it has not been implemented or analyzed.
1.2 Guarantees
This paper contains the first comprehensive study of the preconditioner Eq. 3
, including theoretical analysis and testing on prototypical problems from data analysis and machine learning. One of the main contributions is a rigorous method for choosing the rank
to guarantee good performance, along with an adaptive rank selection procedure that performs well in practice.A key quantity in our analysis is the effective dimension of the regularized matrix . That is,
(4) 
The effective dimension measures the degrees of freedom of the problem after regularization. It may be viewed as a (smoothed) count of the eigenvalues larger than . Many realworld matrices exhibit strong spectral decay, so the effective dimension is typically much smaller than the nominal dimension . As we will discuss, the effective dimension also plays a role in a number of machine learning papers [el2014fast, avron2017faster, bach2013sharp, chowdhury2018randomized, lacotte2020effective] that consider randomized algorithms for solving regularized linear systems.
Our theory tells us the randomized Nyström preconditioner is successful when its rank is proportional to the effective dimension.
Theorem 1 (Randomized Nyström Preconditioner)
Simple probability bounds follow from Eq. 5 via Markov’s inequality. For example,
The main consequence of Theorem 1 is a convergence theorem for PCG with the randomized Nyström preconditioner.
Corollary 1 (Nyström PCG: Convergence)
Construct the preconditioner as in Theorem 1, and condition on the event . Solve the regularized linear system Eq. 1 using Nyström PCG, starting with an initial iterate . After iterations, the relative error satisfies
The error norm is defined as . In particular, iterations suffice to achieve relative error .
Although Theorem 1 gives an interpretable bound for the rank of the preconditioner, we cannot instantiate it without knowledge of the effective dimension. To address this shortcoming, we have designed adaptive methods for selecting the rank in practice (Section 5.4).
Finally, as part of our investigation, we will also develop a detailed understanding of Nyström sketchandsolve, a popular algorithm in the machine learning literature [el2014fast, bach2013sharp]. Our analysis highlights the deficiencies of Nyström sketchandsolve relative to Nyström PCG.
1.3 Example: Ridge Regression
As a concrete example, we consider the regularized leastsquares problem, also known as ridge regression. This problem takes the form
(6) 
where and and . By calculus, the solution to Eq. 6 also satisfies the regularized system of linear equations
(7) 
A direct method to solve Eq. 7 requires flops, which is prohibitive when and are both large. Instead, when and are large, iterative algorithms, such as the conjugate gradient method (CG), become the tools of choice. Unfortunately, the ridge regression linear system Eq. 7 is often very illconditioned, and CG converges very slowly.
Nyström PCG can dramatically accelerate the solution of Eq. 7. As an example, consider the shuttlerf dataset (Section 6.2). The matrix has dimension , while the preconditioner is based on a Nyström approximation with rank . Figure 1 shows the progress of the residual as a function of the iteration count. Nyström PCG converges to machine precision in 13 iterations, while CG stalls.
1.4 Roadmap
Section 2 contains an overview of the Nyström approximation and its key properties. Section 3
studies the role of the Nyström approximation in estimating the inverse of the regularized matrix. We analyze the Nyström sketchandsolve method in
Section 4, and we give a rigorous performance bound for this algorithm. Section 5 presents a full treatment of Nyström PCG, including theoretical results and guidance on numerical implementation. Computational experiments in Section 6 demonstrate the power of Nyström PCG for three different applications involving real data sets.1.5 Notation
We write for the linear space of real symmetric matrices, while denotes the convex cone of real psd matrices. The symbol denotes the Loewner order on . That is, if and only if the eigenvalues of are all nonnegative. The function returns the trace of a square matrix. The map returns the th largest eigenvalue of ; we may omit the matrix if it is clear. As usual, denotes the condition number. We write for the spectral norm of a matrix . For a psd matrix , we write for the norm. Given and , the symbol refers to any best rank approximation to relative to the spectral norm. For and , the regularized matrix is abbreviated . For and effective dimension of is defined as . For , the stable rank of is defined as . For , we denote the time taken to compute a matvec with by .
2 The Nyström approximation
Let us begin with a review of the Nyström approximation and the randomized Nyström approximation.
2.1 Definition and basic properties
The Nyström approximation is a natural way to construct a lowrank psd approximation of a psd matrix . Let be an arbitrary test matrix. The Nyström approximation of with respect to the range of is defined by
(8) 
The Nyström approximation is the best psd approximation of whose range coincides with the range of . It has a deep relationship with the Schur complement and with Cholesky factorization [MartTroppSurvey, Sec. 14].
The Nyström approximation enjoys several elementary properties that we record in the following lemma.
Lemma 1
Let be a Nyström approximation of the psd matrix . Then

The approximation is psd and has rank at most .

The approximation depends only on .

In the Loewner order, .

In particular, the eigenvalues satisfy for each .
2.2 Randomized Nyström approximation
How should we choose the test matrix so that the Nyström approximation provides a good lowrank model for ? Surprisingly, we can obtain a good approximation simply by drawing the test matrix at random. See [tropp2017fixed] for theoretical justification of this claim.
Let us outline the construction of the randomized Nyström approximation. Draw a standard normal test matrix , and compute the sketch . The Nyström approximation Eq. 8 is constructed directly from the test matrix and the sketch :
(9) 
The formula Eq. 9 is not numerically sound. We refer the reader to Algorithm 1 for a stable and efficient implementation of the randomized Nyström approximation [li2017algorithm, tropp2017fixed]. Conveniently, Algorithm 1 returns the truncated eigendecomposition , where
is an orthonormal matrix whose columns are eigenvectors and
is a diagonal matrix listing the eigenvalues, which we often abbreviate as .The randomized Nyström approximation described in this section has a key difference from the Nyström approximations that have traditionally been used in the machine learning literature [el2014fast, bach2013sharp, derezinski2020improved, gittens2011spectral, williams2001using]. In machine learning settings, the Nyström approximation is usually constructed from a sketch that samples random columns from the matrix (i.e., the random test matrix has 1sparse columns). In contrast, Algorithm 1 computes a sketch via random projection (i.e., the test matrix is standard normal). In most applications, we have strong reasons (Section 2.2.3) for preferring random projections to column sampling.
2.2.1 Cost of randomized Nyström approximation
Throughout the paper, we write for the time required to compute a matrix–vector product (matvec) with . Forming the sketch requires matvecs, which costs . The other steps in the algorithm have arithmetic cost . Hence, the total computational cost of Algorithm 1 is operations. The storage cost is floatingpoint numbers.
For Algorithm 1, the worstcase performance occurs when is dense and unstructured. In this case, forming costs operations. However, if we have access to the columns of then we may reduce the cost of forming to by using a structured test matrix
, such as a scrambled subsampled randomized Fourier transform (SSRFT) map or a sparse map
[MartTroppSurvey, tropp2017fixed].2.2.2 A priori guarantees for the randomized Nyström approximation
In this section, we present an a priori error bound for the randomized Nyström approximation. The result improves over previous analyses [gittens2011spectral, gittens2013revisiting, tropp2017fixed] by sharpening the error terms. This refinement is critical for the analysis of the preconditioner.
Proposition 1 (Randomized Nyström approximation: Error)
Consider a psd matrix with eigenvalues . Choose a rank parameter , and draw a standard normal test matrix . Then the rank Nyström approximation computed by Algorithm 1 satisfies
(10) 
The proof of Proposition 1 may be found in Section B.1.
Proposition 1 shows that, in expectation, the randomized Nyström approximation provides a good rank approximation to . The first term in the bound is comparable with the spectralnorm error in the optimal rank approximation, . The second term in the bound is comparable with the tracenorm error in the optimal rank approximation.
Proposition 1 is better understood via the following simplification.
Corollary 2 (Randomized Nyström approximation)
Instate the assumptions of Proposition 1. For and , we have the bound
The stable rank, , reflects decay in the tail eigenvalues.
Corollary 2 shows that the Nyström approximation error is on the order of when the rank parameter . The constant depends on the stable rank , which is small when the tail eigenvalues decay quickly starting at . This bound is critical for establishing our main results (Theorems 3 and 2).
2.2.3 Random projection versus column sampling
Most papers in the machine learning literature [el2014fast, bach2013sharp] construct Nyström approximations by sampling columns at random from an adaptive distribution. In contrast, for most applications, we advocate using an oblivious random projection of the matrix to construct a Nyström approximation.
Random projection has several advantages over column sampling. First, column sampling may not be practical when we only have blackbox matvec access to the matrix, while random projections are natural in this setting. Second, it can be very expensive to obtain adaptive distributions for column sampling. Indeed, computing approximate ridge leverage scores costs just as much as solving the ridge regression problem directly using random projections [drineas2012fast, Theorem 2]
. Third, even with a good sampling distribution, column sampling produces higher variance results than random projection, so it is far less reliable.
On the other hand, we have found that there are a few applications where it is more effective to compute a randomized Nyström preconditioner using column sampling in lieu of random projections. In particular, this seems to be the case for kernel ridge regression (Section 6.5). Indeed, the entries of the kernel matrix are given by an explicit formula, so we can extract full columns with ease. Sampling columns may cost only operations, whereas a single matvec generally costs . Furthermore, kernel matrices usually exhibit fast spectral decay, which limits the performance loss that results from using column sampling in lieu of random projection.
3 Approximating the regularized inverse
Let us return to the regularized linear system Eq. 1. The solution to the problem has the form . Given a good approximation to the matrix , it is natural to ask whether is a good approximation to the desired solution .
There are many reasons why we might prefer to use in place of . In particular, we may be able to solve linear systems in the matrix more efficiently. On the other hand, the utility of this approach depends on how well the inverse approximates the desired inverse . The next result addresses this question for a wide class of approximations that includes the Nyström approximation.
Proposition 2 (Regularized inverses)
Consider psd matrices , , and assume that the difference is psd. Fix . Then
(11) 
Furthermore, the bound (11) is attained when for .
The proof of Proposition 2 may be found in Section B.1.1. It is based on [bhatia2013matrix, Lemma X.1.4].
Proposition 2 has an appealing interpretation. When is small in comparison to the regularization parameter , then the approximate inverse can serve in place of the inverse . Note that , so we can view Eq. 11 as a relative error bound.
4 Nyström sketchandsolve
The simplest mechanism for using the Nyström approximation is an algorithm called Nyström sketchandsolve. This section introduces the method, its implementation, and its history. We also provide a general theoretical analysis that sheds light on its performance. In spite of its popularity, the Nyström sketchandsolve method is rarely worth serious consideration.
4.1 Overview
Given a rank Nyström approximation of the psd matrix , it is tempting to replace the regularized linear system with the proxy . Indeed, we can solve the proxy linear system in time using the Sherman–Morrison–Woodbury formula [golubvanloan2013, Eqn. (2.1.4)]:
Lemma 2 (Approximate regularized inversion)
Consider any rank matrix with eigenvalue decomposition . Then
(12) 
We refer to the approach in this paragraph as the Nyström sketchandsolve algorithm because it is modeled on the sketchandsolve paradigm that originated in [sarlos2006improved].
See Algorithm 2 for a summary of the Nyström sketchandsolve method. The algorithm produces an approximate solution to the regularized linear system Eq. 1 in time . The arithmetic cost is much faster than a direct method, which costs . It can also be faster than running CG for a long time at a cost of per iteration. The method looks attractive if we only consider the runtime, and yet…
Nyström sketchandsolve only has one parameter, the rank of the Nyström approximation, which controls the quality of the approximate solution . When , the method has an appealing computational profile. As increases, the approximation quality increases but the computational burden becomes heavy. We will show that, alas, an accurate solution to the linear system actually requires , at which point the computational benefits of Nyström sketchandsolve evaporate completely.
In summary, Nyström sketchandsolve is almost never the right algorithm to use. We will see that Nyström PCG generally produces much more accurate solutions with a similar computational cost.
4.2 Guarantees and deficiencies
Using Proposition 2 together with the a priori guarantee in Proposition 1, we quickly obtain a performance guarantee for Algorithm 2.
Theorem 2
Fix , and set . For a psd matrix , construct a randomized Nyström approximation using Algorithm 1. Then the approximation error for the inverse satisfies
(13) 
Define , and select . Then the approximate solution computed by Algorithm 2 satisfies
(14) 
The proof of Theorem 2 may be found in Section A.1.
Theorem 2 tells us how accurately we can hope to solve linear systems using Nyström sketchandsolve (Algorithm 2). To obtain relative error , we should choose . When is small, we anticipate that . In this case, Nyström sketchandsolve has no computational value. Our analysis is sharp in its essential respects, so the pessimistic assessment is irremediable.
4.3 History
Nyström sketchandsolve has a long history in the machine learning literature. It was introduced in [williams2001using] to speed up kernelbased learning, and it plays a role in many subsequent papers on kernel methods. In this context, the Nyström approximation is typically obtained using column sampling [el2014fast, bach2013sharp, williams2001using], which has its limitations (Section 2.2.3). More recently, Nyström sketchandsolve has been applied to speed up approximate crossvalidation [stephenson2020LowRank].
The analysis of Nyström sketchandsolve presented above differs from previous analysis. Prior works [el2014fast, bach2013sharp] focus on the kernel setting, and they use properties of column sampling schemes to derive learning guarantees. In contrast, we bound the relative error for a Nyström approximation based on a random projection. Our overall approach extends to column sampling if we replace Proposition 1 by an appropriate analog, such as Gittens’s results [gittens2011spectral].
5 Nyström Preconditioned Conjugate Gradients
We now present our main algorithm, Nyström PCG. This algorithm produces high accuracy solutions to a regularized linear system by using the Nyström approximation as a preconditioner. We provide a rigorous estimate for the condition number of the preconditioned system, and we prove that Nyström PCG leads to fast convergence for regularized linear systems. In contrast, we have shown that Nyström sketchandsolve cannot be expected to yield accurate solutions.
5.1 The preconditioner
In this section, we introduce the optimal lowrank preconditioner, and we argue that the randomized Nyström preconditioner provides an approximation that is easy to compute.
5.1.1 Motivation
As a warmup, suppose we knew the eigenvalue decomposition of the best rank approximation of the matrix: . How should we use this information to construct a good preconditioner for the regularized linear system Eq. 1?
Consider the family of symmetric psd matrices that act as the identity on the orthogonal complement of . Within this class, we claim that the following matrix is the optimal preconditioner:
(15) 
The optimal preconditioner requires storage, and we can solve linear systems in in time. Whereas the regularized matrix has condition number , the preconditioner yields
(16) 
This is the minimum possible condition number attainable by a preconditioner from the class that we have delineated. It represents a significant improvement when . The proofs of these claims are straightforward; for details, see Section B.1.2.
5.1.2 Randomized Nyström preconditioner
It is expensive to compute the best rank approximation accurately. In contrast, we can compute the rank randomized Nyström approximation efficiently (Algorithm 1). Furthermore, we have seen that approximates nearly as well as the optimal rank approximation (Corollary 2). These facts lead us to study the randomized Nyström preconditioner, proposed in [MartTroppSurvey, Sec. 17] without a complete justification.
Consider the eigenvalue decomposition , and write for its th eigenvalue. The randomized Nyström preconditioner and its inverse take the form
(17)  
Like the optimal preconditioner , the randomized Nyström preconditioner (17) is cheap to apply and to store. We may hope that it damps the condition number of the preconditioned system nearly as well as the optimal preconditioner . We will support this intuition with a rigorous bound (Proposition 3).
5.2 Nyström PCG
We can obviously use the randomized Nyström preconditioner within the framework of PCG. We call this approach Nyström PCG, and we present a basic implementation in Algorithm 3.
More precisely, Algorithm 3 uses leftpreconditioned CG. This algorithm implicitly works with the unsymmetric matrix , rather than the symmetric matrix . The two methods yield identical sequences of iterates [saad2003iterative], but the former is more efficient. For ease of analysis, our theoretical results are presented in terms of the symmetrically preconditioned matrix.
5.2.1 Complexity of Nyström PCG
Nyström PCG has two steps. First, we construct the randomized Nyström approximation, and then we solve the regularized linear system using PCG. We have already discussed the cost of constructing the Nyström approximation (Section 2.2.1). The PCG stage costs operations per iteration, and it uses a total of additional storage.
For the regularized linear system Eq. 1, Corollaries 3 and 3 demonstrate that it is enough to choose the rank . In this case, the overall runtime of Nyström PCG is
When the effective dimension is modest, Nyström PCG is very efficient.
In contrast, Section 4.2 shows that the running time for Nyström sketchandsolve has the same form—with in place of . This is a massive difference. Nyström PCG can produce solutions whose residual norm is close to machine precision; this type of successful computation is impossible with Nyström sketchandsolve.
5.2.2 Block Nyström PCG
We can also use the Nyström preconditioner with the block CG algorithm [o1980block] to solve regularized linear systems with multiple righthand sides. For this approach, we also use an orthogonalization preprocessing proposed in [feng1995block] that ensures numerical stability without further orthogonalization steps during the iteration.
5.3 Analysis of Nyström PCG
We now turn to the analysis of the randomized Nyström preconditioner . Theorem 3 provides a bound for the rank of the Nyström preconditioner that reduces the condition number of to a constant. In this case, we deduce that Nyström PCG converges rapidly (Corollary 3).
Theorem 3 (Nyström preconditioning)
Suppose we construct the Nyström preconditioner in Eq. 17 using Algorithm 1 with rank . Using to precondition the regularized matrix results in the condition number bound
The proof of Theorem 3 may be found in Section 5.3.3.
Theorem 3 has several appealing features. Many other authors have noticed that the effective dimension controls sample size requirements for particular applications such as discriminant analysis [chowdhury2018randomized], ridge regression [lacotte2020effective], and kernel ridge regression [el2014fast, bach2013sharp]. In contrast, our result holds for any regularized linear system.
Our argument makes the role of the effective dimension conceptually simpler, and it leads to explicit, practical parameter recommendations. Indeed, the effective dimension is essentially the same as the rank that makes the approximation error proportional to . In previous arguments, such as those in [el2014fast, bach2013sharp, chowdhury2018randomized], the effective dimension arises because the authors reduce the analysis to approximate matrix multiplication [cohen2015optimal], which produces inscrutable constant factors.
Theorem 3 ensures that Nyström PCG converges quickly.
Corollary 3 (Nyström PCG: Convergence)
Define as in Theorem 3, and condition on the event . Let . If we initialize Algorithm 3 with initial iterate , then the relative error in the iterate satisfies
In particular, after iterations, we have relative error .
The proof of Corollary 3 is an immediate consequence of the standard convergence result for CG [trefethen1997numerical, Theorem 38.5, p. 299]. See Section A.2.
5.3.1 Analyzing the condition number
The first step in the proof of Theorem 3 is a deterministic bound on how the preconditioner (17) reduces the condition number of the regularized matrix . Let us emphasize that this bound is valid for any rank Nyström approximation, regardless of the choice of test matrix.
Proposition 3 (Nyström preconditioner: Deterministic bound)
Let be any rank Nyström approximation, with th largest eigenvalue , and let be the approximation error. Construct the Nyström preconditioner as in (17). Then the condition number of the preconditioned matrix satisfies
(18)  
For the proof of Proposition 3 see Section A.1.1.
To interpret the result, recall the expression Eq. 16 for the condition number induced by the optimal preconditioner. Proposition 3 shows that the Nyström preconditioner may reduce the condition number almost as well as the optimal preconditioner.
In particular, when , the condition number of the preconditioned system is bounded by a constant, independent of the spectrum of . In this case, Nyström PCG is guaranteed to converge quickly.
5.3.2 The effective dimension and rank selection
How should we choose the rank to guarantee that ? Corollary 2 shows how the error in the rank randomized Nyström approximation depends on the spectrum of through the eigenvalues of and the tail stable rank. In this section, we present a lemma which demonstrates that the effective dimension controls both quantities. As a consequence of this bound, we will be able to choose the rank proportional to the effective dimension .
Recall from Eq. 4 that the effective dimension of the matrix is defined as
(19) 
As previously mentioned, may be viewed as a smoothed count of the eigenvalues larger than . Thus, one may expect that for . This intuition is correct, and it forms the content of Lemma 3.
Lemma 3 (Effective dimension)
Let with eigenvalues . Let be regularization parameter, and define the effective dimension as in Eq. 19. The following statements hold.

Fix . If , then .

If , then .
The proof of Lemma 3 may be found in Section A.1.2.
Lemma 3, Item 1 captures the intuitive fact that there are no more than eigenvalues larger than . Similarly, Item 2 states that the effective dimension controls the sum of all the eigenvalues whose index exceeds the effective dimension. It is instructive to think about the meaning of these results when is small.
5.3.3 Proof of Theorem 3
We are now prepared to prove Theorem 3. The key ingredients in the proof are Proposition 1, Proposition 3, and Lemma 3.
Proof (Proof of Theorem 3)
Fix the rank . Construct the rank randomized Nyström approximation with eigenvalues . Write for the approximation error. Form the preconditioner via Eq. 17. We must bound the expected condition number of the preconditioned matrix
First, we apply Proposition 3 to obtain a deterministic bound that is valid for any rank Nyström preconditioner:
The second inequality holds because . This is a consequence of Lemma 1, Item 4 and Lemma 3, Item 1 with . We rely on the fact that .
Decompose where . Take the expectation, and invoke Corollary 2 to obtain
By definition, . To complete the bound, apply Lemma 3 twice. We use Item 1 with and Item 2 with to reach
which is the desired result.
5.4 Practical parameter selection
In practice, we may not know the regularization parameter in advance, and we rarely know the effective dimension . As a consequence, we cannot enact the theoretical recommendation for the rank of the Nyström preconditioner: . Instead, we need an adaptive method for choosing the rank . Below, we outline three strategies.
5.4.1 Strategy 1: Choose as large as the user can afford
The first strategy is to choose the rank as large as the user can afford. This approach is coarse, and it does not yield any guarantees on the cost of the Nyström PCG method.
Nevertheless, once we have constructed a rank Nyström approximation , we can obtain a posterior upper bound for the iteration count of Nyström PCG. Recall that Proposition 3 controls the condition number of the preconditioned system:
(20) 
We can estimate the error inexpensively with the randomized power method [kuczynski1992estimating]; see Algorithm 4 in Appendix E. Using the standard convergence theory for PCG, we obtain an upper bound for the iteration count. This gives us advance warning about how long it may take to solve the regularized linear system.
5.4.2 Strategy 2: Adaptive rank selection by a posteriori error estimation
The second strategy uses the posterior condition number estimate adaptively in a procedure the repeatedly doubles the rank as required. We compute a randomized Nyström approximation with initial rank , and we estimate using randomized powering. If is smaller than a prescribed tolerance, we accept the approximation. If the tolerance is not met, then we double the rank, update the approximation, and estimate again. The process repeats until the estimate for falls below the tolerance or it breaches a threshold for the maximum rank. In the case of regularized linear systems, we proceed until for a modest constant . Based on numerical experience, we recommend a choice . For full algorithmic details of adaptive rank selection by estimating , see 5 in E.
5.4.3 Strategy 3: Adaptive rank selection by monitoring
The third strategy is almost identical to the second, except we monitor the ratio instead of . The strategy is rooted in the empirical observation that Nyström PCG converges quickly when . Thus, one doubles the approximation rank until falls below the tolerance or the rank reaches the threshold .
6 Applications and experiments
In this section, we study the performance of Nyström PCG on real world data from three different applications: ridge regression, kernel ridge regression, and approximate crossvalidation. The experiments demonstrate the effectiveness of the preconditioner and our strategies for choosing the rank compared to other algorithms in the literature.
6.1 Preliminaries
We implemented all experiments in MATLAB R2019a and ran them on a server with 128 Intel Xeon E74850 v4 2.10GHz CPU cores and 1056 GB. Every numerical experiment in this section was repeated twenty times; tables report the mean over the twenty runs, and the standard deviation (in parentheses) when it is nonzero. We highlight the bestperforming method in a table in bold.
We select hyperparameters of competing methods by grid search to optimize performance. This procedure tends to be very charitable to the competitors, and it may not be representative of their realworld performance. Indeed, grid search is computationally expensive, and it cannot be used as part of a practical implementation. A detailed overview of the experimental setup for each application may be found in the appropriate section of
Appendix C, and additional numerical results in Appendix D.6.2 Ridge regression
In this section, we solve the ridge regression problem (7) described in Section 1.3 on some standard machine learning data sets (Table 1) from OpenML [vanschoren2014openml] and LIBSVM [chang2011libsvm]. We compare Nyström PCG to standard CG, the preconditioning method of Rokhlin and Tygert (R&T) [rokhlin2008fast], and the Adapative Iterative Hessian Sketch (AdaIHS) [lacotte2020effective].
6.2.1 Experimental overview
Dataset  
CIFAR10  50,000  3,072 
Guillermo  20,000  4,297 
smallNorbrf  24,300  10,000 
shuttlerf  43,300  10,000 
We perform two sets of experiments: computing regularization paths on CIFAR10 and Guillermo, and random features regression [rahimi2007random, rahimi2008uniform] on shuttle and smallNORB with specified values of . The last two datasets have comparable to ; in this setting (R&T) provides no advantage over a direct method, so we omit a comparison. We use Euclidean norm of the residual as our stopping criteria, with convergence declared when . For both sets of experiments, we use Nyström PCG with adaptive rank selection, Algorithm 5 in Appendix E. For complete details of the experimental methodology, see Section C.1.
The regularization path experiments solve Eq. 7 over a regularization path where . We begin by solving the problem for the largest (initialized at zero), and solve for progressively smaller with warm starting. For each value of , every method is allowed at most 500 iterations to reach the desired tolerance.
6.2.2 Computing the regularization path
Fig. 2 shows the evolution of along the regularization path. CIFAR10 and Guillermo are both small, so we compute the exact effective dimension as a point of reference. We see that we reach the rank cap of for CIFAR10 and for Guillermo when is small enough. For CIFAR10, Nyström PCG chooses a rank much smaller than the effective dimension for small values of . Nevertheless, the method still performs well(Fig. 3).
Fig. 3 show the effectiveness of each method for computing the entire regularization path. Nyström PCG is the fastest of the methods along most of the regularization path. For CIFAR10, Nyström PCG is faster than R&T until the very end of the regularization path, when . Hence, the cost of forming the R&T preconditioner is expensive, and it is not beneficial unless . Thus, provided the regularization parameter is not extremely small, we expect Nyström PCG to perform better on ridge regression problems.
AdaIHS is rather slow. It increases the sketch size parameter several times along the regularization path. Each time, AdaIHS must form a new sketch of the matrix, approximate the Hessian, and compute a Cholesky factorization.
6.2.3 Random features regression
Tables 3 and 2 display the results for Nyström PCG and AdaIHS on shuttlerf and smallNorbrf. Table 2 shows that Nyström PCG performs best on both datasets for all metrics. The most striking feature is the difference between the chosen sketch sizes: AdaIHS chooses the maximum allowed sketch size, , which offers little benefit over a direct method, while Nyström PCG uses a sketch size . Table 3 contains estimates for and the condition number of the preconditioned system, which explain the fast convergence.
Dataset  Method  Final rank/ sample size  Number of iterations  Total runtime (s) 

shuttlerf  AdaIHS  
Nyström PCG  
smallNORBrf  AdaIHS  
Nyström PCG 
Dataset  Estimate of  Estimated condition number of preconditioned system 

shuttlerf  
smallNorbrf 
6.3 Approximate crossvalidation
In this subsection we use our preconditioner to compute approximate leaveoneout crossvalidation (ALOOCV), which requires solving a large linear system with multiple righthand sides.
6.3.1 Background
Crossvalidation is an important machinelearning technique to assess and select models and hyperparameters. Generally, it requires refitting a model on many subsets of the data, so can take quite a long time. The worst culprit is leaveoneout crossvalidation (LOOCV), which requires running an expensive training algorithm times. Recent work has developed approximate leaveoneout crossvalidation (ALOOCV), a faster alternative that replaces model retraining by a linear system solve [giordano2019swiss, rad2020scalable, wilson2020approximate]. In particular, these techniques yield accurate and computationally tractable approximations to LOOCV.
To present the approach, we consider the infinitesimal jacknife (IJ) approximation to LOOCV [giordano2019swiss, stephenson2020approximate], The IJ approximation computes
(21) 
where
is the Hessian of the loss function at the solution
, for each datapoint . The main computational challenge is computing the inverse Hessian vector product . When is very large, we can also subsample the data and average Eq. 21over the subsample to estimate ALOOCV. Since ALOOCV solves the same problem with several righthand sides, blocked PCG methods (here, Nyström blocked PCG) are the tool of choice to efficiently solve for multiple righthand sides at once. To demonstrate the idea, we perform numerical experiments on ALOOCV for logistic regression. The datasets we use are all from LIBSVM
[chang2011libsvm]; see Table 4.Dataset  Initial rank  

Gisette  6,000  5,000  1, 1e4  850  
realsim  72,308  20,958  1e4, 1e8  
rcv1.binary  20,242  47,236  1e4, 1e8  
SVHN  73,257  3,072  1, 1e4  850 
6.3.2 Experimental overview
We perform two sets of experiments in this section. The first set of experiments uses Gisette and SVHN to test the efficacy of Nyström sketchandsolve. These datasets are small enough that we can factor using a direct method. We also compare to block CG and block PCG with the computed Nyström approximation as a preconditioner. To assess the error due to an inexact solve for datapoint , let . For any putative solution , we compute the relative error . We average the relative error over 100 datapoints .
The second set of experiments uses the larger datasets realsim and rcv1.binary and small values of , the most challenging setting for ALOOCV. We restrict our comparison to block Nyström PCG versus the block CG algorithm, as Nyström sketchandsolve is so inaccurate in this regime. We employ Algorithm 5 to construct the preconditioner for block Nyström PCG.
6.3.3 Nyström sketchandsolve
As predicted, Nyström sketchandsolve works poorly (Table 5). When , the approximate solutions are modestly accurate, and the accuracy degrades as decreases to . The experimental results agree with the theoretical analysis presented in Section 4, which indicate that sketchandsolve degrades as decreases. In contrast, block CG and block Nyström PCG both provide highquality solutions for each datapoint for both values of the regularization parameter.
Dataset  Nyström sketchandsolve  Block CG  Block Nyström PCG  

Gisette  
Gisette  
SVHN  
SVHN 
6.4 Large scale ALOOCV experiments
Table 6 summarizes results for block Nyström PCG and block CG on the larger datasets. When , block Nyström PCG offers little or no benefit over block CG because the data matrices are very sparse (see Table 4) and the rcv1 problem is wellconditioned (see Table 11).
For , block Nyström PCG reduces the number of iterations substantially, but the speedup is negligible. The data matrix is sparse, which reduces the benefit of the Nyström method. Block CG also benefits from the presence of multiple righthand sides just as block Nyström PCG. Indeed, O’Leary proved that the convergence of block CG depends on the ratio , where is is the number of righthand sides [o1980block]. Consequently, multiple righthand sides precondition block CG and accelerate convergence. We expect bigger gains over block CG when is dense.
Dataset  Method  Number of iterations  Runtime (s)  

rcv1  Block CG  
Block Nyström PCG  
rcv1  Block CG 
Comments
There are no comments yet.