Randomized Nyström Preconditioning

This paper introduces the Nyström PCG algorithm for solving a symmetric positive-definite linear system. The algorithm applies the randomized Nyström method to form a low-rank approximation of the matrix, which leads to an efficient preconditioner that can be deployed with the conjugate gradient algorithm. Theoretical analysis shows that preconditioned system has constant condition number as soon as the rank of the approximation is comparable with the number of effective degrees of freedom in the matrix. The paper also develops adaptive methods for achieving similar performance without knowledge of the effective dimension. Numerical tests show that Nyström PCG can rapidly solve large linear systems that arise in data analysis problems, and it surpasses several competing methods from the literature.

There are no comments yet.

Authors

• 2 publications
• 17 publications
• 38 publications
• Pass-Efficient Randomized LU Algorithms for Computing Low-Rank Matrix Approximation

Low-rank matrix approximation is extremely useful in the analysis of dat...
02/17/2020 ∙ by Bolong Zhang, et al. ∙ 0

• Two-level Nyström–Schur preconditioner for sparse symmetric positive definite matrices

Randomized methods are becoming increasingly popular in numerical linear...
01/28/2021 ∙ by Hussam Al Daas, et al. ∙ 0

• Krylov Subspace Recycling for Fast Iterative Least-Squares in Machine Learning

Solving symmetric positive definite linear problems is a fundamental com...
06/01/2017 ∙ by Filip de Roos, et al. ∙ 0

• On the approximation of a matrix

Let F^* be an approximation of a given (a × b) matrix F derived by metho...
08/25/2021 ∙ by Samriddha Sanyal, et al. ∙ 0

• Fast and stable randomized low-rank matrix approximation

Randomized SVD has become an extremely successful approach for efficient...
09/23/2020 ∙ by Yuji Nakatsukasa, et al. ∙ 0

• Low-Rank Tucker Approximation of a Tensor From Streaming Data

This paper describes a new algorithm for computing a low-Tucker-rank app...
04/24/2019 ∙ by Yiming Sun, et al. ∙ 0

• A Novel Randomized XR-Based Preconditioned CholeskyQR Algorithm

CholeskyQR is a simple and fast QR decomposition via Cholesky decomposit...
11/22/2021 ∙ by Yuwei Fan, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

In their elegant 1997 textbook on numerical linear algebra [trefethen1997numerical], Trefethen and Bau write,

“In ending this book with the subject of preconditioners, we find ourselves at the philosophical center of the scientific computing of the future… Nothing will be more central to computational science in the next century than the art of transforming a problem that appears intractable into another whose solution can be approximated rapidly. For Krylov subspace matrix iterations, this is preconditioning… we can only guess where this idea will take us.”

The next century has since arrived, and one of the most fruitful developments in matrix computations has been the emergence of new algorithms that use randomness in an essential way. This paper explores a topic at the nexus of preconditioning and randomized numerical linear algebra. We will show how to use a randomized matrix approximation algorithm to construct a preconditioner for an important class of linear systems that arises throughout data analysis and scientific computing.

1.1 The Preconditioner

Consider the regularized linear system

 (A+μI)x=bwhere A∈Rn×n is symmetric psd and% μ≥0. (1)

Here and elsewhere, psd abbreviates the term “positive semidefinite.” This type of linear system emerges whenever we solve a regularized least-squares problem. We will design a class of preconditioners for the problem Eq. 1.

Throughout this paper, we assume that we can access the matrix

through matrix–vector products

, commonly known as matvecs. The algorithms that we develop will economize on the number of matvecs, and they may not be appropriate in settings where matvecs are very expensive or there are cheaper ways to interact with the matrix.

For a rank parameter , the randomized Nyström approximation of takes the form

 ^Anys=(AΩ)(ΩTAΩ)†(AΩ)Twhere Ω∈Rn×ℓ is standard normal. (2)

This matrix provides the best psd approximation of whose range coincides with the range of . The randomness in the construction ensures that is a good approximation to the original matrix

with high probability

[MartTroppSurvey, Sec. 14].

We can form the Nyström approximation using matvecs with , plus some extra arithmetic. See Algorithm 1 for the implementation details.

Given the eigenvalue decomposition

of the randomized Nyström approximation, we construct the Nyström preconditioner:

 P=1^λℓ+μU(^Λ+μI)UT+(I−UUT). (3)

In a slight abuse of terminology, we refer to as the rank of the Nyström preconditioner. The key point is that we can solve the linear system very efficiently, and the action of dramatically reduces the condition number of the regularized matrix .

We propose to use Eq. 3 in conjunction with the preconditioned conjugate gradient (PCG) algorithm. Each iteration of PCG involves a single matvec with , and a single linear solve with . When the preconditioned matrix has a modest condition number, the algorithm converges to a solution of Eq. 1 very quickly. See Algorithm 3 for pseudocode for Nyström PCG.

The randomized Nyström preconditioner Eq. 3 was suggested by P.-G. Martinsson in the survey [MartTroppSurvey, Sec. 17], but it has not been implemented or analyzed.

1.2 Guarantees

This paper contains the first comprehensive study of the preconditioner Eq. 3

, including theoretical analysis and testing on prototypical problems from data analysis and machine learning. One of the main contributions is a rigorous method for choosing the rank

to guarantee good performance, along with an adaptive rank selection procedure that performs well in practice.

A key quantity in our analysis is the effective dimension of the regularized matrix . That is,

 deff(μ)=tr(A(A+μI)−1)=n∑j=1λj(A)λj(A)+μ. (4)

The effective dimension measures the degrees of freedom of the problem after regularization. It may be viewed as a (smoothed) count of the eigenvalues larger than . Many real-world matrices exhibit strong spectral decay, so the effective dimension is typically much smaller than the nominal dimension . As we will discuss, the effective dimension also plays a role in a number of machine learning papers [el2014fast, avron2017faster, bach2013sharp, chowdhury2018randomized, lacotte2020effective] that consider randomized algorithms for solving regularized linear systems.

Our theory tells us the randomized Nyström preconditioner is successful when its rank is proportional to the effective dimension.

Theorem 1 (Randomized Nyström Preconditioner)

Let be a psd matrix, and write where the regularization parameter . Define the effective dimension as in Eq. 4. Construct the randomized preconditioner from Eqs. 3 and 2 with rank parameter . Then the condition number of the preconditioned system satisfies

 E[κ2(P−1/2AμP−1/2)]<28. (5)

Theorem 1 is a restatement of Theorem 3.

Simple probability bounds follow from Eq. 5 via Markov’s inequality. For example,

 P{κ2(P−1/2AμP−1/2)≤56}>1/2.

The main consequence of Theorem 1 is a convergence theorem for PCG with the randomized Nyström preconditioner.

Corollary 1 (Nyström PCG: Convergence)

Construct the preconditioner as in Theorem 1, and condition on the event . Solve the regularized linear system Eq. 1 using Nyström PCG, starting with an initial iterate . After iterations, the relative error satisfies

 δt\coloneqq∥xt−x⋆∥PCG∥x⋆∥PCG<2⋅(0.77)twhere Aμx⋆=b.

The error norm is defined as . In particular, iterations suffice to achieve relative error .

Although Theorem 1 gives an interpretable bound for the rank of the preconditioner, we cannot instantiate it without knowledge of the effective dimension. To address this shortcoming, we have designed adaptive methods for selecting the rank in practice (Section 5.4).

Finally, as part of our investigation, we will also develop a detailed understanding of Nyström sketch-and-solve, a popular algorithm in the machine learning literature [el2014fast, bach2013sharp]. Our analysis highlights the deficiencies of Nyström sketch-and-solve relative to Nyström PCG.

1.3 Example: Ridge Regression

As a concrete example, we consider the regularized least-squares problem, also known as ridge regression. This problem takes the form

 minimizex∈Rd12n∥Gx−b∥2+μ2∥x∥2, (6)

where and and . By calculus, the solution to Eq. 6 also satisfies the regularized system of linear equations

 (1nGTG+μI)x=1nGTb. (7)

A direct method to solve Eq. 7 requires flops, which is prohibitive when and are both large. Instead, when and are large, iterative algorithms, such as the conjugate gradient method (CG), become the tools of choice. Unfortunately, the ridge regression linear system Eq. 7 is often very ill-conditioned, and CG converges very slowly.

Nyström PCG can dramatically accelerate the solution of Eq. 7. As an example, consider the shuttle-rf dataset (Section 6.2). The matrix has dimension , while the preconditioner is based on a Nyström approximation with rank . Figure 1 shows the progress of the residual as a function of the iteration count. Nyström PCG converges to machine precision in 13 iterations, while CG stalls.

Section 2 contains an overview of the Nyström approximation and its key properties. Section 3

studies the role of the Nyström approximation in estimating the inverse of the regularized matrix. We analyze the Nyström sketch-and-solve method in

Section 4, and we give a rigorous performance bound for this algorithm. Section 5 presents a full treatment of Nyström PCG, including theoretical results and guidance on numerical implementation. Computational experiments in Section 6 demonstrate the power of Nyström PCG for three different applications involving real data sets.

1.5 Notation

We write for the linear space of real symmetric matrices, while denotes the convex cone of real psd matrices. The symbol denotes the Loewner order on . That is, if and only if the eigenvalues of are all nonnegative. The function returns the trace of a square matrix. The map returns the th largest eigenvalue of ; we may omit the matrix if it is clear. As usual, denotes the condition number. We write for the spectral norm of a matrix . For a psd matrix , we write for the -norm. Given and , the symbol refers to any best rank- approximation to relative to the spectral norm. For and , the regularized matrix is abbreviated . For and effective dimension of is defined as . For , the -stable rank of is defined as . For , we denote the time taken to compute a matvec with by .

2 The Nyström approximation

Let us begin with a review of the Nyström approximation and the randomized Nyström approximation.

2.1 Definition and basic properties

The Nyström approximation is a natural way to construct a low-rank psd approximation of a psd matrix . Let be an arbitrary test matrix. The Nyström approximation of with respect to the range of is defined by

 (8)

The Nyström approximation is the best psd approximation of whose range coincides with the range of . It has a deep relationship with the Schur complement and with Cholesky factorization [MartTroppSurvey, Sec. 14].

The Nyström approximation enjoys several elementary properties that we record in the following lemma.

Lemma 1

Let be a Nyström approximation of the psd matrix . Then

1. The approximation is psd and has rank at most .

2. The approximation depends only on .

3. In the Loewner order, .

4. In particular, the eigenvalues satisfy for each .

The proof of Lemma 1, Item 3 is not completely obvious. It is a consequence of the fact that we may express , where is an orthogonal projector.

2.2 Randomized Nyström approximation

How should we choose the test matrix so that the Nyström approximation provides a good low-rank model for ? Surprisingly, we can obtain a good approximation simply by drawing the test matrix at random. See [tropp2017fixed] for theoretical justification of this claim.

Let us outline the construction of the randomized Nyström approximation. Draw a standard normal test matrix , and compute the sketch . The Nyström approximation Eq. 8 is constructed directly from the test matrix and the sketch :

 ^Anys=A⟨Ω⟩=Y(ΩTY)†YT. (9)

The formula Eq. 9 is not numerically sound. We refer the reader to Algorithm 1 for a stable and efficient implementation of the randomized Nyström approximation [li2017algorithm, tropp2017fixed]. Conveniently, Algorithm 1 returns the truncated eigendecomposition , where

is an orthonormal matrix whose columns are eigenvectors and

is a diagonal matrix listing the eigenvalues, which we often abbreviate as .

The randomized Nyström approximation described in this section has a key difference from the Nyström approximations that have traditionally been used in the machine learning literature [el2014fast, bach2013sharp, derezinski2020improved, gittens2011spectral, williams2001using]. In machine learning settings, the Nyström approximation is usually constructed from a sketch that samples random columns from the matrix (i.e., the random test matrix has 1-sparse columns). In contrast, Algorithm 1 computes a sketch via random projection (i.e., the test matrix is standard normal). In most applications, we have strong reasons (Section 2.2.3) for preferring random projections to column sampling.

2.2.1 Cost of randomized Nyström approximation

Throughout the paper, we write for the time required to compute a matrix–vector product (matvec) with . Forming the sketch requires matvecs, which costs . The other steps in the algorithm have arithmetic cost . Hence, the total computational cost of Algorithm 1 is operations. The storage cost is floating-point numbers.

For Algorithm 1, the worst-case performance occurs when is dense and unstructured. In this case, forming costs operations. However, if we have access to the columns of then we may reduce the cost of forming to by using a structured test matrix

, such as a scrambled subsampled randomized Fourier transform (SSRFT) map or a sparse map

[MartTroppSurvey, tropp2017fixed].

2.2.2 A priori guarantees for the randomized Nyström approximation

In this section, we present an a priori error bound for the randomized Nyström approximation. The result improves over previous analyses [gittens2011spectral, gittens2013revisiting, tropp2017fixed] by sharpening the error terms. This refinement is critical for the analysis of the preconditioner.

Proposition 1 (Randomized Nyström approximation: Error)

Consider a psd matrix with eigenvalues . Choose a rank parameter , and draw a standard normal test matrix . Then the rank- Nyström approximation computed by Algorithm 1 satisfies

 E∥A−^Anys∥≤min2≤p≤ℓ−2⎡⎣(1+2(ℓ−p)p−1)λℓ−p+1+2e2ℓp2−1⎛⎝∑j>ℓ−pλj⎞⎠⎤⎦. (10)

The proof of Proposition 1 may be found in Section B.1.

Proposition 1 shows that, in expectation, the randomized Nyström approximation provides a good rank- approximation to . The first term in the bound is comparable with the spectral-norm error in the optimal rank- approximation, . The second term in the bound is comparable with the trace-norm error in the optimal rank- approximation.

Proposition 1 is better understood via the following simplification.

Corollary 2 (Randomized Nyström approximation)

Instate the assumptions of Proposition 1. For and , we have the bound

 E∥A−^Anys∥≤(3+4e2psrp(A))λp.

The -stable rank, , reflects decay in the tail eigenvalues.

Corollary 2 shows that the Nyström approximation error is on the order of when the rank parameter . The constant depends on the -stable rank , which is small when the tail eigenvalues decay quickly starting at . This bound is critical for establishing our main results (Theorems 3 and 2).

2.2.3 Random projection versus column sampling

Most papers in the machine learning literature [el2014fast, bach2013sharp] construct Nyström approximations by sampling columns at random from an adaptive distribution. In contrast, for most applications, we advocate using an oblivious random projection of the matrix to construct a Nyström approximation.

Random projection has several advantages over column sampling. First, column sampling may not be practical when we only have black-box matvec access to the matrix, while random projections are natural in this setting. Second, it can be very expensive to obtain adaptive distributions for column sampling. Indeed, computing approximate ridge leverage scores costs just as much as solving the ridge regression problem directly using random projections [drineas2012fast, Theorem 2]

. Third, even with a good sampling distribution, column sampling produces higher variance results than random projection, so it is far less reliable.

On the other hand, we have found that there are a few applications where it is more effective to compute a randomized Nyström preconditioner using column sampling in lieu of random projections. In particular, this seems to be the case for kernel ridge regression (Section 6.5). Indeed, the entries of the kernel matrix are given by an explicit formula, so we can extract full columns with ease. Sampling columns may cost only operations, whereas a single matvec generally costs . Furthermore, kernel matrices usually exhibit fast spectral decay, which limits the performance loss that results from using column sampling in lieu of random projection.

3 Approximating the regularized inverse

Let us return to the regularized linear system Eq. 1. The solution to the problem has the form . Given a good approximation to the matrix , it is natural to ask whether is a good approximation to the desired solution .

There are many reasons why we might prefer to use in place of . In particular, we may be able to solve linear systems in the matrix more efficiently. On the other hand, the utility of this approach depends on how well the inverse approximates the desired inverse . The next result addresses this question for a wide class of approximations that includes the Nyström approximation.

Proposition 2 (Regularized inverses)

Consider psd matrices , , and assume that the difference is psd. Fix . Then

 ∥(^A+μI)−1−(A+μI)−1∥≤1μ∥E∥∥E∥+μ. (11)

Furthermore, the bound (11) is attained when for .

The proof of Proposition 2 may be found in Section B.1.1. It is based on [bhatia2013matrix, Lemma X.1.4].

Proposition 2 has an appealing interpretation. When is small in comparison to the regularization parameter , then the approximate inverse can serve in place of the inverse . Note that , so we can view Eq. 11 as a relative error bound.

4 Nyström sketch-and-solve

The simplest mechanism for using the Nyström approximation is an algorithm called Nyström sketch-and-solve. This section introduces the method, its implementation, and its history. We also provide a general theoretical analysis that sheds light on its performance. In spite of its popularity, the Nyström sketch-and-solve method is rarely worth serious consideration.

4.1 Overview

Given a rank- Nyström approximation of the psd matrix , it is tempting to replace the regularized linear system with the proxy . Indeed, we can solve the proxy linear system in time using the Sherman–Morrison–Woodbury formula [golubvanloan2013, Eqn. (2.1.4)]:

Lemma 2 (Approximate regularized inversion)

Consider any rank- matrix with eigenvalue decomposition . Then

 (^A+μI)−1=U(^Λ+μI)−1UT+1μ(I−UUT). (12)

We refer to the approach in this paragraph as the Nyström sketch-and-solve algorithm because it is modeled on the sketch-and-solve paradigm that originated in [sarlos2006improved].

See Algorithm 2 for a summary of the Nyström sketch-and-solve method. The algorithm produces an approximate solution to the regularized linear system Eq. 1 in time . The arithmetic cost is much faster than a direct method, which costs . It can also be faster than running CG for a long time at a cost of per iteration. The method looks attractive if we only consider the runtime, and yet…

Nyström sketch-and-solve only has one parameter, the rank of the Nyström approximation, which controls the quality of the approximate solution . When , the method has an appealing computational profile. As increases, the approximation quality increases but the computational burden becomes heavy. We will show that, alas, an accurate solution to the linear system actually requires , at which point the computational benefits of Nyström sketch-and-solve evaporate completely.

In summary, Nyström sketch-and-solve is almost never the right algorithm to use. We will see that Nyström PCG generally produces much more accurate solutions with a similar computational cost.

4.2 Guarantees and deficiencies

Using Proposition 2 together with the a priori guarantee in Proposition 1, we quickly obtain a performance guarantee for Algorithm 2.

Theorem 2

Fix , and set . For a psd matrix , construct a randomized Nyström approximation using Algorithm 1. Then the approximation error for the inverse satisfies

 E∥∥(A+μI)−1−(^Anys+μI)−1∥∥≤(3+4e2psrp(A))λpμ⋅(λp+μ). (13)

Define , and select . Then the approximate solution computed by Algorithm 2 satisfies

 (14)

The proof of Theorem 2 may be found in Section A.1.

Theorem 2 tells us how accurately we can hope to solve linear systems using Nyström sketch-and-solve (Algorithm 2). To obtain relative error , we should choose . When is small, we anticipate that . In this case, Nyström sketch-and-solve has no computational value. Our analysis is sharp in its essential respects, so the pessimistic assessment is irremediable.

4.3 History

Nyström sketch-and-solve has a long history in the machine learning literature. It was introduced in [williams2001using] to speed up kernel-based learning, and it plays a role in many subsequent papers on kernel methods. In this context, the Nyström approximation is typically obtained using column sampling [el2014fast, bach2013sharp, williams2001using], which has its limitations (Section 2.2.3). More recently, Nyström sketch-and-solve has been applied to speed up approximate cross-validation [stephenson2020LowRank].

The analysis of Nyström sketch-and-solve presented above differs from previous analysis. Prior works [el2014fast, bach2013sharp] focus on the kernel setting, and they use properties of column sampling schemes to derive learning guarantees. In contrast, we bound the relative error for a Nyström approximation based on a random projection. Our overall approach extends to column sampling if we replace Proposition 1 by an appropriate analog, such as Gittens’s results [gittens2011spectral].

5 Nyström Preconditioned Conjugate Gradients

We now present our main algorithm, Nyström PCG. This algorithm produces high accuracy solutions to a regularized linear system by using the Nyström approximation as a preconditioner. We provide a rigorous estimate for the condition number of the preconditioned system, and we prove that Nyström PCG leads to fast convergence for regularized linear systems. In contrast, we have shown that Nyström sketch-and-solve cannot be expected to yield accurate solutions.

5.1 The preconditioner

In this section, we introduce the optimal low-rank preconditioner, and we argue that the randomized Nyström preconditioner provides an approximation that is easy to compute.

5.1.1 Motivation

As a warmup, suppose we knew the eigenvalue decomposition of the best rank- approximation of the matrix: . How should we use this information to construct a good preconditioner for the regularized linear system Eq. 1?

Consider the family of symmetric psd matrices that act as the identity on the orthogonal complement of . Within this class, we claim that the following matrix is the optimal preconditioner:

 P⋆=1λℓ+1+μVℓ(Λℓ+μI)VTℓ+(I−VℓVTℓ). (15)

The optimal preconditioner requires storage, and we can solve linear systems in in time. Whereas the regularized matrix has condition number , the preconditioner yields

 κ2(P−1/2⋆AμP−1/2⋆)=λℓ+1+μλn+μ. (16)

This is the minimum possible condition number attainable by a preconditioner from the class that we have delineated. It represents a significant improvement when . The proofs of these claims are straightforward; for details, see Section B.1.2.

5.1.2 Randomized Nyström preconditioner

It is expensive to compute the best rank- approximation accurately. In contrast, we can compute the rank- randomized Nyström approximation efficiently (Algorithm 1). Furthermore, we have seen that approximates nearly as well as the optimal rank- approximation (Corollary 2). These facts lead us to study the randomized Nyström preconditioner, proposed in [MartTroppSurvey, Sec. 17] without a complete justification.

Consider the eigenvalue decomposition , and write for its th eigenvalue. The randomized Nyström preconditioner and its inverse take the form

 P =1^λℓ+μU(^Λ+μI)UT+(I−UUT); (17) P−1 =(^λℓ+μ)U(^Λ+μI)−1UT+(I−UUT).

Like the optimal preconditioner , the randomized Nyström preconditioner (17) is cheap to apply and to store. We may hope that it damps the condition number of the preconditioned system nearly as well as the optimal preconditioner . We will support this intuition with a rigorous bound (Proposition 3).

5.2 Nyström PCG

We can obviously use the randomized Nyström preconditioner within the framework of PCG. We call this approach Nyström PCG, and we present a basic implementation in Algorithm 3.

More precisely, Algorithm 3 uses left-preconditioned CG. This algorithm implicitly works with the unsymmetric matrix , rather than the symmetric matrix . The two methods yield identical sequences of iterates [saad2003iterative], but the former is more efficient. For ease of analysis, our theoretical results are presented in terms of the symmetrically preconditioned matrix.

5.2.1 Complexity of Nyström PCG

Nyström PCG has two steps. First, we construct the randomized Nyström approximation, and then we solve the regularized linear system using PCG. We have already discussed the cost of constructing the Nyström approximation (Section 2.2.1). The PCG stage costs operations per iteration, and it uses a total of additional storage.

For the regularized linear system Eq. 1, Corollaries 3 and 3 demonstrate that it is enough to choose the rank . In this case, the overall runtime of Nyström PCG is

 O(deff(μ)2n+Tmv(deff(μ)+log(1/ϵ)))operations.

When the effective dimension is modest, Nyström PCG is very efficient.

In contrast, Section 4.2 shows that the running time for Nyström sketch-and-solve has the same form—with in place of . This is a massive difference. Nyström PCG can produce solutions whose residual norm is close to machine precision; this type of successful computation is impossible with Nyström sketch-and-solve.

5.2.2 Block Nyström PCG

We can also use the Nyström preconditioner with the block CG algorithm [o1980block] to solve regularized linear systems with multiple right-hand sides. For this approach, we also use an orthogonalization pre-processing proposed in [feng1995block] that ensures numerical stability without further orthogonalization steps during the iteration.

5.3 Analysis of Nyström PCG

We now turn to the analysis of the randomized Nyström preconditioner . Theorem 3 provides a bound for the rank of the Nyström preconditioner that reduces the condition number of to a constant. In this case, we deduce that Nyström PCG converges rapidly (Corollary 3).

Theorem 3 (Nyström preconditioning)

Suppose we construct the Nyström preconditioner in Eq. 17 using Algorithm 1 with rank . Using to precondition the regularized matrix results in the condition number bound

 E[κ2(P−1/2AμP−1/2)]<28.

The proof of Theorem 3 may be found in Section 5.3.3.

Theorem 3 has several appealing features. Many other authors have noticed that the effective dimension controls sample size requirements for particular applications such as discriminant analysis [chowdhury2018randomized], ridge regression [lacotte2020effective], and kernel ridge regression [el2014fast, bach2013sharp]. In contrast, our result holds for any regularized linear system.

Our argument makes the role of the effective dimension conceptually simpler, and it leads to explicit, practical parameter recommendations. Indeed, the effective dimension is essentially the same as the rank that makes the approximation error proportional to . In previous arguments, such as those in [el2014fast, bach2013sharp, chowdhury2018randomized], the effective dimension arises because the authors reduce the analysis to approximate matrix multiplication [cohen2015optimal], which produces inscrutable constant factors.

Theorem 3 ensures that Nyström PCG converges quickly.

Corollary 3 (Nyström PCG: Convergence)

Define as in Theorem 3, and condition on the event . Let . If we initialize Algorithm 3 with initial iterate , then the relative error in the iterate satisfies

 δt=∥xt−x⋆∥M∥x⋆∥M<2⋅(0.77)twhere Aμx⋆=b.

In particular, after iterations, we have relative error .

The proof of Corollary 3 is an immediate consequence of the standard convergence result for CG [trefethen1997numerical, Theorem 38.5, p. 299]. See Section A.2.

5.3.1 Analyzing the condition number

The first step in the proof of Theorem 3 is a deterministic bound on how the preconditioner (17) reduces the condition number of the regularized matrix . Let us emphasize that this bound is valid for any rank- Nyström approximation, regardless of the choice of test matrix.

Proposition 3 (Nyström preconditioner: Deterministic bound)

Let be any rank- Nyström approximation, with th largest eigenvalue , and let be the approximation error. Construct the Nyström preconditioner as in (17). Then the condition number of the preconditioned matrix satisfies

 max{^λℓ+μλn+μ,1} ≤κ2(P−1/2AμP−1/2) (18) ≤(^λℓ+μ+∥E∥)min{1μ, ^λℓ+λn+2μ(^λℓ+μ)(λn+μ)}.

For the proof of Proposition 3 see Section A.1.1.

To interpret the result, recall the expression Eq. 16 for the condition number induced by the optimal preconditioner. Proposition 3 shows that the Nyström preconditioner may reduce the condition number almost as well as the optimal preconditioner.

In particular, when , the condition number of the preconditioned system is bounded by a constant, independent of the spectrum of . In this case, Nyström PCG is guaranteed to converge quickly.

5.3.2 The effective dimension and rank selection

How should we choose the rank to guarantee that ? Corollary 2 shows how the error in the rank- randomized Nyström approximation depends on the spectrum of through the eigenvalues of and the tail stable rank. In this section, we present a lemma which demonstrates that the effective dimension controls both quantities. As a consequence of this bound, we will be able to choose the rank proportional to the effective dimension .

Recall from Eq. 4 that the effective dimension of the matrix is defined as

 deff(μ)=tr(A(A+μI)−1)=n∑j=1λj(A)λj(A)+μ. (19)

As previously mentioned, may be viewed as a smoothed count of the eigenvalues larger than . Thus, one may expect that for . This intuition is correct, and it forms the content of Lemma 3.

Lemma 3 (Effective dimension)

Let with eigenvalues . Let be regularization parameter, and define the effective dimension as in Eq. 19. The following statements hold.

1. Fix . If , then .

2. If , then .

The proof of Lemma 3 may be found in Section A.1.2.

Lemma 3, Item 1 captures the intuitive fact that there are no more than eigenvalues larger than . Similarly, Item 2 states that the effective dimension controls the sum of all the eigenvalues whose index exceeds the effective dimension. It is instructive to think about the meaning of these results when is small.

5.3.3 Proof of Theorem 3

We are now prepared to prove Theorem 3. The key ingredients in the proof are Proposition 1, Proposition 3, and Lemma 3.

Proof (Proof of Theorem 3)

Fix the rank . Construct the rank- randomized Nyström approximation with eigenvalues . Write for the approximation error. Form the preconditioner via Eq. 17. We must bound the expected condition number of the preconditioned matrix

First, we apply Proposition 3 to obtain a deterministic bound that is valid for any rank- Nyström preconditioner:

 κ2(P−1/2AμP−1/2)≤^λℓ+μ+∥E∥μ≤2+∥E∥μ.

The second inequality holds because . This is a consequence of Lemma 1, Item 4 and Lemma 3, Item 1 with . We rely on the fact that .

Decompose where . Take the expectation, and invoke Corollary 2 to obtain

 E[κ2(P−1/2AμP−1/2)]≤2+(3+4e2psrp(A))(λp/μ).

By definition, . To complete the bound, apply Lemma 3 twice. We use Item 1 with and Item 2 with to reach

 E[κ2(P−1/2AμP−1/2)]≤2+3⋅2μ+4e2⋅2μ/3μ<2+26=28,

which is the desired result.

5.4 Practical parameter selection

In practice, we may not know the regularization parameter in advance, and we rarely know the effective dimension . As a consequence, we cannot enact the theoretical recommendation for the rank of the Nyström preconditioner: . Instead, we need an adaptive method for choosing the rank . Below, we outline three strategies.

5.4.1 Strategy 1: Choose ℓ as large as the user can afford

The first strategy is to choose the rank as large as the user can afford. This approach is coarse, and it does not yield any guarantees on the cost of the Nyström PCG method.

Nevertheless, once we have constructed a rank- Nyström approximation , we can obtain a posterior upper bound for the iteration count of Nyström PCG. Recall that Proposition 3 controls the condition number of the preconditioned system:

 κ2(P−1/2AμP−1/2)≤^λℓ+μ+∥E∥μwhere E=A−^Anys. (20)

We can estimate the error inexpensively with the randomized power method [kuczynski1992estimating]; see Algorithm 4 in Appendix E. Using the standard convergence theory for PCG, we obtain an upper bound for the iteration count. This gives us advance warning about how long it may take to solve the regularized linear system.

5.4.2 Strategy 2: Adaptive rank selection by a posteriori error estimation

The second strategy uses the posterior condition number estimate adaptively in a procedure the repeatedly doubles the rank as required. We compute a randomized Nyström approximation with initial rank , and we estimate using randomized powering. If is smaller than a prescribed tolerance, we accept the approximation. If the tolerance is not met, then we double the rank, update the approximation, and estimate again. The process repeats until the estimate for falls below the tolerance or it breaches a threshold for the maximum rank. In the case of regularized linear systems, we proceed until for a modest constant . Based on numerical experience, we recommend a choice . For full algorithmic details of adaptive rank selection by estimating , see 5 in E.

5.4.3 Strategy 3: Adaptive rank selection by monitoring ^λℓ/μ

The third strategy is almost identical to the second, except we monitor the ratio instead of . The strategy is rooted in the empirical observation that Nyström PCG converges quickly when . Thus, one doubles the approximation rank until falls below the tolerance or the rank reaches the threshold .

6 Applications and experiments

In this section, we study the performance of Nyström PCG on real world data from three different applications: ridge regression, kernel ridge regression, and approximate cross-validation. The experiments demonstrate the effectiveness of the preconditioner and our strategies for choosing the rank compared to other algorithms in the literature.

6.1 Preliminaries

We implemented all experiments in MATLAB R2019a and ran them on a server with 128 Intel Xeon E7-4850 v4 2.10GHz CPU cores and 1056 GB. Every numerical experiment in this section was repeated twenty times; tables report the mean over the twenty runs, and the standard deviation (in parentheses) when it is non-zero. We highlight the best-performing method in a table in bold.

We select hyperparameters of competing methods by grid search to optimize performance. This procedure tends to be very charitable to the competitors, and it may not be representative of their real-world performance. Indeed, grid search is computationally expensive, and it cannot be used as part of a practical implementation. A detailed overview of the experimental setup for each application may be found in the appropriate section of

Appendix C, and additional numerical results in Appendix D.

6.2 Ridge regression

In this section, we solve the ridge regression problem (7) described in Section 1.3 on some standard machine learning data sets (Table 1) from OpenML [vanschoren2014openml] and LIBSVM [chang2011libsvm]. We compare Nyström PCG to standard CG, the preconditioning method of Rokhlin and Tygert (R&T) [rokhlin2008fast], and the Adapative Iterative Hessian Sketch (AdaIHS) [lacotte2020effective].

6.2.1 Experimental overview

We perform two sets of experiments: computing regularization paths on CIFAR-10 and Guillermo, and random features regression [rahimi2007random, rahimi2008uniform] on shuttle and smallNORB with specified values of . The last two datasets have comparable to ; in this setting (R&T) provides no advantage over a direct method, so we omit a comparison. We use Euclidean norm of the residual as our stopping criteria, with convergence declared when . For both sets of experiments, we use Nyström PCG with adaptive rank selection, Algorithm 5 in Appendix E. For complete details of the experimental methodology, see Section C.1.

The regularization path experiments solve Eq. 7 over a regularization path where . We begin by solving the problem for the largest (initialized at zero), and solve for progressively smaller with warm starting. For each value of , every method is allowed at most 500 iterations to reach the desired tolerance.

6.2.2 Computing the regularization path

Fig. 2 shows the evolution of along the regularization path. CIFAR-10 and Guillermo are both small, so we compute the exact effective dimension as a point of reference. We see that we reach the rank cap of for CIFAR-10 and for Guillermo when is small enough. For CIFAR-10, Nyström PCG chooses a rank much smaller than the effective dimension for small values of . Nevertheless, the method still performs well(Fig. 3).

Fig. 3 show the effectiveness of each method for computing the entire regularization path. Nyström PCG is the fastest of the methods along most of the regularization path. For CIFAR-10, Nyström PCG is faster than R&T until the very end of the regularization path, when . Hence, the cost of forming the R&T preconditioner is expensive, and it is not beneficial unless . Thus, provided the regularization parameter is not extremely small, we expect Nyström PCG to perform better on ridge regression problems.

AdaIHS is rather slow. It increases the sketch size parameter several times along the regularization path. Each time, AdaIHS must form a new sketch of the matrix, approximate the Hessian, and compute a Cholesky factorization.

6.2.3 Random features regression

Tables 3 and 2 display the results for Nyström PCG and AdaIHS on shuttle-rf and smallNorb-rf. Table 2 shows that Nyström PCG performs best on both datasets for all metrics. The most striking feature is the difference between the chosen sketch sizes: AdaIHS chooses the maximum allowed sketch size, , which offers little benefit over a direct method, while Nyström PCG uses a sketch size . Table 3 contains estimates for and the condition number of the preconditioned system, which explain the fast convergence.

6.3 Approximate cross-validation

In this subsection we use our preconditioner to compute approximate leave-one-out cross-validation (ALOOCV), which requires solving a large linear system with multiple right-hand sides.

6.3.1 Background

Cross-validation is an important machine-learning technique to assess and select models and hyperparameters. Generally, it requires re-fitting a model on many subsets of the data, so can take quite a long time. The worst culprit is leave-one-out cross-validation (LOOCV), which requires running an expensive training algorithm times. Recent work has developed approximate leave-one-out cross-validation (ALOOCV), a faster alternative that replaces model retraining by a linear system solve [giordano2019swiss, rad2020scalable, wilson2020approximate]. In particular, these techniques yield accurate and computationally tractable approximations to LOOCV.

To present the approach, we consider the infinitesimal jacknife (IJ) approximation to LOOCV [giordano2019swiss, stephenson2020approximate], The IJ approximation computes

 ~θn/jIJ=^θ+1nH−1(^θ)∇θl(^θ,aj), (21)

where

is the Hessian of the loss function at the solution

, for each datapoint . The main computational challenge is computing the inverse Hessian vector product . When is very large, we can also subsample the data and average Eq. 21

over the subsample to estimate ALOOCV. Since ALOOCV solves the same problem with several right-hand sides, blocked PCG methods (here, Nyström blocked PCG) are the tool of choice to efficiently solve for multiple right-hand sides at once. To demonstrate the idea, we perform numerical experiments on ALOOCV for logistic regression. The datasets we use are all from LIBSVM

[chang2011libsvm]; see Table 4.

6.3.2 Experimental overview

We perform two sets of experiments in this section. The first set of experiments uses Gisette and SVHN to test the efficacy of Nyström sketch-and-solve. These datasets are small enough that we can factor using a direct method. We also compare to block CG and block PCG with the computed Nyström approximation as a preconditioner. To assess the error due to an inexact solve for datapoint , let . For any putative solution , we compute the relative error . We average the relative error over 100 data-points .

The second set of experiments uses the larger datasets real-sim and rcv1.binary and small values of , the most challenging setting for ALOOCV. We restrict our comparison to block Nyström PCG versus the block CG algorithm, as Nyström sketch-and-solve is so inaccurate in this regime. We employ Algorithm 5 to construct the preconditioner for block Nyström PCG.

6.3.3 Nyström sketch-and-solve

As predicted, Nyström sketch-and-solve works poorly (Table 5). When , the approximate solutions are modestly accurate, and the accuracy degrades as decreases to . The experimental results agree with the theoretical analysis presented in Section 4, which indicate that sketch-and-solve degrades as decreases. In contrast, block CG and block Nyström PCG both provide high-quality solutions for each datapoint for both values of the regularization parameter.

6.4 Large scale ALOOCV experiments

Table 6 summarizes results for block Nyström PCG and block CG on the larger datasets. When , block Nyström PCG offers little or no benefit over block CG because the data matrices are very sparse (see Table 4) and the rcv1 problem is well-conditioned (see Table 11).

For , block Nyström PCG reduces the number of iterations substantially, but the speedup is negligible. The data matrix is sparse, which reduces the benefit of the Nyström method. Block CG also benefits from the presence of multiple right-hand sides just as block Nyström PCG. Indeed, O’Leary proved that the convergence of block CG depends on the ratio , where is is the number of right-hand sides [o1980block]. Consequently, multiple right-hand sides precondition block CG and accelerate convergence. We expect bigger gains over block CG when is dense.