    # Fast and Accurate Gaussian Kernel Ridge Regression Using Matrix Decompositions for Preconditioning

This paper presents a method for building a preconditioner for a kernel ridge regression problem, where the preconditioner is not only effective in its ability to reduce the condition number substantially, but also efficient in its application in terms of computational cost and memory consumption. The suggested approach is based on randomized matrix decomposition methods, combined with the fast multipole method to achieve an algorithm that can process large datasets in complexity linear to the number of data points. In addition, a detailed theoretical analysis is provided, including an upper bound to the condition number. Finally, for Gaussian kernels, the analysis shows that the required rank for a desired condition number can be determined directly from the dataset itself without performing any analysis on the kernel matrix.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Kernel methods are a way to embed the input space into a high dimensional feature space, which enables learning over a richer and more complex hypothesis class. One of the most elementary applications of kernel methods is Kernel Ridge Regression (KRR). The problem setup can be defined in the following manner - let the training data be defined as where is the input domain and is the output domain, let , and a regularization parameter

, with the response for a given input x estimated as:

 f(x)=n∑j=1k(xj,x)αj (1)

where is the solution of the equation

 (K+βI)α=b (2)
###### Remark 1.1.

is the kernel matrix defined by

KRR, while simple, is a powerful technique that achieves empirically good results. However, KRR comes with the drawback of having to compute the KRR estimator. The straightforward solution for the estimator requires inverting the matrix, which requires time, and is very restrictive for large datasets. Furthermore, holding the original kernel in memory requires storage, which by itself is prohibitive for large n. These drawbacks have driven research to focus on scaling up kernel based methods, mostly by approximation, yielding methods such as the Nyström method for low-rank kernel approximation, or the randomized construction of feature maps such as Random Fourier Features (RFF), originally proposed by Rahimi and Recht . Another approach for the kernel ridge regression problem is finding a preconditioner [34, 11]. It was recently suggested to use RFF as a preconditioner for the kernel matrix  and then to use conjugate gradient to solve the regression problem. In this paper, a different approach for the preconditioner is presented, based on randomized matrix factorizations [18, 22, 33, 7]

. The suggested preconditioner combines between the interpolative decomposition, Nyströmapproximation and fast Gauss transform to achieve an efficient preconditioner that can be applied in

operations, i.e. linear in the number of data points.

### 1.1 Related work

Building scalable and computationally efficient algorithms for kernel methods is an active research area. Some methods use low rank approximations to approximate the kernel directly, such as Nyström approximation or random Fourier features. This approach is very popular and has many variants where the main idea is to reduce the computational cost by low rank approximation while retaining reasonable model quality. Recent work on these approaches can be found in [20, 9, 1, 26]

. A different approach uses fast matrix vector multiplications, e.g Fast Gauss Transform or Fast Multipole methods to solve the regression using Krylov iterations

[32, 11, 34]. These methods reduce the computational cost of each Krylov iteration significantly, from to or even . However, when the kernel matrix is ill-conditioned, the number of iterations required might be huge. Another approach is to use a preconditioner to reduce the condition number of the kernel matrix and in so doing reduce the required number of Krylov iterations. These approaches are discussed in [17, 35, 8].

### 1.2 Contributions

The paper proposes the use of randomized matrix decompositions  (specifically, the interpolative decomposition ), for constructing a preconditioner for the kernel matrix. The preconditioner can be used to solve the kernel ridge regression problem (2) using a substantially smaller number of iterations while maintaining high accuracy. Moreover, the preconditioner is designed in its structure to be both efficient in its application to the kernel matrix and strict in memory consumption. The paper presents a theoretical analysis of the condition number of the preconditioned matrix. Furthermore, for a Gaussian kernel, the paper provides a theoretical lower bound for the low rank approximation required to reach the desired condition number, this bound can be calculated directly from the data.

## 2 Preliminaries

### 2.1 Notation

Throughout the paper, matrices are indicated by capital letters and vectors and scalars by small letters. The norm when applied to vectors, refers to the standard Euclidean norm, i.e. and to the spectral norm when applied to matrices, i.e. . The norm applied to a vector with respect to a positive semidefinite matrix is defined

. Singular values are indicated in descending order, by

and eigenvalues by

also in descending order when they are real. on a square matrix indicates that is positive semidefinite and means .

### 2.2 Interpolative and CUR decompositions

Interpolative decomposition (ID)  is a matrix factorization that can be applied to any matrix and defined as the following:

 Am×n≈Bm×kPk×n (3)

where is a subset of the -columns of and is an interpolation matrix with certain properties (such as small norm) that is used for reconstructing from . The columns of are computed by rank revealing factorization [15, 28, 24, 6, 19] in order to get an error bounded by a factor proportional to the the singular value of . In addition to the deterministic algorithm for computing the ID that is described in , there is a randomized version that starts by projecting

onto a low dimensional space using a random normally distributed matrix

[23, 18]. This property is used implicitly in Algorithm 1. The CUR decomposition [21, 10], is a pseudo-skeleton decomposition [12, 27] that factors a matrix into three matrices, where and are subsets of the columns and rows of respectively, and is defined as the inverse matrix of the overlap between and . Most of the matrices discussed in the paper are symmetric, so the Nyström approximation will be used as a special case of the CUR decomposition. The following Lemma gives an error bound for the CUR decomposition using columns and rows selected by the interpolative decomposition:

###### Lemma 2.1.

 Let , that satisfies and , where C,R are the k columns and k rows of A, respectively, and W, are the interpolation matrices from the interpolative decomposition. Suppose further that R is full rank, and that U is defined as . Then

 ||A−CUR||≤||E||+||~E|| (4)
###### Remark 2.2.

 Note that , furthermore for the deterministic ID, . Equivalently, for a similar bound exists for the randomized ID, the reader is referred to 

### 2.3 Fast Gauss Transform

Fast Gauss transform (FGT) is a variant of the fast multipole method (FMM) . FMM was originally formulated as an efficient method for complex physical calculations, efficiently performing matrix-vector products. FGT deals with the evaluation of the discrete Gauss transform:

 G(yj)=N∑i=1qie−∥yj−xi∥2/ϵ2 (5)

where represent the centers of the Gaussians, each of which has bandwidth , and are the weight coefficients for each Gaussian. Direct evaluation of the summation for a set of points, , is computationally costly in large scale scenarios, as it requires operations. FGT allows setting a desired degree of precision for the approximation of the Gaussian function, and reduces the complexity of the summation (5) to , accompanied by a constant factor, which grows exponentially with the dimension 

. To overcome this disadvantage, which makes plain FGT inapplicable to high-dimensional data, the Fast Improved Gauss (FIG) transform

 uses tree data structures and an adaptive space subdivision technique, leading to reduction of the constant factor to asymptotically polynomial order. It is worth noting, that the computational complexity in is dependent on the data as well as the selection of .

###### Definition 2.1.

Let represent the application of the FIG transform to compute equation (5), where and represent matrices of data points for which the kernel is generated, weighted by the vector . According to this notation, it holds that .

The algorithm discussed in the paper can be applied to any other kernels that can utilize fast matrix-vector product such as other FMM functions. Another approach for fast matrix-vector product for any kernel and based on nearest neighbors is described in .

## 3 Description of the algorithm

The proposed algorithm is based on conjugate gradient with a preconditioner designed not only to be effective in its ability to reduce the condition number, but also in its efficiency to be applied in terms of computational cost and memory consumption. In order to keep the time and storage complexity to a minimum , the proposed scheme uses the Fast Improved Gauss transform [37, 38] (“FIG transform”). If the kernel is Gaussian, then the FIG transform can be used for applying the kernel to any vector in , but the preconditioner has to be applied using the FIG transform as well, to keep the advantage of using a Gaussian kernel. In order to achieve this, the kernel is approximated using a Nyström decomposition, i.e. , where is a subset of columns of and is the inverse of a submatrix of . Adding the ridge regression parameter, and using the Woodbury matrix identity:

 (~K+βI)−1=(CUCT+βI)−1=β−1I−β−1C(βU−1+CTC)−1CT (6)

Since is a submatrix of , then the application of both and to a matrix, can be done using FIG transform in . is also a subset of (unlike itself, which involves matrix inversion) and can therefore be applied in the same way, in . The Woodbury matrix identity requires computing the inverse of , however this is a small matrix of size . Constructing the matrix

can be done by the application of the FIG transform to the identity matrix,

(of size ) in operations. Each application of Eq. 6 involves solving the following linear system

 (βU−1+CTC)x=b. (7)

It is better to store the Cholesky factor of the matrix in Eq. 7, but this matrix tends to be ill-conditioned in large scales. Instead, Eq. 7 is modified to:

 (βI+U12CTCU12)U−12x=U12b. (8)

where the matrix is generally more stable, and its Cholesky decomposition can be applied to solve Eq. 8 for and then restoring using .

 LTL=βI+U12CTCU12 (9)

where is the Cholesky factor of size . The algorithm can be viewed as two steps:

• Preparation stage (or “setup” stage), which selects anchor data points from the dataset and also performs some computations to be used later in the iterative stage, such as building the matrix of the Cholesky decomposition and the matrix .

• Iterative stage, which applies conjugate gradient using the preconditioner that was computed in the preparation stage.

Selection of the anchor points can be done in various ways, such as random sampling or farthest point sampling (FPS) . In this work, the suggested method is based on the interpolative decomposition (or equivalently, on rank revealing decompositions, [15, 28, 24], which has the following advantages:

• It is strongly related to the CUR decomposition , and therefore also to the Nyström approximation. The interpolative decomposition is also integral to the method, yielding theoretical spectral bounds that can give theoretical guarantees for the overall performance of the algorithm.

• It selects columns of the input matrix, which enables the usage of the FIG-transform later on in the algorithm

• It has a randomized version, which can be computed (using the FIG-transform) in linear time and the most demanding computational part can be parallelized.

###### Remark 3.1.

Algorithm 1 uses sampling technique based on interpolative decomposition. When using pivoted QR, the computational complexity is . In practice, the performance is very similar to RRQR.

###### Remark 3.2.

The computational complexity of Algorithm 2 is

###### Remark 3.3.

Applying the preconditioner (Algorithm 3) in each iteration takes operations.

###### Remark 3.4.

The computational complexity of Algorithm 4, when using pivoted QR in Algorithm 1 is , where is the number of iterations used in the conjugate gradient.

###### Remark 3.5.

Anchor points selection can be done differently, for example by random sampling or FPS. In this case, the theoretical bounds will not hold, but may still produce good results in practice and reduce the computational cost of Algorithm 1. For example, from using pivoted QR, to using random sampling.

## 4 Theoretical Analysis

###### Lemma 4.1.

Let be a symmetric matrix, and let be its Nyström approximation, then where p(k,n) is a function bounded by a low degree polynomial in and .

###### Proof.

By Lemma 2.1

 ||K−~K||≤||E||+||~E|| (10)

Note however that since K is symmetric, meaning

 ~K=CUR=CUCT (11)

hence . Therefore it follows immediately that

 ∥K−~K∥≤2||E|| (12)

We also know from by definition of RRQR, that the decomposition must satisfy

 ∥E∥≤σk+1(K)⋅p(k,n) (13)

We therefore combine (7) and (8), yielding

 ∥K−~K∥≤2σk+1(K)⋅p(k,n) (14)

###### Lemma 4.2.

Let be a positive semidefinite matrix, and let its Nyström approximation, then .

###### Proof.

The lemma infers directly from the Schur’s compliment of . ∎

###### Lemma 4.3 ( pp. 673).

For any two matrices the following holds:

 |σi(A)−σi(B)|≤∥A−B∥ (15)
###### Definition 4.1.

 The numerical rank of the Gaussian kernel up to precision is

 ρδ(GXϵ)≜#[j:σj(GXϵ)σ1(GXϵ)≥δ]
###### Theorem 4.4.

[2, 3] Let be a set bounded by a box , where are intervals in . Let be the length of the th interval, i.e. and let be the associated kernel matrix, i.e. , then

 ρδ(GXϵ)≤d∏i=1(⌊γqi⌋+1), (16)

where .

The following theorem gives an upper bound to the condition number:

###### Theorem 4.5.

Let be a rank Nyström approximation for a positive semidefinite matrix , such that , for a positive constant (that may depend on and ) and a ridge parameter , then

 cond[(~K+βI)−1(K+βI)]≤1+Mλk+1(K)β (17)
###### Proof.

Combining Lemma 4.2 and Lemma 4.3 with the fact that and are positive semidefinite, gives

 0≤K−~K≤∥K−~K∥I≤Mσk+1(K)=Mλk+1(K) (18)

Modifying Eq. 18 and adding gives

 ~K+βI≤K+βI≤~K+βI+Mλk+1(K)I (19)

Applying to both sides of the equation and using the fact that is symmetric positive semidefinite gives

 I≤(~K+βI)−12(K+βI)(~K+βI)−12≤I+Mλk+1(K)(~K+βI)−1 (20)

Clearly, , which yields

 I≤(~K+βI)−12(K+βI)(~K+βI)−12≤(1+Mλk+1(K)β)I. (21)

and finally,

 cond[(~K+βI)−1(K+βI)]=cond[(~K+βI)−12(K+βI)(~K+βI)−12]≤1+Mλk+1(K)β (22)

which completes the proof. ∎

###### Corollary 4.6.

Let be a Gaussian kernel matrix over dataset , i.e. . Let be a rank Nyström approximation, such that and let be a ridge parameter. Then, for a maximal condition number,

 k≥d∏i=1(⌊2qiπ√ϵ−1ln¯Mnβ(ξ−1)⌋+1) (23)

where are the lengths of the intervals of the bounding box of the dataset .

###### Proof.

From Theorem 4.5, the condition number depends on , i.e.

 cond[(~K+βI)−1(K+βI)]≤1+¯Mλk+1(K)β≤ξ (24)

where . Therefore, must satisfy . By defining and using Theorem 4.4,

 γ=2π√ϵ−1ln(δ−1)≤2π√ϵ−1ln¯Mnβ(ξ−1) (25)

and therefore,

 k≥d∏i=1(⌊2qiπ√ϵ−1ln¯Mnβ(ξ−1)⌋+1)

where are the lengths of the intervals of the bounding box of the dataset . ∎

###### Remark 4.7.

Corollary 4.6 enables to determine the required rank Nyström approximation and illustrates the tradeoff between the condition number and the low rank, which has implications over the memory consumption and computational load of the algorithm. Smaller condition number yields less iterations on one hand, but on the other hand requires processing larger matrix .

###### Remark 4.8.

As a simple example, for the deterministic interpolative decomposition, , which yields

###### Remark 4.9.

Eq. 25 grows very slowly in , which means that remains small even when grows fast. For example, suppose , , and , then for , and for , which does not affect the value of according to Corollary 4.6. On the down side, it grows fast in .

###### Corollary 4.10.

For a Nyström preconditioner built by choosing , where are columns of chosen by the interpolative decomposition, then

 cond[(~K+βI)−1(K+βI)]≤1+2λk+1(K)√k(n−k)+1β≤1+2λk+1(K)(n2+1)β (26)
###### Proof.

The proof follows immediately by combining Lemma 4.1 with Theorem 4.5. ∎

###### Remark 4.11.

A similar bound can be developed immediately for other matrix decompositions.

## 5 Numerical Results

In this section, empirical evaluation of the algorithm is presented and compared against the naive implementation (no preconditioning) and RFF preconditioning. The datasets used are described in the supplementary appendix. The number of data points is , rank , noise level and

. The most costly computation in the presented flow is the QR decomposition in Algorithm

1 which is , compared to where the number of Gaussian components, therefore in order to maintain similar computational complexity, the number of Gaussian components was set to , and therefore . Figure 1 shows the comparison between the distance , where is the exact solution, and is the solution at iteration .

###### Remark 5.1.

It is important to note that each iteration of the proposed algorithm is , while every iteration of RFF is for . For , each RFF iteration is asymptotically more costly in terms of run-time complexity by a factor of .

###### Remark 5.2.

When the spectral decay is fast, a small will be sufficient for satisfying results.

Figure 2 compares between the norm of the residual error, , where refers to the iteration number and is the solution at iteration of the CG algorithm. Also presented is the distance

It’s worth noting that the graph is slightly misleading, as the number of Gaussian RV’s taken for the RFF preconditioner is very high (in order to achieve the same computational complexity), however per iteration the proposed method is significantly faster, therefore in reality the convergence is still far faster for the proposed method.

### 5.1 Synthetic Electromagnetic Field Simulator

A 3-dimensional synthetic dataset generated by an electric field simulator was used to test the performance of the algorithm. The simulation was composed of the following parts:

• setting the support of the points to

• choosing points from a uniform distribution over the defined support at random, where:

• 5 points represent the positive point-charges

• 30,000 points represent arbitrary sampling locations of the electric field

Following the superposition rule, at each sampling point , each field component was summed as follows:

 ex[j]=5∑i=11r2ijsin(ϕ)cos(θ) (27)
 ey[j]=5∑i=11r2ijsin(ϕ)sin(θ) (28)
 ez[j]=5∑i=11r2ijcos(ϕ) (29)

where is the distance between charge and the point , and and are the spherical angles with respect to axes and respectively. Additionally, an electric potential was calculated at each point, to be used as the problem’s label as follows:

 Φ[j]=5∑i=11rij (30)

## 6 Conclusions

The paper shows that the use of randomized matrix decompositions (specifically RID with CUR) for constructing a preconditioner for the matrix kernel is very successful in solving the scalability drawback intrinsic to kernel matrix methods. The method is strict in memory consumption (no need to hold the matrix), maintains accuracy, allows efficient application of the kernel matrix and reduces its condition number. Our results show fast convergence of the CG algorithm while maintaining accuracy, outperforming similar SOTA methods. The paper also provides theoretical insights and bounds for the low-rank approximation needed to reach the desired condition number.

## References

•  Haim Avron and Vikas Sindhwani. High-performance kernel machines with implicit distributed optimization and randomization. Technometrics, 58(3):341–349, 2016.
•  Amit Bermanis, Amir Averbuch, and Ronald R Coifman. Multiscale data sampling and function extension. Applied and Computational Harmonic Analysis, 34(1):15–29, 2013.
•  Amit Bermanis, Guy Wolf, and Amir Averbuch. Cover-based bounds on the numerical rank of Gaussian kernels. Applied and Computational Harmonic Analysis, 36(2):302–315, 2014.
•  Dennis S Bernstein. Matrix Mathematics: Theory, Facts, and Formulas. Princeton University Press, 2009.
•  Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Numerical geometry of non-rigid shapes. Springer Science & Business Media, 2008.
•  S. Chandrasekaran and I. Ipsen. On rank-revealing QR factorisations.
•  Kenneth L Clarkson and David P Woodruff. Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM), 63(6):54, 2017.
•  Kurt Cutajar, Michael Osborne, John Cunningham, and Maurizio Filippone. Preconditioning kernel matrices. In

International Conference on Machine Learning

, pages 2529–2538, 2016.
•  Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 3041–3049, 2014.
•  Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, 2008.
•  Nando D Freitas, Yang Wang, Maryam Mahdaviani, and Dustin Lang. Fast Krylov methods for N-body learning. In Advances in neural information processing systems, pages 251–258, 2006.
•  Sergei A Goreinov, Eugene E Tyrtyshnikov, and Nickolai L Zamarashkin. A theory of pseudoskeleton approximations. Linear algebra and its applications, 261(1-3):1–21, 1997.
•  Leslie Greengard and Vladimir Rokhlin. A fast algorithm for particle simulations. Journal of computational physics, 73(2):325–348, 1987.
•  Leslie Greengard and John Strain. The fast Gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991.
•  Ming Gu and Stanley C Eisenstat. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM Journal on Scientific Computing, 17(4):848–869, 1996.
•  P.G Martinsson V. Rokhlin H. Cheng, Z. Gimbutas. On the compression of low rank matrices. SIAM, 26.
•  Kenneth L. Clarkson Haim Avron and David P. Woodruff. Faster kernel ridge regression using sketching and preconditioning. SIAM, 38:1116–1138.
•  Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
•  Y.P Hong and C.-T. Pan.

Rank-revealing QR factorizations and the singular value decomposition.

Math. Comp.
•  Zhiyun Lu, Avner May, Kuan Liu, Alireza Bagheri Garakani, Dong Guo, Aurélien Bellet, Linxi Fan, Michael Collins, Brian Kingsbury, Michael Picheny, et al. How to scale up kernel methods to be as good as deep neural nets. arXiv preprint arXiv:1411.4000, 2014.
•  Michael W Mahoney and Petros Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.
•  Michael W Mahoney et al. Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning, 3(2):123–224, 2011.
•  Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis, 30(1):47–68, 2011.
•  L Miranian and Ming Gu. Strong rank revealing lu factorizations. Linear algebra and its applications, 367:1–16, 2003.
•  Vlad I Morariu, Balaji V Srinivasan, Vikas C Raykar, Ramani Duraiswami, and Larry S Davis. Automatic online tuning for fast Gaussian summation. In Advances in neural information processing systems, pages 1113–1120, 2009.
•  Cameron Musco and Christopher Musco. Recursive sampling for the Nystrom method. In Advances in Neural Information Processing Systems, pages 3833–3845, 2017.
•  AI Osinsky and NL Zamarashkin. Pseudo-skeleton approximations with better accuracy estimates. Linear Algebra and its Applications, 537:221–249, 2018.
•  C-T Pan. On the existence and computation of rank-revealing LU factorizations. Linear Algebra and its Applications, 316(1-3):199–222, 2000.
•  Vladimir Rokhlin P.G Martinsson and Mark Tygert. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis, 30.
•  Adam Coates Pieter Abbeel and Andrew Y. Ng. Autonomous helicopter aerobatics through apprenticeship learning. IJRR, 2010.
•  Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, pages 1177–1184, 2008.
•  Vikas C Raykar and Ramani Duraiswami. Fast large scale Gaussian process regression using approximate matrix-vector products. In Learning workshop, 2007.
•  Gil Shabat, Yaniv Shmueli, Yariv Aizenbud, and Amir Averbuch. Randomized LU decomposition. Applied and Computational Harmonic Analysis, 44(2):246–272, 2018.
•  Yirong Shen, Matthias Seeger, and Andrew Y Ng. Fast Gaussian process regression using kd-trees. In Advances in neural information processing systems, pages 1225–1232, 2006.
•  Balaji Vasan Srinivasan, Qi Hu, Nail A Gumerov, Raghu Murtugudde, and Ramani Duraiswami. Preconditioned Krylov solvers for kernel regression. arXiv preprint arXiv:1408.1237, 2014.
•  Sergey Voronin and Per-Gunnar Martinsson. Efficient algorithms for CUR and interpolative matrix decompositions. Advances in Computational Mathematics, 43(3):495–516, 2017.
•  C Yang, R Duraiswami, NA Gumerov, and L Davis.

Improved fast Gauss transform and efficient kernel density estimation.

In

Proceedings Ninth IEEE International Conference on Computer Vision

.
•  Changjiang Yang, Ramani Duraiswami, and Larry S Davis. Efficient kernel machines using the improved fast Gauss transform. In Advances in neural information processing systems, pages 1561–1568, 2005.