The kernel method is a popular technique used to apply linear learning and classification algorithms to datasets with nonlinear structure. Given training input points , the idea is to replace the standard Euclidean dot product with the kernel dot product , where is some positive semidefinite function. Popular kernel functions include e.g., the Gaussian kernel with for some bandwidth parameter and the polynomial kernel of degree with for some parameter .
Throughout this work, we focus on kernels where is a function of the dot products , , and . Such functions encompass many kernels used in practice, including the Gaussian kernel, the Laplace kernel, the polynomial kernel, and the Matern kernels.
Letting be the reproducing kernel Hilbert space associated with , we can write where is a typically non-linear feature map. We let denote the kernelized dataset, whose row is the kernelized datapoint .
There is no requirement that can be efficiently computed or stored – for example, in the case of the Gaussian kernel, is an infinite dimensional space. Thus, kernel methods typically work with the kernel matrix with . We will also sometimes denote to make it clear which kernel function it is generated by. We can equivalently write . As long as all operations of an algorithm only access via the dot products between its rows, they can thus be implemented using just without explicitly computing the feature map.
Unfortunately computing is expensive, and a bottleneck for scaling kernel methods to large datasets. For the kernels we consider, where depends on dot products between the input points, we must at least compute the Gram matrix , requiring time in general. Even if is sparse, this takes time. Storing then takes
space, and processing it for downstream applications like kernel ridge regression and kernel SVM can be even more expensive.
1.1 Low-rank kernel approximation
For this reason, a vast body of work studies how to efficiently approximate via a low-rank surrogate [SS00, AMS01, WS01, FS02, RR07, ANW14, LSS13, BJ02, DM05, ZTK08, BW09, CKS11, WZ13, GM13]. If is rank-, it can be stored in factored form in space and operated on quickly – e.g., it can be inverted in just time to solve kernel ridge regression.
One possibility is to set where is ’s best -rank approximation – the projection onto its top eigenvectors. minimizes, over all rank- , the error , where is the Frobenius norm: . It in fact minimizes error under any unitarily invariant norm, e.g., the popular spectral norm. Unfortunately, is prohibitively expensive to compute, requiring time in practice, or in theory using fast matrix multiplication, where [LG14].
The idea of much prior work on low-rank kernel approximation is to find which is nearly as good as , but can be computed much more quickly. Specifically, it is natural to ask for fulfilling the following relative error guarantee for some parameter :
Other goals, such as nearly matching the spectral norm error or approximating entry-wise have also been considered [RR07, GM13]. Of particular interest to our results is the closely related goal of outputting an orthonormal basis satisfying for any with :
(2) can be viewed as a Kernel PCA guarantee – its asks us to find a low-rank subspace such that the projection of our kernelized dataset onto nearly optimally approximates this dataset. Given , we can approximate using . Alternatively, letting be the projection onto the row span of , we can write , which can be computed efficiently, for example, when is a projection onto a subset of the kernelized datapoints [MM16].
1.2 Fast algorithms for relative-error kernel approximation
However, recent work has shown that this is not required. Avron, Nguyen, and Woodruff [ANW14] demonstrate that for the polynomial kernel, satisfying (2) can be computed in time for a polynomial kernel with degree .
Musco and Musco [MM16] give a fast algorithm for any kernel, using recursive Nyström sampling, which computes (in factored form) satisfying , for input parameter . With the proper setting of , it can output satisfying (2) (see Section C.3 of [MM16]). Computing requires evaluating columns of the kernel matrix along with additional time for other computations. Assuming the kernel is a function of the dot products between the input points, the kernel evaluations require time. The results of [MM16] can also be used to compute satisfying (1) with in time (see Appendix A of [MW17]).
Woodruff and Musco [MW17] show that for any kernel, and for any , it is possible to achieve (1) in time plus the time needed to compute an submatrix of . If has uniform row sparsity – i.e., for some constant and all , this step can be done in time. Alternatively, if for this can be done in time using fast rectangular matrix multiplication [LG12, GU17] (assuming that there are no all zero data points so .)
1.3 Our results
The algorithms of [MM16, MW17] make significant progress in efficiently solving (1) and (2) for general kernel matrices. They demonstrate that, surprisingly, a relative-error low-rank approximation can be computed significantly faster than the time required to write down all of .
A natural question is if these results can be improved. Even ignoring dependencies and typically lower order terms, both algorithms use time. One might hope to improve this to input sparsity, or near input sparsity time, , which is known for computing a low-rank approximation of itself [CW13]. The work of Avron et al. affirms that this is possible for the kernel PCA guarantee of (2) for degree- polynomial kernels, for constant . Can this result be extended to other popular kernels, or even more general classes?
1.3.1 Lower bounds
We show that achieving the guarantee of (1) significantly more efficiently than the work of [MM16, MW17] is likely very difficult. Specifically, we prove that for a wide class of kernels, the kernel low-rank approximation problem is as hard as multiplying the input by an arbitrary . We have the following result for some common kernels to which our techniques apply:
Theorem 1 (Hardness for low-rank kernel approximation).
Consider any polynomial kernel , Gaussian kernel , or the linear kernel . Assume there is an algorithm which given with associated kernel matrix , returns in time satisfying:
for some approximation factor . Then there is an time algorithm for multiplying arbitrary integer matrices , .
The above applies for any approximation factor . While we work in the real RAM model, ignoring bit complexity, as long as and have polynomially bounded entries, our reduction from multiplication to low-rank approximation is achieved using matrices that can be represented with just bits per entry.
Theorem 1 shows that the runtime of for achieved by [MM16] for general kernels cannot be significantly improved without advancing the state-of-the-art in matrix multiplication. Currently no general algorithm is known for multiplying integer , in time, except when for and is dense. In this case, can be computed in time using fast rectangular matrix multiplication [LG12, GU17].
As discussed, when has uniform row sparsity or when , the runtime of [MW17] for , ignoring dependencies and typically lower order terms, is , which is also nearly tight.
In recent work, Backurs et al. [BIS17] give lower bounds for a number of kernel learning problems, including kernel PCA for the Gaussian kernel. However, their strong bound, of time, requires very small error , whereas ours applies for any relative error .
1.3.2 Improved algorithm for radial basis function kernels
In contrast to the above negative result, we demonstrate that achieving the alternative Kernel PCA guarantee of (2) is possible in input sparsity time for any shift and rotationally invariant kernel – e.g., any radial basis function kernel where ). This result significantly extends the progress of Avron et al. [ANW14] on the polynomial kernel.
Our algorithm is based off a fast implementation of the random Fourier features method [RR07]
, which uses the fact that that the Fourier transform of any shift invariant kernel is a probability distribution after appropriate scaling (a consequence of Bochner’s theorem). Sampling frequencies from this distribution gives an approximation toand consequentially the matrix .
We employ a new analysis of this method [AKM17], which shows that sampling random Fourier features suffices to give satisfying the spectral approximation guarantee:
Our algorithm samples random Fourier features, which naively requires time. We show that this can be accelerated to time, using a recent result of Kapralov et al. on fast multiplication by random Gaussian matrices [KPW16]. Our technique is analogous to the ‘Fastfood’ approach to accelerating random Fourier features using fast Hadamard transforms [LSS13]. However, our runtime scales with , which can be significantly smaller than the runtime given by Fastfood when is sparse. Our main algorithmic result is:
Theorem 2 (Input sparsity time kernel PCA).
There is an algorithm that given along with shift and rotation-invariant kernel function with , outputs, with probability , satisfying:
for any with and any .
Letting denote the largest eigenvalue of
largest eigenvalue ofand be the exponent of fast matrix multiplication, the algorithm runs in time.
We note that the runtime of our algorithm is whenever , , , and are not too large. Due to the relatively poor dependence on , the algorithm is relevant for very high dimensional datasets with . Such datasets are found often, e.g., in genetics applications [HDC01, JDMP11]. While we have dependence on , in the natural setting, we only compute a low-rank approximation up to an error threshold, ignoring very small eigenvalues of , and so will not be too small. We do note that if we apply Theorem 2 to the low-rank approximation instances given by our lower bound construction, can be very small, for matrices with bounded entries. Thus, removing this dependence is an important open question in understanding the complexity of low-rank kernel approximation.
We leave open the possibility of improving our algorithm, achieving runtime, which would match the state-of-the-art for low-rank approximation of non-kernelized matrices [CW13]. Alternatively, it is possible that a lower bound can be shown, proving the that high dependence, or the term are required even for the Kernel PCA guarantee of (2).
2 Lower bounds
Our lower bound proof argues that for a broad class of kernels, given input , a low-rank approximation of the associated kernel matrix achieving (1) can be used to obtain a close approximation to the Gram matrix . We write as a function of (or for distance kernels) and expand this function as a power series. We show that the if input points are appropriately rescaled, the contribution of degree-1 term dominates, and hence our kernel matrix approximates , up to some easy to compute low-rank components.
We then show that such an approximation can be used to give a fast algorithm for multiplying any two integer matrices and . The key idea is to set where is a large weight. We then have:
Since w is very large, the block is relatively very small, and so is nearly rank- – it has a ‘heavy’ strip of elements in its last rows and columns. Thus, computing a relative-error rank- approximation to recovers all entries except those in the block very accurately, and importantly, recovers the block and so the product .
2.1 Lower bound for low-rank approximation of .
We first illustrate our lower bound technique by showing hardness of direct approximation of .
Theorem 3 (Hardness of low-rank approximation for ).
Assume there is an algorithm which given any returns such that in time for some approximation factor .
For any and each with integer entries in , let where . It is possible to compute the product in time .
We can write the matrix as:
Let be an orthogonal span for the columns of the matrix:
where spans the columns of . The projection gives the best Frobenius norm approximation to in the span of . We can see that:
since each entry of is bounded in magnitude by and so each entry of is bounded by .
Let be the matrix returned by running on with rank . In order to achieve the approximation bound of we must have, for all :
where the last inequality is from (3). This gives . Since and have integer entries, each entry in the submatrix of is an integer multiple of . Since approximates this entry to error , by simply rounding to the nearest multiple of , we obtain the entry exactly. Thus, given , we can exactly recover in time by computing the submatrix corresponding to in . ∎
we show that even just outputting an orthogonal matrixsuch that is a relative-error low-rank approximation of , but not computing a factorization of itself, is enough to give fast multiplication of integer matrices and .
2.2 Lower bound for dot product kernels
We now extend Theorem 3 to general dot product kernels – where for some function . This includes, for example, the polynomial kernel.
Theorem 4 (Hardness of low-rank approximation for dot product kernels).
Consider any kernel with for some function which can be expanded as with and and for all and some .
Assume there is an algorithm which given with kernel matrix , returns satisfying in time.
For any , with integer entries in , let with , . Then it is possible to compute in time
Using our decomposition of , we can write the kernel matrix for and as:
where and denotes the all ones matrix of appropriate size. The key idea is to show that the contribution of the terms is small, and so any relative-error rank- approximation to must recover an approximation to , and thus the product as in Theorem 3.
By our setting of , the fact that , and our bound on the entries of and , we have for all , . Thus, for any , using that :
Let be the matrix , with its top right block set to . just has its last columns and rows non-zero, so has rank . Let be an orthogonal span for the columns
along with the all ones vector of length. Let be the result of running on with rank . Then we have:
where denotes the top left submatrix of . By our bound on the entries of and (5):
Plugging back into (2.2) and using , this gives for any :
Since and have integer entries, each entry of is an integer multiple of . By the decomposition of (4) and the bound of (5), if we subtract from the corresponding entry of and round it to the nearest multiple of , we will recover the entry of . By the bound of (2.2), we can likewise round the corresponding entry of . Computing all of these entries given takes time , giving the theorem. ∎
Theorem 4 lets us lower bound the time to compute a low-rank kernel approximation for any kernel function expressible as a reasonable power expansion of . As a straightforward example, it gives the lower bound for the polynomial kernel of any degree stated in Theorem 1.
2.3 Lower bound for distance kernels
We finally extend Theorem 4 to handle kernels like the Gaussian kernel whose value depends on the squared distance rather than just the dot product . We prove:
Theorem 5 (Hardness of low-rank approximation for distance kernels).
Consider any kernel function with for some function which can be expanded as with and and for all and some .
Assume there is an algorithm which given input with kernel matrix , returns satisfying in time.
For any , with integer entries in , let with , . It is possible to compute in time.
The proof of Theorem 5 is similar to that of Theorem 4, and relegated to Appendix C. The key idea is to write as a polynomial in the distance matrix with . Since , can be written as plus a rank-2 component. By setting sufficiently small, as in the proof of Theorem 4, we ensure that the higher powers of are negligible, and thus that our low-rank approximation must accurately recover the submatrix of corresponding to . Theorem 5 gives Theorem 1 for the popular Gaussian kernel:
3 Input sparsity time kernel PCA for radial basis kernels
Theorem 1 gives little hope for achieving time for low-rank kernel approximation. However, the guarantee of (1) is not the only way of measuring the quality of . Here we show that for shift/rotationally invariant kernels, including e.g., radial basis kernels, input sparsity time can be achieved for the kernel PCA goal of (2).
3.1 Basic algorithm
Our technique is based on the random Fourier features technique [RR07]. Given any shift-invariant kernel, with
(we will assume this w.l.o.g. as the function can always be scaled), there is a probability density functionover vectors in such that:
is just the (inverse) Fourier transform of , and is a density function by Bochner’s theorem. Informally, given if we let denote the matrix with columns indexed by . . Then (8) gives where is diagonal with , and denotes the Hermitian transpose.
The idea of random Fourier features is to select frequencies according to the density and set . is then used to approximate .
In recent work, Avron et al. [AKM17] give a new analysis of random Fourier features. Extending prior work on ridge leverage scores in the discrete setting [AM15, CMM17], they define the ridge leverage function for parameter :
As part of their results, which seek that spectrally approximates , they prove the following:
For all , .
While simple, this bound is key to our algorithm. It was shown in [CMM17] that if the columns of a matrix are sampled by over-approximations to their ridge leverage scores (with appropriately set ), the sample is a projection-cost preserving sketch for the original matrix. That is, it can be used as a surrogate in computing a low-rank approximation. The results of [CMM17] carry over to the continuous setting giving, in conjunction with Lemma 6:
Lemma 7 (Projection-cost preserving sketch via random Fourier features).
Consider any and shift-invariant kernel with , with associated kernel matrix and kernel Fourier transform . For any , let for sufficiently large and let where are sampled independently according to . Then with probability , for any orthonormal and any with :
3.2 Input sparsity time implementation
Given sampled as in Lemma 7 with , there is an algorithm running in time that computes satisfying with high probability, for any with :
With Corollary 8 in place the main bottleneck to our approach becomes computing .
3.2.1 Sampling Frequencies
To compute , we first sample according to . Here we use the rotational invariance of . In this case, is also rotationally invariant [LSS13] and so, letting be the distribution over norms of vectors sampled from we can sample by first selecting random Gaussian vectors and then rescaling them to have norms distributed according to . That is, we can write where is a random Gaussian matrix and is a diagonal rescaling matrix with with . We will assume that can be sampled from in time. This is true for many natural kernels – e.g., for the Gaussian kernel, is just a Gaussian density.
Due to our large sample size, , even writing down above requires time. However, to form we do not need itself: it suffices to compute for the column with . This requires computing , which contains the appropriate dot products