Kernel methods proved to be an efficient technique in numerous real-world problems. The core idea of kernel methods is the kernel trick – compute an inner product in a high-dimensional (or even infinite-dimensional) feature space by means of a kernel function :
where is a non-linear feature map transporting elements of input space into a feature space . It is a common knowledge that kernel methods incur space and time complexity infeasible to be used with large-scale datasets directly. For example, kernel regression has training time, memory, prediction time complexity for data points in original -dimensional space .
One of the most successful techniques to handle this problem, known as Random Fourier Features (RFF) proposed by [Rahimi and Recht, 2008], introduces a low-dimensional randomized approximation to feature maps:
This is essentially carried out by using Monte-Carlo sampling to approximate scalar product in (1). A randomized -dimensional mapping applied to the original data input allows employing standard linear methods, i.e. reverting the kernel trick. In doing so one reduces the complexity to that of linear methods, e.g. -dimensional approximation admits training time, memory and prediction time.
It is well known that as , the inner product in (2) converges to the exact kernel . Recent research [Yang et al., 2014, Felix et al., 2016, Choromanski and Sindhwani, 2016] aims to improve the convergence of approximation so that a smaller can be used to obtain the same quality of approximation.
This paper considers kernels that allow the following integral representation
For example, the popular Gaussian kernel admits such representation with , where .
The class of kernels admitting the form in (3
) covers shift-invariant kernels (e.g. radial basis function (RBF) kernels) and Pointwise Nonlinear Gaussian (PNG) kernels. They are widely used in practice and have interesting connections with neural networks[Cho and Saul, 2009, Williams, 1997].
The main challenge for the construction of low-dimensional feature maps is the approximation of the expectation in (3) which is -dimensional integral with Gaussian weight. Unlike other research studies we refrain from using simple Monte Carlo estimate of the integral, instead, we propose to use specific quadrature rules. We now list our contributions:
We propose to use spherical-radial quadrature rules to improve kernel approximation accuracy. We show that these quadrature rules generalize the RFF-based techniques. We also provide an analytical estimate of the error for the used quadrature rules that implies better approximation quality.
We use structured orthogonal matrices (so-called butterfly matrices
) when designing quadrature rule that allow fast matrix by vector multiplications. As a result, we speed up the approximation of the kernel function and reduce memory requirements.
We carry out an extensive empirical study comparing our methods with the state-of-the-art ones on a set of different kernels in terms of both kernel approximation error and downstream tasks performance. The study supports our hypothesis on the exceeding accuracy of the method.
2 Quadrature Rules and Random Features
We start with rewriting the expectation in Equation (3) as integral of with respect to :
Integration can be performed by means of quadrature rules. The rules usually take a form of interpolating function that is easy to integrate. Given such a rule, one may sample points from the domain of integration and calculate the value of the rule at these points. Then, the sample average of the rule values would yield the approximation of the integral.
The connection between integral approximation and mapping is straightforward. In what follows we show a brief derivation of the quadrature rules that allow for an explicit mapping of the form: where the choice of the weights and the points is dictated by the quadrature.
We use the average of sampled quadrature rules developed by [Genz and Monahan, 1998]
to yield unbiased estimates of. A change of coordinates is the first step to facilitate stochastic spherical-radial rules. Now, let , with , so that for , leaving us with (to ease the notation we substitute with )
is now a double integral over the unit -sphere and over the radius. To account for both integration regions we apply a combination of spherical () and radial () rules known as spherical-radial () rules. To provide an intuition how the rules work, here we briefly state and discuss their form.
Stochastic radial rules of degree have the form of the weighted symmetric sums and approximate the infinite range integral . Note that when is set to the function of interest, corresponds to the inner integral in (4). To get an unbiased estimate for , points are sampled from specific distributions. The weights are derived so that the rule is exact for polynomials of degree and give unbiased estimate for other functions.
Stochastic spherical rules where
is a random orthogonal matrix, approximate an integral of a functionover the surface of unit -sphere , where are points on , i.e. . Remember that the outer integral in (4) has as its integration region. The weights are stochastic with distribution such that the rule is exact for polynomials of degree and gives unbiased estimate for other functions.
Stochastic spherical-radial rules of degree are given by the following expression222Please see Genz and Monahan  for detailed derivation of SR rules.
where the distributions of weights are such that if degrees of radial rules and spherical rules coincide, i.e. , then the rule is exact for polynomials of degree and gives unbiased estimate of the integral for other functions.
2.1 Spherical-radial rules of degree is RFF
If we take radial rule of degree and spherical rule of degree , we obtain the following rule where . It is easy to see that , and for shift invariant kernel , thus, the rule reduces to , where
Now, RFF [Rahimi and Recht, 2008]
makes approximation of the RBF kernel in exactly the same way: it generates random vector from Gaussian distribution and calculates the corresponding feature map.
Random Fourier Features for RBF kernel are SR rules of degree .
2.2 Spherical-radial rules of degree is ORF
Now, let’s take radial rule of degree 1 and spherical rule of degree 3. In this case we get the following spherical-radial rule where , is an
-th column of the identity matrix.
Let us compare rules with Orthogonal Random Features [Felix et al., 2016] for the RBF kernel. In the ORF approach, the weight matrix is generated, where is a diagonal matrix with the entries drawn independently from distribution and is a random orthogonal matrix. The approximation of the kernel is then given by , where is the -th row of the matrix . As the rows of are orthonormal, they can be represented as .
Orthogonal Random Features for RBF kernel are SR rules of degree .
2.3 Spherical-radial rules of degree
We go further and take both spherical and radial rules of degree 3, where we use original and reflected vertices of randomly rotated unit vertex regular -simplex as the points on the unit sphere
where is the number of sampled rules. Speaking in terms of the approximate feature maps, the new feature dimension in case of the quadrature based approximation equals as we sample rules and evaluate each of them at random points and zero point.
In this work we propose to modify the quadrature rule by generating for each , i.e. It doesn’t affect the quality of approximation while simplifies an analysis of the quadrature-based random features.
We finally arrive at the map where 333To get , you need to sample two times on average (see Appendix for details)., , is the -th row in the matrix , . To get features one simply stacks such matrices so that , where only and are generated randomly . For Gaussian kernel, . For the 0-order arc-cosine kernel, , where is the Heaviside function. For the 1-order arc-cosine kernel, .
2.4 Generating uniformly random orthogonal matrices
The SR rules require a random orthogonal matrix . If follows Haar distribution, the averaged samples of rules provide an unbiased estimate for (4). Essentially, Haar distribution means that all orthogonal matrices in the group are equiprobable, i.e. uniformly random. Methods for sampling such matrices vary in their complexity of generation and multiplication.
We test two algorithms for obtaining, where is a random Householder/Givens matrix and a diagonal matrix has entries such that . It implicates no fast matrix multiplication. We test both methods for random orthogonal matrix generation and, since their performance coincides, we leave this one out for cleaner figures in the Experiments section.
The other choice for are so-called butterfly matrices Genz . For
where is sine and cosine of some angle . For definition and discussion please see Appendix. The factors of are structured and allow fast matrix multiplication. The method using butterfly matrices is denoted by in the Experiments section.
3 Error bounds
Let be a diameter of the compact set and
be the probability density
corresponding to the kernel.
Let us suppose that
be the probability density corresponding to the kernel. Let us suppose that, for all , and for all , where . Then for Quadrature-based Features approximation of the kernel function and any it holds
where . Thus we can construct approximation with error no more than with probability at least as long as
The proof of this proposition closely follows [Sutherland and Schneider, 2015], details can be found in the Appendix.
Term depends on dimension , its maximum is , and , though it is lower for small . Let us compare this probability bound with the similar result for RFF in [Sutherland and Schneider, 2015]. Under the same conditions the required number of samples to achieve error no more than with probability at least for RFF is the following
For Quadrature-based Features for RBF kernel , therefore, we obtain
The asymptotics is the same, however, the constants are smaller for our approach. See Section 4 for empirical justification of the obtained result.
Proposition 3.2 ([Sutherland and Schneider, 2015]).
Given a training set , with and ,
let denote the result of kernel ridge regression using the positive semi-definite training kernel matrix
denote the result of kernel ridge regression using the positive semi-definite training kernel matrix, test kernel values and regularization parameter . Let be the same using a PSD approximation to the training kernel matrix and test kernel values . Further, assume that the training labels are centered, , and let . Also suppose . Then
Suppose that for all . Then and . By denoting we obtain Therefore,
So, for the quadrature rules we can guarantee with probability at least as long as
We extensively study the proposed method on several established benchmarking datasets: Powerplant, LETTER, USPS, MNIST, CIFAR100 [Krizhevsky and Hinton, 2009], LEUKEMIA [Golub et al., 1999]. In Section 4.2 we show kernel approximation error across different kernels and number of features. We also report the quality of SVM models with approximate kernels on the same data sets in Section 4.3.
We present a comparison of our method () with estimators based on a simple Monte Carlo, quasi-Monte Carlo [Yang et al., 2014] and Gaussian quadratures [Dao et al., 2017]. The Monte Carlo approach has a variety of ways to generate samples: unstructured Gaussian [Rahimi and Recht, 2008], structured Gaussian [Felix et al., 2016], random orthogonal matrices (ROM) [Choromanski et al., 2017].
Monte Carlo integration (G, Gort, ROM). The kernel is estimated as , where is a random weight matrix. For unstructured Gaussian based approximation , where . Structured Gaussian has , where , is obtained from RQ decomposition of , is a diagonal matrix with diagonal elements sampled from the distribution. In compliance with the previous work on ROM we use -Rademacher with three blocks: , where is a normalized Hadamard matrix and .
Quasi-Monte Carlo integration (QMC). Quasi-Monte Carlo integration boasts improved rate of convergence compared to of Monte Carlo, however, as empirical results illustrate its performance is poorer than that of orthogonal random features [Felix et al., 2016]. It has larger constant factor hidden under notation in computational complexity. For QMC the weight matrix is generated as a transformation of quasi-random sequences. We run our experiments with Halton sequences in compliance with the previous work.
Gaussian quadratures (GQ). We included subsampled dense grid method from [Dao et al., 2017] into our comparison as it is the only data-independent approach from the paper that is shown to work well. We reimplemented code for the paper to the best of our knowledge as it is not open sourced.
4.2 Kernel approximation
To measure kernel approximation quality we use relative error in Frobenius norm , where and denote exact kernel matrix and its approximation. In line with previous work we run experiments for the kernel approximation on a random subset of a dataset. Table 2 displays the settings for the experiments across the datasets.
Approximation was constructed for different number of samples , where is an original feature space dimensionality and
is the new one. For the Gaussian kernel we set hyperparameterto the default value of for all the approximants, while the arc-cosine kernels (see definition of arc-cosine kernel in the Appendix) have no hyperparameters.
We run experiments for each [kernel, dataset,
] tuple and plot 95% confidence interval around the mean value line. Figure1 shows the results for kernel approximation error on LETTER, MNIST, CIFAR100 and LEUKEMIA datasets.
QMC method almost always coincides with RFF except for arc-cosine 0 kernel. It particularly enjoys Powerplant dataset with , i.e. small number of features. Possible explanation for such behaviour can be due to the connection with QMC quadratures. The worst case error for QMC quadratures scales with , where is the dimensionality and is the number of sample points [Owen, 1998]. It is worth mentioning that for large it is also a problem to construct a proper QMC point set. Thus, in higher dimensions QMC may bring little practical advantage over MC. While recent randomized QMC techniques indeed in some cases have no dependence on , our approach is still computationally more efficient thanks to the structured matrices. GQ method as well matches the performance of RFF. We omit both QMC and GQ from experiments on datasets with large (CIFAR100, LEUKEMIA).
The empirical results in Figure 1 support our hypothesis about the advantages of quadratures applied to kernel approximation compared to SOTA methods. With an exception of a couple of cases: (Arc-cosine 0, Powerplant) and (Gaussian, USPS), our method displays clear exceeding performance.
4.3 Classification/regression with new features
We report accuracy and scores for the classification/regression tasks on some of the datasets (Figure 2
). We examine the performance with the same setting as in experiments for kernel approximation error, except now we map the whole dataset. We use Support Vector Machines to obtain predictions.
Kernel approximation error does not fully define the final prediction accuracy – the best performing kernel matrix approximant not necessarily yields the best accuracy or score. However, the empirical results illustrate that our method delivers comparable and often superior quality on the downstream tasks.
4.4 Walltime experiment
We measure time spent on explicit mapping of features by running each experiment 50 times and averaging the measurements. Indeed, Figure 3 demonstrates that the method scales as theoretically predicted with larger dimensions thanks to the structured nature of the mapping.
5 Related work
The most popular methods for scaling up kernel methods are based on a low-rank approximation of the kernel using either data-dependent or independent basis functions. The first one includes Nyström method [Drineas and Mahoney, 2005], greedy basis selection techniques [Smola and Schölkopf, 2000], incomplete Cholesky decomposition [Fine and Scheinberg, 2001].
The construction of basis functions in these techniques utilizes the given training set making them more attractive for some problems compared to Random Fourier Features approach. In general, data-dependent approaches perform better than data-independent approaches when there is a gap in the eigen-spectrum of the kernel matrix. The rigorous study of generalization performance of both approaches can be found in [Yang et al., 2012].
In data-independent techniques, the kernel function is approximated directly. Most of the methods (including the proposed approach) that follow this idea are based on Random Fourier Features [Rahimi and Recht, 2008]. They require so-called weight matrix that can be generated in a number of ways. [Le et al., 2013] form the weight matrix as a product of structured matrices. It enables fast computation of matrix-vector products and speeds up generation of random features.
Another work [Felix et al., 2016] orthogonalizes the features by means of orthogonal weight matrix. This leads to less correlated and more informative features increasing the quality of approximation. They support this result both analytically and empirically. The authors also introduce matrices with some special structure for fast computations. [Choromanski et al., 2017] propose a generalization of the ideas from [Le et al., 2013] and [Felix et al., 2016], delivering an analytical estimate for the mean squared error (MSE) of approximation.
All these works use simple Monte Carlo sampling. However, the convergence can be improved by changing Monte Carlo sampling to Quasi-Monte Carlo sampling. Following this idea [Yang et al., 2014] apply quasi-Monte Carlo to Random Fourier Features. In [Yu et al., 2015] the authors make attempt to improve quality of the approximation of Random Fourier Features by optimizing sequences conditioning on a given dataset.
Among the recent papers there are works that, similar to our approach, use the numerical integration methods to approximate kernels. While [Bach, 2017] carefully inspects the connection between random features and quadratures, they did not provide any practically useful explicit mappings for kernels. Leveraging the connection [Dao et al., 2017] propose several methods with Gaussian quadratures. Among them three schemes are data-independent and one is data-dependent. The authors do not compare them with the approaches for random feature generation other than random Fourier features. The data-dependent scheme optimizes the weights for the quadrature points to yield better performance.
We propose an approach for the random features methods for kernel approximation, revealing a new interpretation of RFF and ORF. The latter are special cases of the spherical-radial quadrature rules with degrees (1,1) and (1,3) respectively. We take this further and develop a more accurate technique for the random features preserving the time and space complexity of the random orthogonal embeddings.
Our experimental study confirms that for many kernels on the most datasets the proposed approach delivers the best kernel approximation. Additionally, the results showed that the quality of the downstream task (classification/regression) is also superior or comparable to the state-of-the-art baselines.
This work was supported by the Ministry of Science and Education of Russian Federation as a part of Mega Grant Research Project 14.756.31.0001.
- Anderson et al.  Theodore W Anderson, Ingram Olkin, and Les G Underhill. Generation of random orthogonal matrices. SIAM Journal on Scientific and Statistical Computing, 8(4):625–629, 1987.
- Avron and Sindhwani  Haim Avron and Vikas Sindhwani. High-performance kernel machines with implicit distributed optimization and randomization. Technometrics, 58(3):341–349, 2016.
On the equivalence between kernel quadrature rules and random feature
Journal of Machine Learning Research, 18(21):1–38, 2017.
- Baker  John A Baker. Integration over spheres and the divergence theorem for balls. The American Mathematical Monthly, 104(1):36–47, 1997.
- Bergstra et al.  James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International Conference on Machine Learning, pages 115–123, 2013.
- Bochner  Salomon Bochner. Monotone funktionen, stieltjessche integrale und harmonische analyse. Mathematische Annalen, 108(1):378–410, 1933.
- Chen et al.  Xixian Chen, Haiqin Yang, Irwin King, and Michael R Lyu. Training-efficient feature map for shift-invariant kernels. In IJCAI, pages 3395–3401, 2015.
Cho and Saul 
Youngmin Cho and Lawrence K Saul.
Kernel methods for deep learning.In Advances in Neural Information Processing Systems, pages 342–350, 2009.
- Choromanski and Sindhwani  Krzysztof Choromanski and Vikas Sindhwani. Recycling randomness with structure for sublinear time kernel expansions. arXiv preprint arXiv:1605.09049, 2016.
- Choromanski et al.  Krzysztof Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effectiveness of random orthogonal embeddings. arXiv preprint arXiv:1703.00864, 2017.
- Dao et al.  Tri Dao, Christopher M De Sa, and Christopher Ré. Gaussian quadrature for kernel features. In Advances in Neural Information Processing Systems, pages 6109–6119, 2017.
- Drineas and Mahoney  Petros Drineas and Michael W Mahoney. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6(Dec):2153–2175, 2005.
Fang and Li 
Kai-Tai Fang and Run-Ze Li.
Some methods for generating both an NT-net and the uniform distribution on a Stiefel manifold and their applications.Computational Statistics & Data Analysis, 24(1):29–46, 1997.
- Felix et al.  X Yu Felix, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal Random Features. In Advances in Neural Information Processing Systems, pages 1975–1983, 2016.
- Fine and Scheinberg  Shai Fine and Katya Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2(Dec):243–264, 2001.
- Forrester et al.  Alexander Forrester, Andy Keane, et al. Engineering design via surrogate modelling: a practical guide. John Wiley & Sons, 2008.
- Genz  Alan Genz. Methods for generating random orthogonal matrices. Monte Carlo and Quasi-Monte Carlo Methods, pages 199–213, 1998.
- Genz and Monahan  Alan Genz and John Monahan. Stochastic integration rules for infinite regions. SIAM journal on scientific computing, 19(2):426–439, 1998.
- Genz and Monahan  Alan Genz and John Monahan. A stochastic algorithm for high-dimensional integrals over unbounded regions with gaussian weight. Journal of Computational and Applied Mathematics, 112(1):71–81, 1999.
- Golub et al.  Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P Mesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537, 1999.
- Haykin  Simon Haykin. Cognitive dynamic systems: perception-action cycle, radar and radio. Cambridge University Press, 2012.
- Huang et al.  Po-Sen Huang, Haim Avron, Tara N Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran. Kernel methods match deep neural networks on timit. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 205–209. IEEE, 2014.
- Krizhevsky and Hinton  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
- Le et al.  Quoc Le, Tamás Sarlós, and Alex Smola. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the International Conference on Machine Learning, 2013.
- Mezzadri  Francesco Mezzadri. How to generate random matrices from the classical compact groups. arXiv preprint math-ph/0609050, 2006.
- Monahan and Genz  John Monahan and Alan Genz. Spherical-radial integration rules for bayesian computation. Journal of the American Statistical Association, 92(438):664–674, 1997.
- Owen  Art B Owen. Latin supercube sampling for very high-dimensional simulations. ACM Transactions on Modeling and Computer Simulation (TOMACS), 8(1):71–102, 1998.
- Rahimi and Recht  Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177–1184, 2008.
- Rudin  Walter Rudin. Fourier analysis on groups. Courier Dover Publications, 2017.
- Smola and Schölkopf  Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. 2000.
- Stewart  G. W. Stewart. The efficient generation of random orthogonal matrices with an application to condition estimators. SIAM Journal on Numerical Analysis, 17(3):403–409, 1980. ISSN 00361429. URL http://www.jstor.org/stable/2156882.
- Sutherland and Schneider  Dougal J Sutherland and Jeff Schneider. On the error of random fourier features. arXiv preprint arXiv:1506.02785, 2015.
- Williams  Christopher KI Williams. Computing with infinite networks. In Advances in Neural Information Processing Systems, pages 295–301, 1997.
- Yang et al.  Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. In Proceedings of The 31st International Conference on Machine Learning (ICML-14), pages 485–493, 2014.
- Yang et al.  Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison. In Advances in Neural Information Processing Systems, pages 476–484, 2012.
- Yu et al.  Felix X Yu, Sanjiv Kumar, Henry Rowley, and Shih-Fu Chang. Compact nonlinear maps and circulant extensions. arXiv preprint arXiv:1503.03893, 2015.
1 Proof of Proposition 3.1
1.1 Variance of the degree quadrature rule
Let us denote , , . Then it is easy to see that .
Let us denote , . Using the above definitions we obtain
Variance of the first term
Variance of the second term (using independence of and for )
Variance of the last term (using Cauchy-Schwarz inequality)
Now, let us upper bound term
and it concludes the proof.
1.2 Error probability
The proof strategy closely follows that of [Sutherland and Schneider, 2015]; we just use Chebyshev-Cantelli ineqaulity instead of Hoeffding’s and Bernstein inequalities and all the expectations are calculated according to our quadrature rules.
Let , is compact set in with diameter , so we can cover it with an -net using at most balls of radius . Let denote their centers, and be the Lipschitz constant of . If for all and , then for all .
1.2.1 Regularity Condition
Similarly to [Sutherland and Schneider, 2015] (regularity condition section in appendix) it can be proven that .
1.2.2 Lipschitz Constant
Since is differentiable, , where . Via Jensen’s inequality . Then using independence of and for
where . Then using Markov’s inequality we obtain
1.2.3 Anchor points
Let us upper bound the following probability
Let us rewrite the function
where . Let us suppose that . Then we can apply Hoeffding’s inequality
1.2.4 Optimizing over
Now the probability of takes the form
where , . Maximizing this probability over gives us the following bound