1 Introduction
Kernel methods proved to be an efficient technique in numerous realworld problems. The core idea of kernel methods is the kernel trick – compute an inner product in a highdimensional (or even infinitedimensional) feature space by means of a kernel function :
(1) 
where is a nonlinear feature map transporting elements of input space into a feature space . It is a common knowledge that kernel methods incur space and time complexity infeasible to be used with largescale datasets directly. For example, kernel regression has training time, memory, prediction time complexity for data points in original dimensional space .
One of the most successful techniques to handle this problem, known as Random Fourier Features (RFF) proposed by [Rahimi and Recht, 2008], introduces a lowdimensional randomized approximation to feature maps:
(2) 
This is essentially carried out by using MonteCarlo sampling to approximate scalar product in (1). A randomized dimensional mapping applied to the original data input allows employing standard linear methods, i.e. reverting the kernel trick. In doing so one reduces the complexity to that of linear methods, e.g. dimensional approximation admits training time, memory and prediction time.
It is well known that as , the inner product in (2) converges to the exact kernel . Recent research [Yang et al., 2014, Felix et al., 2016, Choromanski and Sindhwani, 2016] aims to improve the convergence of approximation so that a smaller can be used to obtain the same quality of approximation.
This paper considers kernels that allow the following integral representation
(3) 
For example, the popular Gaussian kernel admits such representation with , where .
The class of kernels admitting the form in (3
) covers shiftinvariant kernels (e.g. radial basis function (RBF) kernels) and Pointwise Nonlinear Gaussian (PNG) kernels. They are widely used in practice and have interesting connections with neural networks
[Cho and Saul, 2009, Williams, 1997].The main challenge for the construction of lowdimensional feature maps is the approximation of the expectation in (3) which is dimensional integral with Gaussian weight. Unlike other research studies we refrain from using simple Monte Carlo estimate of the integral, instead, we propose to use specific quadrature rules. We now list our contributions:

We propose to use sphericalradial quadrature rules to improve kernel approximation accuracy. We show that these quadrature rules generalize the RFFbased techniques. We also provide an analytical estimate of the error for the used quadrature rules that implies better approximation quality.

We use structured orthogonal matrices (socalled butterfly matrices
) when designing quadrature rule that allow fast matrix by vector multiplications. As a result, we speed up the approximation of the kernel function and reduce memory requirements.

We carry out an extensive empirical study comparing our methods with the stateoftheart ones on a set of different kernels in terms of both kernel approximation error and downstream tasks performance. The study supports our hypothesis on the exceeding accuracy of the method.
2 Quadrature Rules and Random Features
We start with rewriting the expectation in Equation (3) as integral of with respect to :
Integration can be performed by means of quadrature rules. The rules usually take a form of interpolating function that is easy to integrate. Given such a rule, one may sample points from the domain of integration and calculate the value of the rule at these points. Then, the sample average of the rule values would yield the approximation of the integral.
The connection between integral approximation and mapping is straightforward. In what follows we show a brief derivation of the quadrature rules that allow for an explicit mapping of the form: where the choice of the weights and the points is dictated by the quadrature.
We use the average of sampled quadrature rules developed by [Genz and Monahan, 1998]
to yield unbiased estimates of
. A change of coordinates is the first step to facilitate stochastic sphericalradial rules. Now, let , with , so that for , leaving us with (to ease the notation we substitute with )(4) 
is now a double integral over the unit sphere and over the radius. To account for both integration regions we apply a combination of spherical () and radial () rules known as sphericalradial () rules. To provide an intuition how the rules work, here we briefly state and discuss their form.
Stochastic radial rules of degree have the form of the weighted symmetric sums and approximate the infinite range integral . Note that when is set to the function of interest, corresponds to the inner integral in (4). To get an unbiased estimate for , points are sampled from specific distributions. The weights are derived so that the rule is exact for polynomials of degree and give unbiased estimate for other functions.
Stochastic spherical rules where
is a random orthogonal matrix, approximate an integral of a function
over the surface of unit sphere , where are points on , i.e. . Remember that the outer integral in (4) has as its integration region. The weights are stochastic with distribution such that the rule is exact for polynomials of degree and gives unbiased estimate for other functions.Stochastic sphericalradial rules of degree are given by the following expression^{2}^{2}2Please see Genz and Monahan [1998] for detailed derivation of SR rules.
where the distributions of weights are such that if degrees of radial rules and spherical rules coincide, i.e. , then the rule is exact for polynomials of degree and gives unbiased estimate of the integral for other functions.
2.1 Sphericalradial rules of degree is RFF
If we take radial rule of degree and spherical rule of degree , we obtain the following rule where . It is easy to see that , and for shift invariant kernel , thus, the rule reduces to , where
Now, RFF [Rahimi and Recht, 2008]
makes approximation of the RBF kernel in exactly the same way: it generates random vector from Gaussian distribution and calculates the corresponding feature map.
Proposition 2.1.
Random Fourier Features for RBF kernel are SR rules of degree .
2.2 Sphericalradial rules of degree is ORF
Now, let’s take radial rule of degree 1 and spherical rule of degree 3. In this case we get the following sphericalradial rule where , is an
th column of the identity matrix.
Let us compare rules with Orthogonal Random Features [Felix et al., 2016] for the RBF kernel. In the ORF approach, the weight matrix is generated, where is a diagonal matrix with the entries drawn independently from distribution and is a random orthogonal matrix. The approximation of the kernel is then given by , where is the th row of the matrix . As the rows of are orthonormal, they can be represented as .
Proposition 2.2.
Orthogonal Random Features for RBF kernel are SR rules of degree .
2.3 Sphericalradial rules of degree
We go further and take both spherical and radial rules of degree 3, where we use original and reflected vertices of randomly rotated unit vertex regular simplex as the points on the unit sphere
(5) 
where . We apply (5) to the approximation of (4) by averaging the samples of :
(6) 
where is the number of sampled rules. Speaking in terms of the approximate feature maps, the new feature dimension in case of the quadrature based approximation equals as we sample rules and evaluate each of them at random points and zero point.
In this work we propose to modify the quadrature rule by generating for each , i.e. It doesn’t affect the quality of approximation while simplifies an analysis of the quadraturebased random features.
Explicit mapping
We finally arrive at the map where ^{3}^{3}3To get , you need to sample two times on average (see Appendix for details)., , is the th row in the matrix , . To get features one simply stacks such matrices so that , where only and are generated randomly . For Gaussian kernel, . For the 0order arccosine kernel, , where is the Heaviside function. For the 1order arccosine kernel, .
2.4 Generating uniformly random orthogonal matrices
The SR rules require a random orthogonal matrix . If follows Haar distribution, the averaged samples of rules provide an unbiased estimate for (4). Essentially, Haar distribution means that all orthogonal matrices in the group are equiprobable, i.e. uniformly random. Methods for sampling such matrices vary in their complexity of generation and multiplication.
We test two algorithms for obtaining
. The first uses a QR decomposition of a random matrix to obtain a product of a sequence of reflectors/rotators
, where is a random Householder/Givens matrix and a diagonal matrix has entries such that . It implicates no fast matrix multiplication. We test both methods for random orthogonal matrix generation and, since their performance coincides, we leave this one out for cleaner figures in the Experiments section.The other choice for are socalled butterfly matrices Genz [1998]. For
where is sine and cosine of some angle . For definition and discussion please see Appendix. The factors of are structured and allow fast matrix multiplication. The method using butterfly matrices is denoted by in the Experiments section.
3 Error bounds
Proposition 3.1.
Let be a diameter of the compact set and
be the probability density corresponding to the kernel. Let us suppose that
, for all , and for all , where . Then for Quadraturebased Features approximation of the kernel function and any it holdswhere . Thus we can construct approximation with error no more than with probability at least as long as
The proof of this proposition closely follows [Sutherland and Schneider, 2015], details can be found in the Appendix.
Term depends on dimension , its maximum is , and , though it is lower for small . Let us compare this probability bound with the similar result for RFF in [Sutherland and Schneider, 2015]. Under the same conditions the required number of samples to achieve error no more than with probability at least for RFF is the following
For Quadraturebased Features for RBF kernel , therefore, we obtain
The asymptotics is the same, however, the constants are smaller for our approach. See Section 4 for empirical justification of the obtained result.
Proposition 3.2 ([Sutherland and Schneider, 2015]).
Given a training set , with and , let
denote the result of kernel ridge regression using the positive semidefinite training kernel matrix
, test kernel values and regularization parameter . Let be the same using a PSD approximation to the training kernel matrix and test kernel values . Further, assume that the training labels are centered, , and let . Also suppose . ThenSuppose that for all . Then and . By denoting we obtain Therefore,
So, for the quadrature rules we can guarantee with probability at least as long as
4 Experiments
Method  Space  Time 

ORF  
QMC  
ROM  
Quadrature based 
Dataset  #samples  #runs  

Powerplant  9568  4  550  500 
LETTER  20000  16  550  500 
USPS  9298  256  550  500 
MNIST  70000  784  550  100 
CIFAR100  60000  3072  50  50 
LEUKEMIA  72  7129  10  10 
We extensively study the proposed method on several established benchmarking datasets: Powerplant, LETTER, USPS, MNIST, CIFAR100 [Krizhevsky and Hinton, 2009], LEUKEMIA [Golub et al., 1999]. In Section 4.2 we show kernel approximation error across different kernels and number of features. We also report the quality of SVM models with approximate kernels on the same data sets in Section 4.3.
4.1 Methods
We present a comparison of our method () with estimators based on a simple Monte Carlo, quasiMonte Carlo [Yang et al., 2014] and Gaussian quadratures [Dao et al., 2017]. The Monte Carlo approach has a variety of ways to generate samples: unstructured Gaussian [Rahimi and Recht, 2008], structured Gaussian [Felix et al., 2016], random orthogonal matrices (ROM) [Choromanski et al., 2017].
Monte Carlo integration (G, Gort, ROM). The kernel is estimated as , where is a random weight matrix. For unstructured Gaussian based approximation , where . Structured Gaussian has , where , is obtained from RQ decomposition of , is a diagonal matrix with diagonal elements sampled from the distribution. In compliance with the previous work on ROM we use Rademacher with three blocks: , where is a normalized Hadamard matrix and .
QuasiMonte Carlo integration (QMC). QuasiMonte Carlo integration boasts improved rate of convergence compared to of Monte Carlo, however, as empirical results illustrate its performance is poorer than that of orthogonal random features [Felix et al., 2016]. It has larger constant factor hidden under notation in computational complexity. For QMC the weight matrix is generated as a transformation of quasirandom sequences. We run our experiments with Halton sequences in compliance with the previous work.
Gaussian quadratures (GQ). We included subsampled dense grid method from [Dao et al., 2017] into our comparison as it is the only dataindependent approach from the paper that is shown to work well. We reimplemented code for the paper to the best of our knowledge as it is not open sourced.
4.2 Kernel approximation
To measure kernel approximation quality we use relative error in Frobenius norm , where and denote exact kernel matrix and its approximation. In line with previous work we run experiments for the kernel approximation on a random subset of a dataset. Table 2 displays the settings for the experiments across the datasets.
Approximation was constructed for different number of samples , where is an original feature space dimensionality and
is the new one. For the Gaussian kernel we set hyperparameter
to the default value of for all the approximants, while the arccosine kernels (see definition of arccosine kernel in the Appendix) have no hyperparameters.We run experiments for each [kernel, dataset,
] tuple and plot 95% confidence interval around the mean value line. Figure
1 shows the results for kernel approximation error on LETTER, MNIST, CIFAR100 and LEUKEMIA datasets.QMC method almost always coincides with RFF except for arccosine 0 kernel. It particularly enjoys Powerplant dataset with , i.e. small number of features. Possible explanation for such behaviour can be due to the connection with QMC quadratures. The worst case error for QMC quadratures scales with , where is the dimensionality and is the number of sample points [Owen, 1998]. It is worth mentioning that for large it is also a problem to construct a proper QMC point set. Thus, in higher dimensions QMC may bring little practical advantage over MC. While recent randomized QMC techniques indeed in some cases have no dependence on , our approach is still computationally more efficient thanks to the structured matrices. GQ method as well matches the performance of RFF. We omit both QMC and GQ from experiments on datasets with large (CIFAR100, LEUKEMIA).
The empirical results in Figure 1 support our hypothesis about the advantages of quadratures applied to kernel approximation compared to SOTA methods. With an exception of a couple of cases: (Arccosine 0, Powerplant) and (Gaussian, USPS), our method displays clear exceeding performance.
4.3 Classification/regression with new features
We report accuracy and scores for the classification/regression tasks on some of the datasets (Figure 2
). We examine the performance with the same setting as in experiments for kernel approximation error, except now we map the whole dataset. We use Support Vector Machines to obtain predictions.
Kernel approximation error does not fully define the final prediction accuracy – the best performing kernel matrix approximant not necessarily yields the best accuracy or score. However, the empirical results illustrate that our method delivers comparable and often superior quality on the downstream tasks.
4.4 Walltime experiment
We measure time spent on explicit mapping of features by running each experiment 50 times and averaging the measurements. Indeed, Figure 3 demonstrates that the method scales as theoretically predicted with larger dimensions thanks to the structured nature of the mapping.
5 Related work
The most popular methods for scaling up kernel methods are based on a lowrank approximation of the kernel using either datadependent or independent basis functions. The first one includes Nyström method [Drineas and Mahoney, 2005], greedy basis selection techniques [Smola and Schölkopf, 2000], incomplete Cholesky decomposition [Fine and Scheinberg, 2001].
The construction of basis functions in these techniques utilizes the given training set making them more attractive for some problems compared to Random Fourier Features approach. In general, datadependent approaches perform better than dataindependent approaches when there is a gap in the eigenspectrum of the kernel matrix. The rigorous study of generalization performance of both approaches can be found in [Yang et al., 2012].
In dataindependent techniques, the kernel function is approximated directly. Most of the methods (including the proposed approach) that follow this idea are based on Random Fourier Features [Rahimi and Recht, 2008]. They require socalled weight matrix that can be generated in a number of ways. [Le et al., 2013] form the weight matrix as a product of structured matrices. It enables fast computation of matrixvector products and speeds up generation of random features.
Another work [Felix et al., 2016] orthogonalizes the features by means of orthogonal weight matrix. This leads to less correlated and more informative features increasing the quality of approximation. They support this result both analytically and empirically. The authors also introduce matrices with some special structure for fast computations. [Choromanski et al., 2017] propose a generalization of the ideas from [Le et al., 2013] and [Felix et al., 2016], delivering an analytical estimate for the mean squared error (MSE) of approximation.
All these works use simple Monte Carlo sampling. However, the convergence can be improved by changing Monte Carlo sampling to QuasiMonte Carlo sampling. Following this idea [Yang et al., 2014] apply quasiMonte Carlo to Random Fourier Features. In [Yu et al., 2015] the authors make attempt to improve quality of the approximation of Random Fourier Features by optimizing sequences conditioning on a given dataset.
Among the recent papers there are works that, similar to our approach, use the numerical integration methods to approximate kernels. While [Bach, 2017] carefully inspects the connection between random features and quadratures, they did not provide any practically useful explicit mappings for kernels. Leveraging the connection [Dao et al., 2017] propose several methods with Gaussian quadratures. Among them three schemes are dataindependent and one is datadependent. The authors do not compare them with the approaches for random feature generation other than random Fourier features. The datadependent scheme optimizes the weights for the quadrature points to yield better performance.
6 Conclusion
We propose an approach for the random features methods for kernel approximation, revealing a new interpretation of RFF and ORF. The latter are special cases of the sphericalradial quadrature rules with degrees (1,1) and (1,3) respectively. We take this further and develop a more accurate technique for the random features preserving the time and space complexity of the random orthogonal embeddings.
Our experimental study confirms that for many kernels on the most datasets the proposed approach delivers the best kernel approximation. Additionally, the results showed that the quality of the downstream task (classification/regression) is also superior or comparable to the stateoftheart baselines.
Acknowledgments
This work was supported by the Ministry of Science and Education of Russian Federation as a part of Mega Grant Research Project 14.756.31.0001.
References
 Anderson et al. [1987] Theodore W Anderson, Ingram Olkin, and Les G Underhill. Generation of random orthogonal matrices. SIAM Journal on Scientific and Statistical Computing, 8(4):625–629, 1987.
 Avron and Sindhwani [2016] Haim Avron and Vikas Sindhwani. Highperformance kernel machines with implicit distributed optimization and randomization. Technometrics, 58(3):341–349, 2016.

Bach [2017]
Francis Bach.
On the equivalence between kernel quadrature rules and random feature
expansions.
Journal of Machine Learning Research
, 18(21):1–38, 2017.  Baker [1997] John A Baker. Integration over spheres and the divergence theorem for balls. The American Mathematical Monthly, 104(1):36–47, 1997.
 Bergstra et al. [2013] James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International Conference on Machine Learning, pages 115–123, 2013.
 Bochner [1933] Salomon Bochner. Monotone funktionen, stieltjessche integrale und harmonische analyse. Mathematische Annalen, 108(1):378–410, 1933.
 Chen et al. [2015] Xixian Chen, Haiqin Yang, Irwin King, and Michael R Lyu. Trainingefficient feature map for shiftinvariant kernels. In IJCAI, pages 3395–3401, 2015.

Cho and Saul [2009]
Youngmin Cho and Lawrence K Saul.
Kernel methods for deep learning.
In Advances in Neural Information Processing Systems, pages 342–350, 2009.  Choromanski and Sindhwani [2016] Krzysztof Choromanski and Vikas Sindhwani. Recycling randomness with structure for sublinear time kernel expansions. arXiv preprint arXiv:1605.09049, 2016.
 Choromanski et al. [2017] Krzysztof Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effectiveness of random orthogonal embeddings. arXiv preprint arXiv:1703.00864, 2017.
 Dao et al. [2017] Tri Dao, Christopher M De Sa, and Christopher Ré. Gaussian quadrature for kernel features. In Advances in Neural Information Processing Systems, pages 6109–6119, 2017.
 Drineas and Mahoney [2005] Petros Drineas and Michael W Mahoney. On the Nyström method for approximating a Gram matrix for improved kernelbased learning. Journal of Machine Learning Research, 6(Dec):2153–2175, 2005.

Fang and Li [1997]
KaiTai Fang and RunZe Li.
Some methods for generating both an NTnet and the uniform distribution on a Stiefel manifold and their applications.
Computational Statistics & Data Analysis, 24(1):29–46, 1997.  Felix et al. [2016] X Yu Felix, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N HoltmannRice, and Sanjiv Kumar. Orthogonal Random Features. In Advances in Neural Information Processing Systems, pages 1975–1983, 2016.
 Fine and Scheinberg [2001] Shai Fine and Katya Scheinberg. Efficient SVM training using lowrank kernel representations. Journal of Machine Learning Research, 2(Dec):243–264, 2001.
 Forrester et al. [2008] Alexander Forrester, Andy Keane, et al. Engineering design via surrogate modelling: a practical guide. John Wiley & Sons, 2008.
 Genz [1998] Alan Genz. Methods for generating random orthogonal matrices. Monte Carlo and QuasiMonte Carlo Methods, pages 199–213, 1998.
 Genz and Monahan [1998] Alan Genz and John Monahan. Stochastic integration rules for infinite regions. SIAM journal on scientific computing, 19(2):426–439, 1998.
 Genz and Monahan [1999] Alan Genz and John Monahan. A stochastic algorithm for highdimensional integrals over unbounded regions with gaussian weight. Journal of Computational and Applied Mathematics, 112(1):71–81, 1999.
 Golub et al. [1999] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P Mesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537, 1999.
 Haykin [2012] Simon Haykin. Cognitive dynamic systems: perceptionaction cycle, radar and radio. Cambridge University Press, 2012.
 Huang et al. [2014] PoSen Huang, Haim Avron, Tara N Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran. Kernel methods match deep neural networks on timit. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 205–209. IEEE, 2014.
 Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 Le et al. [2013] Quoc Le, Tamás Sarlós, and Alex Smola. Fastfoodapproximating kernel expansions in loglinear time. In Proceedings of the International Conference on Machine Learning, 2013.
 Mezzadri [2006] Francesco Mezzadri. How to generate random matrices from the classical compact groups. arXiv preprint mathph/0609050, 2006.
 Monahan and Genz [1997] John Monahan and Alan Genz. Sphericalradial integration rules for bayesian computation. Journal of the American Statistical Association, 92(438):664–674, 1997.
 Owen [1998] Art B Owen. Latin supercube sampling for very highdimensional simulations. ACM Transactions on Modeling and Computer Simulation (TOMACS), 8(1):71–102, 1998.
 Rahimi and Recht [2008] Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In Advances in Neural Information Processing Systems, pages 1177–1184, 2008.
 Rudin [2017] Walter Rudin. Fourier analysis on groups. Courier Dover Publications, 2017.
 Smola and Schölkopf [2000] Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. 2000.
 Stewart [1980] G. W. Stewart. The efficient generation of random orthogonal matrices with an application to condition estimators. SIAM Journal on Numerical Analysis, 17(3):403–409, 1980. ISSN 00361429. URL http://www.jstor.org/stable/2156882.
 Sutherland and Schneider [2015] Dougal J Sutherland and Jeff Schneider. On the error of random fourier features. arXiv preprint arXiv:1506.02785, 2015.
 Williams [1997] Christopher KI Williams. Computing with infinite networks. In Advances in Neural Information Processing Systems, pages 295–301, 1997.
 Yang et al. [2014] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney. QuasiMonte Carlo feature maps for shiftinvariant kernels. In Proceedings of The 31st International Conference on Machine Learning (ICML14), pages 485–493, 2014.
 Yang et al. [2012] Tianbao Yang, YuFeng Li, Mehrdad Mahdavi, Rong Jin, and ZhiHua Zhou. Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison. In Advances in Neural Information Processing Systems, pages 476–484, 2012.
 Yu et al. [2015] Felix X Yu, Sanjiv Kumar, Henry Rowley, and ShihFu Chang. Compact nonlinear maps and circulant extensions. arXiv preprint arXiv:1503.03893, 2015.
Appendix
1 Proof of Proposition 3.1
1.1 Variance of the degree quadrature rule
Let us denote , , . Then it is easy to see that .
Let us denote , . Using the above definitions we obtain
(7) 
Variance of the first term
(8) 
Variance of the second term (using independence of and for )
(9) 
Variance of the last term (using CauchySchwarz inequality)
(10) 
Now, let us upper bound term
Using this expression and plugging (1.1), (9), (1.1) into (7) we obtain
(11) 
and it concludes the proof.
1.2 Error probability
The proof strategy closely follows that of [Sutherland and Schneider, 2015]; we just use ChebyshevCantelli ineqaulity instead of Hoeffding’s and Bernstein inequalities and all the expectations are calculated according to our quadrature rules.
Let , is compact set in with diameter , so we can cover it with an net using at most balls of radius . Let denote their centers, and be the Lipschitz constant of . If for all and , then for all .
1.2.1 Regularity Condition
Similarly to [Sutherland and Schneider, 2015] (regularity condition section in appendix) it can be proven that .
1.2.2 Lipschitz Constant
Since is differentiable, , where . Via Jensen’s inequality . Then using independence of and for
where . Then using Markov’s inequality we obtain
1.2.3 Anchor points
Let us upper bound the following probability
Let us rewrite the function
where . Let us suppose that . Then we can apply Hoeffding’s inequality
1.2.4 Optimizing over
Now the probability of takes the form
where , . Maximizing this probability over gives us the following bound
Comments
There are no comments yet.