Kernel methods offer a comprehensive collection of non-parametric techniques for modeling a broad range of problems in machine learning. However, standard kernel machines such as kernel SVM or kernel ridge regression suffer from slow training and prediction which limits their use in real-world large-scale applications. Consider, for example, the training of linear regression overlabeled data points where with and is a binary label. The time complexity of standard least square fit is and the memory demand is . Its kernel-based counterpart requires solving a linear system with an -by- kernel matrix, which takes time and memory usage as typical. Such complexity is far from desirable in big-data applications.
In recent years, significant efforts have been devoted into low-rank kernel approximation for enhancing the scalability of kernel methods. Among existing approaches, random features [Rahimi and Recht2007, Rahimi and Recht2009, Le et al.2013, Bach2015] have drawn considerable attention and yielded state-of-the-art results in classification accuracy on benchmark data sets [Huang et al.2014, Dai et al.2014, Li and Póczos2016]. Specifically, inspired from Bochner’s theorem [Rudin2011], random Fourier features have been studied for evaluating the expectation of shift-invariant kernels (i.e., for some function ). Rahimi_NIPS_07 Rahimi_NIPS_07 proposed to use Monte-Carlo methods (MC) to estimate the expectation; Yang_ICML_14 Yang_ICML_14 leveraged the low-discrepancy properties of Quasi-Monte Carlo (QMC) sequences to reduce integration errors.
Both the MC and QMC methods rely on the (implicit) assumption that all the random features are equally important, and hence assign a uniform weight to the features in kernel estimation. Such a treatment, however, is arguably sub-optimal for minimizing the expected risk in kernel approximation. Avron_JMLR_16 Avron_JMLR_16 presented a weighting strategy for minimizing a loose error bound which depends on the maximum norm of data points. On the other hand, Bayesian Quadrature (BQ) is a powerful framework that solves the integral approximation by where is feature function and is with non-uniform weights. BQ leverages Gaussian Process (GP) to model the prior of function , and the resulted estimator is Bayes optimal [Huszár and Duvenaud2012]. Thus, BQ could be considered a natural choice for kernel approximation. Nonetheless, a well-known limitation (or potential weakness) is that the performance of Bayesian-based models heavily relies on extensive hyper-parameter tuning, e.g., on the condition of the covariance matrix in the Gaussian prior, which is not suitable for large real-world applications.
To address the fundamental limitations of the aforementioned work, we propose a novel approach to data-driven feature weighting in the approximation of shift-invariant kernels, which motivated by the by Stein Effect in the statistical literature (Section 4), and solve it using an efficient stochastic algorithm with a convex optimization objective. We also present a natural extension of BQ to the applications of kernel approximation (Section 3). The adapted BQ together with standard MC and QMC methods serve as representative baselines in our empirical evaluations on six benchmark data sets. The empirical results (Section 5) show that the proposed Stein-Effect Shrinkage (SES) estimator consistently outperforms the baseline methods in terms of lowering the kernel approximation errors, and was competitive to or better than the best performer among the baseline methods in most task-oriented evaluations (on supervised learning of regression and classification tasks).
Let us briefly outline the related background and key concepts in random-feature based kernel approximation.
Reproducing Kernel Hilbert Space
At the heart of kernel methods lies the idea that inner products in high-dimensional feature spaces can be computed in an implicit form via a kernel function , which is defined on a input data domain such that
where is a feature map that associates kernel with an embedding of the input space into a high-dimensional Hilbert space . A key to kernel methods is that as long as kernel algorithms have access to , one need not to represent explicitly. Most often, that means although can be high-dimensional or even infinite-dimensional, their inner products, can be evaluated by . This idea is known as the kernel trick.
The bottleneck of kernel methods lies in both training and prediction phase. The former requires to compute and store the kernel matrix for data points, while the latter need to evaluate the decision function via kernel trick
, which sums over the nonzero support vectors. Unfortunately, Steinwart_book_08 Steinwart_book_08 show that the number of nonzero support vectors growth linearly in the size of training data . This posts a great challenge for designing scalable algorithms for kernel methods in large-scale applications. Some traditional way to address this difficulty is by making trade-offs between time and space. [Kivinen et al.2004, Chang and Lin2011].
Random Features (RF)
Rahimi_NIPS_07 Rahimi_NIPS_07 propose a low-dimensional feature map for the kernel function
under the assumption that fall in the family of shift-invariant kernel. The starting point is a celebrated result that characterizes the class of positive definite functions:
Theorem 1 (Bochner’s theorem [Rudin2011]).
A continuous, real valued, symmetric and shift-invariant function on is a positive definite kernel if and only if there is a positive finite measure such that
The most well-known kernel that belongs to the shift-invariant family is Gaussian kernel , the associated density is again Gaussian, .
Rahimi_NIPS_07 Rahimi_NIPS_07 approximate the integral representation of the kernel (2) by Monte Carlo method as follows,
and are i.i.d. samples from .
After obtaining the randomized feature map , training data could be transformed into . As long as is sufficiently smaller than , this leads to more scalable solutions, e.g., for regression we get back to training and prediction time, with
memory requirements. We can also apply random features to unsupervised learning problems, such as kernel clustering[Chitta et al.2012].
Quasi-Monte Carlo Technique
Instead of using plain MC approximation, Yang_ICML_14 Yang_ICML_14 propose to use the low-discrepancy properties of Quasi-Monte Carlo (QMC) sequences to reduce the integration error in approximations of the form (3). Due to space limitation, we restrict our discussion to the background that is necessary for understanding subsequent sections. We refer interested readers to the comprehensive reviews [Caflisch1998, Dick et al.2013] for more detailed exposition.
The QMC method is generally applicable to integrals over a unit cube. So the procedure is to first generate a discrepancy sequence , and transform it into a sequence in . To convert the integral presentation of kernel (2) to an integral over the unit cube, a simple change of variables suffices. For , define , where we assume that the density function in (2) can be written as and
is the cumulative distribution function (CDF) of. By setting , the integral (2) is equivalent to
3 Bayesian Quadrature for Random Features
Random features constructed by either MC [Rahimi and Recht2007] or QMC [Yang et al.2014] approaches employ equal weights on integrand functions as shown in (3). An important question is how to construct different weights on the integrand functions to have a better kernel approximation.
Given a fixed QMC sequence, Avron_JMLR_16 Avron_JMLR_16 solve the weights by optimizing a derived error bound based on the observed data. There are two potential weakness of [Avron et al.2016b]. First, Avron_JMLR_16 Avron_JMLR_16 does not fully utilize the data information because the optimization objective only depends on the upper bound information of instead of the distribution of . Second, the error bound used in Avron_JMLR_16 Avron_JMLR_16 is potentially loose. Due to technical reasons, Avron_JMLR_16 Avron_JMLR_16 relaxes to when they derive the error bound to optimize. It is not studied whether the relaxation results in a proper upper bound.
On the other hand, selecting weights by such way is closely connected to the Bayesian Quadrature (BQ), originally proposed by Ghahramani_NIPS_02 Ghahramani_NIPS_02 and theoretically guaranteed under Bayes assumptions. BQ is a Bayesian approach that puts prior knowledge on functions to solve the integral estimation problem. We first introduce the work of Ghahramani_NIPS_02 Ghahramani_NIPS_02, and then discuss how to apply BQ to kernel approximations.
Standard BQ considers a single integration problem with form
To utilize the information of , BQ puts a Gaussian Process (GP) prior on with mean and covariance matrix where
After conditioning on sample from , we obtain a closed-form GP posterior over and use the posterior to get the optimal Bayes estimator 111For the square loss. as
where , , and . Clearly, the resulting estimator is simply a new weighted combination of the empirical observations by derived from the Bayes perspective.
In kernel approximation, we could treat it as a series of the single integration problem. Therefore, we could directly apply BQ on it. Here we discuss the technical issues of applying BQ on kernel approximation. First, we need to evaluate . For Gaussian kernel, we derive the closed-form expression
is a random variable evaluated at.
Data-driven by Kernel Hyperparameters
The most import issue of applying BQ on kernel approximation is that it models the information of by the kernel . In practice, we usually use Gaussian kernel for , then the feature information of is modeling by the kernel bandwidth. As suggested by Rasmussen_GP_06 Rasmussen_GP_06, one could tune the best hyper-parameter either by maximizing the marginal likelihood or cross-validation based on the metrics in interested. Note that the existing BQ approximates the integral representation of a kernel function value, evaluated at a single pair of data . However, it is unfeasible to tune individual for pairs of data. Therefore, we tune a global hyper-parameter on a subset of pair data points.
4 Stein Effect Shrinkage (SES) Estimator
In practice, it is well known that the performance of Gaussian Process based models heavily depend on hyper-parameter , i.e. the bandwidth of kernel function . Therefore, instead of using Bayesian way to obtain the optimal Bayes estimator from BQ, we start from the risk minimization perspective by considering the Stein effect to derive the shrinkage estimator for random features.
The standard estimator of Monte-Carlo methods is the unbiased empirical average using equal weights for samples which sum to 1. However, Stein effect (or James-–Stein effect) suggests a family of biased estimators could result in a lower risk than the standard empirical average. Here we extend the analysis of non-parametric Stein effect to the kernel approximation with random features by considering the risk function .
(Stein effect on kernel approximation) Let . For any estimator , which is independent of and , there exists such that is an estimator with lower risk .
The proof closely follows MuandetFSGS2013 MuandetFSGS2013 for a different problem under the non-parametric setting
222 The original Stein effect has a Gaussian distribution assumption.
The original Stein effect has a Gaussian distribution assumption.
. By bias-variance decomposition, we have
By elementary calculation, we have when
and the optimal value happened at
(Shrinkage estimator) If , there is a shrinkage estimator that has lower risk than the empirical mean estimator by setting .
The standard shrinkage estimator derived from Stein effect uses the equal weights to minimize the variance; however, a non-uniform weights usually results in better performance in practice [Huszár and Duvenaud2012]. Therefore, based on the analysis of Theorem 2, we proposed the estimator obtained by minimizing the empirical risk with the constraint to shrink the weights, where is a small constant. With Lagrangian multiplier, we end up the following risk minimization formulation which is data-driven by directly taking into account,
where denotes the Hadamard product, and is the regularization coefficient to shrink the weights.
The objective function (6) can be further simplified as a least squares regression (LSR) formulation,
where , such that for all , there exist a corresponding pair mapping function satisfying and .
where is the randomized sketching matrix. Solving the sketched LSR problem (8) now costs only where is the time cost of sketching. From the optimization perspective, suppose be the optimal solution of LSR problem (7), and be the solution of sketched LSR problem (8). [Wang et al.2017] prove that if , the the objective function value of (8) at is at most worse than the value of (7) at . In practice, we consider an efficient sampling-based sketching matrix where is a diagonal matrix with if row was include in the sample, and otherwise. That being said, we form a sub-matrix of by including each row of
in the sub-matrix independently with probability. [Woodruff and others2014] suggests to set proportional to , the norm of th row of , to have a meaningful error bound.
In this section, we empirically demonstrate the benefits of our proposed method on six data sets listed in Table 1. The features in all the six data sets are scaled to zero mean and unit variance in a preprocessing step.
We consider two different tasks: i) the quality of kernel approximation and ii) the applicability in supervised learning. Both the relative kernel approximation errors and the regression/classification errors are reported on the test sets.
Methods for Comparison
MC: Monte Carlo approximation with uniform weight.
QMC: Quasi-Monte Carlo approximation with uniform weight. We adapt the implementation of Yang_ICML_14 Yang_ICML_14 on Github. 333https://github.com/chocjy/QMC-features
BQ: QMC sequence with Bayesian Quadrature weights. We modify the source code from Huszar_UAI_12 Huszar_UAI_12. 444http://www.cs.toronto.edu/ duvenaud/
Gaussian RBF kernel is used through all the experiments where the kernel bandwidth is tuned on the set that is in favor of Monte Carlo methods. For QMC method, we use scrambling and shifting techniques recommended in the QMC literature (See Dick_Acta_13 Dick_Acta_13 for more details). In BQ framework, we consider the GP prior with the covariance matrix where is tuned on the set over a subset of sampled pair data . Likewise, regularization coefficient of in the proposed objective function (6) is tuned on the set over the same subset of sampled pair data. For regression task, we solve ridge regression problems where the regularization coefficient is tuned by five folds cross-validation. For classification task, we solve L2-loss SVM where the regularization coefficient is tuned by five folds cross-validation. All experiments code are written in MATLAB and run on a Intel(R) Xeon(R) CPU 2.40GHz Linux server with 32 cores and 190GB memory.
Quality of Kernel Approximation
In our setting, the most natural metric for comparison is the quality of approximation of the kernel matrix. We use the relative kernel approximation error to measure the quality, as shown in Figure 1.
SES outperforms the other methods on cpu, census, adult, mnist, covtype, and is competitive with the best baseline methods over the years data sets. Notice that SES is particularly strong when the number of random features is relatively small. This phenomenon confirms our theoretical analysis, that is, the smaller the random features are, the larger the variance would be in general. Our shrinkage estimator with Stein effect enjoys better performance than the empirical mean estimators (MC and QMC) using uniformly weighted features without shrinkage. Although BQ does use weighted random features, its performance is still not as good as SES because tuning the covariance matrix in the Gaussian Process is rather difficult. Moreover, it is even harder to tune a global bandwidth for all integrand functions that approximate the kernel function.
For regression task, we report the relative regression error . For classification task, we report the classification error.
We investigate if better kernel approximations reflect better predictions in the supervised learning tasks. The results are shown in Table 2.
We can see that among the same data set (cpu, census, adult, and covtype) our method performs very well in terms of kernel approximation, and it also yields lower errors in most of the cases of the supervised learning tasks. Note that we only present partial experiment results on some selected number of random features. However, we also observe that the generalization error of our proposed method may be higher than that of other state-of-the-art methods at some other number of random features that we did not present here. This situation is also observed in other works [Avron et al.2016b]: better kernel approximation does not always result in better supervised learning performance. Under certain conditions, less number of features (worse kernel approximation) can play a similar role of regularization to avoid overfitting [Rudi et al.2016]. rudi2016genrf rudi2016genrf suggest to solve this issue by tuning number of random features.
Distributions of Feature Weights
We are interested in the distributions of random feature weights derived from BQ and SES. For different numbers of random features (), we plotted the corresponding histograms and compared them with the equal weight of used in MC and QMC, as shown in Figure 2. When the number of random features is small, the difference between BQ and SES is big. As the number of random features increases, the weights of each method gradually converge to a stable distribution, and get increasingly closer to the uniform weight spike. This empirical observation agrees with the theoretical analysis we found, namely with the role of the weight estimator in the bias-variance trade-off.
Behavior of Randomized Optimization Solver
In theory, our randomized solver only needs to sample data in our proposed objective function (8) to have a sufficiently accurate solution. We verify this assertion by varying the sampled data size in sketching matrix and examine it affects the relative kernel approximation errors. The results are shown in Figure 3. For census data set, we observe that as the number of sampled data exceeds the number of random features , the kernel approximation of our proposed method outperforms the state-of-the-art, QMC. Furthermore, the number of sampled data required by our proposed randomized solver is still the same order of even for a much larger data set covtype, with millions data points. On the other hand, BQ does not have consistent behavior regarding the sampled data size, which is used for tuning the hyperparmaeter . This again illustrates that BQ is highly sensitive to the covariance matrix and is difficult to tune on only a small subset of sampled data. In practice, we found that we usually only need to sample up to times the number of random features to get a relatively stable and good minimizer . This observation coincides with the complexity analysis of our proposed randomized solver in Section 4.
Uniform v.s. Non-uniform Shrinkage Weight
We conducted experiments with SES using a uniform shrinkage weight versus using non-uniform shrinkage weights on the cpu and census data sets. The results in Figures 4 confirm our conjecture that non-uniform shrinkage weights are more beneficial, although both enjoy the lower empirical risk from the Stein effect. We also observed similar behavior on other data sets, and we omit those results.
6 Discussion and Conclusion
Nyström methods [Williams and Seeger2001, Drineas and Mahoney2005] is yet another line of research for kernel approximation. Specifically, eigendecomposition of the kernel matrix is achieved via low rank factorization with various sampling techniques [Wang and Zhang2013]. We refer interested readers to the comprehensive study [Yang et al.2012] for more detailed comparisons. Due to the page limits, the scope of this paper only focus on the study of random Fourier features.
Finally, we notice [Sinha and Duchi2016] proposes a weighting scheme for random features which matches the label correlation matrix with the goal to optimize the supervise objective (e.g. classification errors). In contrast, our work aims to optimize the kernel approximation, which does not require the label information to solve weights and can be applied to different tasks after solving the weights once.
To conclude, we propose a novel shrinkage estimator, which enjoys the benefits of Stein effect w.r.t. lowering the empirical risk compared to empirical mean estimators. Our estimator induces non-uniform weights on the random features with desirable data-driven properties. We further provide an efficient randomized solver to optimize this estimation problem for real-world large-scale applications. The extensive experimental results demonstrate the effectiveness of our proposed method.
We thank the reviewers for their helpful comments. This work is supported in part by the National Science Foundation (NSF) under grants IIS-1546329 and IIS-1563887.
- [Avron et al.2013] Haim Avron, Vikas Sindhwani, and David Woodruff. Sketching structured matrices for faster nonlinear regression. In NIPS, 2013.
- [Avron et al.2016a] Haim Avron, Kenneth L Clarkson, and David P Woodruff. Faster kernel ridge regression using sketching and preconditioning. arXiv, 2016.
- [Avron et al.2016b] Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W Mahoney. Quasi-monte carlo feature maps for shift-invariant kernels. JMLR, 2016.
- [Bach2015] Francis Bach. On the equivalence between quadrature rules and random features. arXiv, 2015.
- [Caflisch1998] Russel E Caflisch. Monte carlo and quasi-monte carlo methods. Acta numerica, 1998.
[Chang and Lin2011]
Chih-Chung Chang and Chih-Jen Lin.
Libsvm: a library for support vector machines.ACM Transactions on Intelligent Systems and Technology (TIST), 2011.
- [Chitta et al.2012] Radha Chitta, Rong Jin, and Anil K Jain. Efficient kernel clustering using random fourier features. In ICDM, 2012.
- [Dai et al.2014] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In NIPS, 2014.
- [Dick et al.2013] Josef Dick, Frances Y Kuo, and Ian H Sloan. High-dimensional integration: the quasi-monte carlo way. Acta Numerica, 2013.
- [Drineas and Mahoney2005] Petros Drineas and Michael W Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning. JMLR, 2005.
- [Ghahramani and Rasmussen2002] Zoubin Ghahramani and Carl E Rasmussen. Bayesian monte carlo. In NIPS, 2002.
[Huang et al.2014]
Po-Sen Huang, Haim Avron, Tara N Sainath, Vikas Sindhwani, and Bhuvana
Kernel methods match deep neural networks on timit.In ICASSP, 2014.
- [Huszár and Duvenaud2012] Ferenc Huszár and David Duvenaud. Optimally-weighted herding is bayesian quadrature. In UAI, 2012.
- [Kivinen et al.2004] Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels. IEEE transactions on signal processing, 52(8), 2004.
- [Le et al.2013] Quoc Le, Tamás Sarlós, and Alexander Smola. Fastfood-computing hilbert space expansions in loglinear time. In ICML, 2013.
- [Li and Póczos2016] Chun-Liang Li and Barnabás Póczos. Utilize old coordinates: Faster doubly stochastic gradients for kernel methods. In UAI, 2016.
- [Muandet et al.2014] K. Muandet, K. Fukumizu, B. Sriperumbudur, A. Gretton, and B. Schölkopf. Kernel mean estimation and stein effect. In ICML, 2014.
- [Rahimi and Recht2007] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NIPS, 2007.
- [Rahimi and Recht2009] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS, 2009.
- [Rasmussen2006] Carl Edward Rasmussen. Gaussian processes for machine learning. 2006.
- [Rudi et al.2016] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties of learning with random features. arXiv, 2016.
- [Rudin2011] Walter Rudin. Fourier analysis on groups. John Wiley & Sons, 2011.
- [Sinha and Duchi2016] Aman Sinha and John C Duchi. Learning kernels with random features. In NIPS, 2016.
- [Steinwart and Christmann2008] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.
- [Wang and Zhang2013] Shusen Wang and Zhihua Zhang. Improving cur matrix decomposition and the nyström approximation via adaptive sampling. JMLR, 2013.
- [Wang et al.2017] Shusen Wang, Alex Gittens, and Michael W Mahoney. Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging. arXiv, 2017.
- [Williams and Seeger2001] Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In NIPS, 2001.
- [Woodruff and others2014] David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 2014.
- [Yang et al.2012] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nyström method vs random fourier features: A theoretical and empirical comparison. In NIPS, 2012.
- [Yang et al.2014] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael W Mahoney. Quasi-monte carlo feature maps for shift-invariant kernels. In ICML, 2014.