1 Introduction
Kernel methods [Schölkopf and Smola2003] are one of the most important and powerful tools in statistical learning. However, kernel methods often suffer from scalability issues in largescale problems due to high space and time complexities. For example, given observations in the original dimensional space , kernel ridge regression (KRR) requires training time and space to store the kernel matrix, which becomes intractable when is large.
One of the most popular approaches for scaling up kernel methods is random Fourier features (RFF) [Rahimi and Recht2007], which approximates the original kernel by mapping input features into a new space spanned by a small number of Fourier basis. Specifically, suppose a given kernel satisfies 1) positive definiteness and 2) shiftinvariance, i.e., . By Bochner’s theorem [Bochner2005], there exists a finite Borel measure
(the Fourier transform associated with
) such that(1) 
(Typically, the kernel is realvalued and thus the imaginary part in Eq. (1) can be discarded.) One can then use Monte Carlo sampling to approximate by the lowdimensional kernel with the explicit mapping
(2) 
where are sampled from independently of the training set. For notational simplicity, here we write such that . Note that we have . Consequently, the original kernel matrix on the observations can be approximated by , where . With random features, this approximation applied to KRR only requires time and memory, hence achieving a substantial computational saving when .
Since RFF uses the Monte Carlo estimates that are independent of the training set, a large number of random features are often required to achieve competitive approximation and generalization performance. To improve performance, recent works
[Sun, Gilbert, and Tewari2018, Li et al.2019] consider using the ridge leverage function [Bach2017, Avron et al.2017] defined with respect to the training data. For a given random feature , this function is defined as(3) 
where is the regularization parameter in KRR and is the th column of given by . Observe that
can be viewed as a probability density function, and hence is referred to as the
Empirical Ridge Leverage Score (ERLS) distribution [Avron et al.2017]. Therefore, one can sample the features according to , which is an importance weighted sampling strategy. Compared to standard Monte Carlo sampling for RFF, based sampling requires fewer Fourier features and enjoys theoretical guarantees [Avron et al.2017, Li et al.2019].However, computing the ridge leverage scores and the ERLS distribution may be intractable when is large, as we need to invert the kernel matrix in Eq. (3). An alternative way in [Sun, Gilbert, and Tewari2018, Li et al.2019] is to use the subset of data to approximate , but this scheme still needs time to generate random features. To address these computational difficulties, we design a simple but effective leverage function to replace the original one. For a given , our leverage function is defined as
(4) 
where the matrix is an ideal kernel that directly fits the training data with 100% accuracy in classification tasks, and thus can be used to guide kernel learning tasks as in kernel alignment [Cortes, Mohri, and Rostamizadeh2012]. It can be found that, our surrogate function avoids the matrix inversion operator so as to further accelerate kernel approximation. Note that, we introduce the additional term and the coefficient in Eq. (4) to ensure, is a surrogate function that upper bounds in Eq. (3) for theoretical guarantees. This can be achieved due to ^{1}^{1}1It holds by , where the notation denotes that is semipositive definite.. In this way, our method with the surrogate function requires less computational time while achieving comparable generalization performance, as demonstrated by our theoretical results and experimental validations.
Specifically, the main contributions of this paper are:

We propose a surrogate ridge leverage function based on kernel alignment and derive its associated fast surrogate ERLS distribution. This distribution is simple in form and has intuitive physical meanings. Our theoretical analysis provides a lower bound on the number of random features that guarantees no loss in the learning accuracy in KRR.

By sampling from the surrogate ERLS distribution, our datadependent algorithm takes time to generate random features, which is the same as RFF and less than the time in [Li et al.2019]. We further provide theoretical guarantees on the generalization performance of our algorithm equipped with the KRR estimator.

Experiments on various benchmark datasets demonstrate that our method performs better than standard random Fourier features based algorithms. Specifically, our algorithm achieves comparable (or even better) accuracy and uses less time when compared to [Li et al.2019].
The remainder of the paper is organized as follows. Section 2 briefly reviews the related work on random features for kernel approximation. Our surrogate leverage weighted sampling strategy for RFF is presented in Section 3, and related theoretical results are given in Section 4. In section 5, we provide experimental evaluation for our algorithm and compare with other representative random features based methods on popular benchmarks. The paper is concluded in Section 6.
2 Related Works
Recent research on random Fourier features focuses on constructing the mapping
The key question is how to select the points and weights so as to uniformly approximate the integral in Eq. (1). In standard RFF, are randomly sampled from and the weights are equal, i.e.,
. To reduce the approximation variance, Yu2016Orthogonal Yu2016Orthogonal proposes the orthogonal random features (ORF) approach, which incorporates an orthogonality constraint when sampling
from . Sampling theory niederreiter1992random niederreiter1992random suggests that the convergence of MonteCarlo used in RFF and ORF can be significantly improved by choosing a deterministic sequence instead of sampling randomly. Therefore, a possible middleground method is QuasiMonte Carlo sampling [Avron et al.2016], which uses a lowdiscrepancy sequence rather than the fully random Monte Carlo samples. Other deterministic approaches based on numerical quadrature are considered in [Evans1993]. bach2017equivalence bach2017equivalence analyzes the relationship between random features and quadrature, which allows one to use deterministic numerical integration methods such as Gaussian quadrature [Dao, De Sa, and Ré2017], sphericalradial quadrature rules [Munkhoeva et al.2018], and sparse quadratures [Gauthier and Suykens2018] for kernel approximation.The above methods are all dataindependent, i.e., the selection of points and weights is independent of the training data. Another line of work considers datadependent algorithms, which use the training data to guide the generation of random Fourier features by using, e.g., kernel alignment [Sinha and Duchi2016], feature compression [Agrawal et al.2019], or the ridge leverage function [Avron et al.2017, Sun, Gilbert, and Tewari2018, Li et al.2019, Fanuel, Schreurs, and Suykens2019]. Since our method builds on the leverage function , we detail this approach here. From Eq. (3), the integral of is
(5) 
The quantity determines the number of independent parameters in a learning problem and hence is referred to as the
number of effective degrees of freedom
[Bach2013]. li2019towards li2019towards provides the sharpest bound on the required number of random features; in particular, with features, no loss is incurred in learning accuracy of kernel ridge regression. Albeit elegant, sampling according to is often intractable in practice. The alternative approach proposed in [Li et al.2019] takes time, which is larger than in the standard RFF.3 Surrogate Leverage Weighted RFF
3.1 Problem setting
Consider a standard supervised learning setting, where
is a compact metric space of features, and (in regression) or (in classification) is the label space. We assume that a sample set is drawn from a nondegenerate Borel probability measure on . The target function of is defined by for each , where is the conditional distribution of at . Given a kernel function and its associated reproducing kernel Hilbert space (RKHS) , the goal is to find a hypothesis in such that is a good estimate of the label for a new instance . By virtue of the representer theorem [Schölkopf and Smola2003], an empirical risk minimization problem can be formulated as(6) 
where with and the convex loss quantifies the quality of the estimate w.r.t. the true . In this paper, we focus on learning with the squared loss, i.e., . Hence, the expected risk in KRR is defined as , with the corresponding empirical risk defined on the sample, i.e., . In standard learning theory, the optimal hypothesis is defined as , where is the conditional distribution of at . The regularization parameter in Eq. (6) should depend on the sample size; in particular, with . Following [Rudi, Camoriano, and Rosasco2017, Li et al.2019], we pick .
As shown in [Li et al.2019], when using random features, the empirical risk minimization problem (6) can be expressed as
(7) 
where
is the label vector and
is the random feature matrix, with as defined in Eq. (2) and sampled from a distribution . Eq. (7) is a linear ridge regression problem in the space spanned by the random features [Suykens et al.2002, Mall and Suykens2015], and the optimal hypothesis is given by , with(8) 
Note that the distribution determines the feature mapping matrix and hence has a significant impact on the generalization performance. Our main goal in the sequel is to design a good , and to understand the relationship between and the expected risk. In particular, we would like to characterize the number of random features needed when sampling from in order to achieve a certain convergence rate of the risk.
3.2 Surrogate leverage weighted sampling
Let be a probability density function to be designed. Given the points sampled from , we define the mapping
(9) 
We again have , where . Accordingly, the kernel matrix can be approximated by , where . Denoting by the th column of , we have . Note that this scheme can be regarded as a form of importance sampling.
Our surrogate empirical ridge leverage score distribution is given by Eq. (4). The integral of is
(10) 
Combining Eq. (4) and Eq. (10), we can compute the surrogate empirical ridge leverage score distribution by
(11) 
The random features can then be sampled from the above . We refer to this sampling strategy as surrogate leverage weighted RFF. Compared to the standard and its associated ERLS distribution, the proposed and are simpler: it does not require inverting the kernel matrix and thus accelerates the generation of random features.
Since the distribution involves the kernel matrix that is defined on the entire training dataset, we need to approximate by random features, and then calculate/approximate . To be specific, we firstly sample with from the spectral measure and form the feature matrix . We have , and thus the distribution can be approximated by
(12) 
Hence, we can then sample from to generate the refined random features by importance sampling.
Note that the term in Eq. (4) and Eq. (12) is independent of the sample set . If we discard this term in our algorithm implementation, in Eq. (4) can be transformed as
(13) 
and further in Eq. (12) can be simplified to
(14) 
For each feature , its resampling probability is associated with the approximate empirical ridge leverage score in Eq. (13). To be specific, it can be represented as
(15) 
It has intuitive physical meanings. From Eq. (15), it measures the correlation between the label and the mapping output . Therefore, quantifies the contribution of , which defines the th dimension of the feature mapping , for fitting the training data. In this view, if is large, is more important than the other features, and will be given the priority in the above importance sampling scheme. Based on this, we resample features from to generate the refined random features. Our surrogate leverage weighted RFF algorithm applied to KRR is summarized in Algorithm 1.
Also note that if the following condition holds
then sampling from or does not have distinct differences. This condition is satisfied if does not dramatically fluctuate for each column. in which sampling from or may be used.
The method in [Li et al.2019] samples from , while ours samples from . In comparison, our surrogate ERLS distribution is much simpler as it avoids inverting the matrix . Hence, generating random features by Algorithm 1 takes time to do the sampling. It is the same as the standard RFF and less than the time needed by [Li et al.2019] which requires for the multiplication of and for inverting .
4 Theoretical Analysis
In this section, we analyze the generalization properties of kernel ridge regression when using random Fourier features sampled from our . Our analysis includes two parts. We first study how many features sampled from are needed to incur no loss of learning accuracy in KRR. We then characterize the convergence rate of the expected risk of KRR when combined with Algorithm 1. Our proofs follow the framework in [Li et al.2019] and in particular involve the same set of assumptions.
4.1 Expected risk for sampling from
The theorem below characterizes the relationship between the expected risk in KRR and the total number of random features used.
Theorem 1.
Given a shiftinvariant and positive definite kernel function
, denote the eigenvalues of the kernel matrix
as . Suppose that the regularization parameter satisfies , is bounded with , and are sampled independently from the surrogate empirical ridge leverage score distribution . If the unit ball of contains the optimal hypothesis andthen for , with probability , the excess risk of can be upper bounded as
(16) 
where is the excess risk for the standard kernel ridge regression estimator.
Theorem 1 shows that if the total number of random features sampled from satisfies , we incur no loss in the learning accuracy of kernel ridge regression. In particular, with the standard choice , the estimator attains the minimax rate of kernel ridge regression.
To illustrate the lower bound in Theorem 1 on the number of features, we consider three cases regarding the eigenvalue decay of : i) the exponential decay with , ii) the polynomial decay with , and iii) the slowest decay with (see [Bach2013] for details). In all three cases, direct calculation shows
Moreover, satisfies in the exponential decay case, in the polynomial decay case, and in the slowest case. Combining these bounds gives the number of random features sufficient for no loss in the learning accuracy of KRR; these results are reported in Tab. 1. It can be seen that sampling from [Li et al.2019] sometimes requires fewer random features than our method. This is actually reasonable as the design of our surrogate ERLS distribution follows in a simple fashion and we directly relax to . It does not strictly follow with the continuous generalization of the leverage scores used in the analysis of linear methods [Alaoui and Mahoney2015, Cohen, Musco, and Musco2017, Avron et al.2017]. Actually, with a more careful argument, this bound can be further improved and made tight, which we leave to future works. Nevertheless, our theoretical analysis actually provides the worst case estimation for the lower bound of . In practical uses, our algorithm would not require the considerable number of random features to achieve a good prediction performance. Specifically, our experimental results in Section 5 demonstrate that when using the same , there is no distinct difference between [Li et al.2019] and our method in terms of prediction performance. But our approach costs less time to generate the refined random features, achieving a substantial computational saving when the total number of random features is relatively large.
Eigenvalue decay  [Li et al.2019]  Ours 
,  
,  
To prove Theorem 1, we need the following lemma.
Lemma 1.
Under the same assumptions from Theorem 1, let with constants and given by
If the number of random features satisfies
(17) 
then for , with probability , we have
(18) 
Proof.
Following the proof of Lemma 4 in [Li et al.2019], by the matrix Bernstein concentration inequality [Tropp and others2015] and , we conclude the proof. ∎
Based on Lemma 1, as well as the previous results including Lemma 2, Lemma 5, Lemma 6, Theorem 5 in [Li et al.2019], we can immediately prove Theorem 1.
4.2 Expected risk for Algorithm 1
In the above analysis, our results are based on the random features sampled from . In Algorithm 1, are actually drawn from or . In this section, we present the convergence rates for the expected risk of Algorithm 1.
Theorem 2.
Proof.
According to Theorem 1 and Corollary 2 in [Li et al.2019], if the number of random features satisfies , then for any , with confidence , the excess risk of can be bounded by
(20) 
Let be the function in spanned by the approximated kernel that achieves the minimal risk, i.e., . Hence, we resample according to as defined in Eq. (11), in which the kernel matrix is indicated by the actual kernel spanned in . Denote our KRR estimator with the regularization parameter and the learning function , and according to Theorem 1, if the number of random features satisfies then for , with confidence , the excess risk of can be estimated by
(21) 
Since achieves the minimal risk over , we can conclude that . Combining Eq. (20) and Eq. (21), we obtain the final excess risk of . ∎
Theorem 2 provides the upper bound of the expected risk in KRR estimator over random features generated by Algorithm 1. Note that, in our implementation, the number of random features used to approximate the kernel matrix is also set to for simplicity, which shares the similar way with the implementation in [Li et al.2019].
5 Experiments
In this section, we empirically evaluate the performance of our method equipped with KRR for classification tasks on several benchmark datasets. All experiments are implemented in MATLAB and carried out on a PC with Intel i56500 CPU (3.20 GHz) and 16 GB RAM. The source code of our implementation can be found in http://www.lfhsgre.org.
5.1 Experimental settings
We choose the popular shiftinvariant Gaussian/RBF kernel for experimental validation, i.e., . Following [Avron et al.2017], we use a fixed bandwidth in our experiments. This is without loss of generality since we can rescale the points and adjust the bounding interval. The regularization parameter is tuned via 5fold inner cross validation over a grid of .
datasets  #feature dimension  #traing examples  #test examples 
EEG  14  7,490  7,490 
codRNA  8  59,535  157,413 
covtype  54  290,506  290,506 
magic04  10  9510  9510 
Datasets: We consider four classification datasets including EEG, codRNA, covtype and magic04; see Tab. 2 for an overview of these datasets. These datasets can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
or the UCI Machine Learning Repository
^{2}^{2}2https://archive.ics.uci.edu/ml/datasets.html.. All datasets are normalized to before the experiments. We use the given training/test partitions on the codRNA dataset. For the other three datasets, we randomly pick half of the data for training and the rest for testing. All experiments are repeated 10 times and we report the average classification accuracy as well as the time cost for generating random features.Compared methods: We compare the proposed surrogate leverage weighted sampling strategy with the following three random features based algorithms:

RFF [Rahimi and Recht2007]: The feature mapping is given by Eq. (2), in which the random features are sampled from .

QMC [Avron et al.2016]: The feature mapping is also given by Eq. (2), but the random features are constructed by a deterministic scheme, e.g., a lowdiscrepancy Halton sequence.

leverageRFF [Li et al.2019]: The datadependent random features are sampled from . The kernel matrix in is approximated using random features presampled from .
In our method, we consider sampling from in Algorithm 1 for simplicity.
Dataset  RFF  QMC  leverageRFF  Ours  
Acc:% (time:sec.)  Acc:% (time:sec.)  Acc:% (time:sec.)  Acc:% (time:sec.)  
EEG  68.450.89 (0.010.00)  67.830.73 (0.020.03)  68.780.97 (0.010.01)  68.620.89 (0.010.00)  
71.441.22 (0.020.00)  70.890.72 (0.030.03)  72.591.51 (0.030.01)  72.721.35 (0.020.00)  
74.700.94 (0.030.01)  74.660.42 (0.040.03)  79.060.73 (0.050.01)  79.720.58 (0.030.01)  
76.960.96 (0.060.02)  77.010.40 (0.070.03)  83.950.58 (0.100.01)  84.970.50 (0.070.02)  
78.540.63 (0.110.00)  78.710.29 (0.120.03)  86.290.50 (0.200.01)  87.230.41 (0.130.01)  
78.960.44 (0.220.02)  78.830.59 (0.240.06)  88.050.31 (0.490.03)  89.380.32 (0.260.03)  
79.970.62 (0.450.01)  79.710.40 (0.470.05)  89.120.36 (1.210.05)  90.360.31 (0.530.02)  
79.790.49 (0.820.05)  79.510.47 (0.840.06)  90.010.27 (3.910.09)  91.020.32 (0.960.03)  
codRNA  87.020.29 (0.060.01)  87.200.00 (0.070.03)  88.620.92 (0.090.02)  89.640.87 (0.070.01)  
87.120.19 (0.120.01)  87.650.00 (0.160.02)  90.421.15 (0.170.01)  90.120.95 (0.130.01)  
87.190.08 (0.240.01)  87.440.00 (0.250.02)  92.650.38 (0.350.02)  92.830.33 (0.270.01)  
87.270.11 (0.470.02)  87.290.00 (0.490.02)  93.410.07 (0.690.02)  93.490.15 (0.530.02)  
87.290.08 (0.910.02)  87.300.00 (0.940.04)  93.710.06 (1.390.05)  93.740.05 (0.990.02)  
87.270.05 (1.800.02)  87.330.00 (1.770.01)  93.760.02 (2.820.08)  93.710.07 (1.950.03)  
87.300.03 (3.480.15)  87.320.00 (3.540.10)  93.730.03 (6.540.53)  93.990.06 (4.050.08)  
87.300.03 (6.790.39)  87.320.00 (6.620.08)  93.660.03 (13.30.23)  93.480.04 (7.780.09)  
covtype^{1}  73.700.79 (1.900.03)  74.710.07 (1.880.11)  73.990.85 (2.960.09)  73.990.63 (2.000.05)  
77.090.25 (3.310.21)  77.040.06 (3.370.29)  77.040.35 (5.250.09)  77.020.29 (3.440.10)  
79.100.13 (6.270.35)  79.070.07 (6.120.17)  79.180.17 (10.20.15)  79.050.14 (6.580.19)  
81.040.12 (12.30.71)  80.900.05 (12.10.45)  81.090.07 (21.10.72)  80.790.11 (13.20.34)  
82.420.10 (24.51.02)  82.370.07 (24.31.56)  82.900.12 (46.52.20)  82.180.10 (28.61.58)  
magic04  73.620.68 (0.010.00)  71.740.40 (0.020.04)  73.620.68 (0.010.01)  73.610.68 (0.010.00)  
75.890.80 (0.010.01)  75.980.36 (0.030.03)  75.910.77 (0.030.01)  75.880.77 (0.020.00)  
77.780.45 (0.030.01)  77.270.33 (0.040.03)  77.780.45 (0.050.01)  77.770.43 (0.030.00)  
78.970.34 (0.050.00)  79.070.17 (0.070.03)  79.150.40 (0.090.01)  79.120.34 (0.060.01)  
80.040.34 (0.100.01)  79.950.37 (0.110.03)  80.800.40 (0.190.01)  80.740.42 (0.110.01)  
80.610.43 (0.190.01)  80.650.31 (0.210.04)  82.000.32 (0.410.03)  82.020.32 (0.220.01)  
80.910.28 (0.380.03)  80.850.27 (0.410.05)  82.390.32 (0.930.05)  82.370.25 (0.440.03)  
81.100.37 (0.730.03)  81.080.29 (0.760.04)  82.590.29 (2.610.15)  82.550.55 (0.870.02) 

Due to the memory limit, we cannot conduct the experiment on the covtype dataset when .
” indicates that leverageRFF and our method are significantly better than the other two baseline methods via paired ttest.
5.2 Comparison results
Highlevel comparison:
We compare the empirical performance of the aforementioned random features mapping based algorithms. In Fig. 1, we consider the EEG dataset and plot the relative kernel matrix approximation error, the test classification accuracy and the time cost for generating random features versus different values of . Note that since we cannot compute the kernel matrix on the entire dataset, we randomly sample 1,000 datapoints to construct the feature matrix , and then calculate the relative approximation error, i.e., .
Fig. 1(a)
shows the mean of the approximation errors across 10 trials (with one standard deviation denoted by error bars) under different random features dimensionality. We find that QMC achieves the best approximation performance when compared to RFF, leverageRFF, and our proposed method. Fig.
1(b) shows the test classification accuracy. We find that as the number of random features increases, leverageRFF and our method significantly outperform RFF and QMC .From the above experimental results, we find that, admittedly, QMC achieves lower approximation error to some extent, but it does not translate to better classification performance when compared to leverageRFF and our method. The reason may be that the original kernel derived by the pointwise distance might not be suitable, and the approximated kernel is not optimal for classification/regression tasks, as discussed in [Avron et al.2017, Munkhoeva et al.2018, Zhang et al.2019]. As the ultimate goal of kernel approximation is to achieve better prediction performance, in the sequel we omit the approximation performance of these methods.
In terms of time cost for generating random features, Fig. 1(c) shows that leverageRFF is quite timeconsuming when the total number of random features is large. In contrast, our algorithm achieves comparable computational efficiency with RFF and QMC. These results demonstrate the superiority of our surrogate weighted sampling strategy, which reduces the time cost.
Detailed comparisons:
Tab. 3 reports the detailed classification accuracy and time cost for generating random features of all algorithms on the four datasets. Observe that by using a datadependent sampling strategy, leverageRFF and our method achieve better classification accuracy than RFF and QMC on the EEG and codRNA dataset when the dimensionality of random features increases. In particular, on the EEG dataset, when ranges from to , the test accuracy of leverageRFF and our method is better than RFF and QMC by around 1% to nearly 11%. On the codRNA dataset, the performance of RFF and QMC is worse than our method by over 6% when . On the covtype dataset, all four methods achieve similar the classification accuracy. Instead, on the magic04 dataset, our algorithm and leverageRFF perform better than RFF and QMC on the final classification accuracy if more random features are considered.
In terms of computational efficiency on these four datasets, albeit datadependent, our method still takes about the similar time cost with the dataindependent RFF and QMC for generating random features. Specifically, when compared to leverageRFF, our method achieves a substantial computational saving.
6 Conclusion
In this work, we have proposed an effective datadependent sampling strategy for generating fast random features for kernel approximation. Our method can significantly improve the generalization performance while achieving the same time complexity when compared to the standard RFF. Our theoretical results and experimental validation have demonstrated the superiority of our method when compared to other representative random Fourier features based algorithms on several classification benchmark datasets.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (No.61572315, 61876107, 61977046), in part by the National Key Research and Development Project (No. 2018AAA0100702), in part by National Science Foundation grants CCF1657420 and CCF1704828, in part by the European Research Council under the European Union’s Horizon 2020 research and innovation program / ERC Advanced Grant EDUALITY (787960). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information; Research Council KUL C14/18/068; Flemish Government FWO project GOA4917N; Onderzoeksprogramma Artificiele Intelligentie (AI) Vlaanderen programme. Jie Yang and Xiaolin Huang are corresponding authors.
References

[Agrawal et al.2019]
Agrawal, R.; Campbell, T.; Huggins, J.; and Broderick, T.
2019.
Datadependent compression of random features for largescale kernel
approximation.
In
Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics
, 1822–1831.  [Alaoui and Mahoney2015] Alaoui, A., and Mahoney, M. W. 2015. Fast randomized kernel ridge regression with statistical guarantees. In Proceedings of Advances in Neural Information Processing Systems, 775–783.
 [Avron et al.2016] Avron, H.; Sindhwani, V.; Yang, J.; and Mahoney, M. W. 2016. QuasiMonte Carlo feature maps for shiftinvariant kernels. Journal of Machine Learning Research 17(1):4096–4133.
 [Avron et al.2017] Avron, H.; Kapralov, M.; Musco, C.; Musco, C.; Velingker, A.; and Zandieh, A. 2017. Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Proceedings of the 34th International Conference on Machine LearningVolume 70, 253–262.
 [Bach2013] Bach, F. 2013. Sharp analysis of lowrank kernel matrix approximations. In Proceedings of Conference on Learning Theory, 185–209.
 [Bach2017] Bach, F. 2017. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research 18(1):714–751.
 [Bochner2005] Bochner, S. 2005. Harmonic Analysis and the Theory of Probability. Courier Corporation.
 [Cohen, Musco, and Musco2017] Cohen, M. B.; Musco, C.; and Musco, C. 2017. Input sparsity time lowrank approximation via ridge leverage score sampling. In Proceedings of the TwentyEighth Annual ACMSIAM Symposium on Discrete Algorithms, 1758–1777.
 [Cortes, Mohri, and Rostamizadeh2012] Cortes, C.; Mohri, M.; and Rostamizadeh, A. 2012. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research 13(2):795–828.
 [Dao, De Sa, and Ré2017] Dao, T.; De Sa, C. M.; and Ré, C. 2017. Gaussian quadrature for kernel features. In Proceedings of Advances in neural information processing systems, 6107–6117.
 [Evans1993] Evans, G. 1993. Practical numerical integration. Wiley New York.
 [Fanuel, Schreurs, and Suykens2019] Fanuel, M.; Schreurs, J.; and Suykens, J.A.K. 2019. Nyström landmark sampling and regularized Christoffel functions. arXiv preprint arXiv:1905.12346.
 [Gauthier and Suykens2018] Gauthier, B., and Suykens, J.A.K. 2018. Optimal quadraturesparsification for integral operator approximation. SIAM Journal on Scientific Computing 40(5):A3636–A3674.
 [Li et al.2019] Li, Z.; Ton, J.F.; Oglic, D.; and Sejdinovic, D. 2019. Towards a unified analysis of random Fourier features. In Proceedings of the 36th International Conference on Machine Learning, 3905–3914.

[Mall and Suykens2015]
Mall, R., and Suykens, J.A.K.
2015.
Very sparse LSSVM reductions for largescale data.
IEEE Transactions on Neural Networks and Learning Systems
26(5):1086–1097.  [Munkhoeva et al.2018] Munkhoeva, M.; Kapushev, Y.; Burnaev, E.; and Oseledets, I. 2018. Quadraturebased features for kernel approximation. In Proceedings of Advances in Neural Information Processing Systems, 9165–9174.
 [Niederreiter1992] Niederreiter, H. 1992. Random number generation and quasiMonte Carlo methods, volume 63. SIAM.
 [Rahimi and Recht2007] Rahimi, A., and Recht, B. 2007. Random features for largescale kernel machines. In Proceedings of Advances in Neural Information Processing Systems, 1177–1184.
 [Rudi, Camoriano, and Rosasco2017] Rudi, A.; Camoriano, R.; and Rosasco, L. 2017. Generalization properties of learning with random features. In Proceedings of Advances in Neural Information Processing Systems, 3215–3225.

[Schölkopf and
Smola2003]
Schölkopf, B., and Smola, A. J.
2003.
Learning with kernels: support vector machines, regularization, optimization, and beyond
. MIT Press.  [Sinha and Duchi2016] Sinha, A., and Duchi, J. C. 2016. Learning kernels with random features. In Proceedins of Advances in Neural Information Processing Systems. 1298–1306.
 [Sun, Gilbert, and Tewari2018] Sun, Y.; Gilbert, A.; and Tewari, A. 2018. But how does it work in theory? Linear SVM with random features. In Proceedings of Advances in Neural Information Processing Systems, 3383–3392.
 [Suykens et al.2002] Suykens, J.A.K.; Van Gestel, T.; De Brabanter, J.; De Moor, B.; and Vandewalle, J. 2002. Least Squares Support Vector Machines. World Scientific.
 [Tropp and others2015] Tropp, J. A., et al. 2015. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning 8(12):1–230.
 [Yu et al.2016] Yu, F. X.; Suresh, A. T.; Choromanski, K.; Holtmannrice, D.; and Kumar, S. 2016. Orthogonal random features. In Proceedings of Advances in Neural Information Processing Systems, 1975–1983.
 [Zhang et al.2019] Zhang, J.; May, A.; Dao, T.; and Re, C. 2019. Lowprecision random fourier features for memoryconstrained kernel approximation. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 1264–1274.
Comments
There are no comments yet.