# But How Does It Work in Theory? Linear SVM with Random Features

We prove that, under low noise assumptions, the support vector machine with N≪ m random features (RFSVM) can achieve the learning rate faster than O(1/√(m)) on a training set with m samples when an optimized feature map is used. Our work extends the previous fast rate analysis of random features method from least square loss to 0-1 loss. We also show that the reweighted feature selection method, which approximates the optimized feature map, helps improve the performance of RFSVM in experiments on a synthetic data set.

## Authors

• 4 publications
• 47 publications
• 2 publications
• ### Mixed Integer Linear Programming for Feature Selection in Support Vector Machine

This work focuses on support vector machine (SVM) with feature selection...
08/07/2018 ∙ by Martine Labbé, et al. ∙ 0

• ### Probabilistic Feature Selection and Classification Vector Machine

Sparse Bayesian learning is one of the state-of- the-art machine learnin...
09/18/2016 ∙ by Bingbing Jiang, et al. ∙ 0

• ### Learning rates for classification with Gaussian kernels

This paper aims at refined error analysis for binary classification usin...
02/28/2017 ∙ by Shao-Bo Lin, et al. ∙ 0

• ### A Comparative Study of Feature Selection Methods for Dialectal Arabic Sentiment Classification Using Support Vector Machine

Unlike other languages, the Arabic language has a morphological complexi...
02/17/2019 ∙ by Omar Al-Harbi, et al. ∙ 0

• ### DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features

Acoustic scene recordings are represented by different types of handcraf...
01/08/2018 ∙ by Abelino Jimenez, et al. ∙ 0

• ### Exact high-dimensional asymptotics for support vector machine

Support vector machine (SVM) is one of the most widely used classificati...
05/13/2019 ∙ by Haoyang Liu, et al. ∙ 0

• ### Secure Detection of Image Manipulation by means of Random Feature Selection

We address the problem of data-driven image manipulation detection in th...
02/02/2018 ∙ by Zhipeng Chen, et al. ∙ 0

## Code Repositories

### randfourier

This repository maintains the code testing the performance of random Fourier features method.

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Kernel methods such as kernel support vector machines (KSVMs) have been widely and successfully used in classification tasks ([Steinwart and Christmann(2008)]). The power of kernel methods comes from the fact that they implicitly map the data to a high dimensional, or even infinite dimensional, feature space, where points with different labels can be separated by a linear functional. It is, however, time-consuming to compute the kernel matrix and thus KSVMs do not scale well to extremely large datasets. To overcome this challenge, researchers have developed various ways to efficiently approximate the kernel matrix or the kernel function.

The random features method, proposed by [Rahimi and Recht(2008)], maps the data to a finite dimensional feature space as a random approximation to the feature space of RBF kernels. With explicit finite dimensional feature vectors available, the original KSVM is converted to a linear support vector machine (LSVM), that can be trained by faster algorithms ([Shalev-Shwartz et al.(2011)Shalev-Shwartz, Singer, Srebro, and Cotter, Hsieh et al.(2008)Hsieh, Chang, Lin, Keerthi, and Sundararajan]) and tested in constant time with respect to the number of training samples. For example, [Huang et al.(2014)Huang, Avron, Sainath, Sindhwani, and Ramabhadran] and [Dai et al.(2014)Dai, Xie, He, Liang, Raj, Balcan, and Song] applied RFSVM or its variant to datasets containing millions of data points and achieved performance comparable to deep neural nets.

Despite solid practical performance, there is a lack of clear theoretical guarantees for the learning rate of RFSVM. [Rahimi and Recht(2009)] obtained a risk gap of order

between the best RFSVM and KSVM classifiers, where

is the number of features. Although the order of the error bound is correct for general cases, it is too pessimistic to justify or to explain the actual computational benefits of random features method in practice. And the model is formulated as a constrained optimization problem, which is rarely used in practice.

[Cortes et al.(2010)Cortes, Mohri, and Talwalkar] and [Sutherland and Schneider(2015)] considered the performance of RFSVM as a perturbed optimization problem, using the fact that the dual form of KSVM is a constrained quadratic optimization problem. Although the maximizer of a quadratic function depends continuously on the quadratic form, its dependence is weak and thus, both papers failed to obtain an informative bound for the excess risk of RFSVM in the classification problem. In particular, such an approach requires RFSVM and KSVM to be compared under the same hyper-parameters. This assumption is, in fact, problematic because the optimal configuration of hyper-parameters of RFSVM is not necessarily the same as those for the corresponding KSVM. In this sense, RFSVM is more like an independent learning model instead of just an approximation to KSVM.

In regression settings, the learning rate of random features method was studied by [Rudi and Rosasco(2017)] under the assumption that the regression function is in the RKHS, namely the realizable case. They show that the uniform feature sampling only requires features to achieve risk of squared loss. They further show that a data-dependent sampling can achieve a rate of , where

, with even fewer features, when the regression function is sufficiently smooth and the spectrum of the kernel integral operator decays sufficiently fast. However, the method leading to these results depends on the closed form of the least squares solution, and thus we cannot easily extend these results to non-smooth loss functions used in RFSVM.

[Bach(2017)]

recently shows that for any given approximation accuracy, the number of random features required is given by the degrees of freedom of the kernel operator under such an accuracy level, when optimized features are available. This result is crucial for sample complexity analysis of RFSVM, though not many details are provided on this topic in Bach’s work.

In this paper, we investigate the performance of RFSVM formulated as a regularized optimization problem on classification tasks. In contrast to the slow learning rate in previous results by [Rahimi and Recht(2009)] and [Bach(2017)], we show, for the first time, that RFSVM can achieve fast learning rate with far fewer features than the number of samples when the optimized features (see Assumption 2) are available, and thus we justify the potential computational benefits of RFSVM on classification tasks. We mainly considered two learning scenarios: the realizable case, and then unrealizable case, where the Bayes classifier does not belong to the RKHS of the feature map. In particular, our contributions are threefold:

1. We prove that under Massart’s low noise condition, with an optimized feature map, RFSVM can achieve a learning rate of 111 represents a quantity less than for some . , with number of features when the Bayes classifier belongs to the RKHS of a kernel whose spectrum decays polynomially ()). When the decay rate of the spectrum of kernel operator is sub-exponential, the learning rate can be improved to with only number of features.

2. When the Bayes classifier satisfies the separation condition; that is, when the two classes of points are apart by a positive distance, we prove that the RFSVM using an optimized feature map corresponding to Gaussian kernel can achieve a learning rate of with number of features.

3. Our theoretical analysis suggests reweighting random features before training. We confirm its benefit in our experiments over synthetic data sets.

We begin in Section 2 with a brief introduction of RKHS, random features and the problem formulation, and set up the notations we use throughout the rest of the paper. In Section 3, we provide our main theoretical results (see the appendices for the proofs), and in Section 4, we verify the performance of RFSVM in experiments. In particular, we show the improvement brought by the reweighted feature selection algorithm. The conclusion and some open questions are summarized at the end. The proofs of our main theorems follow from a combination of the sample complexity analysis scheme used by [Steinwart and Christmann(2008)] and the approximation error result of [Bach(2017)]. The fast rate is achieved due to the fact that the Rademacher complexity of the RKHS of random features and with regularization parameter is only , while and need not be too large to control the approximation error when optimized features are available. Detailed proofs and more experimental results are provided in the Appendices for interested readers.

## 2 Preliminaries and notations

Throughout this paper, a labeled data point is a point in , where is a bounded subset of .

is equipped with a probability distribution

.

### 2.1 Kernels and Random Features

A positive definite kernel function defined on determines the unique corresponding reproducing kernel Hilbert space (RKHS), denoted by . A map from the data space to a Hilbert space such that is called a feature map of and is called a feature space. For any , there exists an such that , and the infimum of the norms of all such s is equal to . On the other hand, given any feature map into , a kernel function is defined by the equation above, and we call the RKHS corresponding to , denoted by .

A common choice of feature space is the space of a probability space

. An important observation is that for any probability density function

defined on , with probability measure defines the same kernel function with the feature map under the distribution . One can sample the image of under the feature map , an function , at points according to the probability distribution to approximately represent . Then the vector in is called a random feature vector of , denoted by . The corresponding kernel function determined by is denoted by .

A well-known construction of random features is the random Fourier features proposed by [Rahimi and Recht(2008)]. The feature map is defined as follows,

 ϕ:X →L2(Rd,ν)⊕L2(Rd,ν) x ↦(cos(ω⋅x),sin(ω⋅x)).

And the corresponding random feature vector is

 ϕN(x)=1√N(cos(ω⋅x),⋯,cos(ω⋅x),sin(ω⋅x),⋯,sin(ω⋅x))⊺,

where s are sampled according to . Different choices of define different translation invariant kernels (see [Rahimi and Recht(2008)]). When

is the normal distribution with mean

and variance

, the kernel function defined by the feature map is Gaussian kernel with bandwidth parameter ,

 kγ(x,x′)=exp(−∥x−x′∥22γ2).

Equivalently, we may consider the feature map with being standard normal distribution.

A more general and more abstract feature map can be constructed using an orthonormal set of . Given the orthonormal set consisting of bounded functions, and a nonnegative sequence , we can define a feature map

 ϕ(ω;x)=∞∑i=1√λiei(x)ei(ω),

with feature space . The corresponding kernel is given by . The feature map and the kernel function are well defined because of the boundedness assumption on . A similar representation can be obtained for a continuous kernel function on a compact set by Mercer’s Theorem ([Lax(2002)]).

Every positive definite kernel function satisfying that defines an integral operator on by

 Σ:L2(X,PX) →L2(X,PX) f ↦∫Xk(x,t)f(t) dPX(t).

is of trace class with trace norm . When the integral operator is determined by a feature map , we denote it by , and the

th eigenvalue in a descending order by

. Note that the regularization paramter is also denoted by but without a subscript. The decay rate of the spectrum of plays an important role in the analysis of learning rate of random features method.

### 2.2 Formulation of Support Vector Machine

Given samples generated i.i.d. by and a function

, usually called a hypothesis in the machine learning context, the empirical and expected risks with respect to the loss function

are defined by

 Rℓm(f):=1mm∑i=1ℓ(yi,f(xi))RℓP(f):=\operatornamewithlimitsE(x,y)∼Pℓ(y,f(x)),

respectively.

The 0-1 loss is commonly used to measure the performance of classifiers:

 ℓ0−1(y,f(x))={1if f(x)y≤0;0if f(x)y>0.

The function that minimizes the expected risk under 0-1 loss is called the Bayes classifier, defined by

 f∗P(x):=sgn(\operatornamewithlimitsE[y∣x]).

The goal of the classification task is to find a good hypothesis with small excess risk . And to find the good hypothesis based on the samples, one minimizes the empirical risk. However, using 0-1 loss, it is hard to find the global minimizer of the empirical risk because the loss function is discontinuous and non-convex. A popular surrogate loss function in practice is the hinge loss: , which guarantees that

 RhP(f)−inffRhP(f)≥R0−1P(f)−R0−1P(f∗P),

where means and means . See [Steinwart and Christmann(2008)] for more details.

A regularizer can be added into the optimization objective with a scalar multiplier to avoid overfitting the random samples. Throughout this paper, we consider the most commonly used regularization. Therefore, the solution of the binary classification problem is given by minimizing the following objective

 Rm,λ(f)=Rhm(f)+λ2∥f∥2F,

over a hypothesis class . When is the RKHS of some kernel function, the algorithm described above is called kernel support vector machine. Note that for technical convenience, we do not include the bias term in the formulation of hypothesis so that all these functions are from the RKHS instead of the product space of RKHS and (see Chapter 1 of [Steinwart and Christmann(2008)] for more explanation of such a convention). Note that is strongly convex and thus the infimum will be attained by some function in . We denote it by .

When random features and the corresponding RKHS are considered, we add into the subscripts of the notations defined above to indicate the number of random features. For example for the RKHS, for the solution of the optimization problem.

## 3 Main Results

In this section we state our main results on the fast learning rates of RFSVM in different scenarios.

First, we need the following assumption on the distribution of data, which is required for all the results in this paper.

###### Assumption 1.

There exists such that

 |\operatornamewithlimitsE(x,y)∼P[y∣x]|≥2/V.

This assumption is called Massart’s low noise condition in many references (see for example [Koltchinskii et al.(2011)Koltchinskii, service), and d’Été de Probabilités de Saint-Flour]). When then all the data points have deterministic labels almost surely. Therefore it is easier to learn the true classifier based on observations. In the proof, Massart’s low noise condition guarantees the variance condition ([Steinwart and Christmann(2008)])

 \operatornamewithlimitsE[(ℓh(f(x))−ℓh(f∗P(x)))2]≤V(Rh(f)−Rh(f∗P)), (1)

which is a common requirement for the fast rate results. Massart’s condition is an extreme case of a more general low noise condition, called Tsybakov’s condition. For the simplicity of the theorem, we only consider Massart’s condition in our work, but our main results can be generalized to Tsybakov’s condition.

The second assumption is about the quality of random features. It was first introduced in [Bach(2017)]’s approximation results.

###### Assumption 2.

A feature map is called optimized if there exists a small constant such that for any ,

 supω∈Ω∥(Σ+μI)−1/2ϕ(ω;x)∥2L2(P)≤tr(Σ(Σ+μI)−1)=∞∑i=1λi(Σ)λi(Σ)+μ.

For any given , the quantity on the left hand side of the inequality is called leverage score with respect to , which is directly related with the number of features required to approximate a function in the RKHS of . The quantity on the right hand side is called degrees of freedom by [Bach(2017)] and effective dimension by [Rudi and Rosasco(2017)], denoted by . Note that whatever the RKHS is, we can always construct optimized feature map for it. In the Appendix A we describe two examples of constructing optimized feature map. When a feature map is optimized, it is easy to control its leverage score by the decay rate of the spectrum of , as described below.

###### Definition 1.

We say that the spectrum of decays at a polynomial rate if there exist and such that

 λi(Σ)≤c1i−c2.

We say that it decays sub-exponentially if there exist such that

 λi(Σ)≤c3exp(−c4i1/d).

The decay rate of the spectrum of characterizes the capacity of the hypothesis space to search for the solution, which further determines the number of random features required in the learning process. Indeed, when the feature map is optimized, the number of features required to approximate a function in the RKHS with accuracy is upper bounded by . When the spectrum decays polynomially, the degrees of freedom is , and when it decays sub-exponentially, is (see Lemma 6 in Appendix C for details). Examples on the kernels with polynomial and sub-exponential spectrum decays can be found in [Bach(2017)]. Our proof of Lemma 8 also provides some useful discussion.

With these preparations, we can state our first theorem now.

###### Theorem 1.

Assume that satisfies Assumption 1, and the feature map satisfies Assumption 2. If with . Then when the spectrum of decays polynomially, by choosing

 λ =m−c22+c2 N =10Cc1,c2m22+c2(ln(32Cc1,c2m22+c2)+ln(1/δ)),

we have

 R0−1P(fN,m,λ)−R0−1P(f∗P)≤Cc1,c2,V,Rm−c22+c2((ln(1/δ)+ln(m))),

with probability . When the spectrum of decays sub-exponentially, by choosing

 λ =1/m N =25Cd,c4lnd(m)(ln(80Cd,c4lnd(m))+ln(1/δ)),

we have

 R0−1P(fN,m,λ)−R0−1P(f∗P) ≤Cc3,c4,d,R,V1m(logd+2(m)+log(1/δ)),

with probability when .

This theorem characterizes the learning rate of RFSVM in realizable cases; that is, when the Bayes classifier belongs to the RKHS of the feature map. For polynomially decaying spectrum, when , we get a learning rate faster than . [Rudi and Rosasco(2017)]

obtained a similar fast learning rate for kernel ridge regression with random features (RFKRR), assuming polynomial decay of the spectrum of

and the existence of a minimizer of the risk in . Our theorem extends their result to classification problems and exponential decay spectrum. However, we have to use a stronger assumption that so that the low noise condition can be applied to derive the variance condition. For RFKRR, the rate faster than will be achieved whenever , and the number of features required is only square root of our result. We think that this is mainly caused by the fact that their surrogate loss is squared. The result for the sub-exponentially decaying spectrum is not investigated for RFKRR, so we cannot make a comparison. We believe that this is the first result showing that RFSVM can achieve with only features. Note however that when

is large, the sub-exponential case requires a large number of samples, even possibly larger than the polynomial case. This is clearly an artifact of our analysis since we can always use the polynomial case to provide an upper bound! We therefore suspect that there is considerable room for improving our analysis of high dimensional data in the sub-exponential decay case. In particular, removing the exponential dependence on

under reasonable assumptions is an interesting direction for future work.

To remove the realizability assumption, we provide our second theorem, on the learning rate of RFSVM in unrealizable case. We focus on the random features corresponding to the Gaussian kernel as introduced in Section 2

. When the Bayes classifier does not belong to the RKHS, we need an approximation theorem to estimate the gap of risks. The approximation property of RKHS of Gaussian kernel has been studied in

[Steinwart and Christmann(2008)], where the margin noise exponent is defined to derive the risk gap. Here we introduce the simpler and stronger separation condition, which leads to a strong result.

The points in can be collected in to two sets according to their labels as follows,

 X1 :={x∈X∣\operatornamewithlimitsE(y∣x)>0} X−1 :={x∈X∣\operatornamewithlimitsE(y∣x)<0}.

The distance of a point to the set is denoted by .

###### Assumption 3.

We say that the data distribution satisfies a separation condition if there exists such that .

Intuitively, Assumption 3 requires the two classes to be far apart from each other almost surely. This separation assumption is an extreme case when the margin noise exponent goes to infinity.

The separation condition characterizes a different aspect of data distribution from Massart’s low noise condition. Massart’s low noise condition guarantees that the random samples represent the distribution behind them accurately, while the separation condition guarantees the existence of a smooth, in the sense of small derivatives, function achieving the same risk with the Bayes classifier.

With both assumptions imposed on , we can get a fast learning rate of with only random features, as stated in the following theorem.

###### Theorem 2.

Assume that is bounded by radius . The data distribution has density function upper bounded by a constant , and satisfies Assumption 1 and 3. Then by choosing

 λ=1/mγ=τ/√lnmN=Cτ,d,ρln2dm(lnlnm+ln(1/δ)),

the RFSVM using an optimized feature map corresponding to the Gaussian kernel with bandwidth achieves the learning rate

 R0−1P(fN,m,λ)−R0−1P(f∗P)≤Cτ,V,d,ρ,Bln2d+1(m)(lnln(m)+ln(1/δ))m,

with probability greater than for , where depends on .

To the best of our knowledge, this is the first theorem on the fast learning rate of random features method in the unrealizable case. It only assumes that the data distribution satisfies low noise and separation conditions, and shows that with an optimized feature distribution, the learning rate of can be achieved using only features. This justifies the benefit of using RFSVM in binary classification problems. The assumption of a bounded data set and a bounded distribution density function can be dropped if we assume that the probability density function is upper bounded by , which suffices to provide the sub-exponential decay of spectrum of . But we prefer the simpler form of the results under current conditions. We speculate that the conclusion of Theorem 2 can be generalized to all sub-Gaussian data.

The main drawback of our two theorems is the assumption of an optimized feature distribution, which is hard to obtain in practice. Developing a data-dependent feature selection method is therefore an important problem for future work on RFSVM. [Bach(2017)] proposed an algorithm to approximate the optimized feature map from any feature map. Adapted to our setup, the reweighted feature selection algorithm is described as follows.

1. Select i.i.d. random vectors according to the distribution .

2. Select data points uniformly from the training set.

3. Generate the matrix with columns .

4. Compute , the diagonal of .

5. Resample features from according to the probability distribution .

The theoretical guarantees of this algorithm have not been discussed in the literature. A result in this direction will be extremely useful for guiding practioners. However, it is outside the scope of our work. Instead, here we implement it in our experiment and empirically compare the performance of RFSVM using this reweighted feature selection method to the performance of RFSVM without this preprocessing step; see Section 4.

For the realizable case, if we drop the assumption of optimized feature map, only weak results can be obtained for the learning rate and the number of features required (see E for more details). In particular, we can only show that random features are sufficient to guarantee the learning rate less than when samples are available. Though not helpful for justifying the computational benefit of random features method, this result matches the parallel result for RFKRR in [Rudi and Rosasco(2017)] and the approximation result in [Sriperumbudur and Szabo(2015)]. We conjecture that this upper bound is also optimal for RFSVM.

[Rudi and Rosasco(2017)] also compared the performance of RFKRR with Nystrom method, which is the other popular method to scale kernel ridge regression to large data sets.We do not find any theoretical guarantees on the fast learning rate of SVM with Nystrom method on classification problems in the literature, though there are several works on its approximation quality to the accurate model and its empirical performance (see [Yang et al.(2012)Yang, Li, Mahdavi, Jin, and Zhou, Zhang et al.(2012)Zhang, Lan, Wang, and Moerchen]). The tools used in this paper should also work for learning rate analysis of SVM using Nystrom method. We leave this analysis to the future.

## 4 Experimental Results

In this section we evaluate the performance of RFSVM with the reweighted feature selection algorithm222The source code is available at https://github.com/syitong/randfourier.. The sample points shown in Figure 4 are generated from either the inner circle or outer annulus uniformly with equal probability, where the radius of the inner circle is 0.9, and the radius of the outer annulus ranges from 1.1 to 2. The points from the inner circle are labeled by -1 with probability 0.9, while the points from the outer annulus are labeled by 1 with probability 0.9. In such a simple case, the unit circle describes the Bayes classifier.

First, we compared the performance of RFSVM with that of KSVM on the training set with samples, over a large range of regularization parameter (). The bandwidth parameter is fixed to be an estimate of the average distance among the training samples. After training, models are tested on a large testing set (). For RFSVM, we considered the effect of the number of features by setting to be and , respectively. Moreover, both feature selection methods, simple random feature selection (labeled by ‘unif’ in the figures), which does not apply any preprocess on drawing features, and reweighted feature selection (labeled by ‘opt’ in the figures) are inspected. For the reweighted method, we set and

to compute the weight of each feature. Every RFSVM is run 10 times, and the average accuracy and standard deviation are presented.

The results of KSVM, RFSVMs with 1 and 20 features are shown in Figure 2 and Figure 2 respectively (see the results of other levels of features in Appendix F in the supplementary material). The performance of RFSVM is slightly worse than the KSVM, but improves as the number of features increases. It also performs better when the reweighted method is applied to generate features.

To further compare the performance of simple feature selection and reweighted feature selection methods, we plot the learning rate of RFSVM with features and the best s for each sample size . KSVM is not included here since it is too slow on training sets of size larger than in our experiment compared to RFSVM. The error rate in Figure 4 is the excess risk between learned classifiers and the Bayes classifier. We can see that the excess risk decays as increases, and the RFSVM using reweighted feature selection method outperforms the simple feature selection.

According to Theorem 2, the benefit brought by optimized feature map, that is, the fast learning rate, will show up when the sample size is greater than (see Appendix D). The number of random features required also depends on , the dimension of data. For data of small dimension and large sample size, as in our experiment, it is not a problem. However, in applications of image recognition, the dimension of the data is usually very large and it is hard for our theorem to explain the performance of RFSVM. On the other hand, if we do not pursue the fast learning rate, the analysis for general feature maps, not necessarily optimized, gives a learning rate of with random features, which does not depend on the dimension of data (see Appendix E). Actually, for high dimensional data, there is barely any improvement in the performance of RFSVM by using reweighted feature selection method (see Appendix F). It is important to understand the role of to fully understand the power of random features method.

## 5 Conclusion

Our study proves that the fast learning rate is possible for RFSVM in both realizable and unrealizable scenarios when the optimized feature map is available. In particular, the number of features required is far less than the sample size, which implies considerably faster training and testing using the random features method. Moreover, we show in the experiments that even though we can only approximate the optimized feature distribution using the reweighted feature selection method, it, indeed, has better performance than the simple random feature selection. Considering that such a reweighted method does not rely on the label distribution at all, it will be useful in learning scenarios where multiple classification problems share the same features but differ in the class labels. We believe that a theoretical guarantee of the performance of the reweighted feature selection method and properly understanding the dependence on the dimensionality of data are interesting directions for future work.

#### Acknowledgements

AT acknowledges the support of a Sloan Research Fellowship.

ACG acknowledges the support of a Simons Foundation Fellowship.

## References

• [Bach(2017)] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1–38, 2017.
• [Cortes et al.(2010)Cortes, Mohri, and Talwalkar] Corinna Cortes, Mehryar Mohri, and Ameet Talwalkar. On the impact of kernel approximation on learning accuracy. Journal of Machine Learning Research, 9:113–120, 2010. ISSN 1532-4435.
• [Cucker and Smale(2002)] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39:1–49, 2002.
• [Dai et al.(2014)Dai, Xie, He, Liang, Raj, Balcan, and Song] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3041–3049. Curran Associates, Inc., 2014.
• [Eric et al.(2008)Eric, Bach, and Harchaoui] Moulines Eric, Francis R Bach, and Zaïd Harchaoui. Testing for homogeneity with kernel Fisher discriminant analysis. In Advances in Neural Information Processing Systems, pages 609–616, 2008.
• [Hsieh et al.(2008)Hsieh, Chang, Lin, Keerthi, and Sundararajan] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 408–415, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4.
• [Huang et al.(2014)Huang, Avron, Sainath, Sindhwani, and Ramabhadran] P. S. Huang, H. Avron, T. N. Sainath, V. Sindhwani, and B. Ramabhadran.

Kernel methods match deep neural networks on timit.

In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 205–209, May 2014.
• [Koltchinskii et al.(2011)Koltchinskii, service), and d’Été de Probabilités de Saint-Flour] Vladimir. Koltchinskii, SpringerLink (Online service), and École d’Été de Probabilités de Saint-Flour. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems École d’Été de Probabilités de Saint-Flour XXXVIII-2008. Lecture Notes in Mathematics,0075-8434 ;2033. Springer-Verlag Berlin Heidelberg, Berlin, Heidelberg, 2011.
• [Lax(2002)] P.D. Lax. Functional analysis. Pure and applied mathematics. Wiley, 2002. ISBN 9780471556046.
• [Rahimi and Recht(2008)] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1177–1184. Curran Associates, Inc., 2008.
• [Rahimi and Recht(2009)] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1313–1320. Curran Associates, Inc., 2009.
• [Rudi and Rosasco(2017)] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, pages 3218–3228, 2017.
• [Scovel et al.(2010)Scovel, Hush, Steinwart, and Theiler] Clint Scovel, Don Hush, Ingo Steinwart, and James Theiler. Radial kernels and their reproducing kernel hilbert spaces. Journal of Complexity, 26(6):641–660, 2010.
• [Shalev-Shwartz et al.(2011)Shalev-Shwartz, Singer, Srebro, and Cotter] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: primal estimated sub-gradient solver for svm. Mathematical Programming, 127(1):3–30, 2011. ISSN 1436-4646.
• [Sriperumbudur and Szabo(2015)] Bharath Sriperumbudur and Zoltan Szabo. Optimal rates for random fourier features. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1144–1152. Curran Associates, Inc., 2015.
• [Steinwart and Christmann(2008)] I. Steinwart and A. Christmann. Support Vector Machines. Information Science and Statistics. Springer New York, 2008. ISBN 9780387772424.
• [Sutherland and Schneider(2015)] Dougal J. Sutherland and Jeff G. Schneider. On the error of random fourier features. CoRR, abs/1506.02785, 2015.
• [Widom(1963)] Harold Widom. Asymptotic behavior of the eigenvalues of certain integral equations. Transactions of the American Mathematical Society, 109(2):278–295, 1963. ISSN 00029947.
• [Yang et al.(2012)Yang, Li, Mahdavi, Jin, and Zhou] Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nyström method vs random fourier features: A theoretical and empirical comparison. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 476–484. Curran Associates, Inc., 2012.
• [Zhang et al.(2012)Zhang, Lan, Wang, and Moerchen] Kai Zhang, Liang Lan, Zhuang Wang, and Fabian Moerchen. Scaling up kernel svm on limited resources: A low-rank linearization approach. In Neil D. Lawrence and Mark Girolami, editors,

Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics

, volume 22 of Proceedings of Machine Learning Research, pages 1425–1434, La Palma, Canary Islands, 21–23 Apr 2012. PMLR.

## Appendix A Examples of Optimized Feature Maps

Assume that a feature map satisfies that is bounded for all and . We can always convert it to an optimized feature map using the method proposed by [Bach(2017)]. We rephrase it using our notation as follows.

Define

 p(ω)=∥(Σ+μI)−1/2ϕ(⋅;ω)∥2L2(X,P)∫Ω∥(Σ+μI)−1/2ϕ(⋅;ω)∥2L2(X,P) dν(ω). (2)

Since is bounded, its norm is finite. The function defined above is a probability density function with respect to , and we denote the new probability measure by . Then the new feature map is given by together with the measure . With , we have

 supω∈Ω∥∥(Σ+μI)−1/2~ϕ(⋅;ω)∥∥2 =supω∈Ω∥∥(Σ+μI)−1/2ϕ(⋅;ω)∥∥2p(ω) (3) =∫Ω∥(Σ+μI)−1/2ϕ(⋅;ω)∥2L2(X,P) dν(ω) (4) =tr(Σ(Σ+μI)−1). (5)

When the feature map is constructed mapping into as described in Section 2, it is optimized. Indeed, we can compute

 supω∈X∥∥(Σ+μI)−1/2ϕ(⋅;ω)∥∥2 =supω∈X∥∥ ∥∥∞∑i=1√λi√λi+μei(⋅)∥∥ ∥∥2 (6) =∞∑i=1λiλi+μ. (7)

As an example for this type of feature map, we can consider to be the Walsh system, which is an orthonormal basis for . Any Bayes classifier with finitely many discontinuities and discontinuous only at dyadic, namely points expressable by finite bits, points, will be a finite linear combination of Walsh basis. This guarantees that the assumptions in Theorem 1 can be satisfied. Our first experiment also make use of this construction.

The construction above is inspired by the use of spline kernel in [Rudi and Rosasco(2017)]. However, our situation is more complicated since the target function, Bayes classifier, is discontinuous. While the functions in the RKHS generated by the spline kernel must be continuous ([Cucker and Smale(2002)]). Though we can construct Bayes classifier using the Walsh basis, we have yet to understand the variety of possible Bayes classifiers in such a space.

## Appendix B Local Rademacher Complexity of RFSVM

Before the proofs, we first briefly summarize the use of each lemmas and theorems. Theorem 3 and 4 are two fundamental external results for our proof. Lemma 7 and 8 refine results that appeared in previous works, so that we can apply them to our case. Lemma 3, 4 and 5 are the key results to establish fast rate for RFSVM, parallel to Steinwarts’ work for KSVM. All other smaller and simpler lemmas included in the appendices are for the purposes of clarity and completeness. The proofs are not hard but quite technical.

First, both of our theorems are consequences of the following fundamental theorem.

###### Theorem 3.

(Theorem 7.20 in [Steinwart and Christmann(2008)]) For a RKHS , denote by . For , consider the following function classes

 Fr:={f∈F∣R1P,λ(f)−R∗≤r}

and

 Hr:={ℓ1∘f−ℓ1∘f∗P∣f∈Fr}.

Assume that there exists such that for any ,

 \operatornamewithlimitsEP(ℓ1∘f−ℓ1∘f∗P)2≤V(R1P(f)−R∗).

If there is a function such that and for all , Then, for any , with , and

 r>max{30φm(r),72Vln(1/δ)m,5B0ln(1/δ)m,r∗},

we have

 R1P,λ(fm,N,λ)−R∗≤6(RhP,λ(f0)−R∗)+3r

with probability greater than .

To establish the fast rate of RFSVM using the theorem above, we must understand the local Rademacher complexity of RFSVM: that is, find a formula for . and are only related with the approximation error, and we leave the discussion of them to next sections. The variance condition Equation 1 is satisfied under Assumption 1. With this variance condition, we can upper bound the Rademacher complexity of RFSVM in terms of number of features and regularization parameter. It is particularly important to have inside the logarithm function.

First, we will need the summation version of Dudley’s inequality using entropy number defined below, instead of covering number.

###### Definition 2.

For a semi-normed space , we define its (dyadic) entropy number by

 en(E,∥⋅∥):=inf⎧⎨⎩ε>0:∃s1,…,s2n−1∈B1 s.t. B1⊂2n−1⋃i=1B(si,ε)⎫⎬⎭,

where is the unit ball in and is the ball with center at and radius .

To take off the loss function from the hypothesis class, we have the following lemma. is the semi-norm defined by .

###### Proof.

Assume that is an -covering over with . By definition . Then is a covering over . For any and in ,

 ∥∥ℓ1∘f−ℓ1∘g∥∥L2(D)≤1⋅∥f−g∥L2(D),

because is -Lipschitz. And hence the radius of the image of an -ball under is less than . Therefore is an -covering over with cardinatily and . By taking infimum over the radius of all such and , the statement is proved. ∎

Now we need to give an upper bound for the entropy number of with semi-norm using a volumetric estimate.

.

###### Proof.

Since consists of functions

 f(x)=1√NN∑i=1wcicos(ωi⋅xγ)+wsisin(ωi⋅xγ),

under the semi-norm it is isometric with the -dimensional subspace of spanned by the vectors

 {[cos(ωi⋅x1γ),…,cos(ωi⋅xmγ)]⊺,[sin(ωi⋅x1γ),…,sin(ωi⋅xmγ)]⊺}Ni=1

for fixed samples. For each , we have which implies that By the property of RKHS, we get

 |f(x)|≤∥f∥F∥k(x,⋅)∥F≤(2rλ)1/2⋅1,

where we use the fact that is the evaluation functional in the RKHS.

Denote the isomorphism from (modulo the equivalent class under the semi-norm) to by . Then we have

 I(Fr)⊂Bm∞((2rmλ)1/2)∩U⊂Bm2((2rλ)1/2)∩U.

The intersection region can be identified as a ball of radius in . Its entropy number by volumetric estimate is given by

 ei(B2N2((2rλ)1/2),∥⋅∥2)≤3(2rλ)1/22−i2N.

With the lemmas above, we can get an upper bound on the entropy number of . However, we should note that such an upper bound is not the best when is small. Because the ramp loss is bounded by , the radius of with respect to is bounded by , which is irrelevant with . This observation will give us finer control on the Rademacher complexity.

###### Lemma 3.

Assume that . Then

 RD(Hr)≤√(ln16)Nlog21/λm(3√2ρ+18√r),

where .

###### Proof.

By Theorem 7.13 in [Steinwart and Christmann(2008)], we have

 RD(Hr) ≤√ln16m(∞∑i=12i/2e2i(Hr∪{0},∥⋅∥L2(D))+suph∈Hr∥h∥L2(D)).

It is easy to see that and . Since is a decreasing sequence with respect to , together with the lemma above, we know that

 ei(Hr)≤min{suph∈Hr∥h∥L2(D),3(2rλ)1/22−i2N}.

Even though the second one decays exponentially, it may be much greater than the first term when is huge for small s. To achieve the balance between these two bounds, we use the first one for first terms in the sum and the second one for the tail. So

 RD(Hr)≤√ln16m(suph∈Hr∥h∥L2(D)T−1∑i=02i/2+3(2rλ)1/2∞∑i=T2i/22−2i−12N).

The first sum is . When is large enough, the second sum is upper bounded by the integral

 ∫∞T−12x/22−2x−1/2Ndx ≤6N2T/2⋅2−2T4N.

To make the form simpler, we bound by , and denote by . Taking to be

 log2(2Nlog2(1λ)),

we get the upper bound of the form

 RD(Hr)≤√ln16m(3ρ√2Nlog21λ+18√Nrlog2(1/λ)),

When , , so we can further enlarge the upper bound to the form

 RD(Hr)≤√(ln16)Nlog21/λm(3√2ρ+18√r),

Next lemma analyzes the expected Rademacher complexity for .

###### Lemma 4.

Assume and . Then

 Rm(Hr)≤C1√N(V+1)log2(1/λ)m√r+C2Nlog2(1/λ)m.
###### Proof.

With Lemma 3, we can directly compute the upper bound for by taking expectation over .

 Rm(Hr) =\operatornamewithlimitsED∼PmRD(Hr) ≤√(ln16)Nlog21/λm(3√2\operatornamewithlimitsEsuph∈Hr∥h∥L2(D)+18√r).

By Jensen’s inequality and A.8.5 in [Steinwart and Christmann(2008)], we have

 \operatornamewithlimitsEsuph∈Hr∥h∥L2(D) ≤(σ2+8Rm(Hr))1/2,

where . When , we have

 Rm(Hr) ≤√(ln16)Nlog2(1/λ)m(9√2σ+18√r) ≤√(ln16)Nlog2(1/λ)m(9√2√Vr+18√r) ≤36√2(ln16)N(V+1)log2(1/λ)m√r.

The second inequality is because and for .

When , we have

 Rm(Hr) ≤√(ln16)Nlog2(1/λ)m(9√2√Rm(Hr)+18√r) ≤36√(ln16)Nlog2(1/λ)m√r+362(ln16)Nlog2(1/λ)m.

The last inequality can be obtained by dividing the formula into two cases, either or and then take the sum of the upper bounds of two cases.

Combining all these inequalities, we finally obtain an upper bound

 Rm(Hr)≤C