# The Error Probability of Random Fourier Features is Dimensionality Independent

We show that the error probability of reconstructing kernel matrices from Random Fourier Features for any shift-invariant kernel function is at most O((-D)), where D is the number of random features. We also provide a matching information-theoretic method-independent lower bound of Ω((-D)) for standard Gaussian distributions. Compared to prior work, we are the first to show that the error probability for random Fourier features is independent of the dimensionality of data points as well as the size of their domain. As applications of our theory, we obtain dimension-independent bounds for kernel ridge regression and support vector machines.

## Authors

• 50 publications
• 1 publication
• ### Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees

Random Fourier features is one of the most popular techniques for scalin...
04/26/2018 ∙ by Haim Avron, et al. ∙ 0

• ### Sparse Recovery With Non-Linear Fourier Features

Random non-linear Fourier features have recently shown remarkable perfor...
02/12/2020 ∙ by Ayca Ozcelikkale, et al. ∙ 0

• ### DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features

Acoustic scene recordings are represented by different types of handcraf...
01/08/2018 ∙ by Abelino Jimenez, et al. ∙ 0

• ### A Unified Analysis of Random Fourier Features

We provide the first unified theoretical analysis of supervised learning...
06/24/2018 ∙ by Zhu Li, et al. ∙ 0

• ### Wind Field Reconstruction with Adaptive Random Fourier Features

We investigate the use of spatial interpolation methods for reconstructi...
02/04/2021 ∙ by Jonas Kiessling, et al. ∙ 0

• ### Simple and Almost Assumption-Free Out-of-Sample Bound for Random Feature Mapping

Random feature mapping (RFM) is a popular method for speeding up kernel ...
09/24/2019 ∙ by Shusen Wang, et al. ∙ 0

• ### How isotropic kernels learn simple invariants

We investigate how the training curve of isotropic kernel methods depend...
06/17/2020 ∙ by Jonas Paccolat, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Kernel methods are widely applied in many machine learning algorithms, including kernel perceptron, support vector machines, principal component analysis, and Gaussian processes. Kernels allow to convert problems that evaluate explicit feature mappings to problems that evaluate kernel functions, i.e., inner products of feature mappings. Kernel methods are efficient since computing inner products of feature mappings is often computationally cheaper than computing the feature mappings directly. To fully leverage the power of the kernel method, a

matrix called kernel matrix(Gram Matrix) must be computed, which does not scale when the number of data points is large. To cope with such problem, Rahimi and Recht [2007] proposed an algorithm called Random Fourier Features (RFF). RFF approximates the kernel evaluation by the average of Fourier Features (cosines of linear projections). This approach is theoretically motivated by Bochner’s theorem [Bochner, 1959]

, which states that any continuous, positive definite and shift-invariant function can be written as the Fourier transform of a nonnegative measure.

Though RFF is a successful method in practice, its theoretical performance is yet to be discovered. Along with the algorithm, Rahimi and Recht [2007] also analyzed the error probability of reconstructing kernel matrices, which is , for any compact data domain, where is the domain diameter, is the number of Fourier features, and is the dimensionality of the data points. Their approach is based on covering numbers. Following the work of [Rahimi and Recht, 2007], Sutherland and Schneider [2015] improved the constants in the previous covering-number upper bound and provided results which were also . Later Sriperumbudur and Szabó [2015] proved an upper bound of , by using a Rademacher complexity approach.

In this paper, we remove the dependence on the dimensionality of the data points from the error probability, and show that the dependence on the domain diameter is sub-linear. That is, we show that the error probability depends only on the number of Fourier features . More specifically, we show an upper bound on the error probability which is . In addition, we also reason about the lower bound, by showing that the minimax bound is

for any estimator based on Gaussian linear projections.

Previous analyses [Rahimi and Recht, 2007, Sutherland and Schneider, 2015, Sriperumbudur and Szabó, 2015]

are agnostic to the distribution of random variables. This is typical in learning theory where one relates random variables to an arbitrary data distribution. In our problem, random variables are Gaussian (used for linear projections), while data lives in a ball of diameter

. Thus, in our analysis, we exploit Gaussianity in order to obtain relatively more optimal bounds.

## 2 Preliminaries

In this section, we introduce some definitions, notations and preliminaries that will be used in the following sections.

### 2.1 Definitions and Notations

For any vector we denote its -norm by . We denote the compact ball centered at the origin and with radius by:

 B(R)={Δ|Δ∈Rd,∥Δ∥2≤R}

The above is the domain of the data points, considered throughout the paper. For a probability distribution

, denote its corresponding probability density function by

, any random vector from by and the expectation with respect to by . For any set of i.i.d. samples from , denote these samples by , its product distribution by , and the expectation with respect to by . Finally denote any multivariate Gaussian distribution with mean and covariance by , and denote the -dimensional standard Gaussian distribution by .

### 2.2 Random Fourier Features

Let be two data points, , and let be a nonnegative, continuous and shift-invariant function, that is

 k(x,y)=k(x−y)

By Bochner’s theorem [Bochner, 1959], the Fourier transform of is a probability density function. We denote such probability density function by and its distribution by . We have

 k(x,y) =∫Rdp(ω)eiω⊤(x−y)dω=Eω∼P[cos(ω⊤(x−y))]

since only the real part of the above integral is considered. Then RFF draws sample points from , and approximates by

 s(x,y)=s(Δ):=1DD∑i=1cos(ω⊤iΔ)=⟨z(x),z(y)⟩

where

 z(x) =1√D(cos(ω⊤1x),sin(ω⊤1x),…,cos(ω⊤Dx),sin(ω⊤Dx)) (1)

The above result enables us to draw a set of samples from and approximates for any , without computing the Gram matrix directly. For the performance analysis of RFF, Rahimi and Recht [2007] first provided a theoretical upper bound on the error probability for uniform convergence. Specifically, Claim 1 of [Rahimi and Recht, 2007] shows that the error probability behaves as follows 111[Rahimi and Recht, 2007] defines , where

is a random variable with uniform distribution in

.

The above depends on and . Thus, the above bound is . Sutherland and Schneider [2015] improved the upper bound of the error probability with the same features used in [Rahimi and Recht, 2007]. Their upper bound is given in Proposition 1 of [Sutherland and Schneider, 2015], which is 222[Sutherland and Schneider, 2015] uses the features in (1), but the number of features is instead.

which has the same asymptotic upper bound as [Rahimi and Recht, 2007]. Later Sriperumbudur and Szabó [2015] proved the following upper bound (See Appendix A for more details):

The above depends on and . Thus, the above bound is . Contrary to common belief, the above bound is less optimal than the bounds in [Rahimi and Recht, 2007, Sutherland and Schneider, 2015]. (See Appendix A for a throughout discussion.)

## 3 Sufficient Number of Samples

In this section we prove that the upper bound on the error probability for uniform convergence in RFF is independent of the dimensionality of the data points , and sub-linear with respect to the domain diameter .

###### Theorem 1.

Let be a set of i.i.d. -dimensional random vectors from . Then:

 PΩ∼PD[supΔ∈B(R)|1DD∑i=1cos(ω⊤iΔ)−e−∥Δ∥222|≥ϵ]≤3R2/3D1/3ϵ2/3exp(−Dϵ212)
###### Proof.

Note and for all . Note that since for every , is independent of then is independent of . Let for all . Furthermore, is a random vector from . Note that for , we have and . Given the above, we have:

 PΩ∼PD[supΔ∈B(R)|1DD∑i=1cos(ω⊤iΔ)−e−∥Δ∥222|≥ϵ]=Pα∼N(0D,ID)[supr∈[0,R]|f(r)|≥ϵ] (2)

where , and . In order to proceed with a union bound for all , we divide the set into subsets for all , each of them with center . For a particular and , if and if the Lipschitz constant of fulfills , we have:

 |f(r)|=|f(ct)+f(r)−f(ct)|≤|f(ct)|+|f(r)−f(ct)|≤ϵ2+L|r−ct|≤ϵ2+RL2T≤ϵ

Next, we proceed to bound the expected value of the Lipschitz constant of . By linearity of expectation, we have . Therefore:

 Eα∼N(0D,ID)[L2] =supr∈[0,R]Eα∼N(0D,ID)[(∂f(r))2] =supr∈[0,R]Eα∼N(0D,ID)[(∂s(r)−∂k(r))2] =supr∈[0,R]Eα∼N(0D,ID)[(∂s(r))2−(∂k(r))2] =supr∈[0,R]Eα∼N(0D,ID)[(∂s(r))2−(Eα∼N(0D,ID)[∂s(r)])2] =supr∈[0,R]Varα∼N(0D,ID)[∂s(r)] =supr∈[0,R]Varα∼N(0D,ID)[1DD∑i=1αisin(αir)] =supr∈[0,R]1D2D∑i=1Varαi∼N(0,1)[αisin(αir)] ≤supr∈[0,R]1D2D∑i=1Eαi∼N(0,1)[α2isin2(αir)] ≤supr∈[0,R]1D2D∑i=1Eαi∼N(0,1)[α2i] =1/D (3)

Since then . Therefore, from (3), since , and by Markov’s and Hoeffding’s inequalities and by the union bound, we have

 Pα∼N(0D,ID)[supr∈[0,R]|f(r)|≥ϵ] =Pα∼N(0D,ID)[L2≥ϵ2T2R2]+Pα∼N(0D,ID)[∃t∈{1,…,T},|f(ct)|≥ϵ2] ≤R2T2Dϵ2+2Texp(−Dϵ28) (4)

By optimizing the above with respect to , we obtain . Finally, by replacing the optimal back in (3), and by (2) we prove our claim. ∎

## 4 Necessary Number of Samples

In contrast to the relative popularity of the upper bound analysis on the error probability for RFF, lower bounds have not been analyzed before. In this section we try to shed light on the minimax bound for any estimator based on Gaussian linear projections, by using Le Cam’s Lemma [Yu, 1997, Wasserman, 2010]. In the first part we introduce Le Cam’s Lemma for minimax bounds, then we use Le Cam’s Lemma to show the minimax bound of the expected error. Finally we generalize our result in the last subsection and show that the supremum of the error probability is also bounded below by .

### 4.1 Le Cam’s Lemma

Recall that given an observation drawn from some distribution in a family of distributions , a function of , and any estimator of , the minimax risk is:

 inf^θsupP∈PEX∼P[d(^θ(X),θ(P))]

Minimax theory illustrates the lower bound of the estimation errors among all estimators. Le Cam’s Lemma is a method for providing a lower bound for the minimax risk. First we introduce Le Cam’s Lemma from Theorem 36.8 of [Wasserman, 2010], which is a revised version from the results in [Yu, 1997]:

###### Lemma 1 (Le Cam [Yu, 1997, Wasserman, 2010]).

Let be a set of distributions over space , where every distribution corresponds to a parameter in the parameter space . Let be any estimator, be an observation drawn from , be a metric in . Then for any pair of distributions

 inf^θsupP∈PEX∼P[d(^θ(X),θ(P))]≥d(θ(P1),θ(P2))4∫X∈Xmin(p1(X),p2(X))dX

where are the probability density functions of , respectively.

(Detailed proofs can be found in Appendix B)

Moreover, can be any semimetric or non-negative symmetric function satisfying the relaxed triangle inequality, which generalizes Le Cam’s Lemma as follows:

###### Lemma 2.

Let be a set of distributions over space , where every distribution corresponds to a parameter in the space of parameter . Let be any estimator, be an observation drawn from , be a non-negative symmetric function in satisfying whenever and . For any pair of distributions satisfying we have:

 inf^θsupP∈PEX∼P[d(^θ(X),θ(P))]≥d(θ(P1),θ(P2))4A∫X∈Xmin(p1(X),p2(X))dX

where are the probability density functions of , respectively.

Le Cam’s Lemma allows for analyzing the minimax bound of estimation errors, which gives us some insight of the lower bound on the estimator. In order to simplify our analysis with Le Cam’s Lemma, we also introduce the following result:

###### Lemma 3 (Lemma 2.6 in [Tsybakov, 2009]).

For any two distributions with support , we have

where is the KL-divergence from to , and are the probability density functions of respectively.

### 4.2 Minimax Bound for the Expected Error

In what follows, we regard the kernel as a parameter of its Fourier transform , and then show by Le Cam’s Lemma that the minimax bound for such parameter estimation is . We introduce a function that fulfills . Furthermore, let be the -th Gaussian linear projection of the data point . The particular estimator used in practice, studied in Section 3 for the sufficient number of samples can be defined as . For our analysis of necessary number of samples, we consider any estimator of the form , thus, setting impossibility results beyond the typically used estimator.

###### Theorem 2.

Let be a set of i.i.d. -dimensional random vectors from . Let be any estimator that uses Gaussian linear projections of the data point , i.e., takes as input. Let where is Lambert’s W function. Define   and

 g(R) =⎧⎪⎨⎪⎩e−γR22−e−R22if R

Then

 inf^θsupΔ∈B(R)EΩ∼PD[ |^θ(ω⊤1Δ,…,ω⊤DΔ)−e−∥Δ∥222|]≥g(R)8e−D2
###### Proof.

Note and for all . Let then can be regarded as a set of i.i.d. samples from and is a random vector from . We define the family of distributions . We let the parameter of be , where the last equality follows from Lemma 5 in Appendix B. By Le Cam’s Lemma 1 and Lemma 3 we have:

 inf^θsupΔ∈B(R)EΩ∼PD[|^θ(ω⊤1Δ,…,ω⊤DΔ)−θ(PΔ)|] =inf^θsupPΔ∈PEα∼PΔ[|^θ(α1,…,αD)−θ(PΔ)|] ≥14|θ(PΔ1)−θ(PΔ2)|∫RDmin(pΔ1(α),pΔ2(α))dα ≥18|θ(PΔ1)−θ(PΔ2)|e−KL(PΔ1∥∥PΔ2) =18|e−∥∥Δ1∥∥222−e−∥∥Δ2∥∥222|e−KL(PΔ1∥∥PΔ2) (5)

where . To complete the proof, we will first set and then maximize the expression . Note that and

are two multivariate normal distributions, which implies that their KL-divergence is

 KL(PΔ1∥PΔ2)=D2(∥Δ1∥22∥Δ2∥22−log∥Δ1∥22∥Δ2∥22−1)

By choosing such that , we can make and thus . Then from (4.2) we get

 inf^θsupΔ∈B(R)EΩ∼PD[|^θ(ω⊤1Δ,…,ω⊤DΔ)−θ(PΔ)|]≥ 18|e−∥∥Δ1∥∥222−e−∥∥Δ2∥∥222|e−D2 (6)

for any that satisfy the constraint . Under this constraint, the solution to maximize is and , if . Similarly, the solution is and , if . By rewriting (6) with this maximizing solution, we prove our claim. ∎

### 4.3 Minimax Bound for the Error Probability

In this part we generalize the results in the previous subsection and show that the supremum of the error probability is for any estimator based on Gaussian linear projections. Here we introduce a generalization of Theorem 2, based on the generalized Le Cam’s Lemma 2:

###### Theorem 3.

Let be a set of i.i.d. -dimensional random vectors from . Let be any estimator that uses Gaussian linear projections of the data point , i.e., takes as input. Then

 inf^θsupΔ∈B(R)PΩ∼PD [|^θ(ω⊤1Δ,…,ω⊤DΔ)−e−∥Δ∥222|>ϵ]≥g(R)−ϵ8e−D2

provided that , where is defined as in Theorem 2.

###### Proof.

The proof follows the proof of Theorem 2 at large. We can regard every as a random variable from and regard as a -dimensional random vector from . We define the family of distributions . We let the parameter of be , where the last equality follows from Lemma 5 in Appendix B. We also define a symmetric, nonnegative function

 dϵ(θ,θ′)=12max(0,|θ−θ′|−ϵ)

which satisfies a relaxed triangle inequality whenever . Furthermore, define

 Iϵ(θ,θ′)={1if |θ−θ′|>ϵ0otherwise

which satisfies . From the above and by Le Cam’s Lemma 2 and Lemma 3 we have:

 inf^θsupΔ∈B(R)PΩ∼PD(|^θ(ω⊤1Δ,…,ω⊤DΔ)−θ(PΔ)|>ϵ] =inf^θsupPΔ∈PPα∼PΔ[|^θ(α1,…,αD)−θ(PΔ)|>ϵ] =inf^θsupPΔ∈PEα∼PΔ[Iϵ(^θ(α1,…,αD),θ(PΔ))] ≥inf^θsupPΔ∈PEα∼PΔ[dϵ(^θ(α1,…,αD),θ(PΔ))] ≥112dϵ(θ(PΔ1),θ(PΔ2))∫RDmin(pΔ1(α),pΔ2(α))dα ≥124dϵ(θ(PΔ1),θ(PΔ2))e−KL(PΔ1∥∥PΔ2)

The proof continues as in Theorem 1, by first setting and then maximizing the expression . ∎

## 5 Applications

In this section, we provide examples of consequences of our theory. In particular, our theory allows for tighter results for the analysis of the expectation of the maximum error, and the sample complexity of kernel ridge regression and support vector machines.

### 5.1 Expectation of the Maximum Error

Proposition 3 of [Sutherland and Schneider, 2015] shows that when the kernel function is -Lipschitz, the expected maximum error of approximation is bounded above by

 EΩ∼PD[ supΔ∈B(R)|1DD∑i=1cos(ω⊤iΔ)−k(Δ)|]∈O(√dR√D(EΩ∼PD[maxi=1,…,D∥ωi∥2]+L))

We improve the above upper bound to be , by the following corollary:

###### Corollary 1.

Let be as in Theorem 1. We have that:

 EΩ∼PD[supΔ∈B(R)|1DD∑i=1cos(ω⊤iΔ)−k(Δ)|]≤31/6Γ(1/6)R2/322/3√D

where is the Gamma function.

### 5.2 Kernel Ridge Regression

Given a training set of samples , kernel matrix , where , and the vector . As proved in Proposition 9 of [Sutherland and Schneider, 2015], the error probability for kernel ridge regression is bounded above by

 P[|^h(x)−h(x)|≥ϵ]≤PΩ∼PD[supΔ|1DD∑i=1cos(ω⊤iΔ)−k(Δ)|≥λ2ϵ(λ+1)m] (7)

where

 h(x) =y⊤(K+λI)−1kx,^h(x)=y⊤(^K+λI)−1^kx

is the approximation of the kernel matrix using RFF, is the approximation of from RFF, is the regularization parameter and

is the standard deviation of the values

. With the upper bound of the error probability in [Sutherland and Schneider, 2015], the authors proved, by applying , that with probability at least if

 D ∈Ω(d((λ+1)mλ2ϵ)2[log1δ+log√dR(λ+1)mλ2ϵ])=Ω(dϵ2(log1δ+log√dRϵ))

On the other hand, we reach the same result with less number of features. Let be Lambert’s W function. From our result in Theorem 1 and (7), we obtain

 D ∈Ω(((λ+1)mλ2ϵ)2W(R2/δ3)=Ω(1ϵ2(log1δ+logR)) (8)

Regarding the work of [Rudi and Rosasco, 2017], note that our result in (8) shows that is independent of the number of samples . In contrast, Theorem 1 in [Rudi and Rosasco, 2017] requires a number of random features . Thus, our result is more optimal.

### 5.3 Support Vector Machines

Now we consider support vector machine(SVM) classifiers. Given a training set of

samples with , the kernel embedding , the SVM classifier , where is the parameter, and the optimization problem

where is the regularization weight. As proved in Section 3.2 of [Sutherland and Schneider, 2015], if the RFF approximation of the kernel is controlled by , that is

 |k(x,y)−s(x,y)|≤ϵ

then the approximation error of SVM is also controlled by

 |^h(x)−h(x)|≤ √2C0(n+√n)14ϵ14+C0(n+√n)12ϵ12∈O(C0n12ϵ12)

where and is the approximation of using RFF. The results in [Sutherland and Schneider, 2015] show that if

 ϵ ∈Θ(√dW(DR2/δ)D)=Θ(√dD(logD+logR+log1δ))

where is Lambert’s W function, then with probability at least . However, with our result in Theorem 1 we can control to be

 ϵ ∈Θ(√W(R2/δ3)D)=Θ(√1D(logR+log1δ))

under which we also have with probability at least . (The dependence with respect to can be customarily removed by using the weight . This relates to scale the squared norm regularization as a function of the number of data points .)

## 6 Concluding Remarks

There are several ways of extending our work. For instance, note that [Rahimi and Recht, 2007, Sutherland and Schneider, 2015, Sriperumbudur and Szabó, 2015] focus on more general shift-invariant kernel functions. In this paper, all results are based on the assumption that the kernel function is Gaussian. Extensions to more general kernel functions can be of interest.

## References

• Bochner [1959] S. Bochner. Lectures on Fourier integrals. Princeton University Press, 1959.
• Rahimi and Recht [2007] A. Rahimi and B. Recht. Random features for large-scale kernel machines. Neural Information Processing Systems, 20:1177–1184, 2007.
• Rudi and Rosasco [2017] A. Rudi and L. Rosasco. Generalization properties of learning with random features. Neural Information Processing Systems, 30:3215–3225, 2017.
• Sriperumbudur and Szabó [2015] B. K. Sriperumbudur and Z. Szabó. Optimal rates for random Fourier features. Neural Information Processing Systems, 28:1144–1152, 2015.
• Sutherland and Schneider [2015] D. Sutherland and J. Schneider. On the error of random Fourier features.

Uncertainty in Artificial Intelligence

, pages 862–871, 2015.
• Tsybakov [2009] A. Tsybakov. Introduction to Nonparametric Estimation. Springer-Verlag, 2009.
• Wasserman [2010] L. Wasserman. Minimax Theory, Lecture Notes on Statistical Machine Learning. 2010.
• Yu [1997] B. Yu. Assouad, Fano, and Le Cam. Springer-Verlag, 1997.

## Appendix A Regarding the Bound in [Sriperumbudur and Szabó, 2015]

Theorem 1 of Sriperumbudur and Szabó [2015] proved the following upper bound:

 P[supx,y∈B(R)|k(x,y)−⟨z(x),z(y)⟩| ≥√2048dlog(2R+1)+√2048dlog(σP+1)+√512d1log(2R+1)+√2τ√D]≤e−τ (9)

If we rewrite the error threshold to be , by the fact that since , we can rewrite (A) as:

 (10)

where