1 Introduction
Kernel methods [scholkopf02learning]
have enjoyed tremendous success in solving several fundamental problems of machine learning ranging from classification, regression, feature extraction, dependency estimation, causal discovery, Bayesian inference and hypothesis testing. Such a success owes to their capability to represent and model complex relations by mapping points into high (possibly infinite) dimensional feature spaces. At the heart of all these techniques is the kernel trick, which allows to
implicitly compute inner products between these high dimensional feature maps, via a kernel function : . However, this flexibility and richness of kernels has a price: by resorting to implicit computations these methods operate on the Gram matrix of the data, which raises serious computational challenges while dealing with largescale data. In order to resolve this bottleneck, numerous solutions have been proposed, such as lowrank matrix approximations [Williams01, drineas05nystrom, alaoui15fast], explicit feature maps designed for additive kernels [vedaldi12efficient, maji13efficient], hashing [shi09hash, kulis12kernelized], and random Fourier features (RFF) [rahimi07random] constructed for shiftinvariant kernels, the focus of the current paper.RFFs implement an extremely simple, yet efficient idea: instead of relying on the implicit feature map associated with the kernel, by appealing to Bochner’s theorem [wendland05scattered]
—any bounded, continuous, shiftinvariant kernel is the Fourier transform of a probability measure—
[rahimi07random] proposed an explicit lowdimensional random Fourier feature map obtained by empirically approximating the Fourier integral so that . The advantage of this explicit lowdimensional feature representation is that the kernel machine can be efficiently solved in the primal form through fast linear solvers, thereby enabling to handle largescale data. Through numerical experiments, it has also been demonstrated that kernel algorithms constructed using the approximate kernel do not suffer from significant performance degradation [rahimi07random]. Another advantage with the RFF approach is that unlike low rank matrix approximation approach [Williams01, drineas05nystrom] which also speeds up kernel machines, it approximates the entire kernel function and not just the kernel matrix. This property is particularly useful while dealing with outofsample data and also in online learning applications. The RFF technique has found wide applicability in several areas such as fast functiontofunction regression [oliva15fast], differential privacy preserving [chaudhuri11differentially] and causal discovery [lopezpaz15towards].Despite the success of the RFF method, surprisingly, very little is known about its performance guarantees. To the best of our knowledge, the only paper in the machine learning literature providing certain theoretical insight into the accuracy of kernel approximation via RFF is [rahimi07random, sutherland15error]:^{1}^{1}1[sutherland15error] derived tighter constants compared to [rahimi07random] and also considered different RFF implementations. it shows that for any compact set , where is the number of random Fourier features. However, since the approximation proposed by the RFF method involves empirically approximating the Fourier integral, the RFF estimator can be thought of as an empirical characteristic function (ECF). In the probability literature, the systematic study of ECFs was initiated by [feuerverger77empirical] and followed up by [csorgo83howlong, csorgo81multivariate, yukich87some]. While [feuerverger77empirical] shows the almost sure (a.s.) convergence of to zero, [csorgo83howlong, Theorems 1 and 2] and [yukich87some, Theorems 6.2 and 6.3] show that the optimal rate is . In addition, [feuerverger77empirical] shows that almost sure convergence cannot be attained over the entire space (i.e.,
) if the characteristic function decays to zero at infinity. Due to this,
[csorgo83howlong, yukich87some] study the convergence behavior of when the diameter of grows with and show that almost sure convergence of is guaranteed as long as the diameter of is . Unfortunately, all these results (to the best of our knowledge) are asymptotic in nature and the only known finitesample guarantee by [rahimi07random, sutherland15error] is nonoptimal. In this paper (see Section 3), we present a finitesample probabilistic bound for that holds for any and provides the optimal rate of for any compact set along with guaranteeing the almost sure convergence of as long as the diameter of is . Since convergence in uniform norm might sometimes be a too strong requirement and may not be suitable to attain correct rates in the generalization bounds associated with learning algorithms involving RFF,^{2}^{2}2For example, in applications like kernel ridge regression based on RFF, it is more appropriate to consider the approximation guarantee in
norm than in the uniform norm. we also study the behavior of in norm () and obtain an optimal rate of . The RFF approach to approximate a translationinvariant kernel can be seen as a special of the problem of approximating a function in the barycenter of a family (say ) of functions, which was considered in [rahimi08uniform]. However, the approximation guarantees in [rahimi08uniform, Theorem 3.2] do not directly apply to RFF as the assumptions on are not satisfied by the cosine function, which is the family of functions that is used to approximate the kernel in the RFF approach. While a careful modification of the proof of [rahimi08uniform, Theorem 3.2] could yield rate of approximation for any compact set , this result would still be suboptimal by providing a linear dependence on similar to the theorems in [rahimi07random, sutherland15error], in contrast to the optimal logarithmic dependence on that is guaranteed by our results.Traditionally, kernel based algorithms involve computing the value of the kernel. Recently, kernel algorithms involving the derivatives of the kernel (i.e., the Gram matrix consists of derivatives of the kernel computed at training samples) have been used to address numerous machine learning tasks, e.g., semisupervised or Hermite learning with gradient information [zhou08derivative, shi10hermite], nonlinear variable selection [rosasco10regularization, rosasco13nonparametric], (multitask) gradient learning [ying12learning] and fitting of distributions in an infinitedimensional exponential family [sriperumbudur14density]. Given the importance of these derivative based kernel algorithms, similar to [rahimi07random], in Section 4, we propose a finite dimensional random feature map approximation to kernel derivatives, which can be used to speed up the above mentioned derivative based kernel algorithms. We present a finitesample bound that quantifies the quality of approximation in uniform and norms and show the rate of convergence to be in both these cases.
A summary of our contributions are as follows. We

provide the first detailed finitesample performance analysis of RFFs for approximating kernels and their derivatives.

prove uniform and convergence on fixed compacts sets with optimal rate in terms of the RFF dimension ();

give sufficient conditions for the growth rate of compact sets while preserving a.s. convergence uniformly and in ; specializing our result we match the best attainable asymptotic growth rate.
Various notations and definitions that are used throughout the paper are provided in Section 2 along with a brief review of RFF approximation proposed by [rahimi07random]. The missing proofs of the results in Sections 3 and 4 are provided in the supplementary material.
2 Notations & preliminaries
In this section, we introduce notations that are used throughout the paper and then present preliminaries on kernel approximation
through random feature maps as introduced by [rahimi07random].
Definitions & Notation: For a topological
space , (resp. ) denotes the space of
all continuous (resp. bounded continuous) functions on . For , is the supremum norm of . and is the set of all finite
Borel and probability measures on , respectively. For ,
denotes the Banach space of power () integrable functions. For , we will use
for if is a Lebesgue measure on
. For , denotes
the norm of for and we write it as
if and is the
Lebesgue measure. For any where , we define and
where ,
is the empirical measure and is a Dirac measure supported on . denotes the support of . denotes the fold product measure.
For , . The diameter of where is
a metric space is defined as . If with , we denote the diameter of as ; if is compact. The volume of is defined as
. For , we define .
is the convex hull of .
For a function
defined on open set ,
, where
are multiindices, and . Define . For positive sequences , , if .
(resp.
) denotes that is bounded in probability (resp. almost surely). is the Gamma function, and .
Random feature maps: Let be a bounded, continuous, positive definite, translationinvariant kernel, i.e., there exists a positive definite function such that
where . By Bochner’s theorem [wendland05scattered, Theorem 6.6], can be represented
as the Fourier transform of a finite nonnegative Borel measure on , i.e.,
(1) 
where follows from the fact that is realvalued and symmetric. Since , where . Therefore, w.l.o.g., we assume throughout the paper that and so . Based on (1), [rahimi07random] proposed an approximation to by replacing with its empirical measure, constructed from so that resultant approximation can be written as the Euclidean inner product of finite dimensional random feature maps, i.e.,
(2) 
where and holds based on the basic trigonometric identity: . This elegant approximation to is particularly useful in speeding up kernelbased algorithms as the finitedimensional random feature map can be used to solve these algorithms in the primal thereby offering better computational complexity (than by solving them in the dual) while at the same time not lacking in performance. Apart from these practical advantages, [rahimi07random, Claim 1] (and similarly, [sutherland15error, Prop. 1]) provides a theoretical guarantee that as for any compact set . Formally, [rahimi07random, Claim 1] showed that—note that (3) is slightly different but more precise than the one in the statement of Claim 1 in [rahimi07random]—for any ,
(3) 
where and when . The condition implies that (and therefore ) is twice differentiable. From (3) it is clear that the probability has polynomial tails if (i.e., small ) and Gaussian tails if (i.e., large ) and can be equivalently written as
(4) 
where . For sufficiently large (i.e., ), it follows from (4) that
(5) 
While (5) shows that is a consistent estimator of in the topology of compact convergence (i.e., convergences to uniformly over compact sets), the rate of convergence of is not optimal. In addition, the order of dependence on is not optimal. While a faster rate (in fact, an optimal rate) of convergence is desired—better rates in (5) can lead to better convergence rates for the excess error of the kernel machine constructed using —, the order of dependence on is also important as it determines the the number of RFF features (i.e., ) that are needed to achieve a given approximation accuracy. In fact, the order of dependence on controls the rate at which can be grown as a function of when (see Remark 1(ii) for a detailed discussion about the significance of growing ). In the following section, we present an analogue of (4)—see Theorem 1—that provides optimal rates and has correct dependence on .
3 Main results: approximation of
As discussed in Sections 1 and 2, while the random feature map approximation of introduced by [rahimi07random] has many practical advantages, it does not seem to be theoretically wellunderstood. The existing theoretical results on the quality of approximation do not provide a complete picture owing to their nonoptimality. In this section, we first present our main result (see Theorem 1) that improves upon (4) and provides a rate of with logarithm dependence on . We then discuss the consequences of Theorem 1 along with its optimality in Remark 1. Next, in Corollary 2 and Theorem 3, we discuss the convergence () of to over compact subsets of .
Theorem 1.
Suppose where is positive definite and . Then for any and nonempty compact set ,
where .
Proof (sketch).
Note that , where , which means the object of interest is the suprema of an empirical process indexed by . Instead of bounding by using Hoeffding’s inequality on a cover of and then applying union bound as carried out in [rahimi07random, sutherland15error], we use the refined technique of applying concentration via McDiarmid’s inequality, followed by symmetrization and bound the Rademacher average by Dudley entropy bound. The result is obtained by carefully bounding the covering number of . The details are provided in Section B.1 of the supplementary material. ∎
Remark 1.
(i) Theorem 1 shows that is a consistent estimator of in the topology of compact convergence as with the rate of a.s. convergence being
(almost sure convergence is guaranteed by the first BorelCantelli lemma). In comparison to (4), it is clear that Theorem 1 provides improved rates with better constants and logarithmic dependence on instead of a linear
dependence. The logarithmic dependence on ensures that we need random features instead of random features, i.e., significantly
fewer features to achieve the same approximation accuracy of .
(ii) Growing diameter: While Theorem 1 provides almost sure convergence uniformly over compact sets, one might wonder whether it is possible to achieve uniform
convergence over . [feuerverger77empirical, Section 2] showed that such a result is possible if is a discrete measure but not possible for that is absolutely continuous w.r.t. the Lebesgue measure (i.e., if has a
density). Since uniform convergence of to over is not possible for many interesting (e.g., Gaussian kernel), it is of interest to study the convergence on whose diameter grows with . Therefore,
as mentioned in Section 2, the order of dependence of rates on is critical. Suppose as (we write instead of to show the explicit dependence on ). Then Theorem 1 shows that
is a consistent estimator of in the topology of compact convergence if as (i.e., ) in contrast to
the result in (4) which requires . In other words, Theorem 1 ensures consistency even when grows exponentially in whereas (4)
ensures consistency only if does not grow faster than .
(iii) Optimality: Note that is the characteristic function of since is the Fourier transform of (by Bochner’s theorem). Therefore, the object of interest ,
is the uniform norm of the difference between and the empirical characteristic function , when both are
restricted to a compact set . The question of the convergence behavior of is not new and has been studied in great
detail in the probability and statistics literature (e.g., see [feuerverger77empirical, yukich87some] for and [csorgo81multivariate, csorgo83howlong] for ) where the characteristic function is not just a realvalued symmetric function (like )
but is Hermitian. [yukich87some, Theorems 6.2 and 6.3] show that the optimal rate of convergence of is when , which matches
with our result in Theorem 1. Also Theorems 1 and 2 in [csorgo83howlong] show that
the logarithmic dependence on is optimal asymptotically. In particular, [csorgo83howlong, Theorem 1] matches with the growing diameter result in Remark 1(ii), while
[csorgo83howlong, Theorem 2] shows that if is absolutely continuous w.r.t. the Lebesgue measure and if ,
then there exists a positive such that . This means the rate
is not only the best possible in general for almost sure convergence, but if faster sequence is considered then even stochastic convergence cannot be retained for any characteristic function
vanishing at infinity along at least one path. While these previous results match with that of Theorem 1 (and its consequences), we would like to highlight the fact that
all these previous results are asymptotic in nature whereas Theorem 1 provides a finitesample probabilistic inequality that holds for any . We are not aware of any
such finitesample result except for the one in [rahimi07random, sutherland15error].
Using Theorem 1, one can obtain a probabilistic inequality for the norm of over any compact set , as given by the following result.
Corollary 2.
Proof.
Note that
The result follows by combining Theorem 1 and the fact that where and (which follows from [folland99real, Corollary 2.55]). ∎
Corollary 2 shows that and therefore if as , then consistency of in norm is achieved as long as as . This means, in comparison to the uniform norm in Theorem 1 where can grow exponential in (), cannot grow faster than () to achieve consistency in norm.
Instead of using Theorem 1 to obtain a bound on (this bound may be weak as for any ), a better bound (for ) can be obtained by directly bounding , as shown in the following result.
Theorem 3.
Suppose where is positive definite. Then for any , and nonempty compact set ,
where is the Khintchine constant given by for and for .
Proof (sketch).
As in Theorem 1, we show that satisfies the bounded difference property, hence by the McDiarmid’s inequality, it concentrates around its expectation . By symmetrization, we then show that is upper bounded in terms of , where
are Rademacher random variables. By exploiting the fact that
is a Banach space of type , the result follows. The details are provided in Section B.2 of the supplementary material. ∎Remark 2.
Theorem 3 shows an improved dependence on without the extra factor given in Corollary 2 and therefore provides a better rate for when the diameter of grows, i.e., if as . However, for , Theorem 3 provides a slower rate than Corollary 2 and therefore it is appropriate to use the bound in Corollary 2. While one might wonder why we only considered the convergence of and not , it is important to note that the latter is not welldefined because even if .
4 Approximation of kernel derivatives
In the previous section we focused on the approximation of the kernel function where we presented uniform and convergence guarantees on compact sets for the random Fourier feature approximation, and discussed how fast the diameter of these sets can grow to preserve uniform and convergence almost surely. In this section, we propose an approximation to derivatives of the kernel and analyze the uniform and convergence behavior of the proposed approximation. As motivated in Section 1, the question of approximating the derivatives of the kernel through finite dimensional random feature map is also important as it enables to speed up several interesting machine learning tasks that involve the derivatives of the kernel [zhou08derivative, shi10hermite, rosasco10regularization, rosasco13nonparametric, ying12learning, sriperumbudur14density], see for example the recent infinite dimensional exponential family fitting technique [strathmann15gradient], which implements this idea.
To this end, we consider as in (1) and define (in other words , , , and ). For , assuming , it follows from the dominated convergence theorem that
so that can be approximated by replacing with , resulting in
(6) 
where and . Now the goal is to understand the behavior of and for , i.e., obtain analogues of Theorems 1 and 3.
As in the proof sketch of Theorem 1, while can be analyzed as the suprema of an empirical process indexed by a suitable function class (say ), some technical issues arise because is not uniformly bounded. This means McDiarmid or Talagrand’s inequality cannot be applied to achieve concentration and bounding Rademacher average by Dudley entropy bound may not be reasonable. While these issues can be tackled by resorting to more technical and refined methods, in this paper, we generalize (see Theorem 4 which is proved in Section B.1 of the supplement) Theorem 1 to derivatives under the restrictive assumption that is bounded (note that many popular kernels including the Gaussian do not satisfy this assumption). We also present another result (see Theorem 5) by generalizing the proof technique^{3}^{3}3We also correct some technical issues in the proof of [rahimi07random, Claim 1], where (i) a shiftinvariant argument was applied to the nonshift invariant kernel estimator , (ii) the convexity of was not imposed leading to possibly undefined Lipschitz constant () and (iii) the randomness of was not taken into account, thus the upper bound on the expectation of the squared Lipschitz constant () does not hold. of [rahimi07random] to unbounded functions where the boundedness assumption of is relaxed but at the expense of a worse rate (compared to Theorem 4).
Theorem 4.
Let , , , and assume that . Suppose is bounded if and . Then for any and nonempty compact set ,
where
.
Remark 3.
(i) Note that Theorem 4 reduces to Theorem 1 if , in which case .
If or , then the boundedness of implies that and .
(ii) Growth of : By the same reasoning as in Remark 1(ii) and Corollary 2, it follows that
if and
if (for ) as
. An exact analogue of Theorem 3 can be obtained (but with different constants) under the assumption that is bounded and it can be shown that
for , if .
The following result relaxes the boundedness of
by imposing certain moment conditions on
but at the expense of a worse rate. The proof relies on applying Bernstein inequality at the elements of a net (which exists by the compactness of ) combined with a union bound, and extending the approximation error from the anchors by a probabilistic Lipschitz argument.Theorem 5.
Let , be continuously differentiable, be continuous, be any nonempty compact set, and . Assume that . Suppose such that
(7) 
where . Define .^{4}^{4}4 is monotonically decreasing in , . Then
(8)  
Remark 4.
(i) The compactness of implies that of . Hence, by the continuity of , one gets .
(7) holds if and ().
If is bounded, then the boundedness of is guaranteed (see Section B.4 in the supplement).
(ii) In the special case when , our requirement boils down to the continuously differentiability of , , and (7).
(iii) Note that (8) is similar to (3) and therefore based on the discussion in Section 2, one has
. But the advantage with Theorem 5
over [rahimi07random, Claim 1] and [sutherland15error, Prop. 1] is that it can handle unbounded functions. In comparison to Theorem 4, we obtain worse rates and it
will be of interest to improve the rates of Theorem 5 while handling unbounded functions.
5 Discussion
In this paper, we presented the first detailed theoretical analysis about the approximation quality of random Fourier features (RFF) that was proposed by [rahimi07random] in the context of improving the computational complexity of kernel machines. While [rahimi07random, sutherland15error] provided a probabilistic bound on the uniform approximation (over compact subsets of ) of a kernel by random features, the result is not optimal. We improved this result by providing a finitesample bound with optimal rate of convergence and also analyzed the quality of approximation in norm (). We also proposed an RFF approximation for derivatives of a kernel and provided theoretical guarantees on the quality of approximation in uniform and norms over compact subsets of .
While all the results in this paper (and also in the literature) dealt with the approximation quality of RFF over only compact subsets of , it is of interest to understand its behavior over entire . However, as discussed in Remark 1(ii) and in the paragraph following Theorem 3, RFF cannot approximate the kernel uniformly or in norm over . By truncating the Taylor series expansion of the exponential function, [cotter11explicit] proposed a nonrandom finite dimensional representation to approximate the Gaussian kernel which also enjoys the computational advantages of RFF. However, this representation also does not approximate the Gaussian kernel uniformly over . Therefore, the question remains whether it is possible to approximate a kernel uniformly or in norm over but still retaining the computational advantages associated with RFF.
Acknowledgments
Z. Szabó wishes to thank the Gatsby Charitable Foundation for its generous support.
References
Appendix A Definitions & notation
Let be a metric space, a measurable space and denotes the set of measurable functions. A family of maps is called a separable Carathéodory family w.r.t. if is separable and is continuous for all . Let , be a Rademacher sequence, i.e., s are i.i.d. and , and . The Rademacher average of is defined as ; we use the shorthand . is said to be an net of if for any there is an such that . The covering number of is defined as the size of the smallest net, i.e.,
Comments
There are no comments yet.