1 Introduction
Kernel methods (shawetaylor04book)
, such as support vector machines
(boser92; vapnik98), map data in a high dimension space in which a linear predictor can solve the learning problem at hand. The mapping space is not directly computed and the linear predictor is represented implicitly thanks to a kernel function. This is the powerful kernel trick: the kernel function computes the scalar product between two data points in this high dimension space. However, kernel methods notoriously suffer from two drawbacks. On the first hand, computing all the scalar products for all the learning samples is costly: for many kernelbased methods, where is the number of training data points. On the other hand, one has to select a kernel function adapted to the learning problem for the algorithm to succeed.The first of these drawbacks has motivated the development of approximation methods making kernel methods more scalable, such as Nyström approximation (williams2001nystrom; drineas2005nystrom) that constructs a lowrank approximation of the Gram matrix^{1}^{1}1The Gram matrix is the matrix constituted by all the values of computed on the learning samples. and is data dependent, or random Fourier features (RFF) (rahimi07) that approximates the kernel with random features based on the Fourier transform and is not data dependent (a comparison between the two approaches have been conducted by nystromVSrff). In this paper, we revisit the latter technique.
We start from the observation that a predictor based on kernel Fourier features can be interpreted as a weighted combination of those features according to a data independent distribution defined by the Fourier transform. We introduce an original viewpoint, where this distribution is interpreted as a prior distribution over a space of weak hypotheses—each hypothesis being a simple trigonometric function obtained by the Fourier decomposition. This suggests that one can improve the approximation by adapting this distribution in regards to data points: we aim at learning a posterior distribution
. By this means, our study proposes strategies to learn a representation to the data. While this representation is not as flexible and powerful than the ones that can be learned by deep neural networks
(Goodfellow16book), we think that it is worthwhile to study this strategy to eventually solve the second drawback of kernel methods that currently heavily rely on the kernel choice.This in mind, while the majority of work related to random Fourier features focus on the study and improvement of the kernel approximation, we propose here a reinterpretation in the light of the PACBayesian theory (mcallester99; catoni07). We derive generalization bounds that can be straightforwardly optimized by learning a pseudoposterior thanks to a closedform expression.
The rest of the paper is organized as follows. Section 2 recalls the RFF setting. Section 3 expresses the Fourier transform as a prior leading (i) to a first PACBayesian analysis and a landmarksbased algorithm in Section 4, (ii) to another PACBayesian analysis in Section 5 allowing to justify the kernel alignment method of SinhaD16 and to propose a greedy kernel learning method. Then Section 6 provides experiments to show the usefulness of our work.
2 Random Fourier Features
Problem setting.
Consider a classification problem where we want to learn a predictor , from a dimensional space to a discrete output space (, . The learning algorithm is given a training set of samples, where denotes the data generating distribution over . We consider a positivesemidefinite (PSD) kernel
. Kernel machines learn predictors of the form
(1) 
by optimizing the values of vector .
Fourier Features.
When
is large, running a kernel machine algorithm (like SVM or kernel ridge regression) is expensive in memory and running time. To circumvent this problem,
rahimi07 introduced the random Fourier features as a way to approximate the value of a shiftinvariant kernel, , relying on the value of , that we writeinterchangeably. Let the distribution be the Fourier transform of the shiftinvariant kernel ,
(2) 
Now, by writing as the inverse of the Fourier transform , and using trigonometric identities, we obtain:
(3) 
rahimi07 suggest expressing the above as a product of two features. One way to achieve this is to map every input example into
(4) 
The random variable
, with drawn from, is an unbiased estimate of
. Indeed, we recover from Equation (3) and Equation (4):To reduce the variance in the estimation of
, the idea is to sample points from. Then, each training sample is mapped to a new feature vector in :(5) 
Thus, when is “large enough”, we have
This provides a decomposition of the PSD kernel that differs from the classical one (as discussed in bach17equivalence). By learning a linear predictor on the transformed training set through an algorithm like a linear SVM, we recover a predictor equivalent to the one learned by a kernelized algorithm. That is, we learn a weight vector and we predict the label of a sample by computing
(6) 
in place of Equation (1).
3 The Fourier Transform as a Prior Distribution
As described in the previous section, the random Fourier features trick has been introduced to reduce the running time of kernel learning algorithms. Consequently, most of the subsequent work study and/or improve the properties of the kernel approximation (, Yu16; Rudi17; bach17equivalence; Choromanski18) with some notable exceptions, as the kernel learning algorithm of SinhaD16 that we discuss and relate to our approach in Section 5.3.
We aim at reinterpreting the Fourier transform—, the distribution of Equation (2)—as a prior distribution over the features space. It can be seen as an alternative representation of the prior knowledge that is encoded in the choice of a specific kernel function, that we denote for now on. In accordance with Equation (3), each feature obtained from a vector can be seen as an hypothesis
From this point of view, the kernel is interpreted as a predictor performing a weighed aggregation of weak hypotheses. This alternative interpretation of distribution as a prior over weak hypotheses naturally suggests to learn a posterior distribution over the same hypotheses. That is, we seek a distribution giving rise to a new kernel
In order to assess the quality of the kernel
, we define a loss function based on the consideration that its output should be high when two samples share the same label, and low otherwise. Hence, we evaluate the kernel on two samples
and through the linear loss(7) 
where denotes a pairwise distance and denotes the pairwise similarity measure:
Furthermore, we define the kernel alignment generalization loss
on a “pairwise” probability distribution
, defined over as(8) 
Note that any data generating distribution over inputoutput spaces automatically gives rise to a “pairwise” distribution . By a slight abuse of notation, we write the corresponding generalization loss, and the associated kernel alignment empirical loss is defined as
(9) 
where for a pair of examples we have and .
Starting from this reinterpretation of the Fourier transform, we provide in the rest of the paper two PACBayesian analyses. The first one (Section 4) is obtained by combining PACBayesian bounds: instead of considering all the possible pairs of data points, we fix one point and we study the generalization ability for all the pairs involving it. The second analysis (Section 5) is based on the fact that the loss can be expressed as a secondorder Ustatistics.
4 PACBayesian Analysis and Landmarks
Due to the linearity of the loss function , we can rewrite the loss of as the average loss of every hypothesis. Indeed, Equation (8) becomes
The above expectation of losses turns out to be the quantity bounded by most PACBayesian generalization theorems^{2}^{2}2This quantity is sometimes referred as the Gibbs risk in the PACBayesian literature., excepted that such results usually apply to the loss over samples instead of distances. Hence, we use PACBayesian bounds to obtain generalization guarantees on from its empirical estimate of Equation (9), that we can rewrite as
However the classical PACBayesian theorems cannot be applied directly bound , as the empirical loss would require to be computed from observations of . Instead, the empirical loss involves dependent samples, as it is computed from pairs formed by elements from .
4.1 First Order KullbackLeibler Bound
A straightforward approach to apply classical PACBayesian results is to bound separately the loss associated with each training sample. That is, for each , we define
Thus, the next theorem gives a generalization guarantee on relying namely on the empirical estimate
and the KullbackLeibler divergence
between the prior and the learned posterior . Note that the statement of Theorem 1 appeared in alquier16, but can be recovered easily from lever13.Theorem 1.
For , , and a prior distribution over , with probability over the choice of , we have for all on :
Proof.
From alquier16, combine Theorem 4.1 and Lemma 1. ∎
Because , we obtain the following corollary by the union bound, applying times Theorem 1 with .
Corollary 2.
For and a prior distribution over , with probability over the choice of , we have for all on :
PseudoPosterior for bounds.
Since the above theorem is valid for any distribution , one can compute the bound for any learned posterior distribution. Note that the bound promotes the minimization of a tradeoff—parametrized by a constant —between the empirical loss and the divergence between the prior and the posterior :
It is wellknown that for fixed , and the minimum bound value is obtained with the pseudoBayesian posterior , such that for ,
(10) 
where is a normalization constant.^{3}^{3}3This tradeoff is the same one involved in some other PACBayesian bounds for data (, catoni07). As discussed in zhang06; grunwald2012; germain2016, there is a similarity between the minimization of such PACBayes bounds and the Bayes update rule. Note also Corollary 2’s bound converges to the generalization loss at rate for the parameter choice .
Due to the continuity of the feature space, the pseudoposterior of Equation (10) is hard to compute. To estimate it, one may use of MonteCarlo (, dalalyan12) or variational Bayes methods (, alquier16). In this work, we explore a simpler method: we work solely from a discrete probability space.
4.2 LandmarksBased Learning
We now propose to leverage on the fact that Theorem 1 bounds the kernel function for the distances to a single data point, instead of learning a kernel globally for every data points as in Corollary 2. We thus aim at learning a collection of kernels (which we can also interpret as similarity functions) for a subset of the training points. We call landmarks these training points. The aim of this approach is to learn a new representation of the input space, that maps the datapoints into compact feature vectors, from which we can learn simple predictor.
Concretely, along with the learning sample of examples from , we consider a landmarks sample of points from , and a prior Fourier transform distribution . For each landmark , let sample points from , denoted
. Then, consider a uniform distribution
on the discrete hypothesis set , such that and . We aim at learning a set of kernels , where each is obtained from a distinct with a fixed parameter , by computing the pseudoposterior distribution given by(11) 
for ; being the normalization constant. Note that Equation (11) gives the minimum of Theorem 1 with . That is, corresponds to the regime where the bound converges. Moreover, similarly to Corollary 2, generalization guarantees are obtained simultaneously for the computed distributions thanks to the union bound and Theorem 1. Thus, with probability , for all :
where .
Once all pseudoposterior are computed thanks to Equation (11), our landmarksbased approach is to map samples to similarity features:
(12) 
and to learn a linear predictor on the transformed training set. Note that, this mapping is not a kernel map anymore and is somehow similar to the mapping proposed by BalcanBS08ML; BalcanBS08COLT; zantedeschi2018multiview for a similarity function that is more general than a kernel but fixed for each landmark.
5 Learning Kernel (Revisited)
In this section, we present PACBayesian theorems that allows to bound directly the kernel alignment generalization loss on a “pairwise” probability distribution —as defined by Equation (8)—even if the empirical loss is computed on dependent samples. These bounds suggest a kernel alignment (or kernel learning) strategy similar to the one proposed by SinhaD16.
5.1 Second Order KullbackLeibler Bound
The following result is based on the fact that is an unbiased secondorder estimator of , allowing us to build on the PACBayesian analysis for Ustatistics of lever13. Indeed, the next theorem gives a generalization guarantee on .
Theorem 3 (lever13).
For and a prior distribution over , with probability over the choice of , we have for all on :
Proof.
See Theorem 7 of lever13, using the fact that the loss function lies in . ∎
5.2 Second Order Bounds for fDivergences
In the following, we build om a recent result of alquier2018 to express a new family of PACBayesian bounds for our dependent samples, where the term is replaced by other divergences.
Given a convex function such that , the corresponding divergence is given by
The following theorem applies to divergences such that .
Theorem 4.
For and a prior distribution over , with probability over the choice of , we have for all on :
where
Proof.
Let .
As a particular case, with , we obtain from Theorem 4 a bound that relies on the chisquare divergence:
Corollary 5.
Given a prior distribution over , with probability over the choice of , we have for all on :
where .
It is noteworthy that above result looks alike other PACBayesian bounds based on the chisquare divergence in the setting, as the one of honorio14, graalaistats16 or alquier2018.
5.3 PACBayesian Interpretation of Kernel Alignment Optimization
SinhaD16 propose a kernel learning algorithm that weights random kernel features. To do so, their algorithm solves a kernel alignment problem. As explained below, this method is coherent with the PACBayesian theory exposed by our current work.
Kernel alignment algorithm.
Let us consider a Fourier transform distribution , from which points are sampled, denoted . Then, consider a uniform distribution on the discrete hypothesis set , such that and . Given a dataset , and constant parameters , , the optimization algorithm proposed by SinhaD16 solves the following problem.
(17)  
such that  (18) 
The iterative procedure proposed by SinhaD16 finds an suboptimal solution to the above problem in steps. The solution provides a learned kernel
SinhaD16 propose to use the above alignment method to reduce the number of features needed compared to the classical RFF procedure (as describe in Section 2). Albeit this method is a kernel learning one, empirical experiments show that with a large number of random features, the classical RFF procedure achieves as good prediction accuracy. However, one can draw (with replacement) features from according to . For a relatively small , learning a linear predictor on the random feature vector (such as the one presented by Equation (5)) obtained from achieves better result than the classical RFF method on the same number of random features.
PACBayesian interpretation.
The optimization problem of Equations (17–18) deals with the same tradeoff as the one promoted by Theorem 4. Indeed, maximizing Equation (17) amounts to minimizing the constraint of Equation (18) controls the divergence , which is the same complexity measure involved in Theorem 4. Furthermore, the empirical experiments performed by SinhaD16 focus on the divergence (the case ), which corresponds to tackling the tradeoff expressed by Corollary 5.
5.4 Greedy Kernel Learning
The method proposed by SinhaD16 can easily be adapted to minimize the bound of Theorem 3 instead of the bound of Theorem 4. We describe this kernel learning procedure below.
Given a Fourier transform prior distribution , let sample points . Let and . Given a dataset , and constant parameters , compute the pseudoposterior
(19) 
for
Then, we sample with replacement features from according to the pseudoposterior . The sampled features are used to map every of the training set into a new vector according to Equation (5). The latter transformed dataset is then given as input to a linear learning procedure.
In summary, this learning method is strongly inspired by the one described in Section 5.3, but the posterior computation phase is faster, as we benefit from a closedform expression (Equation 19 versus Equations (17–18)). Once is computed for all , we can vary the parameter and get a new posterior in steps.
6 Experiments
All experiments use a Gaussian ( RBF) kernel of variance :
for which the Fourier transform is given by
(20) 
Apart from the toy experiment of Figure 1, the experiments on real data are conducted by splitting the available data into a training set, a validation set and a test set. The kernel parameter is chosen among by running an RBF SVM on the training set and keeping the parameter having the best accuracy score on the validation set. That is, this defines the prior distribution given by Equation (6) for all our pseudoBayesian methods. Unless otherwise specified, all the other parameters are selected using the validation set. More details about the experimental procedure are given in the supplementary material.
6.1 LandmarksBased Learning
We present a study of the learning methodology detailed in Section 4.2.
Toy experiment.
To get some insight from the landmarksbased procedure, we generate a 2D dataset , illustrated by Figure 1. We randomly select five training points , and compare two procedures, described below.

RBFLandmarks: Learn a linear SVM on the empirical kernel map given by the five RBF kernels centered on . That is, each is mapped such that

PBLandmarks: Generate random samples according to the Fourier transform of Equation (6). For every landmark of , learn a similarity measure thanks to Equation (11) (with ), minimizing the PACBayesian bound. We thus obtain five posterior distributions , and learn a linear SVM on the mapped training set obtained by Equation (12).
Dataset  Landmarks based  

SVM  RBF  PB  PB  PB  
ads  3.05  10.98  4.88  5.12  5.00 
adult  19.70  19.60  17.99  17.99  17.99 
breast  4.90  6.99  3.50  3.50  2.80 
farm  11.58  17.47  15.73  14.19  15.73 
mnist17  0.34  0.74  0.42  0.32  0.32 
mnist49  1.16  2.26  1.80  2.09  2.50 
mnist56  0.55  0.97  1.06  1.55  1.03 
Hence, the RBFLandmarks method corresponds to the prior, from which we learn a posterior by landmarks by the PBLandmarks procedure. Rightmost plots of Figure 1 shows that the PBLandmarks setting successfully finds a representation from which the linear SVM can predict well.
Experiments on real data.
We conduct similar experiments as the above explained one on seven real binary classification datasets.
Figure 2 studies the behavior of the approaches according to the number of selected landmarks. We select a percentage of the training points as landmarks (from to ), and we compare the classification error of a linear SVM on the mapping obtained by the original RBF functions (as in the RBFLandmarks method above), with the mapping obtained by learning the landmarks posterior distributions (PBLandmark method). We also compare the case where the landmarks are selected at random among the training data (curves postfixed “R”), to another scenario where we use the centroids obtained with a Means clustering algorithm as landmarks (curves postfixed “C”). Note that, the latter case is not rigorously backed by our PACBayesian theorems, since the choice of landmarks is now dependent of the whole observed training set. The results show that the classification error of both cases are similar, but the clustering strategy leads to a more stable behavior, probably since the landmarks are more representative of the original space. Moreover, the pseudoBayesian method improves the results on almost all datasets.
Table 1 compares the error rate of a SVM (trained along with the full Gram matrix and a properly selected on the validation set) with four landmarksbased approaches: (RBF) the landmarks are RBF kernel of parameter ; (PB) the PBLandmarks approach where the number of features per landmarks and the parameter are selected using the validation set; (PB) the PBLandmarks approach where is fixed and is selected by validation; and (PB) the PBLandmarks approach where is fixed and is selected by validation. For all landmarksbased approaches, we select the landmarks by clustering, and use of the training set size as the number of landmarks; we want to study the methods in the regime where it provides relatively compact representations. The result shows that learning the posterior improves the RBFLandmarks (except on “mnist56”) and that the validation of both and parameters are not mandatory to obtain satisfactory results. The SVM RBF is generally the best (except on “breast” and “mnist17”), but requires a far less compact representation of the data as it uses the full Gram matrix.
6.2 Greedy Kernel Learning
Figure 3 presents a study of the kernel learning method detailed in Section 5.4, inspired from the one of SinhaD16. We first generate random features according to as given by Equation (4), and we learn a posterior using two strategies: (OKRFF) the original optimized kernel of SinhaD16 given by Equations (1718), where is selected on the validation set; and (PBRFF) the pseudoposterior given by Equation (19) where is selected on the validation set. For both obtained posteriors, we subsample an increasing number of features to create the mapping given by Equation (5), on which we learn a linear SVM. We also compare to (RFF) the standard random Fourier features as described in Section 2, with randomly selected features according the prior .
We see that our PBRFF approach behaves similarly as OKRFF, with a slight advantage for the latter. However, we recall that computing the posterior of former method is faster. Both kernel learning methods have better accuracy than the classical RFF algorithm for a small number of random features, and similar ones for a large number of random features.
7 Conclusion and Perspectives
We elaborated an original viewpoint of the random Fourier features, proposed by rahimi07 to approximate a kernel. By looking at the Fourier transform as a prior distribution over trigonometric functions, we present two kinds of generalization theorems for random Fourier features, that bound a loss function attesting the quality of the kernel alignment. Based on classical firstorder PACBayesian results, we derived a landmarksbased strategy that learns a compact representation of the data. Then, we proposed two secondorder generalization bounds. The first one is based on the Ustatistic theorem of lever13. The second one is a new PACBayesian theorem for divergences (replacing the usual divergence term). We show that the latter bound provides a theoretical justification to the kernel alignment method of SinhaD16, and we also empirically evaluate a similar but simpler algorithm where the alignment distribution is obtained by the PACBayesian pseudoposterior closedform expression.
We believe that considering the Fourier transform of a kernel as a (pseudo)Bayesian prior can lead to other contributions that the ones explored here. Among them, it might open new perspectives on representation and metric learning. Another interesting perspective would be to extend our study to wavelet transforms (Mallatbook).
References
References
Appendix A Supplementary Material
a.1 Mathematical Results
Lemma 6.
For any datagenerating distribution :
Proof.
Given , we denote
The function above has the bounded differences property. That is, for each :
Thus, we apply the EfronStein inequality (following boucheron13, Corollary 3.2) to obtain
∎
a.2 Experiments
In Section 6 we use the following datasets:
 ads

http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements
The first 4 features which have missing values are removed.  adult
 breast
 farm
 mnist

http://yann.lecun.com/exdb/mnist/
As SinhaD16, binary classification tasks are compiled with the following digits pairs: 1 vs. 7, 4 vs. 9, and 5 vs. 6.
We split the datasets into a training and testing set with a 75/25 ratio except for adult which has a training/test split already computed. We then use 20% of the training set for validation. Table 2 presents an overview. We use the following parameters values range for selection on the validation set:
Dataset  

ads  1967  492  820  1554 
adult  26048  6513  16281  108 
breast  340  86  143  30 
farm  2485  622  1036  54877 
mnist17  9101  2276  3793  784 
mnist49  8268  2068  3446  784 
mnist56  7912  1979  3298  784 
Comments
There are no comments yet.