Kernel methods (shawe-taylor-04-book)
, such as support vector machines(boser-92; vapnik-98), map data in a high dimension space in which a linear predictor can solve the learning problem at hand. The mapping space is not directly computed and the linear predictor is represented implicitly thanks to a kernel function. This is the powerful kernel trick: the kernel function computes the scalar product between two data points in this high dimension space. However, kernel methods notoriously suffer from two drawbacks. On the first hand, computing all the scalar products for all the learning samples is costly: for many kernel-based methods, where is the number of training data points. On the other hand, one has to select a kernel function adapted to the learning problem for the algorithm to succeed.
The first of these drawbacks has motivated the development of approximation methods making kernel methods more scalable, such as Nyström approximation (williams2001nystrom; drineas2005nystrom) that constructs a low-rank approximation of the Gram matrix111The Gram matrix is the matrix constituted by all the values of computed on the learning samples. and is data dependent, or random Fourier features (RFF) (rahimi-07) that approximates the kernel with random features based on the Fourier transform and is not data dependent (a comparison between the two approaches have been conducted by nystromVSrff). In this paper, we revisit the latter technique.
We start from the observation that a predictor based on kernel Fourier features can be interpreted as a weighted combination of those features according to a data independent distribution defined by the Fourier transform. We introduce an original viewpoint, where this distribution is interpreted as a prior distribution over a space of weak hypotheses—each hypothesis being a simple trigonometric function obtained by the Fourier decomposition. This suggests that one can improve the approximation by adapting this distribution in regards to data points: we aim at learning a posterior distribution
. By this means, our study proposes strategies to learn a representation to the data. While this representation is not as flexible and powerful than the ones that can be learned by deep neural networks(Goodfellow-16-book), we think that it is worthwhile to study this strategy to eventually solve the second drawback of kernel methods that currently heavily rely on the kernel choice.
This in mind, while the majority of work related to random Fourier features focus on the study and improvement of the kernel approximation, we propose here a reinterpretation in the light of the PAC-Bayesian theory (mcallester-99; catoni-07). We derive generalization bounds that can be straightforwardly optimized by learning a pseudo-posterior thanks to a closed-form expression.
The rest of the paper is organized as follows. Section 2 recalls the RFF setting. Section 3 expresses the Fourier transform as a prior leading (i) to a first PAC-Bayesian analysis and a landmarks-based algorithm in Section 4, (ii) to another PAC-Bayesian analysis in Section 5 allowing to justify the kernel alignment method of SinhaD16 and to propose a greedy kernel learning method. Then Section 6 provides experiments to show the usefulness of our work.
2 Random Fourier Features
Consider a classification problem where we want to learn a predictor , from a -dimensional space to a discrete output space (, . The learning algorithm is given a training set of samples, where denotes the data generating distribution over . We consider a positive-semidefinite (PSD) kernel
. Kernel machines learn predictors of the form
by optimizing the values of vector .
is large, running a kernel machine algorithm (like SVM or kernel ridge regression) is expensive in memory and running time. To circumvent this problem,rahimi-07 introduced the random Fourier features as a way to approximate the value of a shift-invariant kernel, , relying on the value of , that we write
interchangeably. Let the distribution be the Fourier transform of the shift-invariant kernel ,
Now, by writing as the inverse of the Fourier transform , and using trigonometric identities, we obtain:
rahimi-07 suggest expressing the above as a product of two features. One way to achieve this is to map every input example into
The random variable, with drawn from
, is an unbiased estimate of. Indeed, we recover from Equation (3) and Equation (4):
To reduce the variance in the estimation of, the idea is to sample points from. Then, each training sample is mapped to a new feature vector in :
Thus, when is “large enough”, we have
This provides a decomposition of the PSD kernel that differs from the classical one (as discussed in bach-17-equivalence). By learning a linear predictor on the transformed training set through an algorithm like a linear SVM, we recover a predictor equivalent to the one learned by a kernelized algorithm. That is, we learn a weight vector and we predict the label of a sample by computing
in place of Equation (1).
3 The Fourier Transform as a Prior Distribution
As described in the previous section, the random Fourier features trick has been introduced to reduce the running time of kernel learning algorithms. Consequently, most of the subsequent work study and/or improve the properties of the kernel approximation (, Yu16; Rudi17; bach-17-equivalence; Choromanski18) with some notable exceptions, as the kernel learning algorithm of SinhaD16 that we discuss and relate to our approach in Section 5.3.
We aim at reinterpreting the Fourier transform—, the distribution of Equation (2)—as a prior distribution over the features space. It can be seen as an alternative representation of the prior knowledge that is encoded in the choice of a specific kernel function, that we denote for now on. In accordance with Equation (3), each feature obtained from a vector can be seen as an hypothesis
From this point of view, the kernel is interpreted as a predictor performing a -weighed aggregation of weak hypotheses. This alternative interpretation of distribution as a prior over weak hypotheses naturally suggests to learn a posterior distribution over the same hypotheses. That is, we seek a distribution giving rise to a new kernel
In order to assess the quality of the kernel
, we define a loss function based on the consideration that its output should be high when two samples share the same label, and low otherwise. Hence, we evaluate the kernel on two samplesand through the linear loss
where denotes a pairwise distance and denotes the pairwise similarity measure:
Furthermore, we define the kernel alignment generalization loss
on a “pairwise” probability distribution, defined over as
Note that any data generating distribution over input-output spaces automatically gives rise to a “pairwise” distribution . By a slight abuse of notation, we write the corresponding generalization loss, and the associated kernel alignment empirical loss is defined as
where for a pair of examples we have and .
Starting from this reinterpretation of the Fourier transform, we provide in the rest of the paper two PAC-Bayesian analyses. The first one (Section 4) is obtained by combining PAC-Bayesian bounds: instead of considering all the possible pairs of data points, we fix one point and we study the generalization ability for all the pairs involving it. The second analysis (Section 5) is based on the fact that the loss can be expressed as a second-order U-statistics.
4 PAC-Bayesian Analysis and Landmarks
Due to the linearity of the loss function , we can rewrite the loss of as the -average loss of every hypothesis. Indeed, Equation (8) becomes
The above -expectation of losses turns out to be the quantity bounded by most PAC-Bayesian generalization theorems222This quantity is sometimes referred as the Gibbs risk in the PAC-Bayesian literature., excepted that such results usually apply to the loss over samples instead of distances. Hence, we use PAC-Bayesian bounds to obtain generalization guarantees on from its empirical estimate of Equation (9), that we can rewrite as
However the classical PAC-Bayesian theorems cannot be applied directly bound , as the empirical loss would require to be computed from observations of . Instead, the empirical loss involves dependent samples, as it is computed from pairs formed by elements from .
4.1 First Order Kullback-Leibler Bound
A straightforward approach to apply classical PAC-Bayesian results is to bound separately the loss associated with each training sample. That is, for each , we define
Thus, the next theorem gives a generalization guarantee on relying namely on the empirical estimate
and the Kullback-Leibler divergencebetween the prior and the learned posterior . Note that the statement of Theorem 1 appeared in alquier-16, but can be recovered easily from lever-13.
For , , and a prior distribution over , with probability over the choice of , we have for all on :
From alquier-16, combine Theorem 4.1 and Lemma 1. ∎
Because , we obtain the following corollary by the union bound, applying times Theorem 1 with .
For and a prior distribution over , with probability over the choice of , we have for all on :
Pseudo-Posterior for -bounds.
Since the above theorem is valid for any distribution , one can compute the bound for any learned posterior distribution. Note that the bound promotes the minimization of a trade-off—parametrized by a constant —between the empirical loss and the -divergence between the prior and the posterior :
It is well-known that for fixed , and the minimum bound value is obtained with the pseudo-Bayesian posterior , such that for ,
where is a normalization constant.333This trade-off is the same one involved in some other PAC-Bayesian bounds for data (, catoni-07). As discussed in zhang-06; grunwald-2012; germain-2016, there is a similarity between the minimization of such PAC-Bayes bounds and the Bayes update rule. Note also Corollary 2’s bound converges to the generalization loss at rate for the parameter choice .
Due to the continuity of the feature space, the pseudo-posterior of Equation (10) is hard to compute. To estimate it, one may use of Monte-Carlo (, dalalyan12) or variational Bayes methods (, alquier-16). In this work, we explore a simpler method: we work solely from a discrete probability space.
4.2 Landmarks-Based Learning
We now propose to leverage on the fact that Theorem 1 bounds the kernel function for the distances to a single data point, instead of learning a kernel globally for every data points as in Corollary 2. We thus aim at learning a collection of kernels (which we can also interpret as similarity functions) for a subset of the training points. We call landmarks these training points. The aim of this approach is to learn a new representation of the input space, that maps the data-points into compact feature vectors, from which we can learn simple predictor.
Concretely, along with the learning sample of examples from , we consider a landmarks sample of points from , and a prior Fourier transform distribution . For each landmark , let sample points from , denoted
. Then, consider a uniform distributionon the discrete hypothesis set , such that and . We aim at learning a set of kernels , where each is obtained from a distinct with a fixed parameter , by computing the pseudo-posterior distribution given by
for ; being the normalization constant. Note that Equation (11) gives the minimum of Theorem 1 with . That is, corresponds to the regime where the bound converges. Moreover, similarly to Corollary 2, generalization guarantees are obtained simultaneously for the computed distributions thanks to the union bound and Theorem 1. Thus, with probability , for all :
Once all pseudo-posterior are computed thanks to Equation (11), our landmarks-based approach is to map samples to similarity features:
and to learn a linear predictor on the transformed training set. Note that, this mapping is not a kernel map anymore and is somehow similar to the mapping proposed by BalcanBS08ML; BalcanBS08COLT; zantedeschi2018multiview for a similarity function that is more general than a kernel but fixed for each landmark.
5 Learning Kernel (Revisited)
In this section, we present PAC-Bayesian theorems that allows to bound directly the kernel alignment generalization loss on a “pairwise” probability distribution —as defined by Equation (8)—even if the empirical loss is computed on dependent samples. These bounds suggest a kernel alignment (or kernel learning) strategy similar to the one proposed by SinhaD16.
5.1 Second Order Kullback-Leibler Bound
The following result is based on the fact that is an unbiased second-order estimator of , allowing us to build on the PAC-Bayesian analysis for U-statistics of lever-13. Indeed, the next theorem gives a generalization guarantee on .
Theorem 3 (lever-13).
For and a prior distribution over , with probability over the choice of , we have for all on :
See Theorem 7 of lever-13, using the fact that the loss function lies in . ∎
5.2 Second Order Bounds for f-Divergences
In the following, we build om a recent result of alquier-2018 to express a new family of PAC-Bayesian bounds for our dependent samples, where the term is replaced by other -divergences.
Given a convex function such that , the corresponding -divergence is given by
The following theorem applies to -divergences such that .
For and a prior distribution over , with probability over the choice of , we have for all on :
We start from alquier-2018:
Let us show for :
Line (14) is obtained by Jensen inequality (since ), and the inequality of Line (16) is proven by Lemma 6 of the supplementary material. Note that the latter is based on the Efron-Stein inequality and boucheron-13.
As a particular case, with , we obtain from Theorem 4 a bound that relies on the chi-square divergence:
Given a prior distribution over , with probability over the choice of , we have for all on :
It is noteworthy that above result looks alike other PAC-Bayesian bounds based on the chi-square divergence in the setting, as the one of honorio-14, graal-aistats16 or alquier-2018.
5.3 PAC-Bayesian Interpretation of Kernel Alignment Optimization
SinhaD16 propose a kernel learning algorithm that weights random kernel features. To do so, their algorithm solves a kernel alignment problem. As explained below, this method is coherent with the PAC-Bayesian theory exposed by our current work.
Kernel alignment algorithm.
Let us consider a Fourier transform distribution , from which points are sampled, denoted . Then, consider a uniform distribution on the discrete hypothesis set , such that and . Given a dataset , and constant parameters , , the optimization algorithm proposed by SinhaD16 solves the following problem.
The iterative procedure proposed by SinhaD16 finds an -suboptimal solution to the above problem in steps. The solution provides a learned kernel
SinhaD16 propose to use the above alignment method to reduce the number of features needed compared to the classical RFF procedure (as describe in Section 2). Albeit this method is a kernel learning one, empirical experiments show that with a large number of random features, the classical RFF procedure achieves as good prediction accuracy. However, one can draw (with replacement) features from according to . For a relatively small , learning a linear predictor on the random feature vector (such as the one presented by Equation (5)) obtained from achieves better result than the classical RFF method on the same number of random features.
The optimization problem of Equations (17–18) deals with the same trade-off as the one promoted by Theorem 4. Indeed, maximizing Equation (17) amounts to minimizing the constraint of Equation (18) controls the -divergence , which is the same complexity measure involved in Theorem 4. Furthermore, the empirical experiments performed by SinhaD16 focus on the -divergence (the case ), which corresponds to tackling the trade-off expressed by Corollary 5.
5.4 Greedy Kernel Learning
Given a Fourier transform prior distribution , let sample points . Let and . Given a dataset , and constant parameters , compute the pseudo-posterior
Then, we sample with replacement features from according to the pseudo-posterior . The sampled features are used to map every of the training set into a new vector according to Equation (5). The latter transformed dataset is then given as input to a linear learning procedure.
In summary, this learning method is strongly inspired by the one described in Section 5.3, but the posterior computation phase is faster, as we benefit from a closed-form expression (Equation 19 versus Equations (17–18)). Once is computed for all , we can vary the parameter and get a new posterior in steps.
All experiments use a Gaussian ( RBF) kernel of variance :
for which the Fourier transform is given by
Apart from the toy experiment of Figure 1, the experiments on real data are conducted by splitting the available data into a training set, a validation set and a test set. The kernel parameter is chosen among by running an RBF SVM on the training set and keeping the parameter having the best accuracy score on the validation set. That is, this defines the prior distribution given by Equation (6) for all our pseudo-Bayesian methods. Unless otherwise specified, all the other parameters are selected using the validation set. More details about the experimental procedure are given in the supplementary material.
6.1 Landmarks-Based Learning
We present a study of the learning methodology detailed in Section 4.2.
To get some insight from the landmarks-based procedure, we generate a 2D dataset , illustrated by Figure 1. We randomly select five training points , and compare two procedures, described below.
RBF-Landmarks: Learn a linear SVM on the empirical kernel map given by the five RBF kernels centered on . That is, each is mapped such that
PB-Landmarks: Generate random samples according to the Fourier transform of Equation (6). For every landmark of , learn a similarity measure thanks to Equation (11) (with ), minimizing the PAC-Bayesian bound. We thus obtain five posterior distributions , and learn a linear SVM on the mapped training set obtained by Equation (12).
Hence, the RBF-Landmarks method corresponds to the prior, from which we learn a posterior by landmarks by the PB-Landmarks procedure. Right-most plots of Figure 1 shows that the PB-Landmarks setting successfully finds a representation from which the linear SVM can predict well.
Experiments on real data.
We conduct similar experiments as the above explained one on seven real binary classification datasets.
Figure 2 studies the behavior of the approaches according to the number of selected landmarks. We select a percentage of the training points as landmarks (from to ), and we compare the classification error of a linear SVM on the mapping obtained by the original RBF functions (as in the RBF-Landmarks method above), with the mapping obtained by learning the landmarks posterior distributions (PB-Landmark method). We also compare the case where the landmarks are selected at random among the training data (curves postfixed “-R”), to another scenario where we use the centroids obtained with a -Means clustering algorithm as landmarks (curves postfixed “-C”). Note that, the latter case is not rigorously backed by our PAC-Bayesian theorems, since the choice of landmarks is now dependent of the whole observed training set. The results show that the classification error of both cases are similar, but the clustering strategy leads to a more stable behavior, probably since the landmarks are more representative of the original space. Moreover, the pseudo-Bayesian method improves the results on almost all datasets.
Table 1 compares the error rate of a SVM (trained along with the full Gram matrix and a properly selected on the validation set) with four landmarks-based approaches: (RBF) the landmarks are RBF kernel of parameter ; (PB) the PB-Landmarks approach where the number of features per landmarks and the parameter are selected using the validation set; (PB) the PB-Landmarks approach where is fixed and is selected by validation; and (PB) the PB-Landmarks approach where is fixed and is selected by validation. For all landmarks-based approaches, we select the landmarks by clustering, and use of the training set size as the number of landmarks; we want to study the methods in the regime where it provides relatively compact representations. The result shows that learning the posterior improves the RBF-Landmarks (except on “mnist56”) and that the validation of both and parameters are not mandatory to obtain satisfactory results. The SVM RBF is generally the best (except on “breast” and “mnist17”), but requires a far less compact representation of the data as it uses the full Gram matrix.
6.2 Greedy Kernel Learning
Figure 3 presents a study of the kernel learning method detailed in Section 5.4, inspired from the one of SinhaD16. We first generate random features according to as given by Equation (4), and we learn a posterior using two strategies: (OKRFF) the original optimized kernel of SinhaD16 given by Equations (17-18), where is selected on the validation set; and (PBRFF) the pseudo-posterior given by Equation (19) where is selected on the validation set. For both obtained posteriors, we subsample an increasing number of features to create the mapping given by Equation (5), on which we learn a linear SVM. We also compare to (RFF) the standard random Fourier features as described in Section 2, with randomly selected features according the prior .
We see that our PBRFF approach behaves similarly as OKRFF, with a slight advantage for the latter. However, we recall that computing the posterior of former method is faster. Both kernel learning methods have better accuracy than the classical RFF algorithm for a small number of random features, and similar ones for a large number of random features.
7 Conclusion and Perspectives
We elaborated an original viewpoint of the random Fourier features, proposed by rahimi-07 to approximate a kernel. By looking at the Fourier transform as a prior distribution over trigonometric functions, we present two kinds of generalization theorems for random Fourier features, that bound a loss function attesting the quality of the kernel alignment. Based on classical first-order PAC-Bayesian results, we derived a landmarks-based strategy that learns a compact representation of the data. Then, we proposed two second-order generalization bounds. The first one is based on the U-statistic theorem of lever-13. The second one is a new PAC-Bayesian theorem for -divergences (replacing the usual -divergence term). We show that the latter bound provides a theoretical justification to the kernel alignment method of SinhaD16, and we also empirically evaluate a similar but simpler algorithm where the alignment distribution is obtained by the PAC-Bayesian pseudo-posterior closed-form expression.
We believe that considering the Fourier transform of a kernel as a (pseudo-)Bayesian prior can lead to other contributions that the ones explored here. Among them, it might open new perspectives on representation and metric learning. Another interesting perspective would be to extend our study to wavelet transforms (Mallat-book).
Appendix A Supplementary Material
a.1 Mathematical Results
For any data-generating distribution :
Given , we denote
The function above has the bounded differences property. That is, for each :
Thus, we apply the Efron-Stein inequality (following boucheron-13, Corollary 3.2) to obtain
In Section 6 we use the following datasets:
The first 4 features which have missing values are removed.
As SinhaD16, binary classification tasks are compiled with the following digits pairs: 1 vs. 7, 4 vs. 9, and 5 vs. 6.
We split the datasets into a training and testing set with a 75/25 ratio except for adult which has a training/test split already computed. We then use 20% of the training set for validation. Table 2 presents an overview. We use the following parameters values range for selection on the validation set: