The Error Probability of Random Fourier Features is Dimensionality Independent

10/27/2017 ∙ by Jean Honorio, et al. ∙ Purdue University 0

We show that the error probability of reconstructing kernel matrices from Random Fourier Features for any shift-invariant kernel function is at most O((-D)), where D is the number of random features. We also provide a matching information-theoretic method-independent lower bound of Ω((-D)) for standard Gaussian distributions. Compared to prior work, we are the first to show that the error probability for random Fourier features is independent of the dimensionality of data points as well as the size of their domain. As applications of our theory, we obtain dimension-independent bounds for kernel ridge regression and support vector machines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Kernel methods are widely applied in many machine learning algorithms, including kernel perceptron, support vector machines, principal component analysis, and Gaussian processes. Kernels allow to convert problems that evaluate explicit feature mappings to problems that evaluate kernel functions, i.e., inner products of feature mappings. Kernel methods are efficient since computing inner products of feature mappings is often computationally cheaper than computing the feature mappings directly. To fully leverage the power of the kernel method, a

matrix called kernel matrix(Gram Matrix) must be computed, which does not scale when the number of data points is large. To cope with such problem, Rahimi and Recht [2007] proposed an algorithm called Random Fourier Features (RFF). RFF approximates the kernel evaluation by the average of Fourier Features (cosines of linear projections). This approach is theoretically motivated by Bochner’s theorem [Bochner, 1959]

, which states that any continuous, positive definite and shift-invariant function can be written as the Fourier transform of a nonnegative measure.

Though RFF is a successful method in practice, its theoretical performance is yet to be discovered. Along with the algorithm, Rahimi and Recht [2007] also analyzed the error probability of reconstructing kernel matrices, which is , for any compact data domain, where is the domain diameter, is the number of Fourier features, and is the dimensionality of the data points. Their approach is based on covering numbers. Following the work of [Rahimi and Recht, 2007], Sutherland and Schneider [2015] improved the constants in the previous covering-number upper bound and provided results which were also . Later Sriperumbudur and Szabó [2015] proved an upper bound of , by using a Rademacher complexity approach.

In this paper, we remove the dependence on the dimensionality of the data points from the error probability, and show that the dependence on the domain diameter is sub-linear. That is, we show that the error probability depends only on the number of Fourier features . More specifically, we show an upper bound on the error probability which is . In addition, we also reason about the lower bound, by showing that the minimax bound is

for any estimator based on Gaussian linear projections.

Previous analyses [Rahimi and Recht, 2007, Sutherland and Schneider, 2015, Sriperumbudur and Szabó, 2015]

are agnostic to the distribution of random variables. This is typical in learning theory where one relates random variables to an arbitrary data distribution. In our problem, random variables are Gaussian (used for linear projections), while data lives in a ball of diameter

. Thus, in our analysis, we exploit Gaussianity in order to obtain relatively more optimal bounds.

2 Preliminaries

In this section, we introduce some definitions, notations and preliminaries that will be used in the following sections.

2.1 Definitions and Notations

For any vector we denote its -norm by . We denote the compact ball centered at the origin and with radius by:

The above is the domain of the data points, considered throughout the paper. For a probability distribution

, denote its corresponding probability density function by

, any random vector from by and the expectation with respect to by . For any set of i.i.d. samples from , denote these samples by , its product distribution by , and the expectation with respect to by . Finally denote any multivariate Gaussian distribution with mean and covariance by , and denote the -dimensional standard Gaussian distribution by .

2.2 Random Fourier Features

Let be two data points, , and let be a nonnegative, continuous and shift-invariant function, that is

By Bochner’s theorem [Bochner, 1959], the Fourier transform of is a probability density function. We denote such probability density function by and its distribution by . We have

since only the real part of the above integral is considered. Then RFF draws sample points from , and approximates by

where

(1)

The above result enables us to draw a set of samples from and approximates for any , without computing the Gram matrix directly. For the performance analysis of RFF, Rahimi and Recht [2007] first provided a theoretical upper bound on the error probability for uniform convergence. Specifically, Claim 1 of [Rahimi and Recht, 2007] shows that the error probability behaves as follows 111[Rahimi and Recht, 2007] defines , where

is a random variable with uniform distribution in

.

The above depends on and . Thus, the above bound is . Sutherland and Schneider [2015] improved the upper bound of the error probability with the same features used in [Rahimi and Recht, 2007]. Their upper bound is given in Proposition 1 of [Sutherland and Schneider, 2015], which is 222[Sutherland and Schneider, 2015] uses the features in (1), but the number of features is instead.

which has the same asymptotic upper bound as [Rahimi and Recht, 2007]. Later Sriperumbudur and Szabó [2015] proved the following upper bound (See Appendix A for more details):

The above depends on and . Thus, the above bound is . Contrary to common belief, the above bound is less optimal than the bounds in [Rahimi and Recht, 2007, Sutherland and Schneider, 2015]. (See Appendix A for a throughout discussion.)

3 Sufficient Number of Samples

In this section we prove that the upper bound on the error probability for uniform convergence in RFF is independent of the dimensionality of the data points , and sub-linear with respect to the domain diameter .

Theorem 1.

Let be a set of i.i.d. -dimensional random vectors from . Then:

Proof.

Note and for all . Note that since for every , is independent of then is independent of . Let for all . Furthermore, is a random vector from . Note that for , we have and . Given the above, we have:

(2)

where , and . In order to proceed with a union bound for all , we divide the set into subsets for all , each of them with center . For a particular and , if and if the Lipschitz constant of fulfills , we have:

Next, we proceed to bound the expected value of the Lipschitz constant of . By linearity of expectation, we have . Therefore:

(3)

Since then . Therefore, from (3), since , and by Markov’s and Hoeffding’s inequalities and by the union bound, we have

(4)

By optimizing the above with respect to , we obtain . Finally, by replacing the optimal back in (3), and by (2) we prove our claim. ∎

4 Necessary Number of Samples

In contrast to the relative popularity of the upper bound analysis on the error probability for RFF, lower bounds have not been analyzed before. In this section we try to shed light on the minimax bound for any estimator based on Gaussian linear projections, by using Le Cam’s Lemma [Yu, 1997, Wasserman, 2010]. In the first part we introduce Le Cam’s Lemma for minimax bounds, then we use Le Cam’s Lemma to show the minimax bound of the expected error. Finally we generalize our result in the last subsection and show that the supremum of the error probability is also bounded below by .

4.1 Le Cam’s Lemma

Recall that given an observation drawn from some distribution in a family of distributions , a function of , and any estimator of , the minimax risk is:

Minimax theory illustrates the lower bound of the estimation errors among all estimators. Le Cam’s Lemma is a method for providing a lower bound for the minimax risk. First we introduce Le Cam’s Lemma from Theorem 36.8 of [Wasserman, 2010], which is a revised version from the results in [Yu, 1997]:

Lemma 1 (Le Cam [Yu, 1997, Wasserman, 2010]).

Let be a set of distributions over space , where every distribution corresponds to a parameter in the parameter space . Let be any estimator, be an observation drawn from , be a metric in . Then for any pair of distributions

where are the probability density functions of , respectively.

(Detailed proofs can be found in Appendix B)

Moreover, can be any semimetric or non-negative symmetric function satisfying the relaxed triangle inequality, which generalizes Le Cam’s Lemma as follows:

Lemma 2.

Let be a set of distributions over space , where every distribution corresponds to a parameter in the space of parameter . Let be any estimator, be an observation drawn from , be a non-negative symmetric function in satisfying whenever and . For any pair of distributions satisfying we have:

where are the probability density functions of , respectively.

Le Cam’s Lemma allows for analyzing the minimax bound of estimation errors, which gives us some insight of the lower bound on the estimator. In order to simplify our analysis with Le Cam’s Lemma, we also introduce the following result:

Lemma 3 (Lemma 2.6 in [Tsybakov, 2009]).

For any two distributions with support , we have

where is the KL-divergence from to , and are the probability density functions of respectively.

4.2 Minimax Bound for the Expected Error

In what follows, we regard the kernel as a parameter of its Fourier transform , and then show by Le Cam’s Lemma that the minimax bound for such parameter estimation is . We introduce a function that fulfills . Furthermore, let be the -th Gaussian linear projection of the data point . The particular estimator used in practice, studied in Section 3 for the sufficient number of samples can be defined as . For our analysis of necessary number of samples, we consider any estimator of the form , thus, setting impossibility results beyond the typically used estimator.

Theorem 2.

Let be a set of i.i.d. -dimensional random vectors from . Let be any estimator that uses Gaussian linear projections of the data point , i.e., takes as input. Let where is Lambert’s W function. Define   and

Then

Proof.

Note and for all . Let then can be regarded as a set of i.i.d. samples from and is a random vector from . We define the family of distributions . We let the parameter of be , where the last equality follows from Lemma 5 in Appendix B. By Le Cam’s Lemma 1 and Lemma 3 we have:

(5)

where . To complete the proof, we will first set and then maximize the expression . Note that and

are two multivariate normal distributions, which implies that their KL-divergence is

By choosing such that , we can make and thus . Then from (4.2) we get

(6)

for any that satisfy the constraint . Under this constraint, the solution to maximize is and , if . Similarly, the solution is and , if . By rewriting (6) with this maximizing solution, we prove our claim. ∎

4.3 Minimax Bound for the Error Probability

In this part we generalize the results in the previous subsection and show that the supremum of the error probability is for any estimator based on Gaussian linear projections. Here we introduce a generalization of Theorem 2, based on the generalized Le Cam’s Lemma 2:

Theorem 3.

Let be a set of i.i.d. -dimensional random vectors from . Let be any estimator that uses Gaussian linear projections of the data point , i.e., takes as input. Then

provided that , where is defined as in Theorem 2.

Proof.

The proof follows the proof of Theorem 2 at large. We can regard every as a random variable from and regard as a -dimensional random vector from . We define the family of distributions . We let the parameter of be , where the last equality follows from Lemma 5 in Appendix B. We also define a symmetric, nonnegative function

which satisfies a relaxed triangle inequality whenever . Furthermore, define

which satisfies . From the above and by Le Cam’s Lemma 2 and Lemma 3 we have:

The proof continues as in Theorem 1, by first setting and then maximizing the expression . ∎

5 Applications

In this section, we provide examples of consequences of our theory. In particular, our theory allows for tighter results for the analysis of the expectation of the maximum error, and the sample complexity of kernel ridge regression and support vector machines.

5.1 Expectation of the Maximum Error

Proposition 3 of [Sutherland and Schneider, 2015] shows that when the kernel function is -Lipschitz, the expected maximum error of approximation is bounded above by

We improve the above upper bound to be , by the following corollary:

Corollary 1.

Let be as in Theorem 1. We have that:

where is the Gamma function.

5.2 Kernel Ridge Regression

Given a training set of samples , kernel matrix , where , and the vector . As proved in Proposition 9 of [Sutherland and Schneider, 2015], the error probability for kernel ridge regression is bounded above by

(7)

where

is the approximation of the kernel matrix using RFF, is the approximation of from RFF, is the regularization parameter and

is the standard deviation of the values

. With the upper bound of the error probability in [Sutherland and Schneider, 2015], the authors proved, by applying , that with probability at least if

On the other hand, we reach the same result with less number of features. Let be Lambert’s W function. From our result in Theorem 1 and (7), we obtain

(8)

Regarding the work of [Rudi and Rosasco, 2017], note that our result in (8) shows that is independent of the number of samples . In contrast, Theorem 1 in [Rudi and Rosasco, 2017] requires a number of random features . Thus, our result is more optimal.

5.3 Support Vector Machines

Now we consider support vector machine(SVM) classifiers. Given a training set of

samples with , the kernel embedding , the SVM classifier , where is the parameter, and the optimization problem

where is the regularization weight. As proved in Section 3.2 of [Sutherland and Schneider, 2015], if the RFF approximation of the kernel is controlled by , that is

then the approximation error of SVM is also controlled by

where and is the approximation of using RFF. The results in [Sutherland and Schneider, 2015] show that if

where is Lambert’s W function, then with probability at least . However, with our result in Theorem 1 we can control to be

under which we also have with probability at least . (The dependence with respect to can be customarily removed by using the weight . This relates to scale the squared norm regularization as a function of the number of data points .)

6 Concluding Remarks

There are several ways of extending our work. For instance, note that [Rahimi and Recht, 2007, Sutherland and Schneider, 2015, Sriperumbudur and Szabó, 2015] focus on more general shift-invariant kernel functions. In this paper, all results are based on the assumption that the kernel function is Gaussian. Extensions to more general kernel functions can be of interest.

References

  • Bochner [1959] S. Bochner. Lectures on Fourier integrals. Princeton University Press, 1959.
  • Rahimi and Recht [2007] A. Rahimi and B. Recht. Random features for large-scale kernel machines. Neural Information Processing Systems, 20:1177–1184, 2007.
  • Rudi and Rosasco [2017] A. Rudi and L. Rosasco. Generalization properties of learning with random features. Neural Information Processing Systems, 30:3215–3225, 2017.
  • Sriperumbudur and Szabó [2015] B. K. Sriperumbudur and Z. Szabó. Optimal rates for random Fourier features. Neural Information Processing Systems, 28:1144–1152, 2015.
  • Sutherland and Schneider [2015] D. Sutherland and J. Schneider. On the error of random Fourier features.

    Uncertainty in Artificial Intelligence

    , pages 862–871, 2015.
  • Tsybakov [2009] A. Tsybakov. Introduction to Nonparametric Estimation. Springer-Verlag, 2009.
  • Wasserman [2010] L. Wasserman. Minimax Theory, Lecture Notes on Statistical Machine Learning. 2010.
  • Yu [1997] B. Yu. Assouad, Fano, and Le Cam. Springer-Verlag, 1997.

Appendix A Regarding the Bound in [Sriperumbudur and Szabó, 2015]

Theorem 1 of Sriperumbudur and Szabó [2015] proved the following upper bound:

(9)

If we rewrite the error threshold to be , by the fact that since , we can rewrite (A) as:

(10)

where