Linearized GMM Kernels and Normalized Random Fourier Features

by   Ping Li, et al.
Rutgers University

The method of "random Fourier features (RFF)" has become a popular tool for approximating the "radial basis function (RBF)" kernel. The variance of RFF is actually large. Interestingly, the variance can be substantially reduced by a simple normalization step as we theoretically demonstrate. We name the improved scheme as the "normalized RFF (NRFF)". We also propose the "generalized min-max (GMM)" kernel as a measure of data similarity. GMM is positive definite as there is an associated hashing method named "generalized consistent weighted sampling (GCWS)" which linearizes this nonlinear kernel. We provide an extensive empirical evaluation of the RBF kernel and the GMM kernel on more than 50 publicly available datasets. For a majority of the datasets, the (tuning-free) GMM kernel outperforms the best-tuned RBF kernel. We conduct extensive experiments for comparing the linearized RBF kernel using NRFF with the linearized GMM kernel using GCWS. We observe that, to reach a comparable classification accuracy, GCWS typically requires substantially fewer samples than NRFF, even on datasets where the original RBF kernel outperforms the original GMM kernel. The empirical success of GCWS (compared to NRFF) can also be explained from a theoretical perspective. Firstly, the relative variance (normalized by the squared expectation) of GCWS is substantially smaller than that of NRFF, except for the very high similarity region (where the variances of both methods are close to zero). Secondly, if we make a model assumption on the data, we can show analytically that GCWS exhibits much smaller variance than NRFF for estimating the same object (e.g., the RBF kernel), except for the very high similarity region.



There are no comments yet.


page 1

page 2

page 3

page 4


Nystrom Method for Approximating the GMM Kernel

The GMM (generalized min-max) kernel was recently proposed (Li, 2016) as...

Generalized Intersection Kernel

Following the very recent line of work on the "generalized min-max" (GMM...

Tunable GMM Kernels

The recently proposed "generalized min-max" (GMM) kernel can be efficien...

Several Tunable GMM Kernels

While tree methods have been popular in practice, researchers and practi...

GCWSNet: Generalized Consistent Weighted Sampling for Scalable and Accurate Training of Neural Networks

We develop the "generalized consistent weighted sampling" (GCWS) for has...

Quantization Algorithms for Random Fourier Features

The method of random projection (RP) is the standard technique in machin...

Min-Max Kernels

The min-max kernel is a generalization of the popular resemblance kernel...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is popular in machine learning practice to use linear algorithms such as logistic regression or linear SVM. It is known that one can often improve the performance of linear methods by using nonlinear algorithms such as kernel SVMs, if the computational/storage burden can be resolved. In this paper, we introduce an effective measure of data similarity termed “generalized min-max (GMM)” kernel and the associated hashing method named “generalized consistent weighted sampling (GCWS)”, which efficiently converts this nonlinear kernel into linear kernel. Moreover, we will also introduce what we call “normalized random Fourier features (NRFF)” and compare it with GCWS.

We start the introduction with the basic linear kernel. Consider two data vectors . It is common to use the normalized linear kernel (i.e., the correlation):


This normalization step is in general a recommended practice. For example, when using LIBLINEAR or LIBSVM packages [6], it is often suggested to first normalize the input data vectors to unit

norm. In addition to packages such as LIBLINEAR which implement batch linear algorithms, methods based on stochastic gradient descent (SGD) become increasingly important especially for truly large-scale industrial applications 


In this paper, the proposed GMM kernel is defined on general data types which can have both negative and positive entries. The basic idea is to first transform the original data into nonnegative data and then compute the min-max kernel [20, 9, 12] on the transformed data.

1.1 Data Transformation

Consider the original data vector , to . We define the following transformation, depending on whether an entry is positive or negative:111 This transformation can be generalized by considering a “center vector” , to , such that

In this paper, we always use . Note that the same center vector should be used for all data vectors.


For example, when and , the transformed data vector becomes .

1.2 Generalized Min-Max (GMM) Kernel

Given two data vectors , we first transform them into according to (2). Then the generalized min-max (GMM) similarity is defined as


We will show in Section 4 that GMM is indeed an effective measure of data similarity through an extensive experimental study on kernel SVM classification.

It is generally nontrivial to scale nonlinear kernels for large data [3]. In a sense, it is not practically meaningful to discuss nonlinear kernels without knowing how to compute them efficiently (e.g., via hashing). In this paper, we focus on the generalized consistent weighted sampling (GCWS).

1.3 Generalized Consistent Weighted Sampling (GCWS)

Algorithm 1 summarizes the “generalized consistent weighted sampling” (GCWS). Given two data vectors and , we transform them into nonnegative vectors and as in (2). We then apply the original “consistent weighted sampling” (CWS) [20, 9] to generate random tuples:


where and is unbounded. Following [20, 9]

, we have the basic probability result.

Theorem 1

Input: Data vector = ( to )

Transform: Generate vector in -dim by (2)

Output: Consistent uniform sample (, )

For from 1 to

,  ,


End For


Algorithm 1 Generalized Consistent Weighted Sampling (GCWS). Note that we slightly re-write the expression for compared to [9].

With samples, we can simply use the averaged indicator to estimate

. By property of the binomial distribution, we know the expectation (

) and variance () are


The estimation variance, given samples, will be , which vanishes as GMM approaches 0 or 1, or as the sample size .

1.4 0-bit GCWS for Linearizing GMM Kernel SVM

The so-called “0-bit” GCWS idea is that, based on intensive empirical observations [12], one can safely ignore (which is unbounded) and simply use


For each data vector , we obtain random samples , to . We store only the lowest bits of , based on the idea of [18]. We need to view those integers as locations (of the nonzeros) instead of numerical values. For example, when , we should view as a vector of length . If , then we code it as ; if , we code it as . We can concatenate all such vectors into a binary vector of length , with exactly 1’s.

For linear methods, the computational cost is largely determined by the number of nonzeros in each data vector, i.e., the in our case. For the other parameter , we recommend to use .

The natural competitor of the GMM kernel is the RBF (radial basis function) kernel, and the competitor of the GCWS hashing method is the RFF (random Fourier feature) algorithm.

2 RBF Kernel and Normalized Random Fourier Features (NRFF)

The radial basis function (RBF) kernel is widely used in machine learning and beyond. In this study, for convenience (e.g., parameter tuning), we recommend the following version:


where is the correlation defined in (1) and is a crucial tuning parameter. Based on Bochner’s Theorem [24], it is known [22] that, if we sample , i.i.d., and let , , where , then we have


This provides a nice mechanism for linearizing the RBF kernel and the RFF method has become popular in machine learning, computer vision, and beyond, e.g., 

[21, 27, 1, 7, 5, 28, 8, 25, 4, 23].

Theorem 2

Given , , , and , we have


The proof for (13) can also be found in [26]. One can see that the variance of RFF can be large. Interestingly, the variance can be substantially reduced if we normalize the hashed data, a procedure which we call “normalized RFF (NRFF)”. The theoretical results are presented in Theorem 3.

Theorem 3

Consider iid samples () where , , , , . Let and . As , the following asymptotic normality holds:


Obviously, (in particular, at ), i.e., the variance of the normalized RFF is (much) smaller than that of the original RFF. Figure 1 plots to visualize the improvement due to normalization, which is most significant when is close to 1.

Figure 1: The ratio from Theorem 3 for visualizing the improvement due to normalization.

Note that the theoretical results in Theorem 3 are asymptotic (i.e., for larger ). With samples, the variance of the original RFF is exactly , however the variance of the normalized RFF (NRFF) is written as . It is important to understand the behavior when is not large. For this purpose, Figure 2 presents the simulated mean square error (MSE) results for estimating the RBF kernel , confirming that a): the improvement due to normalization can be substantial, and b): the asymptotic variance formula (15) becomes accurate for merely .

Figure 2: A simulation study to verify the asymptotic theoretical results in Theorem 3. With samples, we estimate the RBF kernel , using both the original RFF and the normalized RFF (NRFF). With repetitions at each , we can compute the empirical mean square error: MSE = Bias+Var. Each panel presents the MSEs (solid curves) for a particular choice of , along with the theoretical variances: and (dashed curves). The variance of the original RFF (curves above, or red if color is available) can be substantially larger than the MSE of the normalized RFF (curves below, or blue). When

, the normalized RFF provides an unbiased estimate of the RBF kernel and its empirical MSE matches the theoretical asymptotic variance.

Next, we attempt to compare RFF with GCWS. While ultimately we can rely on classification accuracy as a metric for performance, here we compare their variances () relative to their expectations () in terms of , as shown in Figure 3. For GCWS, we know . For the original RFF, we have , etc.

Figure 3 shows that the relative variance of GCWS is substantially smaller than that of the original RFF and the normalized RFF (NRFF), especially when is not large. For the very high similarity region (i.e., ), the variances of both GCWS and NRFF approach zero.

Figure 3: Ratio of the variance over the squared expectation, denoted as , for the convenience of comparing RFF/NRFF with GCWS. Smaller (lower) is better.

The results from Figure 3 provide one explanation why later we will observe that, in the classification experiments, GCWS typically needs substantially fewer samples than the normalized RFF, in order to achieve similar classification accuracies. Note that for practical data, the similarities among most data points are usually small (i.e., small ) and hence it is not surprising that GCWS may perform substantially better. Also see Section 3 and Figure 4 for a comparison from the perspective of estimating RBF using GCWS based on a model assumption.

In a sense, this drawback of RFF is expected, due to nature of random projections. For example, as shown in [16, 17], the linear estimator of the correlation using random projections has variance , where is the number of projections. In order to make the variance small, one will have to use many projections (i.e., large ).

Proof of Theorem 2:  The following three integrals will be useful in our proof:

Firstly, we consider integers and evaluate the following general integral:

Note that

When , we have

If , then

In addition, for any , we always have

Thus, only when we have

Otherwise, . Therefore, when , we have

This completes the proof of the first moment. Next, using the following fact

we are ready to compute the second moment

and the variance

Finally, we prove the first moment without the “

random variable:

This completes the proof of Theorem 2.

Proof of Theorem 3:   We will use some of the results from the proof of Theorem 2. Define

From Theorem 2, it is easy to see that, as , we have

We express the deviation as

Note that if and , then

and we can ignore the higher-order term.

Therefore, to analyze the asymptotic variance, it suffices to study the following expectation

which can be obtained from the results in the proof of Theorem 2. In particular, if , then

Otherwise . We can now compute