# Linearized GMM Kernels and Normalized Random Fourier Features

The method of "random Fourier features (RFF)" has become a popular tool for approximating the "radial basis function (RBF)" kernel. The variance of RFF is actually large. Interestingly, the variance can be substantially reduced by a simple normalization step as we theoretically demonstrate. We name the improved scheme as the "normalized RFF (NRFF)". We also propose the "generalized min-max (GMM)" kernel as a measure of data similarity. GMM is positive definite as there is an associated hashing method named "generalized consistent weighted sampling (GCWS)" which linearizes this nonlinear kernel. We provide an extensive empirical evaluation of the RBF kernel and the GMM kernel on more than 50 publicly available datasets. For a majority of the datasets, the (tuning-free) GMM kernel outperforms the best-tuned RBF kernel. We conduct extensive experiments for comparing the linearized RBF kernel using NRFF with the linearized GMM kernel using GCWS. We observe that, to reach a comparable classification accuracy, GCWS typically requires substantially fewer samples than NRFF, even on datasets where the original RBF kernel outperforms the original GMM kernel. The empirical success of GCWS (compared to NRFF) can also be explained from a theoretical perspective. Firstly, the relative variance (normalized by the squared expectation) of GCWS is substantially smaller than that of NRFF, except for the very high similarity region (where the variances of both methods are close to zero). Secondly, if we make a model assumption on the data, we can show analytically that GCWS exhibits much smaller variance than NRFF for estimating the same object (e.g., the RBF kernel), except for the very high similarity region.

## Authors

• 97 publications
07/12/2016

### Nystrom Method for Approximating the GMM Kernel

The GMM (generalized min-max) kernel was recently proposed (Li, 2016) as...
12/29/2016

### Generalized Intersection Kernel

Following the very recent line of work on the "generalized min-max" (GMM...
01/09/2017

### Tunable GMM Kernels

The recently proposed "generalized min-max" (GMM) kernel can be efficien...
05/08/2018

### Several Tunable GMM Kernels

While tree methods have been popular in practice, researchers and practi...
01/07/2022

### GCWSNet: Generalized Consistent Weighted Sampling for Scalable and Accurate Training of Neural Networks

We develop the "generalized consistent weighted sampling" (GCWS) for has...
02/25/2021

### Quantization Algorithms for Random Fourier Features

The method of random projection (RP) is the standard technique in machin...
03/05/2015

### Min-Max Kernels

The min-max kernel is a generalization of the popular resemblance kernel...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

It is popular in machine learning practice to use linear algorithms such as logistic regression or linear SVM. It is known that one can often improve the performance of linear methods by using nonlinear algorithms such as kernel SVMs, if the computational/storage burden can be resolved. In this paper, we introduce an effective measure of data similarity termed “generalized min-max (GMM)” kernel and the associated hashing method named “generalized consistent weighted sampling (GCWS)”, which efficiently converts this nonlinear kernel into linear kernel. Moreover, we will also introduce what we call “normalized random Fourier features (NRFF)” and compare it with GCWS.

We start the introduction with the basic linear kernel. Consider two data vectors . It is common to use the normalized linear kernel (i.e., the correlation):

 ρ=ρ(u,v)=∑Di=1uivi√∑Di=1u2i√∑Di=1v2i (1)

This normalization step is in general a recommended practice. For example, when using LIBLINEAR or LIBSVM packages [6], it is often suggested to first normalize the input data vectors to unit

norm. In addition to packages such as LIBLINEAR which implement batch linear algorithms, methods based on stochastic gradient descent (SGD) become increasingly important especially for truly large-scale industrial applications

[2].

In this paper, the proposed GMM kernel is defined on general data types which can have both negative and positive entries. The basic idea is to first transform the original data into nonnegative data and then compute the min-max kernel [20, 9, 12] on the transformed data.

### 1.1 Data Transformation

Consider the original data vector , to . We define the following transformation, depending on whether an entry is positive or negative:111 This transformation can be generalized by considering a “center vector” , to , such that

In this paper, we always use . Note that the same center vector should be used for all data vectors.

 {~u2i−1=ui,~u2i=0if  ui>0~u2i−1=0,~u2i=−uiif  ui≤0 (2)

For example, when and , the transformed data vector becomes .

### 1.2 Generalized Min-Max (GMM) Kernel

Given two data vectors , we first transform them into according to (2). Then the generalized min-max (GMM) similarity is defined as

 GMM(u,v)=∑2Di=1min(~ui, ~vi)∑2Di=1max(~ui, ~vi) (3)

We will show in Section 4 that GMM is indeed an effective measure of data similarity through an extensive experimental study on kernel SVM classification.

It is generally nontrivial to scale nonlinear kernels for large data [3]. In a sense, it is not practically meaningful to discuss nonlinear kernels without knowing how to compute them efficiently (e.g., via hashing). In this paper, we focus on the generalized consistent weighted sampling (GCWS).

### 1.3 Generalized Consistent Weighted Sampling (GCWS)

Algorithm 1 summarizes the “generalized consistent weighted sampling” (GCWS). Given two data vectors and , we transform them into nonnegative vectors and as in (2). We then apply the original “consistent weighted sampling” (CWS) [20, 9] to generate random tuples:

 (i∗~u,j,t∗~u,j)  and %  (i∗~v,j,t∗~v,j),  j=1,2,...,k (4)

where and is unbounded. Following [20, 9]

, we have the basic probability result.

###### Theorem 1
 Pr{(i∗~u,j,t∗~u,j)=(i∗~v,j,t∗~v,j)}=GMM(u,v) (5)

With samples, we can simply use the averaged indicator to estimate

. By property of the binomial distribution, we know the expectation (

) and variance () are

 E[1{i∗~u,j=i∗~v,j and t∗~u,j=t∗~v,j}]=GMM(u,v), (6) Var[1{i∗~u,j=i∗~v,j and t∗~u,j=t∗~v,j}]=(1−GMM(u,v))GMM(u,v) (7)

The estimation variance, given samples, will be , which vanishes as GMM approaches 0 or 1, or as the sample size .

### 1.4 0-bit GCWS for Linearizing GMM Kernel SVM

The so-called “0-bit” GCWS idea is that, based on intensive empirical observations [12], one can safely ignore (which is unbounded) and simply use

 Pr{i∗~u,j=i∗~v,j}≈GMM(u,v) (8)

For each data vector , we obtain random samples , to . We store only the lowest bits of , based on the idea of [18]. We need to view those integers as locations (of the nonzeros) instead of numerical values. For example, when , we should view as a vector of length . If , then we code it as ; if , we code it as . We can concatenate all such vectors into a binary vector of length , with exactly 1’s.

For linear methods, the computational cost is largely determined by the number of nonzeros in each data vector, i.e., the in our case. For the other parameter , we recommend to use .

The natural competitor of the GMM kernel is the RBF (radial basis function) kernel, and the competitor of the GCWS hashing method is the RFF (random Fourier feature) algorithm.

## 2 RBF Kernel and Normalized Random Fourier Features (NRFF)

The radial basis function (RBF) kernel is widely used in machine learning and beyond. In this study, for convenience (e.g., parameter tuning), we recommend the following version:

 RBF(u,v;γ)=e−γ(1−ρ) (9)

where is the correlation defined in (1) and is a crucial tuning parameter. Based on Bochner’s Theorem [24], it is known [22] that, if we sample , i.i.d., and let , , where , then we have

 E(√2cos(√γx+w)√2cos(√γy+w))=e−γ(1−ρ) (10)

This provides a nice mechanism for linearizing the RBF kernel and the RFF method has become popular in machine learning, computer vision, and beyond, e.g.,

[21, 27, 1, 7, 5, 28, 8, 25, 4, 23].

###### Theorem 2

Given , , , and , we have

 E[√2cos(√γx+w)√2cos(√γy+w)]=e−γ(1−ρ) (11) E[cos(√γx)cos(√γy)]=12e−γ(1−ρ)+12e−γ(1+ρ) (12) Var[√2cos(√γx+w)√2cos(√γy+w)]=12+12(1−e−2γ(1−ρ))2 (13)

The proof for (13) can also be found in [26]. One can see that the variance of RFF can be large. Interestingly, the variance can be substantially reduced if we normalize the hashed data, a procedure which we call “normalized RFF (NRFF)”. The theoretical results are presented in Theorem 3.

###### Theorem 3

Consider iid samples () where , , , , . Let and . As , the following asymptotic normality holds:

 √k⎛⎜ ⎜⎝∑kj=1XjYj√∑kj=1X2j√∑kj=1Y2j−e−γ(1−ρ)⎞⎟ ⎟⎠D⟹N(0,Vn,ρ,γ) (14) where Vn,ρ,γ=Vρ,γ−14e−2γ(1−ρ)[3−e−4γ(1−ρ)] (15) Vρ,γ=12+12(1−e−2γ(1−ρ))2 (16)

Obviously, (in particular, at ), i.e., the variance of the normalized RFF is (much) smaller than that of the original RFF. Figure 1 plots to visualize the improvement due to normalization, which is most significant when is close to 1.

Note that the theoretical results in Theorem 3 are asymptotic (i.e., for larger ). With samples, the variance of the original RFF is exactly , however the variance of the normalized RFF (NRFF) is written as . It is important to understand the behavior when is not large. For this purpose, Figure 2 presents the simulated mean square error (MSE) results for estimating the RBF kernel , confirming that a): the improvement due to normalization can be substantial, and b): the asymptotic variance formula (15) becomes accurate for merely .

Next, we attempt to compare RFF with GCWS. While ultimately we can rely on classification accuracy as a metric for performance, here we compare their variances () relative to their expectations () in terms of , as shown in Figure 3. For GCWS, we know . For the original RFF, we have , etc.

Figure 3 shows that the relative variance of GCWS is substantially smaller than that of the original RFF and the normalized RFF (NRFF), especially when is not large. For the very high similarity region (i.e., ), the variances of both GCWS and NRFF approach zero.

The results from Figure 3 provide one explanation why later we will observe that, in the classification experiments, GCWS typically needs substantially fewer samples than the normalized RFF, in order to achieve similar classification accuracies. Note that for practical data, the similarities among most data points are usually small (i.e., small ) and hence it is not surprising that GCWS may perform substantially better. Also see Section 3 and Figure 4 for a comparison from the perspective of estimating RBF using GCWS based on a model assumption.

In a sense, this drawback of RFF is expected, due to nature of random projections. For example, as shown in [16, 17], the linear estimator of the correlation using random projections has variance , where is the number of projections. In order to make the variance small, one will have to use many projections (i.e., large ).

Proof of Theorem 2:  The following three integrals will be useful in our proof:

 ∫∞−∞cos(cx)e−x2/2dx=√2πe−c2/2
 ∫∞−∞cos(c1x)cos(c2x)e−x2/2dx= 12∫∞−∞[cos((c1+c2)x)+cos((c1−c2)x)]e−x2/2dx = √2π2[e−(c1+c2)2/2+e−(c1−c2)2/2]
 ∫∞−∞sin(c1x)sin(c2x)e−x2/2dx=√2π2[e−(c1−c2)2/2−e−(c1+c2)2/2]

Firstly, we consider integers and evaluate the following general integral:

 E(cos(c1x+b1w)cos(c2y+b2w)) = 12π∫2π0E(cos(c1x+b1t)cos(c2y+b2t))dt = 12π∫2π0∫∞−∞∫∞−∞(cos(c1x+b1t)cos(c2y+b2t))12π1√1−ρ2e−x2+y2−2ρxy2(1−ρ2)dxdydt = 12π∫2π0∫∞−∞∫∞−∞(cos(c1x+b1t)cos(c2y+b2t))12π1√1−ρ2e−x2+y2−2ρxy+ρ2x2−ρ2x22(1−ρ2)dxdydt = 12π∫2π0∫∞−∞12π1√1−ρ2e−x22cos(c1x+b1t)dx∫∞−∞cos(c2y+b2t)e−(y−ρx)22(1−ρ2)dydt = 12π∫2π0∫∞−∞12πe−x22cos(c1x+b1t)dx∫∞−∞cos(c2y√1−ρ2+c2ρx+b2t)e−y2/2dydt = 12π∫2π0∫∞−∞12πe−x22cos(c1x+b1t)cos(c2ρx+b2t)dx∫∞−∞cos(c2y√1−ρ2)e−y2/2dydt = 12π∫2π0∫∞−∞12πe−x22cos(c1x+b1t)cos(c2ρx+b2t)√2πe−c22(1−ρ2)2dxdt = 12π1√2πe−c22(1−ρ2)2∫2π0∫∞−∞e−x22cos(c1x+b1t)cos(c2ρx+b2t)dxdt

Note that

 ∫2π0cos(c1x+b1t)cos(c2ρx+b2t)dt = ∫2π0cos(c1x)cos(b1t)cos(c2ρx)cos(b2t)dt+∫2π0sin(c1x)sin(b1t)sin(c2ρx)sin(b2t)dt − ∫2π0cos(c1x)cos(b1t)sin(c2ρx)sin(b2t)dt−∫2π0sin(c1x)sin(b1t)cos(c2ρx)cos(b2t)dt

When , we have

 ∫2π0cos(b1t)cos(b2t)dt=12∫2π0cos(b1t−b2t)+cos(b1t+b2t)dt=0 ∫2π0sin(b1t)sin(b2t)dt=12∫2π0cos(b1t−b2t)−cos(b1t+b2t)dt=0

If , then

 ∫2π0cos(b1t)cos(b2t)dt=∫2π0sin(b1t)sin(b2t)dt=π

In addition, for any , we always have

 ∫2π0sin(b1t)cos(b2t)dt=12∫2π0sin(b1t−b2t)+sin(b1t+b2t)dt=0

Thus, only when we have

 ∫2π0cos(c1x+b1t)cos(c2ρx+b2t)dt=πcos(c1x)cos(c2ρx)+πsin(c1x)sin(c2ρx)=πcos((c1−c2ρ)x)

Otherwise, . Therefore, when , we have

 E(cos(c1x+b1w)cos(c2y+b2w)) = 12π1√2πe−c22(1−ρ2)2∫2π0∫∞−∞e−x22cos(c1x+b1t)cos(c2ρx+b2t)dxdt = 12π1√2πe−c22(1−ρ2)2∫∞−∞e−x22πcos((c1−c2ρ)x)dx = 12π1√2πe−c22(1−ρ2)2π√2πe−(c1−c2ρ)2/2 = 12e−c21+c22−2c1c2ρ2 = 12e−c2(1−ρ),when  c1=c2=c

This completes the proof of the first moment. Next, using the following fact

 Ecos(2cx+2w)= 12π∫2π01√2π∫∞−∞cos(2cx+2t)e−x2/2dxdt = 12π∫2π01√2π12sin2t∫∞−∞cos(2cx)e−x2/2dxdt = 14πe−2c2∫2π0sin2tdt=0

we are ready to compute the second moment

 E[cos(cx+w)cos(cy+w)]2 = 14E[cos(2cx+2w)cos(2cy+2w)+cos(2cx+2w)+cos(2cy+2w)]+14 = 14E[cos(2cx+2w)cos(2cy+2w)]+14 = 18e−4c2(1−ρ)+14

and the variance

 Var[cos(cx+w)cos(cy+w)]=18e−4c2(1−ρ)+14−14e−2c2(1−ρ)

Finally, we prove the first moment without the “

 E(cos(cx)cos(cy))= ∫∞−∞∫∞−∞cos(cx)cos(cy)12π1√1−ρ2e−x2+y2−2ρxy+ρ2x2−ρ2x22(1−ρ2)dxdy = ∫∞−∞12π1√1−ρ2e−x22cos(cx)dx∫∞−∞cos(cy)e−(y−ρx)22(1−ρ2)dy = ∫∞−∞12πe−x22cos(cx)dx∫∞−∞cos(cy√1−ρ2+cρx)e−y2/2dy = ∫∞−∞12πe−x22cos(cx)cos(cρx)dx∫∞−∞cos(cy√1−ρ2)e−y2/2dy = ∫∞−∞12πe−x22cos(cx)cos(cρx)√2πe−c21−ρ22dx = 1√2πe−c21−ρ22∫∞−∞e−x22cos(cx)cos(cρx)dx = 1√2πe−c21−ρ22√2π2[e−c2(1−ρ)22+e−c2(1+ρ)22] = 12e−c2(1−ρ)+12e−c2(1+ρ)

This completes the proof of Theorem 2.

Proof of Theorem 3:   We will use some of the results from the proof of Theorem 2. Define

 Xj=√2cos(√γxj+wj),Yj=√2cos(√γyj+wj),Zk=∑kj=1XjYj√∑kj=1X2j√∑kj=1Y2j

From Theorem 2, it is easy to see that, as , we have

 1kk∑j=1X2j→E(X2j)=e−γ(1−1)=1,  a.s.1kk∑j=1Y2j→1,  a.s. Zk=1k∑kj=1XjYj√1k∑kj=1X2j√1k∑kj=1Y2j→e−γ(1−ρ)=Z∞,  a.s.

We express the deviation as

 Zk−Z∞= 1k∑kj=1XjYj−Z∞+Z∞√1k∑kj=1X2j√1k∑kj=1Y2j−Z∞ = 1k∑kj=1XjYj−Z∞√1k∑kj=1X2j√1k∑kj=1Y2j+Z∞1−√1k∑kj=1X2j√1k∑kj=1Y2j√1k∑kj=1X2j√1k∑kj=1Y2j = 1kk∑j=1XjYj−Z∞+Z∞1−1k∑kj=1X2j1k∑kj=1Y2j2+OP(1/k) = 1kk∑j=1XjYj−Z∞+Z∞1−1k∑kj=1X2j2+Z∞1−1k∑kj=1Y2j2+OP(1/k)

Note that if and , then

 1−ab=1−(1−(1−a))(1−(1−b))=(1−a)+(1−b)−(1−a)(1−b)

and we can ignore the higher-order term.

Therefore, to analyze the asymptotic variance, it suffices to study the following expectation

 E(XY−Z∞+Z∞1−X22+Z∞1−Y22)2 = E(XY−Z∞(X2+Y2)/2)2 = E(X2Y2)+Z2∞E(X4+Y4+2X2Y2)/4−Z∞E(X3Y)−Z∞E(XY3)

which can be obtained from the results in the proof of Theorem 2. In particular, if , then

 E(cos(c1x+b1w)cos(c2y+b2w))=12e−c21+c22−2c1c2ρ2

Otherwise . We can now compute

 E[cos(cx+w)3cos(c