## 1 Introduction

It is popular in machine learning practice to use linear algorithms such as logistic regression or linear SVM. It is known that one can often improve the performance of linear methods by using nonlinear algorithms such as kernel SVMs, if the computational/storage burden can be resolved. In this paper, we introduce an effective measure of data similarity termed “generalized min-max (GMM)” kernel and the associated hashing method named “generalized consistent weighted sampling (GCWS)”, which efficiently converts this nonlinear kernel into linear kernel. Moreover, we will also introduce what we call “normalized random Fourier features (NRFF)” and compare it with GCWS.

We start the introduction with the basic linear kernel. Consider two data vectors . It is common to use the normalized linear kernel (i.e., the correlation):

(1) |

This normalization step is in general a recommended practice. For example, when using LIBLINEAR or LIBSVM packages [6], it is often suggested to first normalize the input data vectors to unit

norm. In addition to packages such as LIBLINEAR which implement batch linear algorithms, methods based on stochastic gradient descent (SGD) become increasingly important especially for truly large-scale industrial applications

[2].In this paper, the proposed GMM kernel is defined on general data types which can have both negative and positive entries. The basic idea is to first transform the original data into nonnegative data and then compute the min-max kernel [20, 9, 12] on the transformed data.

### 1.1 Data Transformation

Consider the original data vector , to . We define the following transformation, depending on whether an entry is positive or negative:^{1}^{1}1
This transformation can be generalized by considering a “center vector” , to , such that

(2) |

For example, when and , the transformed data vector becomes .

### 1.2 Generalized Min-Max (GMM) Kernel

Given two data vectors , we first transform them into according to (2). Then the generalized min-max (GMM) similarity is defined as

(3) |

We will show in Section 4 that GMM is indeed an effective measure of data similarity through an extensive experimental study on kernel SVM classification.

It is generally nontrivial to scale nonlinear kernels for large data [3]. In a sense, it is not practically meaningful to discuss nonlinear kernels without knowing how to compute them efficiently (e.g., via hashing). In this paper, we focus on the generalized consistent weighted sampling (GCWS).

### 1.3 Generalized Consistent Weighted Sampling (GCWS)

Algorithm 1 summarizes the “generalized consistent weighted sampling” (GCWS). Given two data vectors and , we transform them into nonnegative vectors and as in (2). We then apply the original “consistent weighted sampling” (CWS) [20, 9] to generate random tuples:

(4) |

where and is unbounded. Following [20, 9]

, we have the basic probability result.

###### Theorem 1

(5) |

With samples, we can simply use the averaged indicator to estimate

. By property of the binomial distribution, we know the expectation (

) and variance () are(6) | |||

(7) |

The estimation variance, given samples, will be , which vanishes as GMM approaches 0 or 1, or as the sample size .

### 1.4 0-bit GCWS for Linearizing GMM Kernel SVM

The so-called “0-bit” GCWS idea is that, based on intensive empirical observations [12], one can safely ignore (which is unbounded) and simply use

(8) |

For each data vector , we obtain random samples , to . We store only the lowest bits of , based on the idea of [18]. We need to view those integers as locations (of the nonzeros) instead of numerical values. For example, when , we should view as a vector of length . If , then we code it as ; if , we code it as . We can concatenate all such vectors into a binary vector of length , with exactly 1’s.

For linear methods, the computational cost is largely determined by the number of nonzeros in each data vector, i.e., the in our case. For the other parameter , we recommend to use .

The natural competitor of the GMM kernel is the RBF (radial basis function) kernel, and the competitor of the GCWS hashing method is the RFF (random Fourier feature) algorithm.

## 2 RBF Kernel and Normalized Random Fourier Features (NRFF)

The radial basis function (RBF) kernel is widely used in machine learning and beyond. In this study, for convenience (e.g., parameter tuning), we recommend the following version:

(9) |

where is the correlation defined in (1) and is a crucial tuning parameter. Based on Bochner’s Theorem [24], it is known [22] that, if we sample , i.i.d., and let , , where , then we have

(10) |

This provides a nice mechanism for linearizing the RBF kernel and the RFF method has become popular in machine learning, computer vision, and beyond, e.g.,

[21, 27, 1, 7, 5, 28, 8, 25, 4, 23].###### Theorem 2

Given , , , and , we have

(11) | |||

(12) | |||

(13) |

The proof for (13) can also be found in [26]. One can see that the variance of RFF can be large. Interestingly, the variance can be substantially reduced if we normalize the hashed data, a procedure which we call “normalized RFF (NRFF)”. The theoretical results are presented in Theorem 3.

###### Theorem 3

Consider iid samples () where , , , , . Let and . As , the following asymptotic normality holds:

(14) | ||||

where | ||||

(15) | ||||

(16) |

Obviously, (in particular, at ), i.e., the variance of the normalized RFF is (much) smaller than that of the original RFF. Figure 1 plots to visualize the improvement due to normalization, which is most significant when is close to 1.

Note that the theoretical results in Theorem 3 are asymptotic (i.e., for larger ). With samples, the variance of the original RFF is exactly , however the variance of the normalized RFF (NRFF) is written as . It is important to understand the behavior when is not large. For this purpose, Figure 2 presents the simulated mean square error (MSE) results for estimating the RBF kernel , confirming that a): the improvement due to normalization can be substantial, and b): the asymptotic variance formula (15) becomes accurate for merely .

Next, we attempt to compare RFF with GCWS. While ultimately we can rely on classification accuracy as a metric for performance, here we compare their variances () relative to their expectations () in terms of , as shown in Figure 3. For GCWS, we know . For the original RFF, we have , etc.

Figure 3 shows that the relative variance of GCWS is substantially smaller than that of the original RFF and the normalized RFF (NRFF), especially when is not large. For the very high similarity region (i.e., ), the variances of both GCWS and NRFF approach zero.

The results from Figure 3 provide one explanation why later we will observe that, in the classification experiments, GCWS typically needs substantially fewer samples than the normalized RFF, in order to achieve similar classification accuracies. Note that for practical data, the similarities among most data points are usually small (i.e., small ) and hence it is not surprising that GCWS may perform substantially better. Also see Section 3 and Figure 4 for a comparison from the perspective of estimating RBF using GCWS based on a model assumption.

In a sense, this drawback of RFF is expected, due to nature of random projections. For example, as shown in [16, 17], the linear estimator of the correlation using random projections has variance , where is the number of projections. In order to make the variance small, one will have to use many projections (i.e., large ).

Proof of Theorem 2: The following three integrals will be useful in our proof:

Firstly, we consider integers and evaluate the following general integral:

Note that

When , we have

If , then

In addition, for any , we always have

Thus, only when we have

Otherwise, . Therefore, when , we have

This completes the proof of the first moment. Next, using the following fact

we are ready to compute the second moment

and the variance

This completes the proof of Theorem 2.

From Theorem 2, it is easy to see that, as , we have

We express the deviation as

Note that if and , then

and we can ignore the higher-order term.

Therefore, to analyze the asymptotic variance, it suffices to study the following expectation

which can be obtained from the results in the proof of Theorem 2. In particular, if , then

Otherwise . We can now compute

Comments

There are no comments yet.