The “generalized min-max (GMM)” kernel 
was introduced for large-scale search and machine learning, owing to its efficient linearization via either hashing or the Nystrom method
. For defining the GMM kernel, the first step is a simple transformation on the original data. Consider, for example, the original data vector, to . We define the following transformation, depending on whether an entry is positive or negative:
For example, when and , the transformed data vector becomes . The GMM kernel is defined  as follows:
Even though the GMM kernel has no tuning parameter, it performs surprisingly well for classification tasks as empirically demonstrated in  (also see Table 1 and Table 2), when compared to the best-tuned radial basis function (RBF) kernel:
where is a crucial tuning parameter.
Furthermore, the (nonlinear) GMM kernel can be efficiently linearized via hashing [11, 3, 8] (or the Nystrom method ). This means we can use the linearized GMM kernel for large-scale machine learning tasks essentially at the cost of linear learning.
Naturally, one would ask whether we can improve this (tuning-free) GMM kernel by introducing tuning parameters. For example, we can define the following “exponentiated-GMM” (eGMM) kernel:
and the “powered-GMM” (pGMM) kernel:
Of course, we can also combine these two kernels:
In this study, we will provide an empirical study on kernel SVMs based on the above three tunable GMM kernels. Perhaps not surprisingly, the improvements can be substantial on some datasets. In particular, we will also compare them with deep nets and trees on 11 datasets . In their previous studies, [5, 6, 7] developed tree methods including “abc-mart”, “robust logitboost”, and “abc-robust-logitboost” and demonstrated their excellent performance on those 11 datasets (and other datasets), by establishing the second-order tree-split formula and new derivatives for multi-class logistic loss. Compared to tree methods like “abc-robust-logitboost” (which are slow and need substantial model sizes), the proposed tunable GMM kernels produced largely comparable classification results.
2 An Experimental Study on Kernel SVMs
We essentially use similar datasets as in . Table 1 lists a large number of publicly available datasets from the UCI repository and Table 2 presents datasets from the LIBSVM website and the 11 datasets for testing deep learning methods and trees [4, 7]. In both tables, we report
the kernel SVM test classification results for the linear kernel, the best-tuned RBF kernel, the original (tuning-free) GMM kernel, the best-tuned eGMM kernel, and the pGMM kernel. For the epGMM kernel, the experimental results are reported in Section 3, e.g., Table 3.
In all the experiments, we adopt the -regularization (with a regularization parameter ) and report the test classification accuracies at the best values in Table 1 and Table 2. More detailed results for a wide range of values are reported in Figures 1, 2, and 3. To ensure repeatability, we use the LIBSVM pre-computed kernel functionality, at the significant cost of disk space. For the RBF kernel, we follow , by exhaustively experimenting with 58 different values of 0.001, 0.01, 0.1:0.1:2, 2.5, 3:1:20 25:5:50, 60:10:100, 120, 150, 200, 300, 500, 1000. Basically, Table 1 and Table 2 report the best RBF results among all and values in our experiments. Here, 3:1:20 is the matlab notation, meaning that the iterations stat at 3 and terminate at 20, at a space of 1.
For the eGMM kernel, we experiment with the same set of (58) values as for the RBF kernel. For the pGMM kernel, however, because we have to materialize (store) a kernel matrix for each , disk space becomes a serious concern. Therefore, for the pGMM kernel, we only search in the range of . In other words, the performance of the pGMM kernel (and the epGMM kernel) would be further improved if we expand the range of search or granularity of spacing.
The classification results in Table 1 and 2 and Figures 1, 2 and 3 confirm that the eGMM and pGMM kernels typically improve the original GMM kernel. On a good fraction of datasets, the improvements can be very significant. In fact, Section 3 will show that using the epGMM kernel can bring in further improvements. Nevertheless, the RBF kernel still exhibits the best performance on a very small number of datasets. This is great because it means there is still room for improvement in future study.
|Dataset||# train||# test||# dim||linear||RBF||GMM||eGMM||pGMM|
|Group||Dataset||# train||# test||# dim||linear||RBF||GMM||eGMM||pGMM|
3 The epGMM Kernel, Comparisons with Deep Nets and Trees
Given two data vectors and , the epGMM kernel is defined as
after applying the transformation in (1) to and . When , this becomes the eGMM kernel.
In our experiments with the pGMM kernel, we searched for the best (i.e., the here) parameter in the range of . Note that since we have to store a kernel matrix at each , the experiments are costly. For testing the epGMM kernel, we re-use the
those pre-computed kernels and experiment with the epGMM kernel using the same values (which is the here) as for the RBF and eGMM kernels.
The experimental results are reported in Table 3 (the last column). We can see that the epGMM kernel indeed improves over the eGMM and pGMM kernels, as one would have expected. The improvements can be quite noticeable on those datasets.
|Group||Dataset||# train||# test||# dim||linear||RBF||GMM||eGMM||pGMM||epGMM|
The 11 datasets in Group 1 of Table 3 were already used for testing deep learning algorithms and tree methods [4, 7]. It is perhaps surprising that the performance of the pGMM kernel (and the epGMM kernel) can be largely comparable to deep nets and boosted trees, as shown in Figure 4 and Table 4. These results are exciting, because, that this point, we merely use kernel SVM with single kernels. It is reasonable to expect that additional improvements might be achieved in future studies.
developed tree methods including “abc-mart”, “robust logitboost”, and “abc-robust-logitboost” and demonstrated their excellent performance on those 11 datasets (and other datasets), by establishing the second-order tree-split formula and new derivatives for multi-class logistic loss function. They always used a special histogram-based implementation named “adaptive binning”, and the “best-first” strategy for determining the region for the next split (thus, the trees were not balanced as they did not directly control the levels of depth.).
Figure 4 reports the test classification error rates (lower is better) for six datasets: M-Noise1, M-Noise2, …, M-Noise6. In the left panel, we plot the results of the GMM kernel, the eGMM kernel, and the epGMM kernel, together with the results of two deep learning algorithms as reported in . We can see that for most of those six datasets, the pGMM kernel and the epGMM kernel achieve the best accuracy. In the right panel of Figure 4, we compare epGMM with four boosted tree methods: mart, abc-mart, robust logitboost, and abc-robust-logitboost.
The “mart” tree algorithm  has been popular in industry practice, especially in search. At each boosting step, it uses the first derivative of the logistic loss function as the residual response to fit regression trees, to achieve excellent robustness and fairly good accuracy. The earlier work on “logitboost”  were believed to exhibit numerical issues (which in part motivated the development of mart). It turns out that the numerical issue does not actually exist after  derived the tree-split formula using both the first and second order derivatives of the logistic loss function.  showed the “robust logitboost” in general improves “mart”, as can be seen from Figure 4 (right panel).
[5, 6, 7] made an interesting (and perhaps brave) observation that the derivatives (as in text books) of the classical logistic loss function can be written in a different form for the multi-class case, by enforcing the “sum-to-zero” constraints. At each boosting step, they identify a “base class” either by the “worst-class” criterion  or the exhaustive search method as reported in [6, 7]. This “adaptive base class (abc)” strategy can be combined with either mart or robust logitboost; hence the names “abc-mart” and “abc-robust-logitboost”. The improvements due to the use of “abc” strategy can also be substantial. Again, as mentioned earlier, in all the tree implementations, they [5, 6, 7] always used the adaptive-binning strategy for simplifying the implementation and speeding up training. Also, they followed the “best-first” criterion whereas many tree implementations used balanced trees (which may cause “data-imbalance” and reduce accuracy).
Table 4 reports the test error rates on five other datasets: M-Basic, M-Rotate, M-Image, M-Rand, and M-RotImg. In group 1 (as reported in ), the results show that (i) the kernel SVM with RBF kernel outperforms the kernel SVM with polynomial kernel; (ii) deep learning algorithms usually beat kernel SVM and neural nets. Group 2 presents the same results as in Table 3 (in terms of error rates as opposed to accuracies). We can see that pGMM and epGMM outperform deep learning methods except for M-Rand. In group 3, overall the tree methods especially abc-robust-logitboost achieve very good accuracies. The results of pGMM and epGMM are largely comparable to the results of tree methods.
The training of boosted trees is typically slow (especially in high-dimensional data) because a large number of trees are usually needed in order to achieve good accuracies. Consequently, the model sizes of tree methods are usually large. Therefore, it would be exciting to have methods which are simpler than trees and achieve comparable accuracies.
4 Hashing the pGMM Kernel
It is now well-understood that it is highly beneficial to be able to linearize nonlinear kernels so that learning algorithms can be easily scaled to massive data. The prior work  has already demonstrated the effectiveness of the generalized consistent weighted sampling (GCWS) [11, 3, 8] for hashing the GMM kernel. In this study, we modify GCWS for linearizing the pGMM kernel as summarized in Algorithm 1.
samples, we can estimate
according to the following collision probability:
or, for implementation convenience, the approximate collision probability :
For each vector , we obtain random samples , to . We store only the lowest bits of . We need to view those integers as locations (of the nonzeros). For example, when , we should view as a binary vector of length . We concatenate all such vectors into a binary vector of length , which contains exactly
1’s. We then feed the new data vectors to a linear classifier if the task is classification. The storage and computational cost is largely determined by the number of nonzeros in each data vector, i.e., thein our case. This scheme can of course also be used for many other tasks including clustering, regression, and near neighbor search.
Note that the performance of pGMM can be heavily impacted by the tuning parameter in the definition of the pGMM kernel. Figure 5 presents examples on M-Rotate and M-Image.
Figure 6 presents the experimental results on hashing for M-Rotate. For this dataset, is the best choice (among the range of values we have searched). Figure 6 plots the results for both (left panels) and (right panels), for . Recall here is the number of bits for representing each hashed value in the “0-bit CWS” scheme . The results demonstrate that: (i) hashing using produces better results than hashing using ; (ii) It is preferable to use a fairly large value, for example, or 8. Using smaller values (e.g., ) hurts the accuracy; (iii) With merely a small number of hashes (e.g., ), the linearized pGMM kernel can significantly outperform the original linear kernel. Note that the original dimensionality is 784. This example illustrates the significant advantage of nonlinear kernel and hashing.
Figure 7 presents the experimental results on hashing for M-Noise1 dataset and M-Noise3 dataset, respectively on the left panels () and the right panels (). Figure 8 presents the experimental results on hashing for M-Image dataset and M-RotImg dataset, respectively on the left panels () and the right panels (). These results deliver similar information as the results in Figure 6, confirming the significant advantages of the pGMM kernel and hashing.
It is commonly believed that deep learning algorithms and tree methods can produce the state-of-the-art results in many statistical machine learning tasks. In 2010,  reported a set of surprising experiments on the datasets used by the deep learning community , to show that tree methods can outperform deep nets on a majority (but not all) of those datasets and the improvements can be substantial on a good portion of datasets.  introduced several ideas including the second-order tree-split formula and the new derivatives for multi-class logistic loss function. Nevertheless, tree methods are slow and their model sizes are typically large.
In machine learning practice with massive data, it is desirable to develop algorithms which run almost as efficient as linear methods (such as linear logistic regression or linear SVM) and achieve similar accuracies as nonlinear methods. In this study, the tunable linearized GMM kernels are promising tools for achieving those goals. Our extensive experiments on the same datasets used for testing tree methods and deep nets demonstrate that tunable GMM kernels and their linearized versions through hashing can achieve comparable accuracies as trees. In general, the state-of-the-art boosted tree method called “abc-robust-logitboost” typically achieves better accuracies than the proposed tunable GMM kernels. Also, on some datasets, deep learning methods or RBF kernel SVM outperform tunable GMM kernels. Therefore, there is still room for future improvements.
In this study, we focus on testing tunable GMM kernels and their linearized versions using classification tasks. It is clear that these techniques basically generate new data representations and hence can be applied to a wide variety of statistical learning tasks including clustering and regression. Due to the discrete name of the hashed values, the techniques naturally can also be used for building hash tables for fast near neighbor search.
The current version of this paper is mainly a technical note for supporting the recent work on “The Linearized GMM Kernels and Normalized Random Fourier Features” .
J. H. Friedman.
Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001.
-  J. H. Friedman, T. J. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000.
-  S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246–255, Sydney, AU, 2010.
-  H. Larochelle, D. Erhan, A. C. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, pages 473–480, Corvalis, Oregon, 2007.
-  P. Li. Adaptive base class boost for multi-class classification. CoRR, abs/0811.1250, 2008.
-  P. Li. Abc-boost: Adaptive base class boost for multi-class classification. In ICML, pages 625–632, Montreal, Canada, 2009.
-  P. Li. Robust logitboost and adaptive base class (abc) logitboost. In UAI, 2010.
-  P. Li. 0-bit consistent weighted sampling. In KDD, Sydney, Australia, 2015.
-  P. Li. Linearized GMM kernels and normalized random fourier features. Technical report, arXiv:1605.05721, 2016.
-  P. Li. Nystrom method for approximating the gmm kernel. Technical report, arXiv:1605.05721, 2016.
-  M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.