Choosing the right kernel is a classic question that has riddled machine learning practitioners and theorists alike. Conventional wisdom instructs the user to select a kernel which captures the structure and geometric invariances in the data. Efforts to formulate this principle have inspired vibrant areas of study, going by names fromfeature selection to multiple kernel learning (MKL).
We present a new, principled approach for selecting a translation-invariant or rotation-invariant
kernel to maximize the SVM classification margin. We first describe a kernel-alignment subroutine, which finds a peak in the Fourier transform of anadversarially chosen data-dependent measure. Then, we define an iterative procedure that produces a sequence of feature maps, progressively improving the margin. The resulting algorithm is strikingly simple and scalable.
Intriguingly, our analysis interprets the main algorithm as no-regret learning dynamics in a zero-sum min-max game, whose value is the classification margin. Thus, we are able to quantify convergence guarantees towards the largest margin realizable by a kernel with the assumed invariance. Finally, we exhibit experiments on synthetic and benchmark datasets, demonstrating consistent improvements over related random features-based kernel methods.
1.1 Related Work
There is a vast literature on MKL, from which we use the key concept of kernel alignment (Cristianini et al., 2002). Otherwise, our work bears few similarities to traditional MKL; this and much related work (e.g. Cortes et al. (2012); Gönen & Alpaydın (2011); Lanckriet et al. (2004)) are concerned with selecting a kernel by combining a collection of base kernels, chosen beforehand. Our method allows for greater expressivity and even better generalization guarantees.
Instead, we take inspiration from the method of random features (Rahimi & Recht, 2007). In this pioneering work, originally motivated by scalability, feature maps are sampled according to the Fourier transform of a chosen kernel. The idea of optimizing a kernel in random feature space was studied by Sinha & Duchi (2016). In this work, which is most similar to ours, kernel alignment is optimized via importance sampling on a fixed, finitely supported proposal measure. However, the proposal can fail to contain informative features, especially in high dimension; indeed, they highlight efficiency, rather than showing performance improvements over RBF features.
Learning a kernel in the Fourier domain (without the primal feature maps) has also been considered previously: Oliva et al. (2016) and Yang et al. (2015) model the Fourier spectrum parametrically, which limits expressivity; the former also require complicated posterior inference procedures. Băzăvan et al. (2012) study learning a kernel in the Fourier domain jointly with regression parameters. They show experimentally that this locates informative frequencies in the data, without theoretical guarantees. Our visualizations suggest this approach can get stuck in poor local minima, even in 2 dimensions.
Crammer et al. (2003)
also use boosting to build a kernel sequentially; however, they only consider a basis of linear feature maps, and require costly generalized eigenvector computations. From a statistical view,Fukumizu et al. (2009)
bound the SVM margin in terms of maximum mean discrepancy, which is equivalent to (unweighted) kernel alignment. Notably, their bound can be loose if the number of support vectors is small; in such situations, our theory provides a tighter characterization. Moreover, our attention to the margin goes beyond the usual objective of kernel alignment.
1.2 Our Contributions
We present an algorithm that outputs a sequence of Fourier features, converging to the maximum realizable SVM classification margin on a labeled dataset. At each iteration, a pair of features is produced, which maximizes kernel alignment with a changing, adversarially chosen measure. As this measure changes slowly, the algorithm builds a diverse and informative feature representation.
Our main theorem can be seen as a case of von Neumann’s min-max theorem for a zero-sum concave-linear game; indeed, our method bears a deep connection to boosting (Freund & Schapire, 1996; Schapire, 1999). In particular, both the theory and empirical evidence suggest that the generalization error of our method decreases as the number of random features increases. In traditional MKL methods, generalization bounds worsen as base kernels are added.
Other methods in the framework of Fourier random features take the approach of approximating a kernel by sampling feature maps from a continuous distribution. In contrast, our method constructs a measure with small finite support, and realizes the kernel exactly by enumerating the associated finite-dimensional feature map; there is no randomness in the features.111We note a superficial resemblance to quasi-Monte Carlo methods (Avron et al., 2016); however, these are concerned with accelerating convergence rates rather than learning a kernel.
2.1 The Fourier Dual Measure of a Kernel
We focus on two natural families of kernels : translation-invariant kernels on , which depend only on , and rotation-invariant kernels on the hypersphere , which depend only on . These invariance assumptions subsume most widely-used classes of kernels; notably, the Gaussian (RBF) kernel satisfies both. For the former invariance, Bochner’s theorem provides a Fourier-analytic characterization:
Theorem 2.1 (e.g. Eq. (1), Sec. 1.4.3 in Rudin (2011)).
A translation-invariant continuous function is positive semidefinite if and only if is the Fourier transform of a symmetric non-negative measure (where the Fourier domain is ). That is,
for some , satisfying and for all .
A similar characterization is available for rotation-invariant kernels, where the Fourier basis functions are the spherical harmonics, a countably infinite family of complex polynomials which form an orthonormal basis for square-integrable functions . To unify notation, let be the set of valid index pairs , and let denote a certain involution on ; we supply details and references in Appendix B.1.
A rotation-invariant continuous function is positive semidefinite if and only if it has a symmetric non-negative expansion into spherical harmonics , i.e.
for some , with and for all valid index pairs .
In each of these cases, we call this Fourier transform the dual measure of . This measure decomposes into a non-negative combination of Fourier basis kernels. Furthermore, this decomposition gives us a feature map whose image realizes the kernel under the codomain’s inner product;222In fact, such a pair exists for any psd kernel ; see Dai et al. (2014) or Devinatz (1953). that is, for all ,
Respectively, these feature maps are and . Although they are complex-valued, symmetry of allows us to apply the transformation to yield real features, preserving the inner product. The analogous result holds for spherical harmonics.
2.2 Kernel Alignment
In a binary classification task with training samples , a widely-used quantity for measuring the quality of a kernel is its alignment (Cristianini et al., 2002; Cortes et al., 2012),333Definitions vary up to constants and normalization, depending on the use case. defined by
where is the vector of labels and is the Gram matrix. Here, we let and denote the (unnormalized) empirical measures of each class, where is the Dirac measure at . When are arbitrary measures on , this definition generalizes to
In terms of the dual measure , kernel alignment takes a useful alternate form, noted by Sriperumbudur et al. (2010). Let denote the signed measure . Then, when is translation-invariant, we have
Analogously, when is rotation-invariant, we have
It can also be verified that , which is of course the same for all . In each case, the alignment is linear in . We call the Fourier potential, which is the squared magnitude of the Fourier transform of the signed measure . This function is clearly bounded pointwise by .
3.1 Maximizing Kernel Alignment in the Fourier Domain
First, we consider the problem of finding a kernel (subject to either invariance) that maximizes alignment ; we optimize the dual measure . Aside from the non-negativity and symmetry constraints from Theorems 2.1 and 2.2, we constrain , as this quantity appears as a normalization constant in our generalization bounds (see Theorem 4.3). Maximizing in this constraint set, which we call
, takes the form of a linear program on an infinite-dimensional simplex. Noting that, is maximized by placing a Dirac mass at any pair of opposite modes .
At first, and will be the empirical distributions of the classes, specified in Section 2.2. However, as Algorithm 2 proceeds, it will reweight each data point in the measures by . Explicitly, the reweighted Fourier potential takes the form444Whenever is an index, we will denote the imaginary unit by .
Due to its non-convexity, maximizing , which can be interpreted as finding a global Fourier peak in the data, is theoretically challenging. However, we find that it is easy to find such peaks in our experiments, even in hundreds of dimensions. This arises from the empirical phenomenon that realistic data tend to be band-limited, a cornerstone hypothesis in data compression. An constraint (or equivalently, regularization; see Kakade et al. (2009)) can be explicitly enforced to promote band-limitedness; we find that this is not necessary in practice.
When a gradient is available (in the translation-invariant case), we use Langevin dynamics (Algorithm 1) to find the peaks of , which enjoys mild theoretical hitting-time guarantees (see Theorem 4.5). See Appendix A.3 for a discussion of the (discrete) rotation-invariant case.
It is useful in practice to use parallel initialization, running concurrent copies of the diffusion process and returning the best single encountered. This admits a very efficient GPU implementation: the multi-point evaluations and can be computed from an by matrix product and pointwise trigonometric functions. We find that Algorithm 1 typically finds a reasonable peak within steps.
3.2 Learning the Margin-Maximizing Kernel for SVM
Support vector machines (SVMs) are perhaps the most ubiquitous use case of kernels in practice. To this end, we propose a method that boosts Algorithm 1, building a kernel that maximizes the classification margin. Let be a kernel with dual , and . Write the dual -SVM objective,555Of course, our method applies to SVMs, and has even stronger theoretical guarantees; see Section 4.1. parameterizing the kernel by :
Thus, for a fixed , is equivalent to kernel alignment, and can be minimized by Algorithm 1. However, the support vector weights are of course not fixed; given a kernel , is chosen to maximize , giving the (reciprocal) SVM margin. In all, to find a kernel which maximizes the margin under an adversarial choice of , one must consider a two-player zero-sum game:
where is the same constraint set as in Section 3.1, and is the usual dual feasible set with box parameter .
In this view, we make some key observations. First, Algorithm 1 allows the min-player to play a pure-strategy best response to the max-player. Furthermore, a mixed strategy for the min-player is simply a translation- or rotation-invariant kernel, realized by the feature map corresponding to its support. Finally, since the objective is linear in and concave in , there exists a Nash equilibrium for this game, from which gives the margin-maximizing kernel.
We can use no-regret learning dynamics to approximate this equilibrium. Algorithm 2 runs Algorithm 1 for the min-player, and online gradient ascent (Zinkevich, 2003) for the max-player. Intuitively (and as is visualized in our synthetic experiments), this process slowly morphs the landscape of to emphasize the margin, causing Algorithm 1 to find progressively more informative features. At the end, we simply concatenate these features; contingent on the success of the kernel alignment steps, we have approximated the Nash equilibrium.
We provide a theoretical analysis in Section 4.1A. One important note is that the online gradient is computed very efficiently, where is the vector of the most recently appended features. Langevin dynamics, easily implemented on a GPU, comprise the primary time bottleneck.
4.1 Convergence of No-Regret Learning Dynamics
We first state the main theoretical result, which quantifies the convergence properties of Algorithm 2.
Theorem 4.1 (Main).
Let be the margin obtained by training an linear SVM with the same as in Algorithm 2, on the transformed samples . Then, is -competitive with , the maximally achievable margin by a kernel with the assumed invariance, with
This bound arises from the regret analysis of online gradient ascent (Zinkevich, 2003); our analysis is similar to the approach of Freund & Schapire (1996), where they present a boosting perspective. When using an -SVM, the final term can be improved to (Hazan et al., 2007). For a general overview of results in the field, refer to Hazan (2016).
4.2 Generalization Bounds
Finally, we state two (rather distinct) generalization guarantees. Both depend mildly on a bandwidth assumption and the norm of the data . First, we state a margin-dependent SVM generalization bound, due to Koltchinskii & Panchenko (2002). Notice the appearance of , justifying our choice of normalization constant for Algorithm 1. Intriguingly, the end-to-end generalization error of our method decreases with an increasing number of random features, since the margin bound is being refined during Algorithm 2.
Theorem 4.3 (Generalization via margin).
For any SVM decision function with a kernel constrained by trained on samples drawn i.i.d. from distribution , the generalization error is bounded by
The proof can be found in Appendix D. Note that this improves on the generic result for MKL, from Theorem 2 in (Cortes et al., 2010), which has a dependence on the number of base kernels . This improvement stems from the rank-one property of each component kernel.
Next, we address another concern entirely: the sample size required for to approximate the ideal Fourier potential , the squared magnitude of the Fourier transform of the signed measure arising from the true distribution. For the shift-invariant case:
Theorem 4.4 (Generalization of the potential).
Let be a set of i.i.d. training samples on , with
We have that for all , with probability at least
, with probability at least,
The suppresses factors polynomial in and .
The full statement and proof are standard, and deferred to Appendix E. In particular, this result allows for a mild guarantee of polynomial hitting-time on the locus of approximate local maxima of (as opposed to the empirical ). Adapting the main result from Zhang et al. (2017):
Theorem 4.5 (Langevin hitting time).
Of course, one should not expect polynomial hitting time on approximate global maxima; Roberts & Tweedie (1996) give asymptotic mixing guarantees.
In this section, we highlight the most important and illustrative parts of our experimental results. For further details, we provide an extended addendum to the experimental section in Appendix A. The code can be found at github.com/yz-ignescent/Not-So-Random-Features.
First, we exhibit two simple binary classification tasks, one in and the other on , to demonstrate the power of our kernel selection method. As depicted in Figure 1, we create datasets with sharp boundaries, which are difficult for the standard RBF and arccosine (Cho & Saul, 2009) kernel. In both cases, and .666 is chosen so large to measure the true generalization error. Used on a -SVM classifier, these baseline kernels saturate at 92.1% and 95.1% test accuracy, respectively; they are not expressive enough.
On the “windmill” task, Algorithm 2 chooses random features that progressively refine the decision boundary at the margin. By , it exhibits almost perfect classification (99.7% training, 99.3% test). Similarly, on the “checkerboard” task, Algorithm 2 (with some adaptations described in Appendix A.3) reaches almost perfect classification (99.7% training, 99.1% test) at , supported on only spherical harmonics as features.
We provide some illuminating visualizations. Figures 0(a) and 0(b) show the evolution of the dual weights, random features, and classifier. As the theory suggests, the objective evolves to assign higher weight to points near the margin, and successive features improve the classifier’s decisiveness in challenging regions (0(a), bottom). Figure 0(c) visualizes some features from the experiment.
Next, we evaluate our kernel on standard benchmark binary classification tasks. Challenging label pairs are chosen from the MNIST (LeCun et al., 1998) and CIFAR-10 (Krizhevsky, 2009) datasets; each task consists of training and test examples; this is considered to be large-scale for kernel methods. Following the standard protocol from Yu et al. (2016), 512-dimensional HoG features (Dalal & Triggs, 2005)
are used for the CIFAR-10 tasks instead of raw images. Of course, our intent is not to show state-of-the-art results on these tasks, on which deep neural networks easily dominate. Instead, the aim is to demonstrate viability as a scalable, principled kernel method.
We compare our results to baseline random features-based kernel machines: the standard RBF random features (RBF-RF for short), and the method of Sinha & Duchi (2016) (LKRF),777Traditional MKL methods are not tested here, as they are noticeably ( times) slower. using the same -SVM throughout. As shown by Table 1, our method reliably outperforms these baselines, most significantly in the regime of few features. Intuitively, this lines up with the expectation that isotropic random sampling becomes exponentially unlikely to hit a good peak in high dimension; our method searches for these peaks.
Furthermore, our theory predicts that the margin keeps improving, regardless of the dimensionality of the feature maps. Indeed, as our classifier saturates on the training data, test accuracy continues increasing, without overfitting. This decoupling of generalization from model complexity is characteristic of boosting methods.
In practice, our method is robust with respect to hyperparameter settings. As well, to outperform both RBF-RF and LKRF with 5000 features, our method only needs features. Our GPU implementation reaches this point in seconds. See Appendix A.1 for more tuning guidelines.
We have presented an efficient kernel learning method that uses tools from Fourier analysis and online learning to optimize over two natural infinite families of kernels. With this method, we show meaningful improvements on benchmark tasks, compared to related random features-based methods. Many theoretical questions remain, such as accelerating the search for Fourier peaks (e.g. Hassanieh et al. (2012); Kapralov (2016)). These, in addition to applying our learned kernels to state-of-the-art methods (e.g. convolutional kernel networks (Mairal et al., 2014; Mairal, 2016)), prove to be exciting directions for future work.
We are grateful to Sanjeev Arora, Roi Livni, Pravesh Kothari, Holden Lee, Karan Singh, and Naman Agarwal for helpful discussions. This research was supported by the National Science Foundation (NSF), the Office of Naval Research (ONR), and the Simons Foundation. The first two authors are supported in part by Elad Hazan’s NSF grant IIS-1523815.
Agarwal et al. (2017)
Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma.
Finding approximate local minima faster than gradient descent.
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199. ACM, 2017.
- Avron et al. (2016) Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. Journal of Machine Learning Research, 17(120):1–38, 2016.
- Băzăvan et al. (2012) Eduard Gabriel Băzăvan, Fuxin Li, and Cristian Sminchisescu. Fourier kernel learning. In Computer Vision–ECCV 2012, pp. 459–473. Springer, 2012.
Cho & Saul (2009)
Youngmin Cho and Lawrence K Saul.
Kernel methods for deep learning.In Advances in Neural Information Processing Systems, pp. 342–350, 2009.
- Cortes et al. (2010) Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Generalization bounds for learning kernels. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 247–254. Citeseer, 2010.
- Cortes et al. (2012) Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13(Mar):795–828, 2012.
- Crammer et al. (2003) Koby Crammer, Joseph Keshet, and Yoram Singer. Kernel design using boosting. In Advances in Neural Information Processing Systems, pp. 553–560, 2003.
- Cristianini et al. (2002) Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz Kandola. On kernel-target alignment. In Advances in Neural Information Processing Systems, pp. 367–373, 2002.
- Dai et al. (2014) Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pp. 3041–3049, 2014.
Dalal & Triggs (2005)
Navneet Dalal and Bill Triggs.
Histograms of oriented gradients for human detection.
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pp. 886–893. IEEE, 2005.
- Daskalakis et al. (2015) Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-optimal no-regret algorithms for zero-sum games. Games and Economic Behavior, 92:327–348, 2015.
- Devinatz (1953) A Devinatz. Integral representations of positive definite functions. Transactions of the American Mathematical Society, 74(1):56–77, 1953.
Freund & Schapire (1996)
Yoav Freund and Robert E Schapire.
Game theory, on-line prediction and boosting.
Proceedings of the Ninth Annual Conference on Computational Learning Theory, pp. 325–332. ACM, 1996.
Fukumizu et al. (2009)
Kenji Fukumizu, Arthur Gretton, Gert R Lanckriet, Bernhard Schölkopf, and
Bharath K Sriperumbudur.
Kernel choice and classifiability for RKHS embeddings of probability distributions.In Advances in Neural Information Processing Systems, pp. 1750–1758, 2009.
- Gallier (2009) Jean Gallier. Notes on spherical harmonics and linear representations of lie groups. 2009.
- Gönen & Alpaydın (2011) Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12(Jul):2211–2268, 2011.
- Hassanieh et al. (2012) Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Simple and practical algorithm for sparse Fourier transform. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1183–1194. Society for Industrial and Applied Mathematics, 2012.
- Hazan (2016) Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
- Hazan et al. (2007) Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169–192, 2007.
Symmetric tensor spherical harmonics on the N-sphere and their application to the de Sitter group SO (N, 1).Journal of mathematical physics, 28(7):1553–1566, 1987.
- Kakade et al. (2009) Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems, pp. 793–800, 2009.
- Kapralov (2016) Michael Kapralov. Sparse Fourier transform in any constant dimension with nearly-optimal sample complexity in sublinear time. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, pp. 264–277. ACM, 2016.
- Koltchinskii & Panchenko (2002) Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, pp. 1–50, 2002.
- Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- Lanckriet et al. (2004) Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5(Jan):27–72, 2004.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Mairal (2016) Julien Mairal. End-to-end kernel learning with supervised convolutional kernel networks. In Advances in Neural Information Processing Systems, pp. 1399–1407, 2016.
- Mairal et al. (2014) Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. Convolutional kernel networks. In Advances in Neural Information Processing Systems, pp. 2627–2635, 2014.
- McDiarmid (1989) Colin McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148–188, 1989.
- Müller (2012) Claus Müller. Analysis of spherical symmetries in Euclidean spaces, volume 129. Springer Science & Business Media, 2012.
- Nesterov (2005) Yurii Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235–249, 2005.
Oliva et al. (2016)
Junier Oliva, Avinava Dubey, Barnabas Poczos, Jeff Schneider, and Eric P Xing.
Bayesian nonparametric kernel-learning.
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016.
- Rahimi & Recht (2007) Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177–1184, 2007.
- Roberts & Tweedie (1996) Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pp. 341–363, 1996.
- Rudin (2011) Walter Rudin. Fourier analysis on groups. John Wiley & Sons, 2011.
- Schapire (1999) Robert E Schapire. A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 1401–1406, 1999.
- Schoenberg (1942) IJ Schoenberg. Positive definite functions on spheres. Duke Mathematical Journal, 9(1):96–108, 1942.
- Sinha & Duchi (2016) Aman Sinha and John C Duchi. Learning kernels with random features. In Advances in Neural Information Processing Systems, pp. 1298–1306, 2016.
- Sriperumbudur et al. (2010) Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517–1561, 2010.
- Stein & Weiss (2016) Elias M Stein and Guido Weiss. Introduction to Fourier analysis on Euclidean spaces (PMS-32), volume 32. Princeton University Press, 2016.
- von Neumann (1949) John von Neumann. On rings of operators. reduction theory. Annals of Mathematics, pp. 401–485, 1949.
- Yang et al. (2015) Zichao Yang, Andrew Wilson, Alex Smola, and Le Song. A la carte–learning fast kernels. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pp. 1098–1106, 2015.
- Yu et al. (2016) Felix X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. In Advances in Neural Information Processing Systems, pp. 1975–1983, 2016.
- Zhang et al. (2017) Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient Langevin dynamics. arXiv preprint arXiv:1702.05575, 2017.
- Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936, 2003.
Appendix A Appendix for Experiments
Algorithm 2 gives a high-level outline of the essential components of our method. However, it conceals several hyperparameter choices and algorithmic heuristics, which are pertinent when applying our method in practice. We discuss a few more details in this section.
a.1 Hyperparameter Tuning Guide
Throughout all experiments presented, we use hinge-loss SVM classifiers with . Note that the convergence of Algorithm 2 depends quadratically on .
With regard to Langevin diffusion (Algorithm 1), we observe that the best samples arise from using high temperatures and Gaussian parallel initialization. For the latter, a rule-of-thumb is to initialize 500 parallel copies of Langevin dynamics, drawing the initial position from a centered isotropic Gaussian with
the variance of the optimal RBF random features. (In turn, a common rule-of-thumb for this bandwidth is the median of pairwise Euclidean distances between data points.)
The step size in Algorithm 1 is tuned based on the magnitude of the gradient on , which can be significantly smaller than the upper bound derived in Section E. As is standard practice in Langevin Monte Carlo methods, the temperature is chosen so that the pertubation is roughly at the same magnitude as the gradient step. Empirically, running Langevin dynamics for 100 steps suffices to locate a reasonably good peak. To further improve efficiency, one can modify Algorithm 1 to pick the top samples, a -fold speedup which does not degrade the features much.
The step size of online gradient ascent is set to balance between being conservative and promoting diverse samples; these steps should not saturate (thereby solving the dual SVM problem), in order to have the strongest regret bound. In our experiments, we find that the step size achieving the standard regret guarantee (scaling as ) tends to be a little too conservative.
On the other hand, it never hurts (and seems important in practice) to saturate the peak-finding routine (Algorithm 1), since this contributes an additive improvement to the margin bound. Noting that the objective is very smooth (the -th derivative scales as ), it may be beneficial to refine the samples using a few steps of gradient descent with a very small learning rate, or an accelerated algorithm for finding approximate local minima of smooth non-convex functions; see, e.g. Agarwal et al. (2017).
a.2 Projection onto the SVM Dual Feasible Set
A quick note on projection onto the feasible set
of the SVM dual convex program: it typically suffices in practice to use alternating projection. This feasible set is the intersection of a hyperplane and a hypercube; both of which admit a simple projection step. The alternation projection onto the intersection of two non-empty convex sets was originally proposed byvon Neumann (1949). The convergence rate can be shown to be linear. To obtain the dual variables in our experiments, we use such alternating projections. This results in a dual feasible solution up to hardware precision, and is a negligible component of the total running time (for which the parallel gradient computations are the bottleneck).
a.3 Sampling Spherical Harmonics
As we note in Section 3.1, it is unclear how to define gradient Langevin dynamics on in the inner-product case, since no topology is available on the indices
of the spherical harmonics. One option is to emulate Langevin dynamics, by constructing a discrete Markov chain which mixes to.
However, we find in our experiments that it suffices to compute by examining all values of with no more than some threshold . One should view this as approximating the kernel-target alignment objective function via Fourier truncation. This is highly parallelizable: it involves approximately degree- polynomial evaluations on the same sample data, which can be expressed using matrix multiplication. In our experiments, it sufficed to examine the first coefficients; we remark that it is unnatural in any real-world datasets to expect that only has large values outside the threshold .
Under Fourier truncation, the domain of becomes a finite-dimensional simplex. In the game-theoretic view of 4.1, an approximate Nash equilibrium becomes concretely achievable via Nesterov’s excessive gap technique (Nesterov, 2005; Daskalakis et al., 2015), given that the kernel player’s actions are restricted to a mixed strategy over a finite set of basis kernels.
Finally, we note a significant advantage to this setting, where we have a discrete set of Fourier coefficients: the same feature might be found multiple times. When a duplicate feature is found, it need not be concatenated to the representation; instead, the existing feature is scaled appropriately. This accounts for the drastically smaller support of features required to achieve near-perfect classification accuracy.
Appendix B More on Spherical Harmonics
In this section, we go into more detail about the spherical harmonics in dimensions. Although all of this material is standard in harmonic analysis, we provide this section for convenience, isolating only the relevant facts.
b.1 Spherical Harmonic Expansion of a Rotation-Invariant Kernel
First, we provide a proof sketch for Theorem 2.2. We rely on the following theorem, an analogue of Bochner’s theorem on the sphere, which characterizes rotation-invariant kernels:
Theorem B.1 (Schoenberg (1942)).
A continuous rotation-invariant function on is positive semi-definite if and only if its expansion into Gegenbauer polynomials has only non-negative coefficients, i.e.
The Gegenbauer polynomials are a generalization of the Legendre and Chebyshev polynomials; they form an orthogonal basis on under a certain weighting function (see Stein & Weiss (2016); Gallier (2009); Müller (2012) for details). Note that we adopt a different notation from Schoenberg (1942), which denotes our by .
Unlike Theorem 2.1, Theorem B.1 alone does not provide a clear path for constructing a feature map for inner-product kernel, because the inputs of the kernel are still coupled after the expansion into , which acts on the inner product instead of on and separately. Fortunately, (see, e.g., Proposition 1.18 in Gallier (2009)), the Gegenbauer polynomials evaluated on an inner product admits a decomposition into spherical harmonics :
b.2 Pairing the Spherical Harmonics
We now specify the involution on indices of spherical harmonics, which gives a pairing that takes the role of opposite Fourier coefficients in the case. In particular, the Fourier transform of a real-valued function satisfies , so that the dual measure can be constrained to be symmetric.
Now, consider the case, where the Fourier coefficients are on the set of valid indices of spherical harmonics. We would like a permutation on the indices so that , and whenever is the spherical harmonic expansion of a real kernel.
For a fixed dimension and degree , the spherical harmonics form an -dimensional orthogonal basis for a vector space of complex polynomials from to , namely the -homogeneous polynomials (with domain extended to
) which are harmonic (eigenfunctions of the Laplace-Beltrami operator). This basis is not unique; an arbitrary choice of basis may not contain pairs of conjugate functions.
Fortunately, such a basis exists constructively, by separating the Laplace-Beltrami operator over spherical coordinates (). Concretely, for a fixed and , the -dimensional spherical harmonic, indexed by , is defined as:
where the functions in the product come from a certain family of associated Legendre functions in . For a detailed treatment, we adopt the construction and conventions from Higuchi (1987).
In the scope of this paper, the relevant fact is that the desired involution is well-defined in all dimensions. Namely, consider the permutation that sends a spherical harmonic indexed by to . Then, for all .
The symmetry condition in Theorem 2.2 follows straightforwardly. By orthonormality, we know that every square-integrable function has a unique decomposition into spherical harmonics, with coefficients , so that . When is real-valued, we conclude that , as claimed.
Appendix C Proof of the Main Theorem
In this section, we prove the main theorem, which quantifies convergence of Algorithm 2 to the Nash equilibrium. We restate it here:
Alternatively with the assumption that at each timestep , , , satisfies
If Algorithm 2 is used on a -SVM, the regret bound can be improved to be .
We will make use of the regret bound of online gradient ascent (see, e.g., (Hazan, 2016)). Here we only prove the theorem in the case for -SVM with box constraint , under the assumption of -approximate optimality of Algorithm 1. Extending the proof to other cases is straightfoward.
Lemma C.1 (Regret bound for online gradient ascent).
Let be the diameter of the constraint set , and a Lipschitz constant for an arbitrary sequence of concave functions on . Online gradient ascent on , with step size schedule , guarantees the following for all :
Here, by the box constraint, and we have
Thus, our regret bound is ; that is, for all ,
Since at each timestep , , we have by assumption
and by concavity of in , we know that, for any fixed sequence ,
it follows that
To complete the proof, note that for a given , by linearity of in the dual measure; here, is the approximate Nash equilibrium found by the no-regret learning dynamics.
Appendix D Proof of Theorem 4.3
In this section, we compute the Rademacher complexity of the composition of the learned kernel and the classifier, proving Theorem 4.3.
For any SVM decision function with a kernel constrained by trained on samples drawn i.i.d. from distribution , the generalization error is bounded by
For a fixed sample , we compute the empirical Rademacher complexity of the composite hypothesis class. Below,
is a vector of i.i.d. Rademacher random variables.
Note that each term in the formula is the empirical Rademacher complexity of the composition of a 1-Lipschitz function (cosine or sine) with a linear function with norm bounded by . By the composition property of Rademacher complexity, we have the following: