1 Introduction
Choosing the right kernel is a classic question that has riddled machine learning practitioners and theorists alike. Conventional wisdom instructs the user to select a kernel which captures the structure and geometric invariances in the data. Efforts to formulate this principle have inspired vibrant areas of study, going by names from
feature selection to multiple kernel learning (MKL).We present a new, principled approach for selecting a translationinvariant or rotationinvariant
kernel to maximize the SVM classification margin. We first describe a kernelalignment subroutine, which finds a peak in the Fourier transform of an
adversarially chosen datadependent measure. Then, we define an iterative procedure that produces a sequence of feature maps, progressively improving the margin. The resulting algorithm is strikingly simple and scalable.Intriguingly, our analysis interprets the main algorithm as noregret learning dynamics in a zerosum minmax game, whose value is the classification margin. Thus, we are able to quantify convergence guarantees towards the largest margin realizable by a kernel with the assumed invariance. Finally, we exhibit experiments on synthetic and benchmark datasets, demonstrating consistent improvements over related random featuresbased kernel methods.
1.1 Related Work
There is a vast literature on MKL, from which we use the key concept of kernel alignment (Cristianini et al., 2002). Otherwise, our work bears few similarities to traditional MKL; this and much related work (e.g. Cortes et al. (2012); Gönen & Alpaydın (2011); Lanckriet et al. (2004)) are concerned with selecting a kernel by combining a collection of base kernels, chosen beforehand. Our method allows for greater expressivity and even better generalization guarantees.
Instead, we take inspiration from the method of random features (Rahimi & Recht, 2007). In this pioneering work, originally motivated by scalability, feature maps are sampled according to the Fourier transform of a chosen kernel. The idea of optimizing a kernel in random feature space was studied by Sinha & Duchi (2016). In this work, which is most similar to ours, kernel alignment is optimized via importance sampling on a fixed, finitely supported proposal measure. However, the proposal can fail to contain informative features, especially in high dimension; indeed, they highlight efficiency, rather than showing performance improvements over RBF features.
Learning a kernel in the Fourier domain (without the primal feature maps) has also been considered previously: Oliva et al. (2016) and Yang et al. (2015) model the Fourier spectrum parametrically, which limits expressivity; the former also require complicated posterior inference procedures. Băzăvan et al. (2012) study learning a kernel in the Fourier domain jointly with regression parameters. They show experimentally that this locates informative frequencies in the data, without theoretical guarantees. Our visualizations suggest this approach can get stuck in poor local minima, even in 2 dimensions.
Crammer et al. (2003)
also use boosting to build a kernel sequentially; however, they only consider a basis of linear feature maps, and require costly generalized eigenvector computations. From a statistical view,
Fukumizu et al. (2009)bound the SVM margin in terms of maximum mean discrepancy, which is equivalent to (unweighted) kernel alignment. Notably, their bound can be loose if the number of support vectors is small; in such situations, our theory provides a tighter characterization. Moreover, our attention to the margin goes beyond the usual objective of kernel alignment.
1.2 Our Contributions
We present an algorithm that outputs a sequence of Fourier features, converging to the maximum realizable SVM classification margin on a labeled dataset. At each iteration, a pair of features is produced, which maximizes kernel alignment with a changing, adversarially chosen measure. As this measure changes slowly, the algorithm builds a diverse and informative feature representation.
Our main theorem can be seen as a case of von Neumann’s minmax theorem for a zerosum concavelinear game; indeed, our method bears a deep connection to boosting (Freund & Schapire, 1996; Schapire, 1999). In particular, both the theory and empirical evidence suggest that the generalization error of our method decreases as the number of random features increases. In traditional MKL methods, generalization bounds worsen as base kernels are added.
Other methods in the framework of Fourier random features take the approach of approximating a kernel by sampling feature maps from a continuous distribution. In contrast, our method constructs a measure with small finite support, and realizes the kernel exactly by enumerating the associated finitedimensional feature map; there is no randomness in the features.^{1}^{1}1We note a superficial resemblance to quasiMonte Carlo methods (Avron et al., 2016); however, these are concerned with accelerating convergence rates rather than learning a kernel.
2 Preliminaries
2.1 The Fourier Dual Measure of a Kernel
We focus on two natural families of kernels : translationinvariant kernels on , which depend only on , and rotationinvariant kernels on the hypersphere , which depend only on . These invariance assumptions subsume most widelyused classes of kernels; notably, the Gaussian (RBF) kernel satisfies both. For the former invariance, Bochner’s theorem provides a Fourieranalytic characterization:
Theorem 2.1 (e.g. Eq. (1), Sec. 1.4.3 in Rudin (2011)).
A translationinvariant continuous function is positive semidefinite if and only if is the Fourier transform of a symmetric nonnegative measure (where the Fourier domain is ). That is,
(1) 
for some , satisfying and for all .
A similar characterization is available for rotationinvariant kernels, where the Fourier basis functions are the spherical harmonics, a countably infinite family of complex polynomials which form an orthonormal basis for squareintegrable functions . To unify notation, let be the set of valid index pairs , and let denote a certain involution on ; we supply details and references in Appendix B.1.
Theorem 2.2.
A rotationinvariant continuous function is positive semidefinite if and only if it has a symmetric nonnegative expansion into spherical harmonics , i.e.
(2) 
for some , with and for all valid index pairs .
In each of these cases, we call this Fourier transform the dual measure of . This measure decomposes into a nonnegative combination of Fourier basis kernels. Furthermore, this decomposition gives us a feature map whose image realizes the kernel under the codomain’s inner product;^{2}^{2}2In fact, such a pair exists for any psd kernel ; see Dai et al. (2014) or Devinatz (1953). that is, for all ,
(3) 
Respectively, these feature maps are and . Although they are complexvalued, symmetry of allows us to apply the transformation to yield real features, preserving the inner product. The analogous result holds for spherical harmonics.
2.2 Kernel Alignment
In a binary classification task with training samples , a widelyused quantity for measuring the quality of a kernel is its alignment (Cristianini et al., 2002; Cortes et al., 2012),^{3}^{3}3Definitions vary up to constants and normalization, depending on the use case. defined by
where is the vector of labels and is the Gram matrix. Here, we let and denote the (unnormalized) empirical measures of each class, where is the Dirac measure at . When are arbitrary measures on , this definition generalizes to
In terms of the dual measure , kernel alignment takes a useful alternate form, noted by Sriperumbudur et al. (2010). Let denote the signed measure . Then, when is translationinvariant, we have
(4) 
Analogously, when is rotationinvariant, we have
(5) 
It can also be verified that , which is of course the same for all . In each case, the alignment is linear in . We call the Fourier potential, which is the squared magnitude of the Fourier transform of the signed measure . This function is clearly bounded pointwise by .
3 Algorithms
3.1 Maximizing Kernel Alignment in the Fourier Domain
First, we consider the problem of finding a kernel (subject to either invariance) that maximizes alignment ; we optimize the dual measure . Aside from the nonnegativity and symmetry constraints from Theorems 2.1 and 2.2, we constrain , as this quantity appears as a normalization constant in our generalization bounds (see Theorem 4.3). Maximizing in this constraint set, which we call
, takes the form of a linear program on an infinitedimensional simplex. Noting that
, is maximized by placing a Dirac mass at any pair of opposite modes .At first, and will be the empirical distributions of the classes, specified in Section 2.2. However, as Algorithm 2 proceeds, it will reweight each data point in the measures by . Explicitly, the reweighted Fourier potential takes the form^{4}^{4}4Whenever is an index, we will denote the imaginary unit by .
Due to its nonconvexity, maximizing , which can be interpreted as finding a global Fourier peak in the data, is theoretically challenging. However, we find that it is easy to find such peaks in our experiments, even in hundreds of dimensions. This arises from the empirical phenomenon that realistic data tend to be bandlimited, a cornerstone hypothesis in data compression. An constraint (or equivalently, regularization; see Kakade et al. (2009)) can be explicitly enforced to promote bandlimitedness; we find that this is not necessary in practice.
When a gradient is available (in the translationinvariant case), we use Langevin dynamics (Algorithm 1) to find the peaks of , which enjoys mild theoretical hittingtime guarantees (see Theorem 4.5). See Appendix A.3 for a discussion of the (discrete) rotationinvariant case.
It is useful in practice to use parallel initialization, running concurrent copies of the diffusion process and returning the best single encountered. This admits a very efficient GPU implementation: the multipoint evaluations and can be computed from an by matrix product and pointwise trigonometric functions. We find that Algorithm 1 typically finds a reasonable peak within steps.
3.2 Learning the MarginMaximizing Kernel for SVM
Support vector machines (SVMs) are perhaps the most ubiquitous use case of kernels in practice. To this end, we propose a method that boosts Algorithm 1, building a kernel that maximizes the classification margin. Let be a kernel with dual , and . Write the dual SVM objective,^{5}^{5}5Of course, our method applies to SVMs, and has even stronger theoretical guarantees; see Section 4.1. parameterizing the kernel by :
Thus, for a fixed , is equivalent to kernel alignment, and can be minimized by Algorithm 1. However, the support vector weights are of course not fixed; given a kernel , is chosen to maximize , giving the (reciprocal) SVM margin. In all, to find a kernel which maximizes the margin under an adversarial choice of , one must consider a twoplayer zerosum game:
(6) 
where is the same constraint set as in Section 3.1, and is the usual dual feasible set with box parameter .
In this view, we make some key observations. First, Algorithm 1 allows the minplayer to play a purestrategy best response to the maxplayer. Furthermore, a mixed strategy for the minplayer is simply a translation or rotationinvariant kernel, realized by the feature map corresponding to its support. Finally, since the objective is linear in and concave in , there exists a Nash equilibrium for this game, from which gives the marginmaximizing kernel.
We can use noregret learning dynamics to approximate this equilibrium. Algorithm 2 runs Algorithm 1 for the minplayer, and online gradient ascent (Zinkevich, 2003) for the maxplayer. Intuitively (and as is visualized in our synthetic experiments), this process slowly morphs the landscape of to emphasize the margin, causing Algorithm 1 to find progressively more informative features. At the end, we simply concatenate these features; contingent on the success of the kernel alignment steps, we have approximated the Nash equilibrium.
We provide a theoretical analysis in Section 4.1
, and detailed discussion on heuristics, hyperparameters, and implementation details in depth in Appendix
A. One important note is that the online gradient is computed very efficiently, where is the vector of the most recently appended features. Langevin dynamics, easily implemented on a GPU, comprise the primary time bottleneck.4 Theory
4.1 Convergence of NoRegret Learning Dynamics
We first state the main theoretical result, which quantifies the convergence properties of Algorithm 2.
Theorem 4.1 (Main).
We prove Theorem 4.1 in Appendix C. For convenience, we state the first version as an explicit margin bound (in terms of competitive ratio ):
Corollary 4.2.
Let be the margin obtained by training an linear SVM with the same as in Algorithm 2, on the transformed samples . Then, is competitive with , the maximally achievable margin by a kernel with the assumed invariance, with
This bound arises from the regret analysis of online gradient ascent (Zinkevich, 2003); our analysis is similar to the approach of Freund & Schapire (1996), where they present a boosting perspective. When using an SVM, the final term can be improved to (Hazan et al., 2007). For a general overview of results in the field, refer to Hazan (2016).
4.2 Generalization Bounds
Finally, we state two (rather distinct) generalization guarantees. Both depend mildly on a bandwidth assumption and the norm of the data . First, we state a margindependent SVM generalization bound, due to Koltchinskii & Panchenko (2002). Notice the appearance of , justifying our choice of normalization constant for Algorithm 1. Intriguingly, the endtoend generalization error of our method decreases with an increasing number of random features, since the margin bound is being refined during Algorithm 2.
Theorem 4.3 (Generalization via margin).
For any SVM decision function with a kernel constrained by trained on samples drawn i.i.d. from distribution , the generalization error is bounded by
The proof can be found in Appendix D. Note that this improves on the generic result for MKL, from Theorem 2 in (Cortes et al., 2010), which has a dependence on the number of base kernels . This improvement stems from the rankone property of each component kernel.
Next, we address another concern entirely: the sample size required for to approximate the ideal Fourier potential , the squared magnitude of the Fourier transform of the signed measure arising from the true distribution. For the shiftinvariant case:
Theorem 4.4 (Generalization of the potential).
Let be a set of i.i.d. training samples on , with . We have that for all
, with probability at least
,The suppresses factors polynomial in and .
The full statement and proof are standard, and deferred to Appendix E. In particular, this result allows for a mild guarantee of polynomial hittingtime on the locus of approximate local maxima of (as opposed to the empirical ). Adapting the main result from Zhang et al. (2017):
Theorem 4.5 (Langevin hitting time).
Of course, one should not expect polynomial hitting time on approximate global maxima; Roberts & Tweedie (1996) give asymptotic mixing guarantees.
5 Experiments
In this section, we highlight the most important and illustrative parts of our experimental results. For further details, we provide an extended addendum to the experimental section in Appendix A. The code can be found at github.com/yzignescent/NotSoRandomFeatures.
First, we exhibit two simple binary classification tasks, one in and the other on , to demonstrate the power of our kernel selection method. As depicted in Figure 1, we create datasets with sharp boundaries, which are difficult for the standard RBF and arccosine (Cho & Saul, 2009) kernel. In both cases, and .^{6}^{6}6 is chosen so large to measure the true generalization error. Used on a SVM classifier, these baseline kernels saturate at 92.1% and 95.1% test accuracy, respectively; they are not expressive enough.
On the “windmill” task, Algorithm 2 chooses random features that progressively refine the decision boundary at the margin. By , it exhibits almost perfect classification (99.7% training, 99.3% test). Similarly, on the “checkerboard” task, Algorithm 2 (with some adaptations described in Appendix A.3) reaches almost perfect classification (99.7% training, 99.1% test) at , supported on only spherical harmonics as features.
We provide some illuminating visualizations. Figures 0(a) and 0(b) show the evolution of the dual weights, random features, and classifier. As the theory suggests, the objective evolves to assign higher weight to points near the margin, and successive features improve the classifier’s decisiveness in challenging regions (0(a), bottom). Figure 0(c) visualizes some features from the experiment.
Next, we evaluate our kernel on standard benchmark binary classification tasks. Challenging label pairs are chosen from the MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky, 2009) datasets; each task consists of training and test examples; this is considered to be largescale for kernel methods. Following the standard protocol from Yu et al. (2016), 512dimensional HoG features (Dalal & Triggs, 2005)
are used for the CIFAR10 tasks instead of raw images. Of course, our intent is not to show stateoftheart results on these tasks, on which deep neural networks easily dominate. Instead, the aim is to demonstrate viability as a scalable, principled kernel method.
We compare our results to baseline random featuresbased kernel machines: the standard RBF random features (RBFRF for short), and the method of Sinha & Duchi (2016) (LKRF),^{7}^{7}7Traditional MKL methods are not tested here, as they are noticeably ( times) slower. using the same SVM throughout. As shown by Table 1, our method reliably outperforms these baselines, most significantly in the regime of few features. Intuitively, this lines up with the expectation that isotropic random sampling becomes exponentially unlikely to hit a good peak in high dimension; our method searches for these peaks.
Furthermore, our theory predicts that the margin keeps improving, regardless of the dimensionality of the feature maps. Indeed, as our classifier saturates on the training data, test accuracy continues increasing, without overfitting. This decoupling of generalization from model complexity is characteristic of boosting methods.
In practice, our method is robust with respect to hyperparameter settings. As well, to outperform both RBFRF and LKRF with 5000 features, our method only needs features. Our GPU implementation reaches this point in seconds. See Appendix A.1 for more tuning guidelines.
6 Conclusion
We have presented an efficient kernel learning method that uses tools from Fourier analysis and online learning to optimize over two natural infinite families of kernels. With this method, we show meaningful improvements on benchmark tasks, compared to related random featuresbased methods. Many theoretical questions remain, such as accelerating the search for Fourier peaks (e.g. Hassanieh et al. (2012); Kapralov (2016)). These, in addition to applying our learned kernels to stateoftheart methods (e.g. convolutional kernel networks (Mairal et al., 2014; Mairal, 2016)), prove to be exciting directions for future work.
Acknowledgments
We are grateful to Sanjeev Arora, Roi Livni, Pravesh Kothari, Holden Lee, Karan Singh, and Naman Agarwal for helpful discussions. This research was supported by the National Science Foundation (NSF), the Office of Naval Research (ONR), and the Simons Foundation. The first two authors are supported in part by Elad Hazan’s NSF grant IIS1523815.
References

Agarwal et al. (2017)
Naman Agarwal, Zeyuan AllenZhu, Brian Bullins, Elad Hazan, and Tengyu Ma.
Finding approximate local minima faster than gradient descent.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pp. 1195–1199. ACM, 2017.  Avron et al. (2016) Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W Mahoney. QuasiMonte Carlo feature maps for shiftinvariant kernels. Journal of Machine Learning Research, 17(120):1–38, 2016.
 Băzăvan et al. (2012) Eduard Gabriel Băzăvan, Fuxin Li, and Cristian Sminchisescu. Fourier kernel learning. In Computer Vision–ECCV 2012, pp. 459–473. Springer, 2012.

Cho & Saul (2009)
Youngmin Cho and Lawrence K Saul.
Kernel methods for deep learning.
In Advances in Neural Information Processing Systems, pp. 342–350, 2009.  Cortes et al. (2010) Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Generalization bounds for learning kernels. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pp. 247–254. Citeseer, 2010.
 Cortes et al. (2012) Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13(Mar):795–828, 2012.
 Crammer et al. (2003) Koby Crammer, Joseph Keshet, and Yoram Singer. Kernel design using boosting. In Advances in Neural Information Processing Systems, pp. 553–560, 2003.
 Cristianini et al. (2002) Nello Cristianini, John ShaweTaylor, Andre Elisseeff, and Jaz Kandola. On kerneltarget alignment. In Advances in Neural Information Processing Systems, pp. 367–373, 2002.
 Dai et al. (2014) Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, MariaFlorina Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pp. 3041–3049, 2014.

Dalal & Triggs (2005)
Navneet Dalal and Bill Triggs.
Histograms of oriented gradients for human detection.
In
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on
, volume 1, pp. 886–893. IEEE, 2005.  Daskalakis et al. (2015) Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Nearoptimal noregret algorithms for zerosum games. Games and Economic Behavior, 92:327–348, 2015.
 Devinatz (1953) A Devinatz. Integral representations of positive definite functions. Transactions of the American Mathematical Society, 74(1):56–77, 1953.

Freund & Schapire (1996)
Yoav Freund and Robert E Schapire.
Game theory, online prediction and boosting.
In
Proceedings of the Ninth Annual Conference on Computational Learning Theory
, pp. 325–332. ACM, 1996. 
Fukumizu et al. (2009)
Kenji Fukumizu, Arthur Gretton, Gert R Lanckriet, Bernhard Schölkopf, and
Bharath K Sriperumbudur.
Kernel choice and classifiability for RKHS embeddings of probability distributions.
In Advances in Neural Information Processing Systems, pp. 1750–1758, 2009.  Gallier (2009) Jean Gallier. Notes on spherical harmonics and linear representations of lie groups. 2009.
 Gönen & Alpaydın (2011) Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12(Jul):2211–2268, 2011.
 Hassanieh et al. (2012) Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Simple and practical algorithm for sparse Fourier transform. In Proceedings of the TwentyThird Annual ACMSIAM Symposium on Discrete Algorithms, pp. 1183–1194. Society for Industrial and Applied Mathematics, 2012.
 Hazan (2016) Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Hazan et al. (2007) Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169–192, 2007.

Higuchi (1987)
Atsushi Higuchi.
Symmetric tensor spherical harmonics on the Nsphere and their application to the de Sitter group SO (N, 1).
Journal of mathematical physics, 28(7):1553–1566, 1987.  Kakade et al. (2009) Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems, pp. 793–800, 2009.
 Kapralov (2016) Michael Kapralov. Sparse Fourier transform in any constant dimension with nearlyoptimal sample complexity in sublinear time. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, pp. 264–277. ACM, 2016.
 Koltchinskii & Panchenko (2002) Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, pp. 1–50, 2002.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 Lanckriet et al. (2004) Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5(Jan):27–72, 2004.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Mairal (2016) Julien Mairal. Endtoend kernel learning with supervised convolutional kernel networks. In Advances in Neural Information Processing Systems, pp. 1399–1407, 2016.
 Mairal et al. (2014) Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. Convolutional kernel networks. In Advances in Neural Information Processing Systems, pp. 2627–2635, 2014.
 McDiarmid (1989) Colin McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148–188, 1989.
 Müller (2012) Claus Müller. Analysis of spherical symmetries in Euclidean spaces, volume 129. Springer Science & Business Media, 2012.
 Nesterov (2005) Yurii Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235–249, 2005.

Oliva et al. (2016)
Junier Oliva, Avinava Dubey, Barnabas Poczos, Jeff Schneider, and Eric P Xing.
Bayesian nonparametric kernellearning.
In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
, 2016.  Rahimi & Recht (2007) Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177–1184, 2007.
 Roberts & Tweedie (1996) Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pp. 341–363, 1996.
 Rudin (2011) Walter Rudin. Fourier analysis on groups. John Wiley & Sons, 2011.
 Schapire (1999) Robert E Schapire. A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 1401–1406, 1999.
 Schoenberg (1942) IJ Schoenberg. Positive definite functions on spheres. Duke Mathematical Journal, 9(1):96–108, 1942.
 Sinha & Duchi (2016) Aman Sinha and John C Duchi. Learning kernels with random features. In Advances in Neural Information Processing Systems, pp. 1298–1306, 2016.
 Sriperumbudur et al. (2010) Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517–1561, 2010.
 Stein & Weiss (2016) Elias M Stein and Guido Weiss. Introduction to Fourier analysis on Euclidean spaces (PMS32), volume 32. Princeton University Press, 2016.
 von Neumann (1949) John von Neumann. On rings of operators. reduction theory. Annals of Mathematics, pp. 401–485, 1949.
 Yang et al. (2015) Zichao Yang, Andrew Wilson, Alex Smola, and Le Song. A la carte–learning fast kernels. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pp. 1098–1106, 2015.
 Yu et al. (2016) Felix X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N HoltmannRice, and Sanjiv Kumar. Orthogonal random features. In Advances in Neural Information Processing Systems, pp. 1975–1983, 2016.
 Zhang et al. (2017) Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient Langevin dynamics. arXiv preprint arXiv:1702.05575, 2017.
 Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 928–936, 2003.
Appendix A Appendix for Experiments
Algorithm 2 gives a highlevel outline of the essential components of our method. However, it conceals several hyperparameter choices and algorithmic heuristics, which are pertinent when applying our method in practice. We discuss a few more details in this section.
a.1 Hyperparameter Tuning Guide
Throughout all experiments presented, we use hingeloss SVM classifiers with . Note that the convergence of Algorithm 2 depends quadratically on .
With regard to Langevin diffusion (Algorithm 1), we observe that the best samples arise from using high temperatures and Gaussian parallel initialization. For the latter, a ruleofthumb is to initialize 500 parallel copies of Langevin dynamics, drawing the initial position from a centered isotropic Gaussian with
the variance of the optimal RBF random features. (In turn, a common ruleofthumb for this bandwidth is the median of pairwise Euclidean distances between data points.)
The step size in Algorithm 1 is tuned based on the magnitude of the gradient on , which can be significantly smaller than the upper bound derived in Section E. As is standard practice in Langevin Monte Carlo methods, the temperature is chosen so that the pertubation is roughly at the same magnitude as the gradient step. Empirically, running Langevin dynamics for 100 steps suffices to locate a reasonably good peak. To further improve efficiency, one can modify Algorithm 1 to pick the top samples, a fold speedup which does not degrade the features much.
The step size of online gradient ascent is set to balance between being conservative and promoting diverse samples; these steps should not saturate (thereby solving the dual SVM problem), in order to have the strongest regret bound. In our experiments, we find that the step size achieving the standard regret guarantee (scaling as ) tends to be a little too conservative.
On the other hand, it never hurts (and seems important in practice) to saturate the peakfinding routine (Algorithm 1), since this contributes an additive improvement to the margin bound. Noting that the objective is very smooth (the th derivative scales as ), it may be beneficial to refine the samples using a few steps of gradient descent with a very small learning rate, or an accelerated algorithm for finding approximate local minima of smooth nonconvex functions; see, e.g. Agarwal et al. (2017).
a.2 Projection onto the SVM Dual Feasible Set
A quick note on projection onto the feasible set
of the SVM dual convex program: it typically suffices in practice to use alternating projection. This feasible set is the intersection of a hyperplane and a hypercube; both of which admit a simple projection step. The alternation projection onto the intersection of two nonempty convex sets was originally proposed by
von Neumann (1949). The convergence rate can be shown to be linear. To obtain the dual variables in our experiments, we use such alternating projections. This results in a dual feasible solution up to hardware precision, and is a negligible component of the total running time (for which the parallel gradient computations are the bottleneck).a.3 Sampling Spherical Harmonics
As we note in Section 3.1, it is unclear how to define gradient Langevin dynamics on in the innerproduct case, since no topology is available on the indices
of the spherical harmonics. One option is to emulate Langevin dynamics, by constructing a discrete Markov chain which mixes to
.However, we find in our experiments that it suffices to compute by examining all values of with no more than some threshold . One should view this as approximating the kerneltarget alignment objective function via Fourier truncation. This is highly parallelizable: it involves approximately degree polynomial evaluations on the same sample data, which can be expressed using matrix multiplication. In our experiments, it sufficed to examine the first coefficients; we remark that it is unnatural in any realworld datasets to expect that only has large values outside the threshold .
Under Fourier truncation, the domain of becomes a finitedimensional simplex. In the gametheoretic view of 4.1, an approximate Nash equilibrium becomes concretely achievable via Nesterov’s excessive gap technique (Nesterov, 2005; Daskalakis et al., 2015), given that the kernel player’s actions are restricted to a mixed strategy over a finite set of basis kernels.
Finally, we note a significant advantage to this setting, where we have a discrete set of Fourier coefficients: the same feature might be found multiple times. When a duplicate feature is found, it need not be concatenated to the representation; instead, the existing feature is scaled appropriately. This accounts for the drastically smaller support of features required to achieve nearperfect classification accuracy.
Appendix B More on Spherical Harmonics
In this section, we go into more detail about the spherical harmonics in dimensions. Although all of this material is standard in harmonic analysis, we provide this section for convenience, isolating only the relevant facts.
b.1 Spherical Harmonic Expansion of a RotationInvariant Kernel
First, we provide a proof sketch for Theorem 2.2. We rely on the following theorem, an analogue of Bochner’s theorem on the sphere, which characterizes rotationinvariant kernels:
Theorem B.1 (Schoenberg (1942)).
A continuous rotationinvariant function on is positive semidefinite if and only if its expansion into Gegenbauer polynomials has only nonnegative coefficients, i.e.
(7) 
with
The Gegenbauer polynomials are a generalization of the Legendre and Chebyshev polynomials; they form an orthogonal basis on under a certain weighting function (see Stein & Weiss (2016); Gallier (2009); Müller (2012) for details). Note that we adopt a different notation from Schoenberg (1942), which denotes our by .
Unlike Theorem 2.1, Theorem B.1 alone does not provide a clear path for constructing a feature map for innerproduct kernel, because the inputs of the kernel are still coupled after the expansion into , which acts on the inner product instead of on and separately. Fortunately, (see, e.g., Proposition 1.18 in Gallier (2009)), the Gegenbauer polynomials evaluated on an inner product admits a decomposition into spherical harmonics :
(8) 
b.2 Pairing the Spherical Harmonics
We now specify the involution on indices of spherical harmonics, which gives a pairing that takes the role of opposite Fourier coefficients in the case. In particular, the Fourier transform of a realvalued function satisfies , so that the dual measure can be constrained to be symmetric.
Now, consider the case, where the Fourier coefficients are on the set of valid indices of spherical harmonics. We would like a permutation on the indices so that , and whenever is the spherical harmonic expansion of a real kernel.
For a fixed dimension and degree , the spherical harmonics form an dimensional orthogonal basis for a vector space of complex polynomials from to , namely the homogeneous polynomials (with domain extended to
) which are harmonic (eigenfunctions of the LaplaceBeltrami operator
). This basis is not unique; an arbitrary choice of basis may not contain pairs of conjugate functions.Fortunately, such a basis exists constructively, by separating the LaplaceBeltrami operator over spherical coordinates (). Concretely, for a fixed and , the dimensional spherical harmonic, indexed by , is defined as:
where the functions in the product come from a certain family of associated Legendre functions in . For a detailed treatment, we adopt the construction and conventions from Higuchi (1987).
In the scope of this paper, the relevant fact is that the desired involution is welldefined in all dimensions. Namely, consider the permutation that sends a spherical harmonic indexed by to . Then, for all .
The symmetry condition in Theorem 2.2 follows straightforwardly. By orthonormality, we know that every squareintegrable function has a unique decomposition into spherical harmonics, with coefficients , so that . When is realvalued, we conclude that , as claimed.
Appendix C Proof of the Main Theorem
In this section, we prove the main theorem, which quantifies convergence of Algorithm 2 to the Nash equilibrium. We restate it here:
Theorem 4.1.
Assume that during each timestep , the call to Algorithm 1 returns an approximate global maximizer (i.e. ). Then, Algorithm 2 returns a dual measure , which satisfies
Alternatively with the assumption that at each timestep , , , satisfies
If Algorithm 2 is used on a SVM, the regret bound can be improved to be .
Proof.
We will make use of the regret bound of online gradient ascent (see, e.g., (Hazan, 2016)). Here we only prove the theorem in the case for SVM with box constraint , under the assumption of approximate optimality of Algorithm 1. Extending the proof to other cases is straightfoward.
Lemma C.1 (Regret bound for online gradient ascent).
Let be the diameter of the constraint set , and a Lipschitz constant for an arbitrary sequence of concave functions on . Online gradient ascent on , with step size schedule , guarantees the following for all :
Here, by the box constraint, and we have
Thus, our regret bound is ; that is, for all ,
Since at each timestep , , we have by assumption
and by concavity of in , we know that, for any fixed sequence ,
it follows that
To complete the proof, note that for a given , by linearity of in the dual measure; here, is the approximate Nash equilibrium found by the noregret learning dynamics.
∎
Appendix D Proof of Theorem 4.3
In this section, we compute the Rademacher complexity of the composition of the learned kernel and the classifier, proving Theorem 4.3.
Theorem 4.3.
For any SVM decision function with a kernel constrained by trained on samples drawn i.i.d. from distribution , the generalization error is bounded by
Proof.
For a fixed sample , we compute the empirical Rademacher complexity of the composite hypothesis class. Below,
is a vector of i.i.d. Rademacher random variables.
Note that each term in the formula is the empirical Rademacher complexity of the composition of a 1Lipschitz function (cosine or sine) with a linear function with norm bounded by . By the composition property of Rademacher complexity, we have the following:
Comments
There are no comments yet.