offer a comprehensive suite of mathematically well-founded non-parametric modeling techniques for a wide range of problems in machine learning. These include nonlinear classification, regression, clustering, semi-supervised learning(Belkin et al., 2006), time-series analysis (Parzen, 1970), sequence modeling (Song et al., 2010), dynamical systems (Boots et al., 2013), hypothesis testing (Harchaoui et al., 2013), causal modeling (Zhang et al., 2011) and many more.
The central object of kernel methods is a kernel function defined on an input domain 111In fact, can be a rather general set. However, in this paper it is restricted to being a subset of .. The kernel is (non-uniquely) associated with an embedding of the input space into a high-dimensional Hilbert space (with inner product ) via a feature map, , such that
Standard regularized linear statistical models in then provide non-linear inference with respect to the original input representation. The algorithmic basis of such constructions are classical Representer Theorems (Wahba, 1990; Schölkopf and Smola, 2002) that guarantee finite-dimensional solutions of associated optimization problems, even if is infinite-dimensional.
However, there is a steep price of these elegant generalizations in terms of scalability. Consider, for example, least squares regression given data points and assume that
. The complexity of linear regression training using standard least squares solvers is, with memory requirements, and prediction speed on a test point. Its kernel-based nonlinear counterpart, however, requires solving a linear system involving the Gram matrix of the kernel function (defined by ). In general, this incurs complexity for training, memory requirements, and prediction time for a single test point – none of which are particularly appealing in “big data” settings. Similar conclusions apply to other algorithms such as Kernel PCA.
This is rather unfortunate, since non-parametric models, such as the ones produced by kernel methods, are particularly appealing in a big data settings as they can adapt to the full complexity of the underlying domain, as uncovered by increasing dataset sizes. It is well-known that imposing strong structural constraints upfront for the purpose of allowing an efficient solution (in the above example: a linear hypothesis space) often limits, both theoretically and empirically, the potential to deliver value on large amounts of data. Thus, as big data becomes pervasive across a number of application domains, it has become necessary to be able to develop highly scalable algorithms for kernel methods.
Recent years have seen intensive research on improving the scalability of kernel methods; we review some recent progress in the next section. In this paper, we revisit one of the most successful techniques, namely the randomized construction of a family of low-dimensional approximate feature maps proposed by Rahimi and Recht (2008). These randomized feature maps, , provide low-distortion approximations for (complex-valued) kernel functions :
where denotes the space of -dimensional complex numbers with the inner product, , with denoting the conjugate of the complex number (though Rahimi and Recht (2008) also define real-valued feature maps for real-valued kernels, our technical exposition is simplified by adopting the generality of complex-valued features). The mapping is now applied to each of the data points, to obtain a randomized feature representation of the data. We then apply a simple linear method to these random features. That is, if our data is we learn on where . As long as is sufficiently smaller than , this leads to more scalable solutions, e.g., for regression we get back to training and prediction time, with memory requirements. This technique is immensely successful, and has been used in recent years to obtain state-of-the-art accuracies for some classical datasets (Huang et al., 2014; Dai et al., 2014; Sindhwani and Avron, 2014; Lu et al., 2014).
The starting point of Rahimi and Recht (2008) is a celebrated result that characterizes the class of positive definite functions:
A function is a positive definite function if for any set of points, , the matrix defined by is positive semi-definite.
Theorem 2 (Bochner (1933)).
A complex-valued function is positive definite if and only if it is the Fourier
Transform of a finite non-negative Borel measure
is positive definite if and only if it is the Fourier Transform of a finite non-negative Borel measureon , i.e.,
Without loss of generality, we assume henceforth that.
A kernel function on is called shift-invariant if , for some positive definite function . Bochner’s theorem implies that a scaled shift-invariant kernel can therefore be put into one-to-one correspondence with a density such that,
For the most notable member of the shift-invariant family of kernels – the Gaussian kernel:
the associated density is again Gaussian .
The integral representation of the kernel (2) may be approximated as follows:
through the feature map,
The subscript denotes dependence of the feature map on the sequence .
The goal of this work is to improve the convergence behavior of this approximation, so that a smaller can be used to get the same quality of approximation to the kernel function. This is motivated by recent work that showed that in order to obtain state-of-the-art accuracy on some important datasets, a very large number of random features is needed (Huang et al., 2014; Sindhwani and Avron, 2014).
Our point of departure from the work of Rahimi and Recht (2008) is the simple observation that when are are drawn from the distribution defined by the density function , the approximation in (1) may be viewed as a standard Monte Carlo (MC) approximation to the integral representation of the kernel. Instead of using plain MC approximation, we propose to use the low-discrepancy properties of Quasi-Monte Carlo (QMC) sequences to reduce the integration error in approximations of the form (1). A self-contained overview of Quasi-Monte Carlo techniques for high-dimensional integration problems is provided in Section 2. In Section 3, we describe how QMC techniques apply to our setting.
We then proceed to apply an average case theoretical analysis of the integration error for any given sequence (Section 4). This bound motivates an optimization problem over the sequence whose minimizer provides adaptive QMC sequences fine tuned to our kernels (Section 5).
We use both for subscript and for denoting , relying on the context to distinguish between the two. We use to denote scalars. We use
to denote vectors, and useto denote the -th coordinate of vectors . Furthermore, in a sequence of vectors, we use to denote the -th element of the sequence and use to denote the -th coordinate of vector . Given , the Gram matrix is defined as where for . We denote the error function by , i.e., for ; see Weideman (1994) and Mori (1983) for more details.
In “MC sequence” we mean points drawn randomly either from the unit cube or certain distribution that will be clear from the text. For “QMC sequence” we mean a deterministic sequence designed to reduce the integration error. Typically, it will be a low-discrepancy sequence on the unit cube.
It is also useful to recall the definition of Reproducing Kernel Hilbert Space (RKHS).
Definition 3 (Reproducing Kernel Hilbert Space (Berlinet and Thomas-Agnan, 2004)).
A reproducing kernel Hilbert space (RKHS) is a Hilbert Space that possesses a reproducing kernel, i.e., a function for which the following hold for all and :
Equivalently, RKHSs are Hilbert spaces with bounded, continuous evaluation functionals. Informally, they are Hilbert spaces with the nice property that if two functions are close in the sense of the distance derived from the norm in (i.e., is small), then their values are also close for all ; in other words, the norm controls the pointwise behavior of functions in (Berlinet and Thomas-Agnan, 2004).
2.2 Related Work
In this section we discuss related work on scalable kernel methods. Relevant work on QMC methods is discussed in the next subsection.
Scalability has long been identified as a key challenge associated with deploying kernel methods in practice. One dominant line of work constructs low-rank approximations of the Gram matrix, either using data-oblivious randomized feature maps to approximate the kernel function, or using sampling techniques such as the classical Nyström method (Williams and Seeger, 2001). In its vanilla version, the latter approach - Nyström method - samples points from the dataset, computes the columns of the Gram matrix that corresponds to the sampled data points, and uses this partial computation of the Gram matrix to construct an approximation to the entire Gram matrix. More elaborate techniques exist, both randomized and deterministic; see Gittens and Mahoney (2013) for a thorough treatment.
More relevant to our work is the randomized feature mapping approach. Pioneered by the seminal paper of Rahimi and Recht (2008), the core idea is to construct, for a given kernel on a data domain , a transformation such that . Invoking Bochner’s theorem, a classical result in harmonic analysis, Rahimi and Recht show how to construct a randomized feature map for shift-invariant kernels, i.e., kernels that can be written for some positive definite function .
Subsequently, there has been considerable effort given to extending this technique to other classes of kernels. Li et al. (2010) use Bochner’s theorem to provide random features to the wider class of group-invariant kernels. Maji and Berg (2009) suggested random features for the intersection kernel . Vedaldi and Zisserman (2012) developed feature maps for -homogeneous kernels. Sreekanth et al. (2010) developed feature maps for generalized RBF kernels where is a positive definite function, and is a distance metric. Kar and Karnick (2012) suggested feature maps for dot-product kernels. The feature maps are based on the Maclaurin expansion, which is guaranteed to be non-negative due to a classical result of Schoenberg (1942). Pham and Pagh (2013) suggested feature maps for the polynomial kernels. Their construction leverages known techniques from sketching theory. It can also be shown that their feature map is an oblivious subspace embedding, and this observation provides stronger theoretical guarantees than point-wise error bounds prevalent in the feature map literature (Avron et al., 2014). By invoking a variant of Bochner’s theorem that replaces the Fourier transform with the Laplace transform, Yang et al. (2014) obtained randomized feature maps for semigroup kernels on histograms. We note that while the original feature maps suggested by Rahimi and Recht were randomized, some of the aforementioned maps are deterministic.
Our work is more in-line with recent efforts on scaling up the random features, so that learning and prediction can be done faster. Le et al. (2013) return to the original construction of Rahimi and Recht (2008), and devise a clever distribution of random samples that is structured so that the generation of random features can be done much faster. They showed that only a very limited concession in term of convergence rate is made. Hamid et al. (2014), working on the polynomial kernel, suggest first generating a very large amount of random features, and then applying them a low-distortion embedding based the Fast Johnson-Lindenstruass Transform, so the make the final size of the mapped vector rather small. In contrast, our work tries to design so that less features will be necessary to get the same quality of kernel approximation.
Several other scalable approaches for large-scale kernel methods have been suggested over the years, starting from approaches such as chunking and decomposition methods proposed in the early days of SVM optimization literature. Raykar and Duraiswami (2007) use an improved fast Gauss transform for large scale Gaussian Process regression. There are also approaches that are more specific to the objective function at hand, e.g., Keerthi et al. (2006) builds a kernel expansion greedily to optimize the SVM objective function. Another well known approach is the Core Vector Machines (Tsang et al., 2005) which draws on approximation algorithms from computational geometry to scale up a class of kernel methods that can be reformulated in terms of the minimum enclosing ball problem.
For a broader discussion of these methods, and others, see Bottou et al. (2007).
2.3 Quasi-Monte Carlo Techniques: an Overview
In this section we provide a self-contained overview of Quasi-Monte Carlo (QMC) techniques. For brevity, we restrict our discussion to background that is necessary for understanding subsequent sections. We refer the interested reader to the excellent reviews by Caflisch (1998) and Dick et al. (2013), and the recent book Leobacher and Pillichschammer (2014) for a much more detailed exposition.
Consider the task of computing an approximation of the following integral
One can observe that if
is a random vector uniformly distributed overthen . An empirical approximation to the expected value can be computed by drawing a random point set independently from , and computing:
This is the Monte Carlo (MC) method.
Define the integration error with respect to the point set as,
is drawn randomly, the Central Limit Theorem asserts that ifis large enough then where
is a standard normal random variable, and
is the square-root of the variance of; that is,
In other words, the root mean square error of the Monte Carlo method is,
Therefore, the Monte Carlo method converges at a rate of .
The aim of QMC methods is to improve the convergence rate by using a deterministic low-discrepancy sequence to construct , instead of randomly sampling points. The underlying intuition is illustrated in Figure 1, where we plot a set of 1000 two-dimensional random points (left graph), and a set of 1000 two-dimensional points from a quasi-random sequence (Halton sequence; right graph).
In the random sequence we see that there is an undesired clustering of points, and as a consequence empty spaces. Clusters add little to the approximation of the integral in those regions, while in the empty spaces the integrand is not sampled. This lack of uniformity is due to the fact that Monte Carlo samples are independent of each other. By carefully designing a sequence of correlated points to avoid such clustering effects, QMC attempts to avoid this phenomena, and thus provide faster convergence to the integral.
The theoretical apparatus for designing such sequences are inequalities of the form
in which is a measure of the variation or difficulty of integrating and is a sequence-dependent term that typically measures the discrepancy, or degree of deviation from uniformity, of the sequence . For example, the expected Monte Carlo integration error decouples into a variance term, and as in (5).
A prototypical inequality of this sort is the following remarkable and classical result:
Theorem 4 (Koksma-Hlawka inequality).
For any function with bounded variation, and sequence , the integration error is bounded above as follows,
where is the Hardy-Krause variation of (see Niederreiter (1992)), which is defined in terms of the following partial derivatives,
and is the star discrepancy defined by
where is the local discrepancy function
with with .
Given , the second term in
is an estimate of the volume of, which will be accurate if the points in are uniform enough. measures the maximum difference between the actual volume of the subregion and its estimate for all in .
An infinite sequence is defined to be a low-discrepancy sequence if, as a function of , . Several constructions are know to be low-discrepancy sequences. One notable example is the Halton sequences, which are defined as follows. Let be the first prime numbers. The Halton sequence of dimension is defined by
where for integers and we have
in which is given by the unique decomposition
It is outside the scope of this paper to describe all these constructions in detail. However we mention that in addition to the Halton sequences, other notable members are Sobol’ sequences, Faure sequences, Niederreiter sequences, and more (see Dick et al. (2013), Section ). We also mention that it is conjectured that the rate for star discrepancy decay is optimal.
The classical QMC theory, which is based on the Koksma-Hlawka inequality and low discrepancy sequences, thus achieves a convergence rate of . While this is asymptotically superior to for a fixed , it requires to be exponential in for the improvement to manifest. As such, in the past QMC methods were dismissed as unsuitable for very high-dimensional integration.
However, several authors noticed that QMC methods perform better than MC even for very high-dimensional integration (Sloan and Wozniakowski, 1998; Dick et al., 2013).222Also see: “On the unreasonable effectiveness of QMC”, I.H. Sloan https://mcqmc.mimuw.edu.pl/Presentations/sloan.pdf Contemporary QMC literature explains and expands on these empirical observations, by leveraging the structure of the space in which the integrand function lives, to derive more refined bounds and discrepancy measures, even when classical measures of variation such as (6) are unbounded. This literature has evolved along at least two directions: one, where worst-case analysis is provided under the assumption that the integrands live in a Reproducing Kernel Hilbert Space (RKHS) of sufficiently smooth and well-behaved functions (see Dick et al. (2013), Section ) and second, where the analysis is done in terms of average-case
error, under an assumed probability distribution over the integrands, instead of worst-case error(Wozniakowski, 1991; Traub and Wozniakowski, 1994). We refrain from more details, as these are essentially the paths that the analysis in Section 4 follows for our specific setting.
3 QMC Feature Maps: Our Algorithm
We assume that the density function in (2) can be written as , where is a univariate density function. The density functions associated to many shift-invariant kernels, e.g., Gaussian, Laplacian and Cauchy, admits such a form.
The QMC method is generally applicable to integrals over a unit cube. So typically integrals of the form (2) are handled by first generating a low discrepancy sequence , and transforming it into a sequence in , instead of drawing the elements of the sequence from as in the MC method.
To convert (2) to an integral over the unit cube, a simple change of variables suffices. For , define
is the cumulative distribution function (CDF) of, for . By setting , then (2) can be equivalently written as
4 Theoretical Analysis and Average Case Error Bounds
The proofs for assertions made in this section and the next can be found in the Appendix.
The goal of this section is to develop a framework for analyzing the approximation quality of the QMC feature maps described in the previous section (Algorithm 1). We need to develop such a framework since the classical Koksma-Hlawka inequality cannot be applied to our setting, as the following proposition shows:
For any , where is a univariate density function, let
For a fixed , consider , . The Hardy-Krause variation of is unbounded. That is, one of the integrals in the sum (6) is unbounded.
Our framework is based on a new discrepancy measure, box discrepancy, that characterizes integration error for the set of integrals defined with respect to the underlying data domain. Throughout this section we use the convention that , and the notation .
Given a probability density function and , we define the integration error of a function with respect to and the samples as,
We are interested in characterizing the behavior of on where
As is common in modern QMC analysis (Leobacher and Pillichschammer, 2014; Dick et al., 2013), our analysis is based on setting up a Reproducing Kernel Hilbert Space of “nice” functions that is related to integrands that we are interested in, and using properties of the RKHS to derive bounds on the integration error. In particular, the integration error of integrands in an RKHS can be bounded using the following proposition.
Proposition 6 (Integration Error in an RKHS).
Let be an RKHS with kernel . Assume that . Then, for all we have,
The RKHS we use is as follows. For a vector , let us define . Let
and consider the space of functions that admit an integral representation over of the form
This space is associated with bandlimited functions, i.e., functions with compactly-supported inverse Fourier transforms, which are of fundamental importance in the Shannon-Nyquist sampling theory. Under a natural choice of inner product, these spaces are called Paley-Wiener spaces and they constitute an RKHS.
Proposition 8 (Kernel of Paley-Wiener RKHS (Berlinet and Thomas-Agnan, 2004; Yao, 1967; Peloso, 2011)).
By , denote the space of functions which admit the representation in (10), with the inner product . is an RKHS with kernel function,
For notational convenience, in the above we define to be . Furthermore, .
If then , so . Since we wish to bound the integration error on functions in , it suffices to bound the integration error on . Unfortunately, while defines , the functions in it, being not square integrable, are not members of , so analyzing the integration error in do not directly apply to them. However, damped approximations of of the form are members of with . Hence, we expect the analysis of the integration error in to provide provide a discrepancy measure for integrating functions in .
For the discrepancy measure in Proposition 6 can be written explicitly.
Theorem 9 (Discrepancy in ).
Suppose that is a probability density function, and that we can write
where each is a univariate probability density
function as well. Let be the characteristic function associated
be the characteristic function associated with. Then,
This naturally leads to the definition of the box discrepancy, analogous to the star discrepancy described in Theorem 4.
Definition 10 (Box Discrepancy).
The box discrepancy of a sequence with respect to is defined as,
For notational convenience, we generally omit the from as long as it is clear from the context.
The worse-case integration error bound for Paley-Wiener spaces is stated in the following as a corollary of Proposition 6. As explained earlier, this result not yet apply to functions in because these functions are not part of . Nevertheless, we state it here for completeness.
Corollary 11 (Integration Error in ).
For we have
Our main result shows that the expected square error of an integrand drawn from a uniform distribution over is proportional to the square discrepancy measure . This result is in the spirit of similar average case analysis in the QMC literature (Wozniakowski, 1991; Traub and Wozniakowski, 1994).
Theorem 12 (Average Case Error).
Let denote the uniform distribution on . That is, denotes where and is randomly drawn from a uniform distribution on . We have,
We now give an explicit formula for for the case that
is the density function of the multivariate Gaussian distribution with zero mean and independent components. This is an important special case since this is the density function that is relevant for the Gaussian kernel.
Corollary 13 (Discrepancy for Gaussian Distribution).
Let be the -dimensional multivariate Gaussian density function with zero mean and covariance matrix equal to . We have,
Intuitively, the box discrepancy of the Gaussian kernel can be interpreted as follows. The function achieves its maximum at and minimizes at discrete values of decaying to 0 as goes to . Hence the first term in (12) tends to be minimized when the pairwise distance between are sufficiently separated. Due to the shape of cumulative distribution function of Gaussian distribution, the values of are driven to be close to the boundary of the unit cube. As for second term, the original expression is -. This term encourages the sequence to mimic samples from . Since concentrates its mass around , the also concentrates around to maximize the integral and therefore the values of are driven closer to the center of the unit cube. Sequences with low box discrepancy therefore optimize a tradeoff between these competing terms.
Two other shift-invariant kernel that have been mentioned in the machine learning literature is the Laplacian kernel (Rahimi and Recht, 2008) and Matern kernel (Le et al., 2013). The distribution associated with the Laplacian kernel can be written as a product , where
is density associated with the Cauchy distribution. The characteristic function is simple () so analytic formulas like (12) can be derived. The distribution associated with the Matern kernel, on the other hand, is the multivariate t-distribution, which cannot be written as a product , so the presented theory does not apply to it.
Discrepancy of Monte-Carlo Sequences.
We now derive an expression for the expected discrepancy of Monte-Carlo sequences, and show that it decays as . This is useful since via an averaging argument we are guaranteed that there exists sets for which the discrepancy behaves .
Suppose are chosen uniformly from . Let , for . Assume that . Then
Again, we can derive specific formulas for the Gaussian density. The following is straightforward from Corollary 14. We omit the proof.
Let be the -dimensional multivariate Gaussian density function with zero mean and covariance matrix equal to . Suppose are chosen uniformly from . Let , for . Then,
5 Learning Adaptive QMC Sequences
For simplicity, in this section we assume that is the density function of Gaussian distribution with zero mean. We also omit the subscript from . Similar analysis and equations can be derived for other density functions.
Error characterization via discrepancy measures like (12) is typically used in the QMC literature to prescribe sequences whose discrepancy behaves favorably. It is clear that for the box discrepancy, a meticulous design is needed for a high quality sequence and we leave this to future work. Instead, in this work, we use the fact that unlike the star discrepancy (4), the box discrepancy is a smooth function of the sequence with a closed-form formula. This allows us to both evaluate various candidate sequences, and select the one with the lowest discrepancy, as well as to adaptively learn a QMC sequence that is specialized for our problem setting via numerical optimization. The basis is the following proposition, which gives an expression for the gradient of .
Proposition 16 (Gradient of Box Discrepancy).
Define the following scalar functions and variables,
In the above ,we define to be . Then, the elements of the gradient vector of are given by,
We explore two possible approaches for finding sequences based on optimizing the box discrepancy, namely global optimization and greedy optimization. The latter is closely connected to herding algorithms (Welling, 2009).
Global Adaptive Sequences.
The task is posed in terms minimization of the box discrepancy function (12) over the space of sequences of vectors in :
The gradient can be plugged into any first order numerical solver for non-convex optimization. We use non-linear conjugate gradient in our experiments (Section 6.2).
The above learning mechanism can be extended in various directions. For example, QMC sequences for -point rank-one Lattice Rules (Dick et al., 2013) are integral fractions of a lattice defined by a single generating vector . This generating vector may be learnt via local minimization of the box discrepancy.
Greedy Adaptive Sequences.
Starting with , for , let . At step , we solve the following optimization problem,
The greedy adaptive procedure is closely related to the herding algorithm, recently presented by Welling (2009). Applying the herding algorithm to and , and using our notation, the points are generated using the following iteration
In the above, is a series of functions in . The literature is not specific on the initial value of , with both and suggested. Either way, it is always the case that where .
Chen et al. (2010) showed that under some additional assumptions, the herding algorithm, when applied to a RKHS , greedily minimizes , which, recall, is equal to . Thus, under certain assumptions, herding and (15) are equivalent. Chen et al. (2010) also showed that under certain restrictions on the RKHS, herding will reduce the discrepancy in a ratio of . However, it is unclear whether those restrictions hold for and . Indeed, Bach et al. (2012) recently shown that these restrictions never hold for infinite-dimensional RKHS, as long as the domain is compact. This result does not immediately apply to our case since is not compact.
Classically, Monte-Carlo and Quasi-Monte Carlo approximations of integrals are unweighted, or more precisely, have a uniform weights. However, it is quite plausible to weight the approximations, i.e. approximate using
where is a set of weights. This lead to the feature map
This construction requires for , although we note that (16) itself does not preclude negative weights. We do not require the weights to be normalized, that is it is possible that .
One can easily generalize the result of the previous section to derive the following discrepancy measure that takes into consideration the weights