# Learning from Distributions via Support Measure Machines

This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (Flex-SVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework.

## Authors

• 30 publications
• 45 publications
• 3 publications
• 192 publications
05/31/2019

### Quantum Mean Embedding of Probability Distributions

The kernel mean embedding of probability distributions is commonly used ...
01/16/2013

### Variational Relevance Vector Machines

The Support Vector Machine (SVM) of Vapnik (1998) has become widely esta...
03/09/2013

### Complex Support Vector Machines for Regression and Quaternary Classification

The paper presents a new framework for complex Support Vector Regression...
01/02/2020

### Kernelized Support Tensor Train Machines

Tensor, a multi-dimensional data structure, has been exploited recently ...
03/01/2018

### Wasserstein Distance Measure Machines

This paper presents a distance-based discriminative framework for learni...
12/02/2019

### A Rigorous Theory of Conditional Mean Embeddings

Conditional mean embeddings (CME) have proven themselves to be a powerfu...
01/07/2016

### Fast Kronecker product kernel methods via generalized vec trick

Kronecker product kernel provides the standard approach in the kernel me...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Discriminative learning algorithms are typically trained from large collections of vectorial training examples. In many classical learning problems, however, it is arguably more appropriate to represent training data not as individual data points, but as probability distributions. There are, in fact, multiple reasons why probability distributions may be preferable.

Firstly, uncertain or missing data naturally arises in many applications. For example, gene expression data obtained from the microarray experiments are known to be very noisy due to various sources of variabilities [1]

. In order to reduce uncertainty, and to allow for estimates of confidence levels, experiments are often replicated. Unfortunately, the feasibility of replicating the microarray experiments is often inhibited by cost constraints, as well as the amount of available mRNA. To cope with experimental uncertainty given a limited amount of data, it is natural to represent each array as a probability distribution that has been designed to approximate the variability of gene expressions across slides.

Probability distributions may be equally appropriate given an abundance of training data. In data-rich disciplines such as neuroinformatics, climate informatics, and astronomy, a high throughput experiment can easily generate a huge amount of data, leading to significant computational challenges in both time and space. Instead of scaling up one’s learning algorithms, one can scale down one’s dataset by constructing a smaller collection of distributions which represents groups of similar samples. Besides computational efficiency, aggregate statistics can potentially incorporate higher-level information that represents the collective behavior of multiple data points.

Previous attempts have been made to learn from distributions by creating positive definite (p.d.) kernels on probability measures. In [2], the probability product kernel (PPK) was proposed as a generalized inner product between two input objects, which is in fact closely related to well-known kernels such as the Bhattacharyya kernel [3] and the exponential symmetrized Kullback-Leibler (KL) divergence [4]. In [5], an extension of a two-parameter family of Hilbertian metrics of Topsøe was used to define Hilbertian kernels on probability measures. In [6], the semi-group kernels were designed for objects with additive semi-group structure such as positive measures. Recently, [7] introduced nonextensive information theoretic kernels on probability measures based on new Jensen-Shannon-type divergences. Although these kernels have proven successful in many applications, they are designed specifically for certain properties of distributions and application domains. Moreover, there has been no attempt in making a connection to the kernels on corresponding input spaces.

The contributions of this paper can be summarized as follows. First, we prove the representer theorem for a regularization framework over the space of probability distributions, which is a generalization of regularization over the input space on which the distributions are defined (Section 2). Second, a family of positive definite kernels on distributions is introduced (Section 3). Based on such kernels, a learning algorithm on probability measures called support measure machine (SMM) is proposed. An SVM on the input space is provably a special case of the SMM. Third, the paper presents the relations between sample-based and distribution-based methods (Section 4). If the distributions depend only on the locations in the input space, the SMM particularly reduces to a more flexible SVM that places different kernels on each data point.

## 2 Regularization on probability distributions

Given a non-empty set , let denote the set of all probability measures on a measurable space , where is a -algebra of subsets of . The goal of this work is to learn a function given a set of example pairs , where and . In other words, we consider a supervised setting in which input training examples are probability distributions. In this paper, we focus on the binary classification problem, i.e., .

In order to learn from distributions, we employ a compact representation that not only preserves necessary information of individual distributions, but also permits efficient computations. That is, we adopt a Hilbert space embedding to represent the distribution as a mean function in an RKHS [8, 9]. Formally, let denote an RKHS of functions , endowed with a reproducing kernel . The mean map from into is defined as

 μ:P→H,P⟼∫Xk(x,⋅)dP(x). (1)

We assume that is bounded for any . It can be shown that, if is characteristic, the map (1) is injective, i.e., all the information about the distribution is preserved [10]. For any , letting , we have the reproducing property

 EP[f]=⟨μP,f⟩H,∀f∈H. (2)

That is, we can see the mean embedding as a feature map associated with the kernel , defined as . Since , it also follows that , where the second equality follows from the reproducing property of . It is immediate that is a p.d. kernel on .

The following theorem shows that optimal solutions of a suitable class of regularization problems involving distributions can be expressed as a finite linear combination of mean embeddings.

###### Theorem 1.

Given training examples , a strictly monotonically increasing function

, and a loss function

, any minimizing the regularized risk functional

 ℓ(P1,y1,EP1[f],…,Pm,ym,EPm[f])+Ω(∥f∥H) (3)

admits a representation of the form for some .

Theorem 1 clearly indicates how each distribution contributes to the minimizer of (5). Roughly speaking, the coefficients controls the contribution of the distributions through the mean embeddings . Furthermore, if we restrict to a class of Dirac measures on and consider the training set , the functional (5) reduces to the usual regularization functional [11] and the solution reduces to . Therefore, the standard representer theorem is recovered as a particular case (see also [12] for more general results on representer theorem).

Note that, on the one hand, the minimization problem (5) is different from minimizing the functional for the special case of the additive loss . Therefore, the solution of our regularization problem is different from what one would get in the limit by training on an infinitely many points sampled from . On the other hand, it is also different from minimizing the functional where . In a sense, our framework is something in between.

## 3 Kernels on probability distributions

As the map (1) is linear in , optimizing the functional (5) amounts to finding a function in that approximate well functions from to in the function class where is a class of bounded continuous functions on . Since for any , it follows that where is a class of bounded continuous functions on endowed with the topology of weak convergence and the associated Borel -algebra. The following lemma states the relation between the RKHS induced by the kernel and the function class .

###### Lemma 2.

Assuming that is compact, the RKHS induced by a kernel is dense in if is universal, i.e., for every function and every there exists a function with .

###### Proof.

Assume that is universal. Then, for every function and every there exists a function induced by with [13]. Hence, by linearity of , for every and every there exists a function such that . ∎

Nonlinear kernels on can be defined in an analogous way to nonlinear kernels on , by treating mean embeddings of as its feature representation. First, assume that the map (1) is injective and let be an inner product on . By linearity, we have (cf. [8] for more details). Then, the nonlinear kernels on can be defined as where is a p.d. kernel. As a result, many standard nonlinear kernels on can be used to define nonlinear kernels on as long as the kernel evaluation depends entirely on the inner product , e.g., . Although requiring more computational effort, their practical use is simple and flexible. Specifically, the notion of p.d. kernels on distributions proposed in this work is so generic that standard kernel functions can be reused to derive kernels on distributions that are different from many other kernel functions proposed specifically for certain distributions.

It has been recently proved that the Gaussian RBF kernel given by is universal w.r.t given that is compact and the map is injective [14]

. Despite its success in real-world applications, the theory of kernel-based classifiers beyond the input space

, as also mentioned by [14], is still incomplete. It is therefore of theoretical interest to consider more general classes of universal kernels on probability distributions.

### 3.1 Support measure machines

This subsection extends SVMs to deal with probability distributions, leading to support measure machines (SMMs). In its general form, an SMM amounts to solving an SVM problem with the expected kernel . This kernel can be computed in closed-form for certain classes of distributions and kernels . Examples are given in Table 1.

Alternatively, one can approximate the kernel by the empirical estimate:

 Kemp(ˆPn,ˆQm)=1n⋅mn∑i=1m∑j=1k(xi,zj) (4)

where and are empirical distributions of and given random samples and , respectively. A finite sample of size from a distribution suffices (with high probability) to compute an approximation within an error of . Instead, if the sample set is sufficiently large, one may choose to approximate the true distribution by simpler probabilistic models, e.g., a mixture of Gaussians model, and choose a kernel whose expected value admits an analytic form. Storing only the parameters of probabilistic models may save some space compared to storing all data points.

Note that the standard SVM feature map is usually nonlinear in , whereas is linear in . Thus, for an SMM, the first level kernel is used to obtain a vectorial representation of the measures, and the second level kernel allows for a nonlinear algorithm on distributions. For clarity, we will refer to and as the embedding kernel and the level-2 kernel, respectively

## 4 Theoretical analyses

This section presents key theoretical aspects of the proposed framework, which reveal important connection between kernel-based learning algorithms on the space of distributions and on the input space on which they are defined.

### 4.1 Risk deviation bound

Given a training sample drawn i.i.d. from some unknown probability distribution on , a loss function , and a function class , the goal of statistical learning is to find the function that minimizes the expected risk functional . Since is unknown, the empirical risk based on the training sample is considered instead. Furthermore, the risk functional can be simplified further by considering based on samples drawn from each .

Our framework, on the other hand, alleviates the problem by minimizing the risk functional for with corresponding empirical risk functional (cf. the discussion at the end of Section 2). It is often easier to optimize as the expectation can be computed exactly for certain choices of and . Moreover, for universal , this simplification preserves all information of the distributions. Nevertheless, there is still a loss of information due to the loss function .

Due to the i.i.d. assumption, the analysis of the difference between and can be simplified w.l.o.g. to the analysis of the difference between and for a particular distribution . The theorem below provides a bound on the difference between and .

###### Theorem 3.

Given an arbitrary probability distribution

with variance

, a Lipschitz continuous function with constant , an arbitrary loss function that is Lipschitz continuous in the second argument with constant , it follows that for any .

Theorem 3

indicates that if the random variable

is concentrated around its mean and the function and are well-behaved, i.e., Lipschitz continuous, then the loss deviation will be small. As a result, if this holds for any distribution in the training set , the true risk deviation is also expected to be small.

### 4.2 Flexible support vector machines

It turns out that, for certain choices of distributions , the linear SMM trained using is equivalent to an SVM trained using some samples with an appropriate choice of kernel function.

###### Lemma 4.

Let be a bounded p.d. kernel on a measure space such that , and be a square integrable function such that for all . Given a sample where each is assumed to have a density given by , the linear SMM is equivalent to the SVM on the training sample with kernel .

Note that the important assumption for this equivalence is that the distributions differ only in their location in the parameter space. This need not be the case in all possible applications of SMMs.

Furthermore, we have . Thus, it is clear that the feature map of depends not only on the kernel , but also on the density . Consequently, by virtue of Lemma 4, the kernel allows the SVM to place different kernels at each data point. We call this algorithm a flexible SVM (Flex-SVM).

Consider for example the linear SMM with Gaussian distributions

and Gaussian RBF kernel with bandwidth parameter . The convolution theorem of Gaussian distributions implies that this SMM is equivalent to a flexible SVM that places a data-dependent kernel on training example , i.e., a Gaussian RBF kernel with larger bandwidth.

## 5 Related works

The kernel is in fact a special case of the Hilbertian metric [5], with the associated kernel , and a generative mean map kernel (GMMK) proposed by [15]. In the GMMK, the kernel between two objects and is defined via and , which are estimated probabilistic models of and , respectively. That is, a probabilistic model is learned for each example and used as a surrogate to construct the kernel between those examples. The idea of surrogate kernels has also been adopted by the Probability Product Kernel (PPK) [2]. In this case, we have , which has been shown to be a special case of GMMK when [15]. Consequently, GMMK, PPK with , and our linear kernels are equivalent when the embedding kernel is . More recently, the empirical kernel (4) was employed in an unsupervised way for multi-task learning to generalize to a previously unseen task [16]. In contrast, we treat the probability distributions in a supervised way (cf. the regularized functional (5)) and the kernel is not restricted to only the empirical kernel.

The use of expected kernels in dealing with the uncertainty in the input data has a connection to robust SVMs. For instance, a generalized form of the SVM in [17] incorporates the probabilistic uncertainty into the maximization of the margin. This results in a second-order cone programming (SOCP) that generalizes the standard SVM. In SOCP, one needs to specify the parameter that reflects the probability of correctly classifying the th training example. The parameter is therefore closely related to the parameter , which specifies the variance of the distribution centered at the th example. [18] showed the equivalence between SVMs using expected kernels and SOCP when . When , the mean and covariance of missing kernel entries have to be estimated explicitly, making the SOCP more involved for nonlinear kernels. Although achieving comparable performance to the standard SVM with expected kernels, the SOCP requires a more computationally extensive SOCP solver, as opposed to simple quadratic programming (QP).

## 6 Experimental results

In the experiments, we primarily consider three different learning algorithms: i) SVMis considered as a baseline algorithm. ii) Augmented SVM (ASVM)is an SVM trained on augmented samples drawn according to the distributions . The same number of examples are drawn from each distribution. iii) SMMis distribution-based method that can be applied directly on the distributions111We used the LIBSVM implementation..

### 6.1 Synthetic data

Firstly, we conducted a basic experiment that illustrates a fundamental difference between SVM, ASVM, and SMM. A binary classification problem of 7 Gaussian distributions with different means and covariances was considered. We trained the SVM using only the means of the distributions, ASVM with 30 virtual examples generated from each distribution, and SMM using distributions as training examples. A Gaussian RBF kernel with was used for all algorithms.

Figure 0(a)

shows the resulting decision boundaries. Having been trained only on means of the distributions, the SVM classifier tends to overemphasize the regions with high densities and underrepresent the lower density regions. In contrast, the ASVM is more expensive and sensitive to outliers, especially when learning on heavy-tailed distributions. The SMM treats each distribution as a training example and implicitly incorporates properties of the distributions, i.e., means and covariances, into the classifier. Note that the SVM can be trained to achieve a similar result to the SMM by choosing an appropriate value for

(cf. Lemma 4). Nevertheless, this becomes more difficult if the training distributions are, for example, nonisotropic and have different covariance matrices.

Secondly, we evaluate the performance of the SMM for different combinations of embedding and level-2 kernels. Two classes of synthetic Gaussian distributions on

were generated. The mean parameters of the positive and negative distributions are normally distributed with means

and and identical covariance matrix , respectively. The covariance matrix for each distribution is generated according to two Wishart distributions with covariance matrices given by and with degrees of freedom. The training set consists of 500 distributions from the positive class and 500 distributions from the negative class. The test set consists of 200 distributions with the same class proportion.

The kernels used in the experiment include linear kernel (LIN), polynomial kernel of degree 2 (POLY2), polynomial kernel of degree 3 (POLY3), unnormalized Gaussian RBF kernel (RBF), and normalized Gaussian RBF kernel (NRBF). To fix parameter values of both kernel functions and SMM, 10-fold cross-validation (10-CV) is performed on a parameter grid, for SMM, bandwidth parameter for Gaussian RBF kernels, and degree parameter for polynomial kernels. The average accuracy and standard deviation for all kernel combinations over 30 repetitions are reported in Table 2. Moreover, we also investigate the sensitivity of kernel parameters for two kernel combinations: RBF-RBF and POLY-RBF. In this case, we consider the bandwidth parameter for Gaussian RBF kernels and degree parameter for polynomial kernels. Figure 0(b) depicts the accuracy values and average accuracies for considered kernel functions.

Table 2 indicates that both embedding and level-2 kernels are important for the performance of the classifier. The embedding kernels tend to have more impact on the predictive performance compared to the level-2 kernels. This conclusion also coincides with the results depicted in Figure 0(b).

### 6.2 Handwritten digit recognition

In this section, the proposed framework is applied to distributions over equivalence classes of images that are invariant to basic transformations, namely, scaling, translation, and rotation. We consider the handwritten digits obtained from the USPS dataset. For each image, the distribution over the equivalence class of the transformations is determined by a prior on parameters associated with such transformations. Scaling and translation are parametrized by the scale factors and displacements along the and axes, respectively. The rotation is parametrized by an angle . We adopt Gaussian distributions as prior distributions, including , , and . For each image, the virtual examples are obtained by sampling parameter values from the distribution and applying the transformation accordingly.

Experiments are categorized into simple and difficult binary classification tasks. The former consists of classifying digit 1 against digit 8 and digit 3 against digit 4. The latter considers classifying digit 3 against digit 8 and digit 6 against digit 9. The initial dataset for each task is constructed by randomly selecting 100 examples from each class. Then, for each example in the initial dataset, we generate 10, 20, and 30 virtual examples using the aforementioned transformations to construct virtual data sets consisting of 2,000, 4,000, and 6,000 examples, respectively. One third of examples in the initial dataset are used as a test set. The original examples are excluded from the virtual datasets. The virtual examples are normalized such that their feature values are in . Then, to reduce computational cost, principle component analysis (PCA) is performed to reduce the dimensionality to 16. We compare the SVM on the initial dataset, the ASVM on the virtual datasets, and the SMM. For SVM and ASVM, the Gaussian RBF kernel is used. For SMM, we employ the empirical kernel (4) with Gaussian RBF kernel as a base kernel. The parameters of the algorithms are fixed by 10-CV over parameters and .

The results depicted in Figure 4 clearly demonstrate the benefits of learning directly from the equivalence classes of digits under basic transformations222While the reported results were obtained using virtual examples with Gaussian parameter distributions (Sec. 6.2

), we got similar results using uniform distributions.

. In most cases, the SMM outperforms both the SVM and the ASVM as the number of virtual examples increases. Moreover, Figure 4 shows the benefit of the SMM over the ASVM in term of computational cost333The evaluation was made on a 64-bit desktop computer with Intel Core 2 Duo CPU E8400 at 3.00GHz2 and 4GB of memory..

### 6.3 Natural scene categorization

This section illustrates benefits of the nonlinear kernels between distributions for learning natural scene categories in which the bag-of-word (BoW) representation is used to represent images in the dataset. Each image is represented as a collection of local patches, each being a codeword from a large vocabulary of codewords called codebook. Standard BoW representations encode each image as a histogram that enumerates the occurrence probability of local patches detected in the image w.r.t. those in the codebook. On the other hand, our setting represents each image as a distribution over these codewords. Thus, images of different scenes tends to generate distinct set of patches. Based on this representation, both the histogram and the local patches can be used in our framework.

We use the dataset presented in [19]. According to their results, most errors occurs among the four indoor categories (830 images), namely, bedroom (174 images), living room (289 images), kitchen (151 images), and office (216 images). Therefore, we will focus on these four categories. For each category, we split the dataset randomly into two separate sets of images, 100 for training and the rest for testing.

A codebook is formed from the training images of all categories. Firstly, interesting keypoints in the image are randomly detected. Local patches are then generated accordingly. After patch detection, each patch is transformed into a 128-dim SIFT vector [20]

. Given the collection of detected patches, K-means clustering is performed over all local patches. Codewords are then defined as the centers of the learned clusters. Then, each patch in an image is mapped to a codeword and the image can be represented by the histogram of the codewords. In addition, we also have an

matrix of SIFT vectors where is the number of codewords.

We compare the performance of a Probabilistic Latent Semantic Analysis (pLSA) with the standard BoW representation, SVM, linear SMM (LSMM), and nonlinear SMM (NLSMM). For SMM, we use the empirical embedding kernel with Gaussian RBF base kernel : where is the histogram of the th image and is the th SIFT vector. A Gaussian RBF kernel is also used as the level-2 kernel for nonlinear SMM. For the SVM, we adopt a Gaussian RBF kernel with -distance between the histograms [21], i.e., where . The parameters of the algorithms are fixed by 10-CV over parameters and . For NLSMM, we use the best of LSMM in the base kernel and perform 10-CV to choose parameter only for the level-2 kernel. To deal with multiple categories, we adopt the pairwise approach and voting scheme to categorize test images. The results in Figure 4 illustrate the benefit of the distribution-based framework. Understanding the context of a complex scene is challenging. Employing distribution-based methods provides an elegant way of utilizing higher-order statistics in natural images that could not be captured by traditional sample-based methods.

## 7 Conclusions

This paper proposes a method for kernel-based discriminative learning on probability distributions. The trick is to embed distributions into an RKHS, resulting in a simple and efficient learning algorithm on distributions. A family of linear and nonlinear kernels on distributions allows one to flexibly choose the kernel function that is suitable for the problems at hand. Our analyses provide insights into the relations between distribution-based methods and traditional sample-based methods, particularly the flexible SVM that allows the SVM to place different kernels on each training example. The experimental results illustrate the benefits of learning from a pool of distributions, compared to a pool of examples, both on synthetic and real-world data.

#### Acknowledgments

KM would like to thank Zoubin Gharamani, Arthur Gretton, Christian Walder, and Philipp Hennig for a fruitful discussion. We also thank all three insightful reviewers for their invaluable comments.

## References

• [1] Y. H. Yang and T. Speed. Design issues for cDNA microarray experiments. Nat. Rev. Genet., 3(8):579–588, 2002.
• [2] T. Jebara, R. Kondor, A. Howard, K. Bennett, and N. Cesa-bianchi. Probability product kernels.

Journal of Machine Learning Research

, 5:819–844, 2004.
• [3] A. Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math Soc., 1943.
• [4] P. J. Moreno, P. P. Ho, and N. Vasconcelos.

A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications.

In Proceedings of Advances in Neural Information Processing Systems. MIT Press, 2004.
• [5] M. Hein and O. Bousquet. Hilbertian metrics and positive definite kernels on probability. In

Proceedings of The 12th International Conference on Artificial Intelligence and Statistics

, pages 136–143, 2005.
• [6] M. Cuturi, K. Fukumizu, and J-P. Vert. Semigroup kernels on measures. Journal of Machine Learning Research, 6:1169–1198, 2005.
• [7] André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, and Mário A. T. Figueiredo. Nonextensive information theoretic kernels on measures. Journal of Machine Learning Research, 10:935–975, 2009.
• [8] A. Berlinet and Thomas C. Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Kluwer Academic Publishers, 2004.
• [9] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A hilbert space embedding for distributions. In Proceedings of the 18th International Conference on Algorithmic Learning Theory, pages 13–31. Springer-Verlag, 2007.
• [10] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and Gert R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 99:1517–1561, 2010.
• [11] B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In COLT ’01/EuroCOLT ’01, pages 416–426. Springer-Verlag, 2001.
• [12] F. Dinuzzo and B. Schölkopf. The representer theorem for Hilbert spaces: a necessary and sufficient condition. In Advances in Neural Information Processing Systems 25, pages 189–196. 2012.
• [13] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001.
• [14] A. Christmann and I. Steinwart. Universal kernels on non-standard input spaces. In Proceedings of Advances in Neural Information Processing Systems, pages 406–414. 2010.
• [15] N. A. Mehta and A. G. Gray. Generative and latent mean map kernels. CoRR, abs/1005.0188, 2010.
• [16] G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems 24, pages 2178–2186. 2011.
• [17] P. K. Shivaswamy, C. Bhattacharyya, and A. J. Smola. Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Research, 7:1283–1314, 2006.
• [18] H.S. Anderson and M.R. Gupta. Expected kernel for missing features in support vector machines. In Statistical Signal Processing Workshop, pages 285–288, 2011.
• [19] L. Fei-fei. A bayesian hierarchical model for learning natural scene categories. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

, pages 524–531, 2005.
• [20] D. G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the International Conference on Computer Vision, pages 1150–1157, Washington, DC, USA, 1999.
• [21] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In Proceedings of the International Conference on Computer Vision, pages 606–613, 2009.

## Appendix A Proof of Theorem 1

###### Theorem 1.

Given training examples , a strictly monotonically increasing function , and a loss function , any minimizing the regularized risk functional

 ℓ(P1,y1,EP1[f],…,Pm,ym,EPm[f])+Ω(∥f∥H) (5)

admits a representation of the form for some .

###### Proof.

By virtue of Proposition 2 in [10], the linear functional are bounded for all . Then, given , any can be decomposed as

 f=fμ+f⊥

where lives in the span of , i.e., and satisfying, for all , . Hence, for all , we have

 EPj[f]=EPj[fμ+f⊥]=⟨fμ+f⊥,μPj⟩=⟨fμ,μPj⟩+⟨f⊥,μPj⟩=⟨fμ,μPj⟩

which is independent of . As a result, the loss functional in (5) does not depend on . For the regularization functional , since is orthogonal to and is strictly monotonically increasing, we have

 Ω(∥f∥)=Ω(∥fμ+f⊥∥)=Ω(√∥fμ∥2+∥f⊥∥2)≥Ω(∥fμ∥)

with equality if and only if and thus . Consequently, any minimizer must take the form . ∎

## Appendix B Proof of Theorem 3

###### Theorem 3.

Given an arbitrary probability distribution with variance , a Lipschitz continuous function with constant , an arbitrary loss function that is Lipschitz continuous in the second argument with constant , it follows that

 |Ex∼P[ℓ(y,f(x))]−ℓ(y,Ex∼P[f(x)])|≤2CℓCfσ

for any .

###### Proof.

Assume that is distributed according to . Let be the mean of in . Thus, we have

 |EP[ℓ(y,f(x))]−ℓ(y,EP[f(x)])| ≤ ∫|ℓ(y,f(~x))−ℓ(y,EP[f(x)])|dP(~x) ≤ Cℓ∫|f(~x)−EP[f(x)]|dP(~x) ≤ Cℓ∫|f(~x)−f(mX)|dP(~x)A+Cℓ|f(mX)−EP[f(x)]|B.
##### Control of (A)

The first term is upper bounded by

 Cℓ∫Cf∥~x−mX∥dP(~x)≤CℓCfσ, (6)

where the last inequality is given by .

##### Control of (B)

Similarly, the second term is upper bounded by

 Cℓ∣∣∣∫f(mX)−f(~x)∣∣∣dP(~x)≤Cℓ∫Cf∥mX−~x∥dP(~x)≤CℓCfσ. (7)

Combining (B) and (7) yields

 |EP[ℓ(y,f(x))]−ℓ(y,EP[f(x)])|≤2CℓCfσ,

thus completing the proof. ∎

## Appendix C Proof of Lemma 4

###### Lemma 4.

Let be a bounded p.d. kernel on a measure space such that , and be a square integrable function such that for all . Given a sample where each is assumed to have a density given by , the linear SMM is equivalent to the SVM on the training sample with kernel .

###### Proof.

For a training sample , the SVM with kernel minimizes

 ℓ({xi,yi,f(xi)+b}mi=1)+λ∥f∥2HKg.

By the representer theorem, with some , hence this is equivalent to

 ℓ({xi,yi,m∑j=1αjKg(xi,xj)+b}mi=1)+λm∑i,j=1αiαjKg(xi,xj).

Next, consider the kernel mean of the probability measure given by and note that for any . The linear SMM with loss and kernel minimizes

 ℓ({Pi,yi,⟨μi,f⟩Hk+b}mi=1)+λ∥f∥2Hk.

By Theorem 1, each minimizer admits a representation of the form

 f=m∑j=1αjμj=m∑j=1αj∫k(⋅,~x)g(xj,~x)d~x.

Thus, for this we have

 ⟨μi,f⟩Hk=m∑j=1αj∬k(~z,~x)g(xi,~x)g(xj,~z)d~xd~z=m∑j=1αjKg(xi,xj)

and

 ∥f∥2Hk=m∑i,j=1αiαj⟨μi,μj⟩=m∑i,j=1αiαjKg(xi,xj)

, as above. This completes the proof. ∎