 # Probabilistic Kernel Support Vector Machines

We propose a probabilistic enhancement of standard kernel Support Vector Machines for binary classification, in order to address the case when, along with given data sets, a description of uncertainty (e.g., error bounds) may be available on each datum. In the present paper, we specifically consider Gaussian distributions to model uncertainty. Thereby, our data consist of pairs (x_i,Σ_i), i∈{1,...,N}, along with an indicator y_i∈{-1,1} to declare membership in one of two categories for each pair. These pairs may be viewed to represent the mean and covariance, respectively, of random vectors ξ_i taking values in a suitable linear space (typically R^n). Thus, our setting may also be viewed as a modification of Support Vector Machines to classify distributions, albeit, at present, only Gaussian ones. We outline the formalism that allows computing suitable classifiers via a natural modification of the standard "kernel trick." The main contribution of this work is to point out a suitable kernel function for applying Support Vector techniques to the setting of uncertain data for which a detailed uncertainty description is also available (herein, "Gaussian points").

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction and motivation

The Support Vector methodology is a nonlinear generalization of Linear Classification, typically utilized for binary classification. In this methdology, a training set of data becomes available where each datum belongs to one of two categories while an indicator function identifies the category for each. Imagine points on a plane labeled with two different colors. Solely based on their location and proximity to others of particular colors, one needs to classify new points and assign them to one of the two categories. When the points of the training set in the two categories can be separated by a line, then this line can be chosen to separate two respective regions, one corresponding to each of the two categories. On the other hand, when the “training” points of the two categories cannot be separated by a line, then a higher order curve needs to be chosen. The Support Vector methodology provides a disciplined and systematic way to do exactly that.

In the case where no simple boundary can be chosen to delineate regions for the points in the two categories, and a simple classification rule is still sought based on location, then a smooth boundary can be selected to define regions that contain the vast majority of points in the respective categories. Thus, simplicity of the classification rule (Occam’s razor) may be desirable, as the domains corresponding to the two categories may not be completely separate and a misclassification error in that case must be deemed acceptable. This again can be handled by a suitable relaxation of SVM.

The purpose of this paper is to raise and in some ways address the issue that, in most engineering applications, data is uncertain. Measurements are often provided along with error bars that quantify expected margins. In other applications, data may yet be more complex. For instance, a datum may be an empirical probability distributions. Such distributions may be approximated by a Gaussian, a mixture of Gaussians, or other options. In such cases, a “datum” is now a point in a space more complex than

. We seek to formulate a new paradigm, where the Support Vector methodology can be extended.

We postulate that measurements provide vectors together with covariance matrices that quantify our uncertainty in the value we recorded. For the purposes of binary classification, we record such pairs along with the information on the category of origin of each, which is given by an indicator value

. A datum with a large variance naturally delineates a larger volume that should be associated to the corresponding category, or at least, impact proportionately the drawing of the separation between the two. Prior state-of-the-art does not do that. Thus, in the present note, we propose a new paradigm of treating measurement uncertainty as part of the data.

Below, in Section II, we highlight some background on Support Vector Machines. Section III gives our main result that identifies a kernel function for the case of data provided in the form of “Gaussian points.”

## 2. Background

Support vector machines are well established techniques for (binary) classification and regression analysis

. The main idea is to imbed a given (training) data set into a high dimensional space , of dimension much higher than the native dimension of the base space

and possibly infinite, so that for binary classification, the two classes can be separated with a hyperplane

. Effectively, this separation hyperplane projects down to the native base space , where the data points originally reside, as curves/surfaces/etc. that separate the two classes. The imbedding into is effected by a mapping

 φ:x∈X↦φ(x)∈H,

where is referred to as the feature vector. The space has an inner product structure (Hilbert space) and, naturally, the construction of the separating hyperplane relies on the geometry of . However, and most importantly, the map does not need to be known explicitely and does not need to be applied to the data. All needed operations, inner product and projections that show up in the classifier and computations, can be effectively carried out in the base space using the so called “kernel trick.”

Indeed, to accomplish the task and construct the classification rule (via suitable curves/surfaces/etc.) it is sufficient to know the kernel

 (1) k(x,y):=⟨φ(x),φ(y)⟩H,

with . It evaluates the inner product in the feature space as a function of the corresponding representatives in the native base space. Thus, the kernel is a bivariate function which is completely characterized by the property of being positive, in the sense that for all , , and any corresponding set of values (as we are interested in real-valued kernels)

 N∑i,j=1αiαjk(xi,xj)≥0.

Necessity stems from the fact that the left hand side is the inner product

 ⟨N∑i=1αiφ(xi),N∑j=1αjφ(xj)⟩H.

Sufficiency, in the existence of a feature map that realizes (1) is a celebrated theorem , see also [4, Theorem 7.5.2] and [1, page 30, Theorem 3.1 (Mercer)].

Classification is relies on constructing a classifier that is built on a linear functional, when viewed in . It is of the form

 (2) x→sign(⟨w,φ(x)⟩H−b),

and the value aims to differentiate between elements in two categories. The coefficients and are chosen so that

 {h∈H∣⟨w,h⟩H−b=0}

is a separating hyperplane of the two subsets

 S±={φ(xi)∣yi=±1}

of the compete (training) data set. Once again, does not need to evaluated at any point in the construction, existence of such a map is enough, and it is guaranteed by the positivity of the kernel.

The construction of the classifier requires selection of the parameters and . These are chosen either (“hard margin”) to minimize

 ⟨w,w⟩H subject to yi(⟨w,φ(xi)⟩H−b)≥1,

or, to (“soft margin”) minimize

 (3) 1NN∑i=1max{0,1−yi(⟨w,φ(xi)⟩H−b)}+λ⟨w,w⟩H,

for all available points in the “training set.” The “hard margin” formulation coincides with the limit of the “soft” formulation as . The dual formulation of (3) becomes to maximize

 N∑i=1ci−12N∑i=1n∑j=1yiyjcicjk(xi⋅xj),

subject to

 N∑i=1ciyi=0, as well as 0≤ci≤(2Nλ)−1

for all . The coefficients can now be obtained via quadratic programming, and can be found as

 b=N∑i=1ciyik(xi,xj)−yj

with corresponding to an index such that . The classification rule then becomes

 (4) x→sign(N∑i=1ciyik(xi,x)−b),

The above follows standard development, see e.g.,  as well as the Wikipedia webpage on Support Vector Machines.

## 3. Classification of Gaussian points

The problem we are addressing in the present note is the classification of uncertain data points into one of two categories, i.e., a binary classification as before. However, a salient feature of our setting is that data are only known with finite accuracy. Uncertainty is modeled is a probabilistic manner. For simplicity, in this paper, we consider only Gaussian points. These consist of pairs

representing the mean and variance of a normally distributed random vector

.

Alternatively, we may think of the data as points on a manifold of distributions (though, only Gaussian at present). These may represent approximations of empirical distributions that have been obtained at various times. An indicator is provided as usual along with the information of the category that the current datum belongs to. If we regard it as representing a distribution, we postulate that it arose from experiment involving population .

We follow the standard setting of kernel Support Vector Machines (kSVM) that was outlined in the background section, and we overlay a probabilistic component. This Probabilistic kernel Support Vector Machine (PkSVM) relies on a suitable modification of the kernel. To this end we consider the set of “data points”

 Ω:={(x,Σ)∣x∈Rn,Σ∈S+,n}

with the cone of non-negative definite symmetric matrices in . We also consider the family of normally distributed independent random vectors

 ξ∼N(x,Σ) for (x,Σ)∈Ω,

i.e., are independent when , and . We utilize the popular exponential (RBF) kernel

 (5) k(x,y)=e−12σ2∥x−y∥2,

where the parameter affects the scale of the desired resolution. We readily compute

 E{k(ξi,ξj)} =E{e−12σ2∥ξi−ξj∥2} =(2π)−n/2|Σi+Σj|−1/2×∫e−12∥ξi−ξj−xi+xj∥2(Σi+Σj)−1−12σ2∥ξi−ξj∥2dξidξj (6) =|I+σ−2(Σi+Σj)|1/2×e−12∥xi−xj∥2(σ2I+(Σi+Σj))−1

for . When , i.e., for any ,

 E{k(ξi,ξi)} =1 (7) ≤|I+σ−2(2Σi)|1/2,

where the inequality in (7) we get from (6) when substituting the same values for and , respectively. We now state our main result.

###### Theorem 1.

The function

 (8) κ((xi,Σi),(xj,Σj)):= |I+σ−2(Σi+Σj)|1/2e−12∥xi−xj∥2(σ2I+(Σi+Σj))−1

defines a positive kernel on .

###### Proof.

Consider in (5) for . Then, for any collection and any collection of , for ,

 N∑i,j=1αiαjκ((xi,Σi),(xj,Σj)) ≥E{N∑i,j=1αiαjk(ξi,ξj)} =E{⟨N∑i=1αiφ(ξi),N∑j=1αjφ(ξi)⟩H}≥0,

with

the map to the radial basis functions corresponding to the positive kernel (

5). The inequality in line two, follows from (7). The claim in the theorem follows. ∎

It can be seen that when applied to points in having zero uncertainly, i.e., when the corresponding covariance matrices are identically , then the Probabilistic kernel Support Vector Machine model reduces to the standard one where points lie in . That is,

 κ((xi,0),(xj,0))=k(xi,xj).

Thereby, PkSVM is natural extension of standard SVM, able to account for error and uncertainty that is available and encoded in the data.

## 4. Conclusion

The formalism herein presents a new paradigm where data incorporate a quantification of own uncertainty. We concentrated on binary classification and the case of “Gaussian points” to present a proof of concept, in the form of a suitable kernel in Theorem 1

. Numerical experiments and further insight will be provided and included in a forthcoming more detailed report. Moreover, the basic idea appears to be easily generalizable to more detailed and explicit descriptions of uncertainty, e.g., Gaussian Mixture models, with however, the caveat of added complexity in the resulting formulae. This we expect will be the starting point of future investigations.

## Acknowledgements

This project was supported in part by NSF under grants 1665031, 1807664, 1839441 and 1901599, the AFOSR under Grants FA9550-17-1-0435, grants from National Institutes of Health (R01-AG048769, R01-CA198121), MSK Cancer Center Support Grant/Core Grant (P30 CA008748), and a grant from Breast Cancer Research Foundation (grant BCRF-17-193).

## References

•  B. Schölkopf, C. J. Burges, A. J. Smola et al., Advances in kernel methods: support vector learning.   MIT press, 1999.
• 

T. M. Cover, “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,”

IEEE transactions on electronic computers, no. 3, pp. 326–334, 1965.
•  N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American mathematical society, vol. 68, no. 3, pp. 337–404, 1950.
•  D. Alpay, “An advanced complex analysis problem book,” Topological vector spaces, functional analysis, and Hilbert spaces of analytic functions. Birkäuser Basel, 2015.
•  B. Scholkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond.   MIT press, 2001.