1. Introduction and motivation
The Support Vector methodology is a nonlinear generalization of Linear Classification, typically utilized for binary classification. In this methdology, a training set of data becomes available where each datum belongs to one of two categories while an indicator function identifies the category for each. Imagine points on a plane labeled with two different colors. Solely based on their location and proximity to others of particular colors, one needs to classify new points and assign them to one of the two categories. When the points of the training set in the two categories can be separated by a line, then this line can be chosen to separate two respective regions, one corresponding to each of the two categories. On the other hand, when the “training” points of the two categories cannot be separated by a line, then a higher order curve needs to be chosen. The Support Vector methodology provides a disciplined and systematic way to do exactly that.
In the case where no simple boundary can be chosen to delineate regions for the points in the two categories, and a simple classification rule is still sought based on location, then a smooth boundary can be selected to define regions that contain the vast majority of points in the respective categories. Thus, simplicity of the classification rule (Occam’s razor) may be desirable, as the domains corresponding to the two categories may not be completely separate and a misclassification error in that case must be deemed acceptable. This again can be handled by a suitable relaxation of SVM.
The purpose of this paper is to raise and in some ways address the issue that, in most engineering applications, data is uncertain. Measurements are often provided along with error bars that quantify expected margins. In other applications, data may yet be more complex. For instance, a datum may be an empirical probability distributions. Such distributions may be approximated by a Gaussian, a mixture of Gaussians, or other options. In such cases, a “datum” is now a point in a space more complex than
. We seek to formulate a new paradigm, where the Support Vector methodology can be extended.We postulate that measurements provide vectors together with covariance matrices that quantify our uncertainty in the value we recorded. For the purposes of binary classification, we record such pairs along with the information on the category of origin of each, which is given by an indicator value
. A datum with a large variance naturally delineates a larger volume that should be associated to the corresponding category, or at least, impact proportionately the drawing of the separation between the two. Prior stateoftheart does not do that. Thus, in the present note, we propose a new paradigm of treating measurement uncertainty as part of the data.
Below, in Section II, we highlight some background on Support Vector Machines. Section III gives our main result that identifies a kernel function for the case of data provided in the form of “Gaussian points.”
2. Background
Support vector machines are well established techniques for (binary) classification and regression analysis
[1]. The main idea is to imbed a given (training) data set into a high dimensional space , of dimension much higher than the native dimension of the base spaceand possibly infinite, so that for binary classification, the two classes can be separated with a hyperplane
[2]. Effectively, this separation hyperplane projects down to the native base space , where the data points originally reside, as curves/surfaces/etc. that separate the two classes. The imbedding into is effected by a mappingwhere is referred to as the feature vector. The space has an inner product structure (Hilbert space) and, naturally, the construction of the separating hyperplane relies on the geometry of . However, and most importantly, the map does not need to be known explicitely and does not need to be applied to the data. All needed operations, inner product and projections that show up in the classifier and computations, can be effectively carried out in the base space using the so called “kernel trick.”
Indeed, to accomplish the task and construct the classification rule (via suitable curves/surfaces/etc.) it is sufficient to know the kernel
(1) 
with . It evaluates the inner product in the feature space as a function of the corresponding representatives in the native base space. Thus, the kernel is a bivariate function which is completely characterized by the property of being positive, in the sense that for all , , and any corresponding set of values (as we are interested in realvalued kernels)
Necessity stems from the fact that the left hand side is the inner product
Sufficiency, in the existence of a feature map that realizes (1) is a celebrated theorem [3], see also [4, Theorem 7.5.2] and [1, page 30, Theorem 3.1 (Mercer)].
Classification is relies on constructing a classifier that is built on a linear functional, when viewed in . It is of the form
(2) 
and the value aims to differentiate between elements in two categories. The coefficients and are chosen so that
is a separating hyperplane of the two subsets
of the compete (training) data set. Once again, does not need to evaluated at any point in the construction, existence of such a map is enough, and it is guaranteed by the positivity of the kernel.
The construction of the classifier requires selection of the parameters and . These are chosen either (“hard margin”) to minimize
or, to (“soft margin”) minimize
(3) 
for all available points in the “training set.” The “hard margin” formulation coincides with the limit of the “soft” formulation as . The dual formulation of (3) becomes to maximize
subject to
for all . The coefficients can now be obtained via quadratic programming, and can be found as
with corresponding to an index such that . The classification rule then becomes
(4) 
The above follows standard development, see e.g., [5] as well as the Wikipedia webpage on Support Vector Machines.
3. Classification of Gaussian points
The problem we are addressing in the present note is the classification of uncertain data points into one of two categories, i.e., a binary classification as before. However, a salient feature of our setting is that data are only known with finite accuracy. Uncertainty is modeled is a probabilistic manner. For simplicity, in this paper, we consider only Gaussian points. These consist of pairs
representing the mean and variance of a normally distributed random vector
.Alternatively, we may think of the data as points on a manifold of distributions (though, only Gaussian at present). These may represent approximations of empirical distributions that have been obtained at various times. An indicator is provided as usual along with the information of the category that the current datum belongs to. If we regard it as representing a distribution, we postulate that it arose from experiment involving population .
We follow the standard setting of kernel Support Vector Machines (kSVM) that was outlined in the background section, and we overlay a probabilistic component. This Probabilistic kernel Support Vector Machine (PkSVM) relies on a suitable modification of the kernel. To this end we consider the set of “data points”
with the cone of nonnegative definite symmetric matrices in . We also consider the family of normally distributed independent random vectors
i.e., are independent when , and . We utilize the popular exponential (RBF) kernel
(5) 
where the parameter affects the scale of the desired resolution. We readily compute
(6) 
for . When , i.e., for any ,
(7) 
where the inequality in (7) we get from (6) when substituting the same values for and , respectively. We now state our main result.
Theorem 1.
The function
(8)  
defines a positive kernel on .
Proof.
Consider in (5) for . Then, for any collection and any collection of , for ,
with
the map to the radial basis functions corresponding to the positive kernel (
5). The inequality in line two, follows from (7). The claim in the theorem follows. ∎It can be seen that when applied to points in having zero uncertainly, i.e., when the corresponding covariance matrices are identically , then the Probabilistic kernel Support Vector Machine model reduces to the standard one where points lie in . That is,
Thereby, PkSVM is natural extension of standard SVM, able to account for error and uncertainty that is available and encoded in the data.
4. Conclusion
The formalism herein presents a new paradigm where data incorporate a quantification of own uncertainty. We concentrated on binary classification and the case of “Gaussian points” to present a proof of concept, in the form of a suitable kernel in Theorem 1
. Numerical experiments and further insight will be provided and included in a forthcoming more detailed report. Moreover, the basic idea appears to be easily generalizable to more detailed and explicit descriptions of uncertainty, e.g., Gaussian Mixture models, with however, the caveat of added complexity in the resulting formulae. This we expect will be the starting point of future investigations.
Acknowledgements
This project was supported in part by NSF under grants 1665031, 1807664, 1839441 and 1901599, the AFOSR under Grants FA95501710435, grants from National Institutes of Health (R01AG048769, R01CA198121), MSK Cancer Center Support Grant/Core Grant (P30 CA008748), and a grant from Breast Cancer Research Foundation (grant BCRF17193).
References
 [1] B. Schölkopf, C. J. Burges, A. J. Smola et al., Advances in kernel methods: support vector learning. MIT press, 1999.

[2]
T. M. Cover, “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,”
IEEE transactions on electronic computers, no. 3, pp. 326–334, 1965.  [3] N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American mathematical society, vol. 68, no. 3, pp. 337–404, 1950.
 [4] D. Alpay, “An advanced complex analysis problem book,” Topological vector spaces, functional analysis, and Hilbert spaces of analytic functions. Birkäuser Basel, 2015.
 [5] B. Scholkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.