A witness function based construction of discriminative models using Hermite polynomials

01/10/2019
by   H. N. Mhaskar, et al.
Duke University
0

In machine learning, we are given a dataset of the form {(x_j,y_j)}_j=1^M, drawn as i.i.d. samples from an unknown probability distribution μ; the marginal distribution for the x_j's being μ^*. We propose that rather than using a positive kernel such as the Gaussian for estimation of these measures, using a non-positive kernel that preserves a large number of moments of these measures yields an optimal approximation. We use multi-variate Hermite polynomials for this purpose, and prove optimal and local approximation results in a supremum norm in a probabilistic sense. Together with a permutation test developed with the same kernel, we prove that the kernel estimator serves as a `witness function' in classification problems. Thus, if the value of this estimator at a point x exceeds a certain threshold, then the point is reliably in a certain class. This approach can be used to modify pretrained algorithms, such as neural networks or nonlinear dimension reduction techniques, to identify in-class vs out-of-class regions for the purposes of generative models, classification uncertainty, or finding robust centroids. This fact is demonstrated in a number of real world data sets including MNIST, CIFAR10, Science News documents, and LaLonde data sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 12

10/15/2017

Estimation of Squared-Loss Mutual Information from Positive and Unlabeled Data

Capturing input-output dependency is an important task in statistical da...
02/27/2019

Multiple Kernel Learning from U-Statistics of Empirical Measures in the Feature Space

We propose a novel data-driven method to learn multiple kernels in kerne...
03/08/2019

Kernel Based Estimation of Spectral Risk Measures

Spectral risk measures (SRMs) belongs to the family of coherent risk mea...
05/12/2020

The Geometry of Nonlinear Embeddings in Kernel Discriminant Analysis

Fisher's linear discriminant analysis is a classical method for classifi...
04/15/2016

A short note on extension theorems and their connection to universal consistency in machine learning

Statistical machine learning plays an important role in modern statistic...
08/19/2020

ℓ_p-Norm Multiple Kernel One-Class Fisher Null-Space

The paper addresses the multiple kernel learning (MKL) problem for one-c...
12/29/2021

Total positivity in multivariate extremes

Positive dependence is present in many real world data sets and has appe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A central problem in machine learning is the following. We are given data of the form , where each is typically in a Euclidean space and is a label associated with . The data is assumed to be sampled independently from an unknown probability distribution . The problem is to learn either the generative model or a functional model for the unknown function . A problem germane to both of these interpretations is to learn the generative model for the -component; i.e., the marginal distribution of .

In the context of classification problems, the labels are limited to a finite set, which may be identified with , where is the number of classes. Corresponding to each of these classes, say the -th class, one can define a probability distribution giving the probability that the label is associated with . These measures can also be thought of as the discriminative models . The classification problem is then equivalent to the approximation of the function,

We have proposed in earlier papers [23, 22] a variant of this problem where one seeks to approximate the function

where is if the label associated with is , and otherwise. The functions are manifestly not continuous. Nevertheless, using diffusion geometry (see [4] for an early introduction), we can assume that the feature space is selected judiciously to be a data-defined manifold on which the supports of these functions are well separated; at least, the set of discontinuities of is not too large. The use of localized kernels developed in [22, 30] on the unknown manifold is then expected to yield a good approximation to the ’s and hence, a good classification. A theory for this sort of approximation is well developed in many papers, e.g., [32, 22, 10, 28], illustrated with some toy examples in [22, 9], and in connection with diabetic blood sugar prediction in a clinical data set in [31]. A bottleneck in this approach is the construction of the right eigensystem to approximate with. Another drawback of this strategy is the lack of out-of-sample extension; i.e., the model depends upon the data set at hand, the entire computation needs to be repeated with the appearance of any new data.

In this paper, we seek to approximate more directly the measures . Unlike the approach in [22], this needs to be done without knowing the value of even at training data points. These values are just not available. Further, we do not make any assumptions on the structure of the support of these measures. Rather than using a data dependent orthogonal system, we use the localized kernels developed in [29, 5], based on Hermite polynomials. This has several advantages.

  1. The kernels are well defined on the entire Euclidean space and their theoretical properties are well investigated. Therefore, the use of these kernels obviates the problem of out-of-sample extension. Indeed, we use this kernel to generate labels for new, unseen possible points in the feature space.

  2. Although the use of these kernels does involve some tunable parameters, the constructions are more robust with respect to the choice of these parameters, compared with the role of the parameters in the diffusion geometry setting.

  3. In view of the Mehler identity, these kernels can be computed efficiently even in high dimensions using only the inner products of the data points (cf. Section 4.1).

  4. It is shown in [26, 6] that these kernels can be implemented as Gaussian networks with arbitrary accuracy.

Since the values of are not known, we resort to a straightforward kernel estimation. In classical probability estimation, one typically uses a positive kernel to ensure that the estimate is also a probability measure. From an approximation theory point of view, this is doomed to give a poor approximation. Instead, the use of our kernels, which are not positive, ensures that increasingly many moments of the approximation are the same as those of the original measure. This ensures an optimal approximation, albeit not by probability measures. Assuming absolute continuity of with respect to and that of with respect to the Lebesgue measure on the ambient space, we obtain local rates of approximation in terms of the number of samples in the training set and the smoothness of the Radon-Nikodym derivatives of these measures. We note that the assumption that is absolutely continuous with respect to the Lebesgue measure on the ambient space is restrictive, and further research is required to examine how to relax this condition. Nevertheless, this assumption seems reasonable in the present context in order to take into account noise in the data. In contrast to classical machine learning, our estimates are pointwise in a probabilistic sense rather than in the sense of an norm. A technical novelty in this analysis is a Videnskii-type inequality that estimates the derivative of a weighted polynomial on a cube in terms of the norm of the weighted polynomial on the same cube, the estimates being independent of the center of this cube.

In one-hot classification, we need to estimate as if there are two classes: labeled if the actual class label is and otherwise; i.e., we treat two measures at a time, and . In general, the question of detecting where two distributions deviate, and whether that deviation is significant, is of great interest both in machine learning and in various scientific disciplines. We refer to the difference between the estimators of these two measures as a witness function. The approach of building a kernel based witness function has been developed by Gretton et al [13]. In these works, the witness function that identifies where two samples deviate comes as a biproduct of determining the maximum mean discrepancy (MMD) between the two distributions and create a measure of global deviation between the distributions. The paper [3] describes how the power of the global test is affected by changing the definition of locality in the kernel. In the present work, we examine the local deviation of distributions as measured by the witness function, and provide theory and algorithms for determining whether the deviations are significant locally. This is important for a number of applications, where it is important to highlight why two distributions deviate, and determine whether specific bumps in the witness function are a product of structure or of noise. Each experiment in Section 5 relies on identifying these local deviations.

In Section 5.1, we demonstrate experimentally that introducing the Hermite kernel significantly increases the power of detecting local deviations compared to the Gaussian kernel traditionally used in MMD.

An important application of determining the local deviation between distributions is in accurately determining the confidence of predictions or generated points using neural networks. There have been many reported cases of decisions being made by neural networks for points that lie far away from the training data, and yet are assigned high confidence predictive value [14, 16]

. Moreover, it has been recently shown that this is an inherent property of ReLU networks

[15]

. While there have been a number of attempts to alleviate the issue for predictive nets, there hasn’t been nearly as much work for determining out-of-distribution regions for generative networks and Variational Autoenconders, where sampling from out-of-distribution regions leads to non-natural images or blurring between multiple classes of images. We discuss avoiding out-of-distribution and out-of-class sampling of Variational Autoencoders in Section

5.2. The use of a witness function for model comparison of GANs is studied in [40]

. However, that paper only considers the raw size of the witness function without establishing a local baseline, and does not provide a theory of how the empirical witness function deviates from the true witness function. Most approaches to mitigating out-of-distribution high confidence prediction require changing the training objective for the network. We instead examine a pretrained VGG-16 network and determine a method for detecting outliers and likely misclassified points in Section

5.3.

Certain scientific applications also benefit from recognition of the where two distributions deviate. The topic of clustering documents based off of a co-occurance of words has been a topic of significant consideration [38, 42]

. However, most approaches based on k-means in an embedding space can be biased by outliers and clusters with small separation. We examine the uses of the witness function for determining in class vs out-of-training distribution points in a term document embedding in Section

5.4

using the Hermite kernel, which exhibits better edges between close or overlapping clusters. Another application is in propensity matching, in which the problem is to identify bias in which groups received or didn’t receive treatment in an observational trial. Propensity matching boils down to modeling the probability of being in one class vs the other, traditionally with a logistic regression, and using such probability for subsequent matching of treated and untreated patients

[36]. The uses of propensity matching in literature are too numerous to cite here, but we refer readers to the following article outlining both the importance and its drawbacks [17]. We instead consider a distance based matching using the nonlinear Hermite kernel in Section 5.5, and demonstrate that viewing this problem as finding local deviations of the witness function allows for the benefits of both an unsupervised distance based algorithm like proposed in [17] and a simple 1D similar to a propensity score that describes the bias in treatment.

To summarize, we illustrate our theory using the following examples:

  1. A toy example of detecting the differences, with low false discovery rate, in support of a measure supported on a circle verse one supported on an ellipse (Section 5.1),

  2. Discovering the boundaries between classes in the latent space of a variational autoencoder, and generating “prototypical” class examples in the MNIST data set (Section 5.2),

  3. Prospectively quantifying the uncertainty in classification of the VGG-16 network trained on CIFAR10 (Section 5.3),

  4. Determining robust cluster centroids in a document-term matrix of Science News documents (Section 5.4), and

  5. Discovering the propensity of treatment for people in the LaLonde job training data set (Section 5.5).

We develop the necessary background and notation for Hermite polynomials and associated kernels and other results from the theory of weighted polynomial approximation in Section 2. Our main theorems are discussed in Section 3, and proved in Section 6. The algorithms to implement these theorems are given in Section 4, and the results of their application in different experiments are given in Section 5.

2 Background and notation

2.1 Hermite polynomials

A good preliminary source of many identities regarding Hermite polynomials is the book [41] of Szegö or the Bateman manuscript [1].

The orthonormalized Hermite polynomial of degree is defined by the Rodrigues’ formula:

(2.1)

Writing , one has the orthogonality relation for ,

(2.2)

In multivariate case, we adopt the notation . The orthonormalized Hermite function is defined by

(2.3)

In general, when univariate notation is used in multivariate context, it is to be understood in the tensor product sense as above; e.g.,

, , etc. The notation will denote the Euclidean norm, with the subscript omitted when .

We define

(2.4)

Let be a function, if , if . We define

(2.5)

Constant convention:

In the sequel, will denote positive constants depending upon , , and other fixed quantities in the discussion, such as the norm. Their values may be different at different occurrences, even within a single formula.

The following lemma [5, Lemma 4.1] lists some important properties of (the notation in [5] is somewhat different; the kernel above is in the notation of [5]).

Lemma 2.1

Let be an integer.
(a) For , ,

(2.6)

In particular,

(2.7)

(b) For , ,

(2.8)

2.2 Weighted polynomials

For (not necessarily an integer), let . An element of has the form for a polynomial of total degree . The following proposition lists a few important properties of these spaces (cf. [24, 27, 29]).

Proposition 2.1

Let , , .
(a) (Infinite-finite range inequality) For any , there exists such that

(2.9)

(b) (MRS identity) We have

(2.10)

(c) (Bernstein inequality) There is a positive constant depending only on such that

(2.11)

In view of Proposition 2.1, we refer to the cube as the critical cube. When , it is often called the MRS (Mhaskar-Rakhmanov-Saff) interval. The following corollary is easy to prove using Proposition 2.1, parts (b) and (c).

Corollary 2.1

Let , be a finite set satisfying

(2.12)

Then for any ,

(2.13)

There exists a set as above with .

If , , we denote the ball of radius around by

(2.14)

2.3 Function approximation

In this section, we describe some results on approximation of functions. If , ,

(2.15)

The symbol denotes the set of all for which as . Thus, if , and . For , the smoothness class comprises for which

(2.16)

In [24, 25], we have given a characterization of the spaces in terms of the constructive properties of in terms of divided differences and the bounds near . We define

(2.17)

The following proposition is routine to prove using Lemma 2.1(b) with :

Proposition 2.2

(a) If and , then .
(b) If , , then

(2.18)

(c) We have

(2.19)

Next, we describe local smoothness classes. If , the local smoothness class comprises functions with the following property: There exists a neighborhood of such that for every function supported on , . We note that the quantity is expected to depend upon . The following characterization of local smoothness classes can be obtained by imitating arguments in [29].

Theorem 2.1

Let , , , . The following statements are equivalent:
(a) .
(b) There exists such that

(c) There exists such that

If , , , and part (b) of Theorem 2.1 holds with for some , we define

(2.20)

where we note that this defines a norm since we have used the norm of on the entire as one of the summands above.

3 Recovery of measures

In the two sample problem, one has samples from distributions , , and associates the label with , with . The task of a witness function is to determine if in the neighborhood of a given point dominates or the other way round, or if they are equal. If both the distributions are absolutely continuous with respect to the Lebesgue measure on with smooth densities , respectively, then Theorem 2.1 suggests that , or its Monte-Carlo estimator using the samples should work as a witness function. If the Lebesgue measure were a probability distribution on , then it would be easy to put probabilistic bounds to make this more precise. Since this is not the case, we take the viewpoint that is a sample from a ground probability distribution with smooth density , and is another smooth function that takes approximately the value on , on . Then a candidate for the witness function is given by

With this re-formulation, we no longer need to restrict ourselves to two classes, can be chosen to approximate any number of class values, or can even be just any smooth function. The following theorem makes these sentiments more precise in terms of the Monte-Carlo approximation to the last integral above.

Next, we discuss the robustness of this witness function. For this purpose, we assume noisy data of the form , with a joint probability distribution and with being the marginal distribution of with respect to . In place of , we consider a noisy variant , and denote

(3.1)

It is easy to verify using Fubini’s theorem that if is integrable with respect to then for any ,

(3.2)

A part of our theorem below uses the Lambert function defined by

(3.3)

It is known that

(3.4)
Theorem 3.1

Let be a probability distribution on for some sample space , be the marginal distribution of restricted to . We assume that is absolutely continuous with respect to the Lebesgue measure on and denote its density by . Let be a bounded function, and be defined by (3.1). Let , , , , and be chosen such that (cf. (2.20)). Let , be a set of random samples chosen i.i.d. from , and define

(3.5)

Then there exists such that for every and ,

(3.6)

In particular, writing , , and

(3.7)

we obtain for that

(3.8)
Remark 3.1

In the case when , (in particular, when on ), one may choose to be arbitrarily large, although the constants involved may depend upon the choice of .

Remark 3.2

We note that the critical cube can be partitioned into subcubes of radius . On each of these subcubes, say the subcube with center the function is in a different smoothness class . Therefore, Theorem 3.1 implies an estimate for the entire critical cube with probability at least .

Remark 3.3

Taking , approximates the generative model . Thus, our construction can be viewed both as an estimator of the generative model as well as a witness function for the discriminative model.

4 Algorithms

4.1 Implementation of the kernels

In this sub-section, we describe the construction of the kernels . We remark firstly that the univariate Hermite functions can be computed using the recurrence relations

(4.1)

Next, we recall the Mehler identity, valid for , and :

(4.2)

It is clear that the projections depend only on , and ; in particular, that they are rotation independent. Denoting by the acute angle between and , we may thus write

(4.3)

Using the Mehler identity (4.2), we deduce that

Hence,

(4.4)

where

(4.5)

and

(4.6)
Remark 4.1

A completion of squares shows that

(4.7)

With the choice , the Mehler identity (4.2) then shows that

It is now easy to calculate using orthogonality that

(4.8)

where

(4.9)

A careful discretization of (4.8) using a Gauss quadrature formula for Hermite weights exact for polynomials of degree leads to an interpretation of the kernels

in terms of Gaussian networks with fixed variance and fixed centers, independent of the data. We refer to

[26, 6] for the details.

There is no training required for these networks. However, the mathematical results are applicable only to this specific construction. Treating the theorem as a pure existence theorem and then trying to find a Gaussian network by traditional training will not work.

4.2 Algorithms

Our numerical experiments in Section 5 are designed to illustrate the use of defined in (3.5) as a discriminative model for two classes, arising with probabilities with densities and respectively; i.e., with . The quantity representing the difference between the corresponding class labels is a noisy version of . Intuitively, if is larger than a positive threshold in a neighborhood of , then dominates at , and we may take the label for to be , and similarly if is smaller than a negative threshold. When is smaller than the threshold, then there is uncertainty in the correct label for , whether because it belongs both to the supports of and or because is small, making the point itself anomalous. In theory, Theorem 3.1 establishes this threshold as . However, it is not possible to estimate this threshold based only on the given data.

For this reason, we introduce the use of the permutation test [33]

. Permutation test is a parameter-free and nonparametric method of testing hypotheses, namely in this case the null hypothesis that

near . Theorem 3.1 shows that if the null hypothesis is true then is smaller than with high probability. In turn, it is easy to create a for which this is true. We randomly permute the labels of all points across all classes, reassign these randomized labels . Since represents the class label, this ensures that we know is the same as , but for its expected value , . Informally, this means we are mixing the distributions of the classes so that when we redraw new collections of points, each collection is being drawn from the same distribution. The benefit of this test in our context is that we can sample this randomized distribution a large number of times, and use that distribution to estimate . This threshold we call , as this is the decision threshold associated to the local area around . If the two classes had equal sampling density around (i.e. if ), then if we estimated correctly, Theorem 3.1 tells us that the probability exceeds is small. If, on the other hand, if for some constant associated with and , then the hypothesis that near can be rejected.

This suggests two parametric choices in our algorithm, estimating and . Estimating comes down to estimating the threshold that exceeds only fraction of the time. Because each random permutation is independent, the probability of failure for any permutation to exceed over permutations is . One can choose a desired probability of failure and compute a number of permutations . The statistic in this case would be the maximum size of the local infinity norm across all permutations. If we wish to avoid taking the maximum of random variables, it is also possible to increase the size of and take the quantile of the random variables.

For now, we take to be a parameter of choice, with the fact that , rejection of the assumption scales continuously with , and much larger than becomes far too restrictive. Details of the entire algorithm can be found in Algorithm 1 for the two class test, and Algorithm 2 for the multiple class test. This algorithm returns the empirical witness function , as well as a decision as to whether is significantly nonzero (i.e. whether or ).

0:   Input: Data sets and , points at which to inspect significance, level of confidence
0:   Output: Estimate of for all and estimate of whether .
1:  
2:  
3:  
4:  
5:  
6:  
7:  
8:  
9:  for  to  do
10:     
11:     
12:  end for
13:  
14:  
15:  Return:,
Algorithm 1 Algorithm for determining significance of empirical witness function using label permutation. is notation for the empirical witness function at , and is an indicator for whether is significantly nonzero.
0:   Input: Data sets , points at which to inspect significance, level of confidence
0:   Output: Estimate of for all and estimate of whether the region is dominated by one class
1:  
2:  
3:  
4:  
5:  
6:  
7:  
8:  
9:  
10:  
11:  for  to  do
12:     
13:     
14:     
15:     
16:  end for
17:  
18:  
19:  Return: , ,
Algorithm 2 Algorithm for determining significance of class identifier function using label permutation. Along with returning and , it also returns , which is the predicted class for .

For all the experimental sections, we will specify several parameters that are necessary to the algorithm. One parameter is the degree of the polynomials used. Note that the parameter in the kernel is given by . We also specify (the tunable parameter to set the level of confidence for determining significant regions), and the scaling factor on the data . The scaling factor rescales the data so that . One way to consider this scaling is in analogy with the bandwidth of a Gaussian kernel, where ; i.e., the variable is renormalized to have a variance . In the context of the witness function, no longer represents a variance parameter, but serves the same role as a free parameter.

5 Experiments

5.1 Toy Examples

0 0.09 0.18 0.27 0.36
Gauss Sig Wit 0 Gauss Sig Wit 0.09 Gauss Sig Wit 0.18 Gauss Sig Wit 0.27 Gauss Sig Wit 0.36
Herm Sig Wit 0 Herm Sig Wit 0.09 Herm Sig Wit 0.18 Herm Sig Wit 0.27 Herm Sig Wit 0.36
Figure 1: (Top) examples of the two data sets with varying , the difference between the principle axis lengths. (Middle) Points deemed significant by the Gaussian witness function for given . (Bottom) Points deemed significant by the Hermite witness function for given . In all figures, green corresponds to , meaning neither distribution dominates.

We begin the set of experiments with a toy model to demonstrate the benefits of determining the significant regions for the witness function, as well as the benefits of using the Hermite kernel versus using the Gaussian kernel. We generate two data sets. The first is of the form , where is distributed uniformly on and are normal random variables with mean

and standard deviation

. The second data set is of the form , where is distributed uniformly on and are normal random variables with mean and standard deviation . See Figure 1 for visualizations of the data sets with various . We make random draws of points of the first form and points of the second form for various sizes of , and use Algorithm 1 to determine the regions of significant deviation. For this data, we set , , and .

Figure 1 shows the significant areas for varying radii, where the witness function is measured at random points in a ring surrounding the two distributions. Not only does the Hermite kernel begin to detect significant regions earlier than the Gaussian kernel, the Hermite kernel is also the only kernel to detect the shorter radii of the ellipse. Also, observe the gap of significance around the points of interaction, in which both distributions are locally similar.

5.2 Data Generation Through VAE

The next set of experiments revolve around estimating regions of space corresponding to given classes, and sampling new points from that given region. This problem has been of great interest in recent years with the growth of generative networks, namely various variants of generative adversarial networks (GANs) [12] and variational autoencoders (VAEs) [18]. Each has a low-dimensional latent space in which new points are sampled, and mapped to through a neural network. While GANs have been more popular in literature in recent years, we focus on VAEs in this paper because it is possible to know the locations of training points in the latent space. A good tutorial on VAEs can be found in [8].

Our first example is the well known MNIST data set [21]. This is a set of handwritten digits , each scanned as a pixel image. There are images in the training data set, and in the test data.

In order to select the “right” features for this data set, we construct a three layer VAE with encoder with architecture and a decoder/generator with architecture , and for clarity consider the latent space to be the 2D middle layer. In the 2D latent space, we construct a uniform grid on . Each of these points can be mapped to the image space via , but there is no guarantee that the reconstructed image appears “real” and no a priori knowledge of which digit will be generated. We display the resulting images in Figure 2, with each digit location corresponding to the location in the 2D latent space. However, we also have additional information about the latent space, namely the positions of each of the training points and their associated classes. In other words, we know for all training data . The embedding of the training data is displayed in Figure 2 as well as the resulting images for each in the grid. As one can see, certain regions are dominated by a single digit, other regions contain mixtures of multiple digits, and still other regions are completely devoid of training points (meaning one should avoid sampling from those regions entirely).