# Nonparametric Unsupervised Classification

Unsupervised classification methods learn a discriminative classifier from unlabeled data, which has been proven to be an effective way of simultaneously clustering the data and training a classifier from the data. Various unsupervised classification methods obtain appealing results by the classifiers learned in an unsupervised manner. However, existing methods do not consider the misclassification error of the unsupervised classifiers except unsupervised SVM, so the performance of the unsupervised classifiers is not fully evaluated. In this work, we study the misclassification error of two popular classifiers, i.e. the nearest neighbor classifier (NN) and the plug-in classifier, in the setting of unsupervised classification.

## Authors

• 13 publications
• 49 publications
• ### Speculate-Correct Error Bounds for k-Nearest Neighbor Classifiers

We introduce the speculate-correct method to derive error bounds for loc...
10/09/2014 ∙ by Eric Bax, et al. ∙ 0

• ### An Unsupervised Learning Classifier with Competitive Error Performance

An unsupervised learning classification model is described. It achieves ...
06/25/2018 ∙ by Daniel N. Nissani, et al. ∙ 0

• ### Unsupervised Fusion Weight Learning in Multiple Classifier Systems

In this paper we present an unsupervised method to learn the weights wit...
02/06/2015 ∙ by Anurag Kumar, et al. ∙ 0

• ### Supervised and unsupervised neural approaches to text readability

We present a set of novel neural supervised and unsupervised approaches ...
07/26/2019 ∙ by Matej Martinc, et al. ∙ 0

• ### Optimally Combining Classifiers Using Unlabeled Data

We develop a worst-case analysis of aggregation of classifier ensembles ...
03/05/2015 ∙ by Akshay Balsubramani, et al. ∙ 0

• ### Unsupervised Ensemble Classification with Dependent Data

Ensemble learning, the machine learning paradigm where multiple algorith...
06/22/2019 ∙ by Panagiotis A. Traganitis, et al. ∙ 7

• ### Classification with imperfect training labels

We study the effect of imperfect training data labels on the performance...
05/29/2018 ∙ by Timothy I. Cannings, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Clustering methods partition the data into a set of self-similar clusters. Representative clustering methods include K-means

[4]

which minimizes the within-cluster dissimilarities, spectral clustering

[5] which identifies clusters of more complex shapes lying on some low dimensional manifolds, and statistical modeling method [6] approximates the data by a mixture of parametric distribution.

On the other hand, viewing clusters as classes, recent works on unsupervised classification learn a classifier from unlabeled data, and they have established the connection between clustering and multi-class classification from a supervised learning perspective.

[7] learns a max-margin two-class classifier in an unsupervised manner. Such method is known as unsupervised SVM, whose theoretical property is further analyzed in [8]. Also, [9] and [10]

learn the kernelized Gaussian classifier and the kernel logistic regression classifier respectively. Both

[10] and [11] adopt the entropy of the posterior distribution of the class label by the classifier to measure the quality of the learned classifier, and the parameters of such unsupervised classifiers can be computed by continuous optimization. More recent work presented in [12] learns an unsupervised classifier by maximizing the mutual information between cluster labels and the data, and the Squared-Loss Mutual Information is employed to produce a convex optimization problem. However, previous methods either do not consider the misclassification error of the learned unsupervised classifiers, one of the most important performance measures for classification, or only minimizes the error of unsupervised SVM [7]. Therefore, the performance of the unsupervised classifier is not fully evaluated. Although Bengio et al. [13]

analyze the out-of-sample error for unsupervised learning algorithms, their method focuses on lower-dimensional embedding of the data points and does not train a classifier from unlabeled data.

In contrast, we analyze the unsupervised nearest neighbor classifier (NN) and the plug-in classifier from unlabeled data by the training scheme for unsupervised classification introduced in [7], and derive the bound for their misclassification error. Although the generalization properties of the NN and the plug-in classifier have been extensity studied [14, 15], to the best of our knowledge most analysis focuses on the case of average generalization error. Unsupervised classification methods, such as unsupervised SVM [7], measure the quality of a specific data partition by its associated misclassification error, so we derive the data dependent misclassification error with respect to fixed training data. The resultant error bound comprises pairwise similarity between the data points, which also induces the similarity kernel over the data. We prove that the error of the plug-in classifier is asymptotically bounded by the (scaled) weighted volume of cluster boundary [1], and the latter is designed to encourage the cluster boundary to avoid high density regions following the Low Density Separation assumption [16].

Building a graph where the nodes represent the data and the edge weight is set by the induced similarity kernel, clustering by minimizing the error bound of the two unsupervised classifiers reduces to (normalized) graph-cut problems, which can be solved by normalized graph Laplacian (or Normalized Cut [17]). The normalized graph Laplacian from the similarity kernel induced by the upper bound for the error (or the volume of the misclassified region) of the plug-in classifier renders a certain type of transition kernel for Diffusion maps [2, 3]

, and the Fokker-Planck operator (or the Laplace-Beltrami operator) is recovered by the infinitesimal generator of the corresponding Markov chains. It is interesting to observe that the volume of the misclassified region is independent of the marginal data distribution, which is consistent with the fact that the Laplace-Beltrami operator only captures the geometric information. This implies close relationship between manifold learning and unsupervised classification.

The rest part of this paper is organized as follows. We first introduce the formulation of unsupervised classification in Section 2, then derive the error bound for the unsupervised NN and plug-in classifiers and explain the connection to other related methods in Section 3. We conclude the paper in Section 4.

## 2 The Model

We first introduce the notations in the formulation of unsupervised classification. Let

be a random couple with joint distribution

, where

is a vector of

features and is a label indicating the class to which belongs. We assume that is bounded by . The sample are independent copies of , and we only observe . Suppose is the induced marginal distribution of .

### The Training Scheme for Unsupervised Classification

The training scheme introduced by the unsupervised SVM [7, 8] forms the basis for learning a classifier from unlabeled data in a principled way. With any hypothetical labeling , we can build the corresponding training data for a potential classifier, where is the data with label and is a partition of the data. In this way, the quality of a labeling , or equivalently a data partition, can be evaluated by the misclassification error of the classifier learned from the corresponding training data 111Two labelings are equivalent if they produce the same data partition , and we refer to as the data partition in the following text.. Existing methods [7, 8] perform clustering by searching for the data partition with minimum associated misclassification error. We adopt this training scheme for unsupervised classification, and analyze the misclassification error of the classifier learned from any fixed data partition (corresponding to a hypothetical labeling ).

It is worthwhile to mention that previous unsupervised classification methods [12, 10] circumvent the above combinatorial unsupervised training scheme by learning a probabilistic classifier from the whole data, so they can not evaluate the classification performance of the learned classifier in the learning procedure. Rather than learning the unsupervised SVM [7, 8], we study the misclassification error of the unsupervised NN and plug-in classifiers, revealing the theoretical property of these popular classifiers in the setting of unsupervised classification.

### The Misclassification Error

By the training scheme for unsupervised classification, the misclassification error (or the generalization error) of the classifier learned from the training data is:

 R(FSC)≜P(FSC≠Y) (1)

In order to evaluate the quality of the data partition

, we should estimate

with any fixed rather than the average generalization error . indicates either NN or plug-in classifier from , and is the classification function which returns the class label of a sample . We also let be the probabilistic density function of , be the regression function of on , i.e. , and be the class-conditional density function and the prior for class (). Let be measurable functions, and there are further assumptions on :

(A1) is bounded, i.e. .

(A2) is Hölder- smooth: where is the Hölder constant and .

Because we will estimate the underlying probabilistic density function frequently in the following text, we introduce the non-parametric kernel density estimator of

as below:

 ^fn,h(x)=1nn∑l=1Kh(x−Xl) (2)

where

 Kh(x)=K(xh),K(x)≜1(2π)d/d22e−∥x∥22 (3)

and is the isotropic Gaussian kernel with bandwidth . We introuduce one assumption on the kernel bandwidth sequence :

(B) (1) , (2) , (4) , (5) for some .

[18] proves that the kernel density estimator (2) almost sure uniformly converges to the underlying density:

###### Theorem 1

(Theorem 2.3 in Gine et al. [18], in a slighted change form) Under the assumption (A1)-(A2), suppose the kernel bandwidth sequence

satisfies assumption (B), then with probability

 (4)

where

Similarly, the kernel density estimator of the class-conditional density function is

 ^f(i)n,h(x)=1nπ(i)n∑l=1Kh(x−Xl)1I{Yl=i} (5)

is an indicator function. By the similar argument for the kernel density estimator (2), under the assumption (B), (C1)-(C2), we have the almost sure uniform convergence for , i.e. with probability for .

(C1) are bounded, i.e. .

(C2) is Hölder- smooth: where is the Hölder constant, , and .

Note that the assumption (C2) indicates (A2).

## 3 Main Results

We prove the error bound for the unsupervised NN and plug-in classifiers, and then show the connection to other related methods in this section.

### 3.1 Unsupervised Classification By Nearest Neighbor

Since the NN rule makes hard decision for a given datum, we introduce the following soft NN cost function which converges to the NN classification function, similar to the one adopted by Neighbourhood Components Analysis [19]:

###### Definition 1

The soft NN cost function is defined as

 ^NNSC,h∗(x,i)=N∑l=1Kh∗(x−Xl)1I{Yl=i}N∑l=1Kh∗(x−Xl) (6)

where represents the probability that the datum is assigned to class by the soft NN rule learned from .

Then we have the misclassification error of the soft NN:

###### Lemma 1

The misclassification error of the soft NN is given by

 (7)

Lemma 1 can be proved by the definition of misclassification error. Lemma 2 shows that, with a large probability, the error of the soft NN (7) is bounded. To facilitate our analysis, we introduce the cover of as below:

###### Definition 2

The -cover of the set is a sequence of sets such that and each is a box of length in , .

###### Lemma 2

Under the assumption (A1), (C1)-(C2), suppose the kernel bandwidth sequence satisfies assumption (B), then with probability greater than the misclassification error of the soft NN, i.e. , satisfies:

 1n∑i≠jn∑l=11I{Yl=j}∫Xπ(i)^f(i)n,hn(x)Kh∗(x−Xl)^IEZ[Kh∗(x−Z)]+O(hγn)+˜εdx+O(hγn)≤R(^NNSC,h∗) ≤1n∑i≠jn∑l=11I{Yl=j}∫Xπ(i)^f(i)n,hn(x)Kh∗(x−Xl)^IEZ[Kh∗(x−Z)]+O(hγn)−˜εdx+O(hγn) (8)

where , is the size of the -cover of , , , , are small enough such that .

###### Proof

Let be the -cover of the set . Suppose points are chosen from and . For each , according to the Hoeffding’s inequality

 Pr[|T(ˆxr)−IEZ[Kh∗(ˆxr−Z)]|>ε]<2e−Mh∗ (9)

where . By the union bound, the probability that the above event happens for is less than . It follows that with probability greater than ,

 |T(ˆxr)−IEZ[Kh∗(ˆxr−Z)]|≤ε (10)

holds for any . For any , for some , so that

 |T(x)−T(^xr)|≤T1(h∗)∥x−^xr∥≤T1(h∗)√dτ (11)

where . Moreover,

 |IEZ[Kh∗(ˆxr−Z)]−IEZ[Kh∗(x−Z)]|≤c(√dτ)γ (12)

Combining (10), (11) and (12),

 |T(x)−IEZ[Kh∗(x−Z)]|≤T1(h∗)√dτ+c(√dτ)γ+ε (13)

By Theorem 1, . Similarly, . Substituting and for and , and applying (13), (2) is verified. ∎

Denote the classification function of the NN by , it can be verified that . In order to approach the misclassification error of the NN, we construcut a sequence . Letting , we further have the asymptotic misclassification errof of the soft NN:

###### Theorem 2

Let be a sequence such that and with . Under the assumption (A1), (C1)-(C2), when , then with probability ,

 limn→∞{R(^NNSC,h∗n)−1n2∑l
 Hlm=Khn(xl−xm)⎛⎝1^fn,hn(Xl)+1^fn,hn(Xm)⎞⎠ (15)

where is the kernel density estimator defined by (2) with kernel bandwidth sequence under the assumption (B), is a class indicator function such that if and belongs to different classes, and otherwise.

###### Proof

In Lemma 2, let , for and . Since , with . is small enough such that for sufficiently large .

Let in the RHS of (2), note that and , then with probability ,

 ¯¯¯¯¯¯¯¯limn→∞{R(^NNSC,h∗n)−1n∑i≠jn∑l=11I{Yl=j}π(i)^f(i)n,hn(Xl)Kh∗(x−Xl)^fn,hn(Xl)−λ0τ0dx}≤0 (16)

Substitute into (16),

 ¯¯¯¯¯¯¯¯limn→∞{R(^NNSC,h∗n)−1n2∑l

Since (17) holds for arbitrarily small , with probability . Similarly with probability , and (14) is verified. ∎

By Theorem 2, the misclassification error involves only pairwise terms, and can be interpreted as the similarity kernel over and induced by the misclassification error of the unsupervised NN.

### 3.2 Unsupervised Classification By Plug-in Classifier

Next we will derive the misclassification error of the unsupervised plug-in classifier, that is, the classifier with the form

 FPIn(X)=argmax1≤i≤Q^η(i)(X) (18)

where is a nonparametric estimator of the regression function , and we choose

 ^η(i)n(x)=n∑l=1Khn(x−Xl)1I{Yl=i}n^fn(x) (19)

Let be the Bayesian classifier which is a minimizer of the misclassification error of all classifiers, and . Due to the almost sure uniform convergence of the kernel density estimator and under the assumption (A1), (B), (C1)-(C2), converges almost sure uniformly to , and converges to the Bayesian classifier : . It follows that by the dominant convergence theorem. It is also known that the excess risk of , namely , converges to of the order under some complexity assumption on the class of the regression functions with smooth parameter that belongs to [20, 15]. Again, this result deals with the average generalization error and cannot be applied to deriving the data dependent misclassification error of unsupervised classification with fixed training data in our setting.

Similar to Lemma 1, it can be verified that

 R(FPIn)=∑i,j=1,...,Q,i≠jIEX[η(i)(X)P[FPIn(X)=j]] (20)

We then give the upper bound for the misclassification error of in Lemma 3.

###### Lemma 3

Under the assumption (A1), (B), (C1)-(C2), the asymptotic misclassification error of the plug-in classifier satisfies

 R(FPIn)≤RPIn+O(hγn) (21) RPIn≜2∑i,j=1,...,Q,i≠jIEX[^η(i)n(X)^η(j)n(X)] (22)

where is the regression functions, and this bound is tight.

###### Proof

By Theorem 1, for

 ∥∥^η(i)n−η(i)∥∥∞=O(hγn) (23)

According to (20), .

Suppose the decision regions of is , then on each , for any , and

 ∑i,j=1,...,Q,i≠jIEX[^η(i)n(X)P[FPIn(X)=j]] (24) =∑i,j=1,...,Q,i≠jIEX∈Rj[^η(i)n(X)⋅Q∑k=1^η(k)n(X)] =IEX⎡⎢⎣(Q∑k=1^η(k)n(X))2⎤⎥⎦−Q∑i=1IEX∈Ri[^η(i)n(X)⋅Q∑k=1^η(k)n(X)] ≤IEX⎡⎢⎣(Q∑k=1^η(k)n(X))2⎤⎥⎦−Q∑i=1IEX[(^η(i)n(X))2] =2∑i,j=1,...,Q,i≠jIEX[^η(i)n(X)^η(j)n(X)]

So that (21) is verified. Since the equality in (24) holds when for , the upper bound in (21) is tight. ∎

Based on Lemma 3, we can bound the error of the plug-in classifier from above by . In order to estimate the error bound , we introduce the following generalized kernel density estimator:

###### Lemma 4

Suppose is a probabilistic density function on that satisfies assumption (A1)-(A2). are drawn i.i.d. according to . Let be a Hölder- smooth continuous function defined on with Hölder constant , and is bounded, i.e. . Let . Define the generalized kernel density estimator of as

 ^en,h≜1nn∑l=1Kh(x−Xl)g(Xl) (25)

When the kernel bandwidth sequence satisfies assumption (B), then the estimator converges to almost sure uniformly, i.e. with probability ,

 limn→∞h−γn∥^en,hn(x)−e(x)∥∞=CK,f,g (26)

where .

###### Proof

Define the class of functions on the measurable space :

 F≜{K(t−⋅h),t∈IRd,h≠0}Fg≜{K(t−⋅h)g(⋅),t∈IRd,h≠0}

It is shown in [18, 21] that is a bounded VC class of measurable functions with respect to the envelope such that for any . Therefore, there exist positive numbers and such that for every probability measure on for which and any ,

 N(F,∥⋅∥L2(P),τ∥F∥L2(P))≤(Aτ)v (27)

where is defined as the minimal number of open -balls of radius and centers in required to cover . For any and , . Let , and . Then .

We choose the envelope function for as and for any , then

 N(Fg,∥⋅∥L2(P),τ∥∥Fg∥∥L2(P))≤(Aτ)v (28)

So that is also a bounded VC class. By similar arguement in the proof of Theorem 1, we can obtain (26). ∎

###### Theorem 3

Under the assumption (A1), (B), (C1)-(C2), the error of the plug-in classifier satisfies

 R(FPIn)≤2n2∑l,mθlmGlm,hn+O(hγn) (29)

where is a class indicator function and

 Glm,hn=Ghn(Xl,Xm),Gh(x,y)=Kh(x−y)^f12n,h(x)^f12n,h(y) (30)

for any kernel bandwidth sequence that satisfies assumption (B).

###### Proof

For any , by (23)

 (31)

Since

 IEX[η(i)(X)η(j)(X)]=∫Xπ(i)f(i)(x)f12(x)⋅π(j)f(j)(x)f12(x)dx,

, , and are Hölder- smooth, and is also Hölder- smooth. For any sequence that satisfies assumption (B), we obtain the kernel estimator of using the generalized kernel density estimator (25):

 ~η(i)n(x)=1nn∑l=1K~hn(x−Xl)1I{Yl=i}f12(Xl) (32)

By Lemma 4, for .222It can be verifed by applying Lemma 4 to . It follows that

 IEX[η(i)(X)η(j)(X)]≤IEX[~η(i)n(X)~η(j)n(X)]+O(hγn) (33)

Also,

 ∑i,j=1,...,Q,i≠jIEX[~η(i)n(X)~η(j)n(X)] =1n2∑l,mK√2~hn(Xl−Xm)f12(Xl)f12(Xm)θlm

and we obtain last equality by convolution of two Gaussian kernels. Let , then satisfies assumption (B). The kernel density estimator satisfies by Theorem 1. Therefore, with large enough,

 ∣∣ ∣∣∑i,j=1,...,Q,i≠jIEX[~η(i)n(X)~η(j)n(X)]−1n2∑l,mGlm,hnθlm∣∣ ∣∣ (34) =1n2∣∣ ∣ ∣∣∑l,m