K-medoids Clustering of Data Sequences with Composite Distributions

This paper studies clustering of data sequences using the k-medoids algorithm. All the data sequences are assumed to be generated from unknown continuous distributions, which form clusters with each cluster containing a composite set of closely located distributions (based on a certain distance metric between distributions). The maximum intra-cluster distance is assumed to be smaller than the minimum inter-cluster distance, and both values are assumed to be known. The goal is to group the data sequences together if their underlying generative distributions (which are unknown) belong to one cluster. Distribution distance metrics based k-medoids algorithms are proposed for known and unknown number of distribution clusters. Upper bounds on the error probability and convergence results in the large sample regime are also provided. It is shown that the error probability decays exponentially fast as the number of samples in each data sequence goes to infinity. The error exponent has a simple form regardless of the distance metric applied when certain conditions are satisfied. In particular, the error exponent is characterized when either the Kolmogrov-Smirnov distance or the maximum mean discrepancy are used as the distance metric. Simulation results are provided to validate the analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/08/2017

Nonparametric Composite Hypothesis Testing in an Asymptotic Regime

We investigate the nonparametric, composite hypothesis testing problem f...
04/25/2014

Nonparametric Detection of Anomalous Data Streams

A nonparametric anomalous hypothesis testing problem is investigated, in...
08/22/2021

The Exploitation of Distance Distributions for Clustering

Although distance measures are used in many machine learning algorithms,...
01/19/2011

Transductive-Inductive Cluster Approximation Via Multivariate Chebyshev Inequality

Approximating adequate number of clusters in multidimensional data is an...
09/24/2017

Interdependence of clusters measures and distance distribution in compact metric spaces

A compact metric space (X, ρ) is given. Let μ be a Borel measure on X. B...
03/14/2019

Distributed Detection with Empirically Observed Statistics

We consider a binary distributed detection problem in which the distribu...
10/24/2021

Non-Asymptotic Error Bounds for Bidirectional GANs

We derive nearly sharp bounds for the bidirectional GAN (BiGAN) estimati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

THIS paper aims to cluster sequences generated by unknown

continuous distributions into classes so that each class contains all the sequences generated from the same (composite) distribution cluster. By sequence, we mean a set of feature observations generated by an underlying probability distribution. Here each distribution cluster contains a set of distributions that are close to each other, whereas different clusters are assumed to be far away from each other based on a certain distance metric between distributions. To be more concrete, we assume that the maximum intra-cluster distance (or its upper bound) is smaller than the minimum inter-cluster distance (or its lower bound). This assumption is necessary for the clustering problem to be meaningful. The distance metrics used to characterize the difference of data sequences and corresponding distributions are assumed to have certain properties summarized in Assumption

1 in Section II.

Such unsupervised learning problems have been widely studied [1, 2]

. The problem is typically solved by applying clustering methods, e.g., k-means clustering

[3, 4, 5] and k-medoids clustering[6, 7, 8], where the data sequences are viewed as multivariate data with Euclidean distance as the distance metric. These clustering algorithms usually require the knowledge of the number of clusters and they differ in how the initial centers are determined. One reasonable way is to choose a data sequence as a center if it has the largest minimum distances to all the existing centers [9, 10, 11]. Alternatively, all the initial centers can be randomly chosen [12]. With the number of clusters unknown, there are typically two alternative approaches for clustering. One starts with a small number of clusters, e.g., , which is an underestimate of the true number, and proceed to split the existing clusters until convergence [13, 11]. The authors in [13] assumed a maximum number of clusters and the threshold for clustering depended on a pre-determined significance level of the two sample Kolmogorov-Smirnov (KS) test whereas the algorithm proposed in [11] did not assume a maximum number of clusters and the threshold for clustering was a function of the intra-cluster and inter-cluster distances. Alternatively, one may start with an overestimated the number of clusters, e.g., every sequence is treated as a cluster, and proceed to merge clusters that are deemed close to each other [11]. The algorithms in [9, 12, 13] were all validated by simulation results without carrying out an analysis of the error probability.

There are some key differences between the k-means algorithm and the k-medoids algorithm. The k-means algorithm minimizes a sum of squared Euclidean distances. Meanwhile, the k-medoids algorithm assigns data sequences as centers and minimizes a sum of arbitrary distances, which makes it more robust to outliers and noise

[14, 15]. Moreover, the k-means algorithm requires updating the distances between data sequences and the corresponding centroids in every iteration whereas the k-medoids algorithm only requires the pairwise distances of the data sequences, which can be computed before hand. Thus, the k-medoids algorithm outperforms the k-means algorithm in terms of computational complexity as the number of sequences increases [16].

Most prior research focused on computational complexity analysis, whereas the error probability and the performance comparison of different clustering algorithms were typically studied through simulations, e.g., [7, 17, 8, 16]

. This paper attempts to theoretically analyze the error probability for the k-medoids algorithm especially in the asymptotic region. Furthermore, in contrast to previous studies, which frequently used vector norms as the distance metric, e.g., Euclidean distance, our study adopts the distance metrics between distributions for clustering in order to capture the statistical models of data sequences considered in this paper. This formulation based on a distributional distance metric is uniquely suited to the proposed clustering problem, where each data point, i.e., each data sequence, represents a probability distribution and each cluster is a collection of closely related distributions, i.e., composite hypotheses.

Various distance metrics that take the distribution properties into account can be reasonable choices, e.g., KS distance [11, 10, 13, 12] and maximum mean discrepancy (MMD) [18]. Our previous work [11, 10] has shown that the KS distance based k-medoids algorithms are exponentially consistent for both known and unknown number of clusters. In this paper, we consider a much more general framework and instead of focusing on a specific distance metric such as KS distance in our prior work [11, 10], we consider any distance metric that satisfies Assumption 1 for clustering. The rationale is the following: with any distance metric that captures the statistical model of data sequences, one is likely to observe that for a large sample size, 1) sequences generated from the same distribution cluster are close to each other, and 2) sequences generated from different distribution clusters are far away from each other. Thus, if the minimum distance of different distribution clusters is greater than the maximum diameter of all the distribution clusters defined in Section II, then it becomes unlikely to make a clustering error as the sample size increases. In this paper, we develop k-mediods distribution clustering algorithms where the distances between distributions are selected to capture the underlying statistical model of the data. We analyze the error probability of the proposed algorithms, which takes the form of exponential decay as the sample size increases under Assumption 1. Furthermore, beyond the KS distance, we also consider another distance metric namely MMD and show that they both satisfy Assumption 1 so that the error probability of the proposed algorithms decays exponentially fast if the KS distance and MMD are used as the distance metrics.

We note that recent studies [19, 20, 21, 22, 23]

of anomaly detection problems and classification also took into account the statistical model of the data sequences. Our focus here is on the clustering problem, leading to error performance analysis that is substantially different from that in

[19, 20, 21, 22, 23].

The rest of the paper is organized as follows. In Section II, the system model of the clustering problem, the preliminaries of the KS distance and MMD, and notations are introduced. The clustering algorithm given the number of clusters and the corresponding upper bound on the error probability are provided in Section III, followed by the results of the clustering algorithms with an unknown number of clusters in Section IV. The simulation results for the KS and MMD based algorithms are provided in Section V.

Ii system model and preliminaries

Ii-a Clustering Problem

Suppose there are distribution clusters denoted by for , where is fixed. Define the intra-cluster distance of and the inter-cluster distance between and for respectively as

(1)

where is a distance metric between distributions, e.g., the KS distance or MMD defined later in (5) and (6) respectively. Then and represent the diameter of and the distance between and , respectively. Define

(2)

Table I summarizes the notations of the generalized form of distances defined in (1) and (2) which will be used in the following discussion.

general KS MMD
TABLE I: Notations

Suppose data sequences are generated from the distributions in , and hence a total of sequences are to be clustered. Without loss of generality, assume that each sequence consists of independently identically distributed (i.i.d.) samples generated from for and . The -th observation in is denoted by , where and . Note that for any fixed , ’s are not necessarily distinct. Namely, for a fixed , some ’s can be generated from the same distribution. Although the following discussion assumes that all the data sequences have the same length, our analysis can be easily extended to the case with different sequence lengths by replacing with the minimum sequence length. In order to analyze the error probability of the clustering algorithm, we introduce an assumption relating to the concentration property of the distance metrics:

Assumption 1.

For any distribution clusters and any sequences , and , where , the following inequalities hold:

(3a)
(3b)
(3c)
(3d)

where ’s are some constants independent of distributions, () is a function of and is the sample size. ∎

Here (3a) guarantees that the lower bound of inter-cluster distances is greater than the upper bound of intra-cluster distances. (3b) guarantees that the probability that the distance between two sequences generated from different distribution clusters is smaller than decays exponentially fast. (3c) guarantees that the probability that the distance between two sequences generated from the same distribution cluster is greater than decays exponentially fast. (3d) guarantees that given two sequences generated from the same cluster and a third sequence generated from another distribution cluster, the probability that the first sequence is actually closer to the third sequence decays exponentially fast.

A clustering output is said to be incorrect if and only if the sequences generated by different distribution clusters are assigned to the same cluster, or sequences generated by the same distribution cluster are assigned to more than one cluster. Denote by the error probability of a clustering algorithm. A clustering algorithm is said to be consistent if

where is the sample size. The algorithm is said to be exponentially consistent if

For the case where a clustering algorithm is exponentially consistent, we are also interested in characterizing the error exponent .

Ii-B Preliminaries of KS distance

Suppose is generated by the distribution , where

. Then the empirical cumulative distribution function (c.d.f.) induced by

is given by

(4)

where is the indicator function. Let the c.d.f. of evaluated at be . The KS distance is defined as

(5)

where and can be either c.d.f’s of distributions or empirical c.d.f.’s induced by sequences.

Proposition 1.

The KS distance satisfies (3b) - (3d) if .

Proof.

See Lemmas .5, .3, and .7 in Appendix -A. ∎

Ii-C Preliminaries of MMD

Suppose includes a class of probability distributions, and suppose is the reproducing kernel Hilbert space (RKHS) associated with a kernel . Define a mapping from to such that each distribution is mapped into an element in as follows

where is referred to as the mean embedding of the distribution into the Hilbert space . Due to the reproducing property of , it is clear that for all .

It has been shown in [24, 25, 26, 27] that for many RKHSs such as those associated with Gaussian and Laplace kernels, the mean embedding is injective, i.e., each is mapped to a unique element

. In this way, many machine learning problems with unknown distributions can be solved by studying mean embeddings of probability distributions without actually estimating the distributions, e.g.,

[28, 29, 21, 22]. In order to distinguish between two distributions and , the authors in [30] introduced the following notion of MMD based on the mean embeddings and of and , respectively:

(6)

An unbiased estimator of

based on and which consist of and samples, respectively, is given by

(7)

Assume that the kernel is bounded, i.e., , where is finite.

Proposition 2.

The MMD distance satisfies (3b) - (3d) if .

Proof.

See Lemmas .6, .4 and .8 in Appendix -A. ∎

Ii-D Additional Notations

The following notations are used in the algorithms and the corresponding proofs. Let be the -th cluster obtained at the -th cluster update step and let , and be the centers of the -th cluster obtained by the center update step, merge step and split step of the -th iteration respectively for . Moreover, let be the -th cluster obtained at the initialization step and be the corresponding center. Let () be the number of centers before the -th cluster update step and the -th split step. Moreover, use to denote the number of centers obtained at the center initialization step. For simplicity, all the superscripts are omitted in the following discussion when there is no ambiguity.

To further simplify the notation in the algorithms and the proofs, let denote the data sequence set . However, the one-to-one mapping from onto is not fixed, i.e., given a fixed , can be any sequence in unless other constraints are imposed. Denote by if is generated from a distribution . Furthermore, define a set of integers

where and .

Iii Known number of clusters

In this section, we study the clustering algorithm for known , the number of clusters. The method proposed in [9] is used for center initialization, as described in Algorithm 1. The initial centers are chosen sequentially such that the center of the -th cluster is the sequence that has the largest minimum distance to the previous centers. The clustering algorithm itself is presented in Algorithm 2. Given the centers, each sequence is assigned to the cluster for which the sequence has the minimum distance to the center. For a given cluster, a sequence is assigned as the center subsequently if the sum of its distances to all the sequences in the cluster is the smallest. The algorithm continues until the clustering result converges.

1:Input: Data sequences , number of clusters .
2:Output: Partitions .
3:{Center initialization}
4:Arbitrarily choose one as .
5:for  do
6:     
7:end for
8:{Cluster initialization}
9:Set for .
10:for  do
11:     , where
12:end for
13:Return
Algorithm 1 Initialization with known
1:Input: Data sequences , number of clusters .
2:Output: Partition set .
3:Initialize by Algorithm 1.
4:while not converge do
5:     {Center update}
6:     for  do
7:         
8:     end for
9:     
10:     for  do
11:         if  and  then
12:               and .
13:         end if
14:     end for
15:end while
16:Return
Algorithm 2 Clustering with known

The following theorem provides the convergence guarantee for Algorithm 2 via an upper bound on the error probability.

Theorem III.1.

Algorithm 2 converges after a finite number of iterations. Moreover, under Assumption 1, the error probability of Algorithm 2 after iterations is upper bounded as follows

where , , and are as defined in Assumption 1.

Outline of the Proof.

The idea of proving the upper bound on the error probability is as follows. We first prove that the error probability at the initialization step decays exponentially. Note that the event that an error occurs during the first iterations is the union of the event that an error occurs at the -th step and the previous iterations are correct for . Thus, if we prove that the error probability at the -th step given correct updates from the previous iterations decays exponentially, then so does the error probability of the algorithm by the union bound argument. See Appendix -B1 for details. ∎

Theorem III.1 shows that for any given , any distance metric satisfying Assumption 1 yields an exponentially consistent k-medoids clustering algorithm with the error exponent .

Corollary III.1.1.

Suppose the KS distance and MMD are used for Algorithms 1 and 2, then

Proof.

By Propositions 1 and 2, the upper bound on the error probability of Algorithm 2 applies to the KS distance and MMD. Thus, the corollary is obtained by substituting the values specified in Lemmas .3 - .8 in the upper bound. ∎

Corollary III.1.1 implies that Algorithm 2 is exponentially consistent under KS and MMD distance metrics with an error exponent no smaller than and , respectively.

Iv Unknown number of clusters

In this section, we propose the merge- and split-based algorithms for estimating the number of clusters as well as grouping the sequences.

Iv-a Merge Step

If a distance metric satisfies (3c) and two sequences generated by distributions within the same cluster are assigned as centers, then, with high probability, the distance between the two centers is small. This is the premise of the clustering algorithm based on merging centers that are close to each other.

The proposed approach is summarized in Algorithms 3 and 4. There are two major differences between Algorithms 3 and 4 and Algorithms 1 and 2. First, the center initialization step of Algorithm 3 keeps generating an increasing number of centers until all the sequences are close to one of the existing centers. Second, an additional Merge Step in Algorithm 4 helps to combine clusters if the corresponding centers have small distances between each other.

1:Input: Data sequences and threshold .
2:Output: Partitions .
3:
4:Arbitrarily choose one as and set .
5:while  do
6:     
7:     
8:end while
9:Clustering initialization specified in Algorithm 1.
10:Return
Algorithm 3 Merge-based initialization with unknown
1:Input: Data sequences and threshold .
2:Output: Partition set .
3:Initialize by Algorithm 3.
4:while not converge do
5:     Center update specified in Algorithm 2.
6:     
7:     for  and  do
8:         if  then
9:              if  then
10:                   and delete and .
11:              else
12:                   and delete and .
13:              end if
14:              .
15:         end if
16:     end for
17:     Cluster update specified in Algorithm 2.
18:end while
19:Return
Algorithm 4 Merge-based clustering with unknown
Theorem IV.1.

Algorithm 4 converges after a finite number of iterations. Moreover, under Assumption 1, the error probability of Algorithm 4 after iterations is upper bounded as follows

where , , and are as defined in Assumption 1.

Proof.

The proof shares the same idea as that of Theorem III.1. See Appendix -B2 for details. ∎

Theorem IV.1 shows that the merge-based algorithm is exponentially consistent under any distance metric satisfying Assumption 1 with the error exponent .

Corollary IV.1.1.

Suppose the KS distance and MMD are used with and . Then the error probability of Algorithm 4 after iterations is upper bounded as follows

Proof.

By Propositions 1 and 2, the upper bound on the error probability of Algorithm 4 in IV.1 applies to the KS distance and MMD. Thus, the corollary is obtained by substituting the values specified in Lemmas .3 - .8 in the upper bound. ∎

Corollary IV.1.1 implies that Algorithm 4 is exponentially consistent under KS and MMD distance metrics with an error exponent no smaller than and , respectively.

Iv-B Split Step

Suppose a cluster contains sequences generated by different distributions and the center is generated from . Then if the distance metric satisfies (3b), the probability that the distances between sequences generated from distribution clusters other than and the center is small decays as the sample size increases. Therefore, it is reasonable to begin with one cluster and then split a cluster if there exists a sequence in the cluster that has a large distance to the center. The corresponding algorithm is summarized in Algorithm 5.

1:Input: Data sequences and threshold .
2:Output: Partition set .
3:, and find by center update specified in Algorithm 2.
4:while not converge do
5:     {Split Step}
6:     if  then
7:         .
8:         
9:         
10:     end if
11:     Cluster update specified in Algorithm 2.
12:end while
13:Return
Algorithm 5 Split-based clustering with unknown
Definition IV.1.1.

Suppose Algorithm 5 obtains clusters at the -th iteration, where and or . Then the correct clustering update result is that each cluster contains all the sequences generated from the distribution cluster that generates the center.

Theorem IV.2.

Algorithm 5 converges after a finite number of iterations. Moreover, under Assumption 1, the error probability of Algorithm 5 after iterations is upper bounded as follows

where , , and are as defined in Assumption 1.

Outline of the Proof.

An error occurs at the -th iteration if and only if the -th center is generated from distribution clusters that generated the previous centers or the clustering result is incorrect. Note that the error event of the first iterations is the union of the events that an error occurs at the -th iteration while the clustering results in the previous iterations are correct for . Similar to the proof of Theorem III.1, the error probability is bounded by the union bound. See Appendix -B3 for more details. ∎

Theorem IV.2 shows that the split-based algorithm is exponentially consistent under any distance metric satisfying Assumption 1 with the error exponent .

Corollary IV.2.1.

Suppose the KS distance and MMD are used with and . Then the error probability of Algorithm 5 after iterations is upper bounded as follows

Proof.

By Propositions 1 and 2, the upper bound on the error probability of Algorithm 5 in Theorem IV.2 applies to the KS distance and MMD. Thus, the corollary is obtained by substituting the values specified in Lemmas .3 - .8 in the upper bound. ∎

Corollary IV.2.1 implies that Algorithm 5 is exponentially consistent under KS and MMD with an error exponent no smaller than and , respectively.

V Numerical Results

In this section, we provide some simulation results given and for . Moreover,

. Gaussian distributions

and Gamma distributions

are used in the simulations. The probability density function (p.d.f.) of

is defined as

where , and is the Gamma function, respectively. Specifically, the parameters of the distributions are , , and , where and . Note that when , all the sequences generated from the same distribution cluster are generated from a single distribution. The exponential kernel function is used in the simulations for the MMD distance, i.e.,

(8)

V-a Known Number of Clusters

Simulation results for a known number of clusters are shown in Fig. 1. One can observe from the figures that by using both the KS distance and MMD, is a linear function of the sample size, i.e., is exponentially consistent. Moreover, the logarithmic slope of with respect to , i.e., the quantity , increases as becomes smaller, which, in the current simulation setting, implies a larger .

Furthermore, a good distance metric for Algorithm 2 depends on the underlying distributions. This is because the underlying distributions have different distances under the KS distance and MMD, which results in different values of and .

(a) Gaussian distributions
(b) Gamma distributions
Fig. 1: Performance of Algorithm 2

V-B Unknown Number of Clusters

With an unknown number of distribution clusters, the threshold specified in Corollaries IV.1.1 and IV.2.1 are used in the simulation. The performance of Algorithms 4 and 5 for the KS distance and MMD are shown in Figs. 2 and 3, respectively. Given the KS distance and MMD, ’s are linear functions of the sample size when the sample size is large and larger implies a larger slope of . Furthermore, given the same value of , Algorithms 4 and 5 have similar performance under the KS distance whereas Algorithm 5 outperforms Algorithm 4 under MMD given and . Intuitively, smaller implies larger in the current simulation setting, thereby should result in better clustering performance for a given sample size. However, Fig. 2LABEL:sub@fig:algorithm4_5_KS-b indicates that Algorithms 4 and 5 with KS distance performs better with than that with when the sample size is small. This is likely due to the fact that the KS distance between the two sequences is always lower bounded by . Thus, with small sample sizes, Algorithms 4 and 5 are likely to overestimate the number of clusters. This can be mitigated by the increased threshold to control merging/splitting of cluster centers.

(a) Gaussian distributions
(b) Gamma distributions
Fig. 2: Performance of Algorithms 4 and 5 for the KS distance
(a) Gaussian distributions
(b) Gamma distributions
Fig. 3: Performance of Algorithms 4 and 5 for MMD

V-C Choice of

Note that in general , where . Theorems IV.1 and IV.2 only establish the exponential consistency of Algorithms 4 and 5, respectively. We now investigate the impact on performance given different ’s. One can observe from Fig. 4 that the choice of has a significant impact on the performance of Algorithms 4 and 5. The optimal depends on both the value of and the underlying distributions. Moreover, from Fig. 4LABEL:sub@fig:algorithm4_5_alpha_KS-a-LABEL:sub@fig:algorithm4_5_alpha_KS-b, we can see that a smaller which implies larger results in better performance for KS distance and the two algorithms always have similar performance. On the other hand, from Fig. 4LABEL:sub@fig:algorithm4_5_alpha_MMD-a-LABEL:sub@fig:algorithm4_5_alpha_MMD-b, we can see that is a good choice for MMD if is of interest and (8) is used as the kernel function. Moreover, when is small, Algorithm 5 outperforms Algorithm 4 under MMD.

(a) KS distance (, )
(b) KS distance (, )
(c) MMD (, )
(d) MMD (, )
Fig. 4: Performance of Algorithms 4 and 5 given different

Vi Conclusion

This paper studied the k-medoids algorithm for clustering data sequences generated from composite distributions. The convergence of the proposed algorithms and the upper bound on the error probability were analyzed for both known and unknown number of clusters. The error probability of the proposed algorithms were characterized to decay exponentially for distance metrics that satisfied certain properties. In particular, the KS distance and MMD were shown to satisfy the required condition, and hence the corresponding algorithms were exponentially consistent.

One possible generalization of the current work is to investigate the exponential consistency of other clustering algorithms given distributional distance metrics that satisfy the properties similar to that in Assumption 1.

-a Technical Lemmas

The following technical lemmas are used to prove Corollaries III.1.1, IV.1.1 and IV.2.1. All the data sequences in Lemmas .3 - .8 are assumed to consist of i.i.d. samples.

Lemma .1.

[Dvoretzky-Kiefer-Wolfowitz Inequality[31]] Suppose consists of i.i.d. samples generated from . Then

Lemma .2.

[McDiarmid’s Inequality[32]]

Consider independent random variables

and a mapping . If for all , and for any , there exist for which

where

then for all probability measure and every ,

where denotes the expectation over the random variables .

Lemmas .3 - .8 establish that the KS distance and MMD satisfy Assumption 1. Moreover, the lemmas provided in [10] are special cases of Lemmas .3, .5 and .7 with .

Lemma .3.

Suppose for , where and . Then for any ,

Proof.

Consider

where , and . The first inequality is due to the triangle inequality of the -norm and the property of the supremum, and the last inequality is due to Lemma .1. Therefore, by the continuity of the exponential function, we have

Lemma .3 implies that the KS distance satisfies (3c) for .

Lemma .4.

Suppose for , where and . Then for any ,

Proof.

Without loss of generality, for any , consider , where

Note that given , and ,

Since , by the triangle inequality of -norm, we have

where for any fixed . Moreover, notice that . Then

The last inequality is due to lemma .2. ∎

Lemma .4 implies that MMD satisfies (3c) for .

Lemma .5.

Suppose two distribution clusters and satisfy (3a) under the KS distance. Assume that for , where . Then for any ,

Proof.

Similar to the proof of lemma .3, we have

where , and . The last inequality is due to Lemma .1. Therefore, by the continuity of the exponential function, we have

Lemma .5 implies that the KS distance satisfies (3b) for .

Lemma .6.

Suppose two distribution clusters and satisfy (3a) under MMD. Assume that for , where . Then for any ,