THIS paper aims to cluster sequences generated by unknown
continuous distributions into classes so that each class contains all the sequences generated from the same (composite) distribution cluster. By sequence, we mean a set of feature observations generated by an underlying probability distribution. Here each distribution cluster contains a set of distributions that are close to each other, whereas different clusters are assumed to be far away from each other based on a certain distance metric between distributions. To be more concrete, we assume that the maximum intra-cluster distance (or its upper bound) is smaller than the minimum inter-cluster distance (or its lower bound). This assumption is necessary for the clustering problem to be meaningful. The distance metrics used to characterize the difference of data sequences and corresponding distributions are assumed to have certain properties summarized in Assumption1 in Section II.
. The problem is typically solved by applying clustering methods, e.g., k-means clustering[3, 4, 5] and k-medoids clustering[6, 7, 8], where the data sequences are viewed as multivariate data with Euclidean distance as the distance metric. These clustering algorithms usually require the knowledge of the number of clusters and they differ in how the initial centers are determined. One reasonable way is to choose a data sequence as a center if it has the largest minimum distances to all the existing centers [9, 10, 11]. Alternatively, all the initial centers can be randomly chosen . With the number of clusters unknown, there are typically two alternative approaches for clustering. One starts with a small number of clusters, e.g., , which is an underestimate of the true number, and proceed to split the existing clusters until convergence [13, 11]. The authors in  assumed a maximum number of clusters and the threshold for clustering depended on a pre-determined significance level of the two sample Kolmogorov-Smirnov (KS) test whereas the algorithm proposed in  did not assume a maximum number of clusters and the threshold for clustering was a function of the intra-cluster and inter-cluster distances. Alternatively, one may start with an overestimated the number of clusters, e.g., every sequence is treated as a cluster, and proceed to merge clusters that are deemed close to each other . The algorithms in [9, 12, 13] were all validated by simulation results without carrying out an analysis of the error probability.
There are some key differences between the k-means algorithm and the k-medoids algorithm. The k-means algorithm minimizes a sum of squared Euclidean distances. Meanwhile, the k-medoids algorithm assigns data sequences as centers and minimizes a sum of arbitrary distances, which makes it more robust to outliers and noise[14, 15]. Moreover, the k-means algorithm requires updating the distances between data sequences and the corresponding centroids in every iteration whereas the k-medoids algorithm only requires the pairwise distances of the data sequences, which can be computed before hand. Thus, the k-medoids algorithm outperforms the k-means algorithm in terms of computational complexity as the number of sequences increases .
Most prior research focused on computational complexity analysis, whereas the error probability and the performance comparison of different clustering algorithms were typically studied through simulations, e.g., [7, 17, 8, 16]
. This paper attempts to theoretically analyze the error probability for the k-medoids algorithm especially in the asymptotic region. Furthermore, in contrast to previous studies, which frequently used vector norms as the distance metric, e.g., Euclidean distance, our study adopts the distance metrics between distributions for clustering in order to capture the statistical models of data sequences considered in this paper. This formulation based on a distributional distance metric is uniquely suited to the proposed clustering problem, where each data point, i.e., each data sequence, represents a probability distribution and each cluster is a collection of closely related distributions, i.e., composite hypotheses.
Various distance metrics that take the distribution properties into account can be reasonable choices, e.g., KS distance [11, 10, 13, 12] and maximum mean discrepancy (MMD) . Our previous work [11, 10] has shown that the KS distance based k-medoids algorithms are exponentially consistent for both known and unknown number of clusters. In this paper, we consider a much more general framework and instead of focusing on a specific distance metric such as KS distance in our prior work [11, 10], we consider any distance metric that satisfies Assumption 1 for clustering. The rationale is the following: with any distance metric that captures the statistical model of data sequences, one is likely to observe that for a large sample size, 1) sequences generated from the same distribution cluster are close to each other, and 2) sequences generated from different distribution clusters are far away from each other. Thus, if the minimum distance of different distribution clusters is greater than the maximum diameter of all the distribution clusters defined in Section II, then it becomes unlikely to make a clustering error as the sample size increases. In this paper, we develop k-mediods distribution clustering algorithms where the distances between distributions are selected to capture the underlying statistical model of the data. We analyze the error probability of the proposed algorithms, which takes the form of exponential decay as the sample size increases under Assumption 1. Furthermore, beyond the KS distance, we also consider another distance metric namely MMD and show that they both satisfy Assumption 1 so that the error probability of the proposed algorithms decays exponentially fast if the KS distance and MMD are used as the distance metrics.
of anomaly detection problems and classification also took into account the statistical model of the data sequences. Our focus here is on the clustering problem, leading to error performance analysis that is substantially different from that in[19, 20, 21, 22, 23].
The rest of the paper is organized as follows. In Section II, the system model of the clustering problem, the preliminaries of the KS distance and MMD, and notations are introduced. The clustering algorithm given the number of clusters and the corresponding upper bound on the error probability are provided in Section III, followed by the results of the clustering algorithms with an unknown number of clusters in Section IV. The simulation results for the KS and MMD based algorithms are provided in Section V.
Ii system model and preliminaries
Ii-a Clustering Problem
Suppose there are distribution clusters denoted by for , where is fixed. Define the intra-cluster distance of and the inter-cluster distance between and for respectively as
where is a distance metric between distributions, e.g., the KS distance or MMD defined later in (5) and (6) respectively. Then and represent the diameter of and the distance between and , respectively. Define
Suppose data sequences are generated from the distributions in , and hence a total of sequences are to be clustered. Without loss of generality, assume that each sequence consists of independently identically distributed (i.i.d.) samples generated from for and . The -th observation in is denoted by , where and . Note that for any fixed , ’s are not necessarily distinct. Namely, for a fixed , some ’s can be generated from the same distribution. Although the following discussion assumes that all the data sequences have the same length, our analysis can be easily extended to the case with different sequence lengths by replacing with the minimum sequence length. In order to analyze the error probability of the clustering algorithm, we introduce an assumption relating to the concentration property of the distance metrics:
For any distribution clusters and any sequences , and , where , the following inequalities hold:
where ’s are some constants independent of distributions, () is a function of and is the sample size. ∎
Here (3a) guarantees that the lower bound of inter-cluster distances is greater than the upper bound of intra-cluster distances. (3b) guarantees that the probability that the distance between two sequences generated from different distribution clusters is smaller than decays exponentially fast. (3c) guarantees that the probability that the distance between two sequences generated from the same distribution cluster is greater than decays exponentially fast. (3d) guarantees that given two sequences generated from the same cluster and a third sequence generated from another distribution cluster, the probability that the first sequence is actually closer to the third sequence decays exponentially fast.
A clustering output is said to be incorrect if and only if the sequences generated by different distribution clusters are assigned to the same cluster, or sequences generated by the same distribution cluster are assigned to more than one cluster. Denote by the error probability of a clustering algorithm. A clustering algorithm is said to be consistent if
where is the sample size. The algorithm is said to be exponentially consistent if
For the case where a clustering algorithm is exponentially consistent, we are also interested in characterizing the error exponent .
Ii-B Preliminaries of KS distance
Suppose is generated by the distribution , where
. Then the empirical cumulative distribution function (c.d.f.) induced byis given by
where is the indicator function. Let the c.d.f. of evaluated at be . The KS distance is defined as
where and can be either c.d.f’s of distributions or empirical c.d.f.’s induced by sequences.
Ii-C Preliminaries of MMD
Suppose includes a class of probability distributions, and suppose is the reproducing kernel Hilbert space (RKHS) associated with a kernel . Define a mapping from to such that each distribution is mapped into an element in as follows
where is referred to as the mean embedding of the distribution into the Hilbert space . Due to the reproducing property of , it is clear that for all .
An unbiased estimator ofbased on and which consist of and samples, respectively, is given by
Assume that the kernel is bounded, i.e., , where is finite.
Ii-D Additional Notations
The following notations are used in the algorithms and the corresponding proofs. Let be the -th cluster obtained at the -th cluster update step and let , and be the centers of the -th cluster obtained by the center update step, merge step and split step of the -th iteration respectively for . Moreover, let be the -th cluster obtained at the initialization step and be the corresponding center. Let () be the number of centers before the -th cluster update step and the -th split step. Moreover, use to denote the number of centers obtained at the center initialization step. For simplicity, all the superscripts are omitted in the following discussion when there is no ambiguity.
To further simplify the notation in the algorithms and the proofs, let denote the data sequence set . However, the one-to-one mapping from onto is not fixed, i.e., given a fixed , can be any sequence in unless other constraints are imposed. Denote by if is generated from a distribution . Furthermore, define a set of integers
where and .
Iii Known number of clusters
In this section, we study the clustering algorithm for known , the number of clusters. The method proposed in  is used for center initialization, as described in Algorithm 1. The initial centers are chosen sequentially such that the center of the -th cluster is the sequence that has the largest minimum distance to the previous centers. The clustering algorithm itself is presented in Algorithm 2. Given the centers, each sequence is assigned to the cluster for which the sequence has the minimum distance to the center. For a given cluster, a sequence is assigned as the center subsequently if the sum of its distances to all the sequences in the cluster is the smallest. The algorithm continues until the clustering result converges.
The following theorem provides the convergence guarantee for Algorithm 2 via an upper bound on the error probability.
Outline of the Proof.
The idea of proving the upper bound on the error probability is as follows. We first prove that the error probability at the initialization step decays exponentially. Note that the event that an error occurs during the first iterations is the union of the event that an error occurs at the -th step and the previous iterations are correct for . Thus, if we prove that the error probability at the -th step given correct updates from the previous iterations decays exponentially, then so does the error probability of the algorithm by the union bound argument. See Appendix -B1 for details. ∎
Iv Unknown number of clusters
In this section, we propose the merge- and split-based algorithms for estimating the number of clusters as well as grouping the sequences.
Iv-a Merge Step
If a distance metric satisfies (3c) and two sequences generated by distributions within the same cluster are assigned as centers, then, with high probability, the distance between the two centers is small. This is the premise of the clustering algorithm based on merging centers that are close to each other.
The proposed approach is summarized in Algorithms 3 and 4. There are two major differences between Algorithms 3 and 4 and Algorithms 1 and 2. First, the center initialization step of Algorithm 3 keeps generating an increasing number of centers until all the sequences are close to one of the existing centers. Second, an additional Merge Step in Algorithm 4 helps to combine clusters if the corresponding centers have small distances between each other.
Suppose the KS distance and MMD are used with and . Then the error probability of Algorithm 4 after iterations is upper bounded as follows
Iv-B Split Step
Suppose a cluster contains sequences generated by different distributions and the center is generated from . Then if the distance metric satisfies (3b), the probability that the distances between sequences generated from distribution clusters other than and the center is small decays as the sample size increases. Therefore, it is reasonable to begin with one cluster and then split a cluster if there exists a sequence in the cluster that has a large distance to the center. The corresponding algorithm is summarized in Algorithm 5.
Suppose Algorithm 5 obtains clusters at the -th iteration, where and or . Then the correct clustering update result is that each cluster contains all the sequences generated from the distribution cluster that generates the center.
Outline of the Proof.
An error occurs at the -th iteration if and only if the -th center is generated from distribution clusters that generated the previous centers or the clustering result is incorrect. Note that the error event of the first iterations is the union of the events that an error occurs at the -th iteration while the clustering results in the previous iterations are correct for . Similar to the proof of Theorem III.1, the error probability is bounded by the union bound. See Appendix -B3 for more details. ∎
Suppose the KS distance and MMD are used with and . Then the error probability of Algorithm 5 after iterations is upper bounded as follows
V Numerical Results
In this section, we provide some simulation results given and for . Moreover,
are used in the simulations. The probability density function (p.d.f.) ofis defined as
where , and is the Gamma function, respectively. Specifically, the parameters of the distributions are , , and , where and . Note that when , all the sequences generated from the same distribution cluster are generated from a single distribution. The exponential kernel function is used in the simulations for the MMD distance, i.e.,
V-a Known Number of Clusters
Simulation results for a known number of clusters are shown in Fig. 1. One can observe from the figures that by using both the KS distance and MMD, is a linear function of the sample size, i.e., is exponentially consistent. Moreover, the logarithmic slope of with respect to , i.e., the quantity , increases as becomes smaller, which, in the current simulation setting, implies a larger .
Furthermore, a good distance metric for Algorithm 2 depends on the underlying distributions. This is because the underlying distributions have different distances under the KS distance and MMD, which results in different values of and .
V-B Unknown Number of Clusters
With an unknown number of distribution clusters, the threshold specified in Corollaries IV.1.1 and IV.2.1 are used in the simulation. The performance of Algorithms 4 and 5 for the KS distance and MMD are shown in Figs. 2 and 3, respectively. Given the KS distance and MMD, ’s are linear functions of the sample size when the sample size is large and larger implies a larger slope of . Furthermore, given the same value of , Algorithms 4 and 5 have similar performance under the KS distance whereas Algorithm 5 outperforms Algorithm 4 under MMD given and . Intuitively, smaller implies larger in the current simulation setting, thereby should result in better clustering performance for a given sample size. However, Fig. 2LABEL:sub@fig:algorithm4_5_KS-b indicates that Algorithms 4 and 5 with KS distance performs better with than that with when the sample size is small. This is likely due to the fact that the KS distance between the two sequences is always lower bounded by . Thus, with small sample sizes, Algorithms 4 and 5 are likely to overestimate the number of clusters. This can be mitigated by the increased threshold to control merging/splitting of cluster centers.
V-C Choice of
Note that in general , where . Theorems IV.1 and IV.2 only establish the exponential consistency of Algorithms 4 and 5, respectively. We now investigate the impact on performance given different ’s. One can observe from Fig. 4 that the choice of has a significant impact on the performance of Algorithms 4 and 5. The optimal depends on both the value of and the underlying distributions. Moreover, from Fig. 4LABEL:sub@fig:algorithm4_5_alpha_KS-a-LABEL:sub@fig:algorithm4_5_alpha_KS-b, we can see that a smaller which implies larger results in better performance for KS distance and the two algorithms always have similar performance. On the other hand, from Fig. 4LABEL:sub@fig:algorithm4_5_alpha_MMD-a-LABEL:sub@fig:algorithm4_5_alpha_MMD-b, we can see that is a good choice for MMD if is of interest and (8) is used as the kernel function. Moreover, when is small, Algorithm 5 outperforms Algorithm 4 under MMD.
This paper studied the k-medoids algorithm for clustering data sequences generated from composite distributions. The convergence of the proposed algorithms and the upper bound on the error probability were analyzed for both known and unknown number of clusters. The error probability of the proposed algorithms were characterized to decay exponentially for distance metrics that satisfied certain properties. In particular, the KS distance and MMD were shown to satisfy the required condition, and hence the corresponding algorithms were exponentially consistent.
One possible generalization of the current work is to investigate the exponential consistency of other clustering algorithms given distributional distance metrics that satisfy the properties similar to that in Assumption 1.
-a Technical Lemmas
[Dvoretzky-Kiefer-Wolfowitz Inequality] Suppose consists of i.i.d. samples generated from . Then
Suppose for , where and . Then for any ,
where , and . The first inequality is due to the triangle inequality of the -norm and the property of the supremum, and the last inequality is due to Lemma .1. Therefore, by the continuity of the exponential function, we have
Suppose for , where and . Then for any ,
Without loss of generality, for any , consider , where
Note that given , and ,
Since , by the triangle inequality of -norm, we have
where for any fixed . Moreover, notice that . Then
The last inequality is due to lemma .2. ∎
Suppose two distribution clusters and satisfy (3a) under the KS distance. Assume that for , where . Then for any ,
Suppose two distribution clusters and satisfy (3a) under MMD. Assume that for , where . Then for any ,