EGMM: an Evidential Version of the Gaussian Mixture Model for Clustering

10/03/2020 ∙ by Lianmeng Jiao, et al. ∙ Université de Technologie de Compiègne 0

The Gaussian mixture model (GMM) provides a convenient yet principled framework for clustering, with properties suitable for statistical inference. In this paper, we propose a new model-based clustering algorithm, called EGMM (evidential GMM), in the theoretical framework of belief functions to better characterize cluster-membership uncertainty. With a mass function representing the cluster membership of each object, the evidential Gaussian mixture distribution composed of the components over the powerset of the desired clusters is proposed to model the entire dataset. The parameters in EGMM are estimated by a specially designed Expectation-Maximization (EM) algorithm. A validity index allowing automatic determination of the proper number of clusters is also provided. The proposed EGMM is as convenient as the classical GMM, but can generate a more informative evidential partition for the considered dataset. Experiments with synthetic and real datasets demonstrate the good performance of the proposed method as compared with some other prototype-based and model-based clustering techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is one of the most fundamental tasks in data mining and machine learning. It aims to divide a set of objects into homogeneous groups, by maximizing the similarity between objects in the same group and minimizing the similarity between objects in different groups

Aggarwal and Reddy (2014)

. As an active research topic, new approaches are constantly proposed, because the usage and interpretation of clustering depend on each particular application. Clustering is currently applied in a variety of fields, such as computer vision

Yang et al. (2019), communications Tam et al. (2019), biology Wang et al. (2020), and commerce Song et al. (2020)

. Based on the properties of clusters generated, clustering techniques can be classified as partitional clustering and hierarchical clustering

Han et al. (2011). Partitional clustering conducts one-level partitioning on datasets. In contrast, hierarchical clustering conducts multi-level partitioning on datasets, in agglomerative or divisive way.

Model-based clustering is a classical and powerful approach for partitional clustering. It attempts to optimize the fit between the observed data and some mathematical model using a probabilistic approach, with the assumption that the data are generated by a mixture of underlying probability distributions. Many mixture models can be adopted to represent the data, among which the Gaussian mixture model (GMM) is by far the most commonly used representation

Melnykov and Maitra (2010). As a model-based clustering approach, the GMM provides a principled statistical way to the practical issues that arise in clustering, e.g., how many clusters there are. Besides, the statistical properties also make it suitable for inference Fraley and Raftery (2002). The GMM has shown promising results in many clustering applications, ranging from image registration Ma et al. (2017), topic modeling Costa and Ortale (2019), traffic prediction Jia et al. (2019)

to anomaly detection

Li et al. (2020).

However, the GMM is limited to probabilistic (or fuzzy) partitions for datasets: it does not allow ambiguity or imprecision in the assignment of objects to clusters. Actually, in many applications, it is more reasonable to assign those objects in overlapping regions to a set of clusters rather than some single cluster. Recently, the notion of evidential partition Denœux and Masson (2004); Masson and Denœux (2008) was introduced based on the theory of belief functions Dempster (1967); Shafer (1976); Denœux (2016); Denœux et al. (2020). As a general extension of the probabilistic (or fuzzy), possibilistic, and rough partitions, it allows the object not only to belong to single clusters, but also to belong to any subsets of the frame of discernment that describes the possible clusters Denœux and Kanjanatarakul (2016). Therefore, the evidential partition provides more refined partitioning results than the other ones, which makes it very appealing for solving complex data clustering problems. Up to now, different evidential clustering algorithms have been proposed to build an evidential partition for object datasets. Most of these algorithms fall in the category of prototype-based clustering, including evidential -means (ECM) Masson and Denœux (2008), and its variants such as constrained ECM (CECM) Antoine et al. (2012), median ECM (MECM) Zhou et al. (2015), et al. Besides, in Denœux et al. (2015), a decision-directed clustering procedure, called EK-NNclus, was developed based on the evidential -nearest neighbor rule, and in Su and Denœux (2019), a belief-peaks evidential clustering (BPEC) algorithm was developed by fast search and find of density peaks. Although the above mentioned algorithms can generate powerful evidential partitions, they are purely descriptive and unsuitable for statistical inference. A recent work for model-based evidential clustering was proposed in Denœux (2020) by bootstrapping Gaussian mixture models (called bootGMM). This algorithm builds calibrated evidential partitions in an approximate way, but the high computational complexity in the procedures of bootstrapping and calibration limits its application to large datasets.

In this paper, we propose a new model-based evidential clustering algorithm, called EGMM (evidential GMM), by extending the classical GMM in the belief function framework directly. Unlike the GMM, the EGMM associates a distribution not only to each single cluster, but also to sets of clusters. Specifically, with a mass function representing the cluster membership of each object, the evidential Gaussian mixture distribution composed of the components over the powerset of the desired clusters is proposed to model the entire dataset. After that, the maximum likelihood solution of the EGMM is derived via a specially designed Expectation-Maximization (EM) algorithm. With the estimated parameters, the clustering is performed by calculating the -tuple evidential membership , which provides an evidential partition for the considered

objects. Besides, in order to determine the number of clusters automatically, an evidential Bayesian inference criterion (EBIC) is also presented as the validity index. The proposed EGMM is as convenient as the classical GMM that has no open parameter and does not require to fix the number of clusters in advance. More importantly, the proposed EGMM generates an evidential partition, which is more informative than a probabilistic partition.

The rest of this paper is organized as follows. Section 2 recalls the necessary preliminaries about the theory of belief functions and the Gaussian mixture model from which the proposal is derived. Our proposed EGMM is then presented in Section 3. In Section 4, we conduct experiments to evaluate the performance of the proposal using both synthetic and real-world datasets. Finally, Section 5 concludes the paper.

2 Preliminaries

We first briefly introduce necessary concepts about belief function theory in Section 2.1. The Gaussian mixture model for clustering is then recalled in Section 2.2.

2.1 Basics of the Belief Function Theory

The theory of belief functions Dempster (1967); Shafer (1976)

, also known as Dempster-Shafer theory or evidence theory, is a generalization of the probability theory. It offers a well-founded and workable framework to model a large variety of uncertain information. In belief function theory, a problem domain is represented by a finite set

called the frame of discernment. A mass function expressing the belief committed to the elements of by a given source of evidence is a mapping function : , such that

(1)

Subsets such that are called the focal sets of the mass function . The mass function has several special cases, which represent different types of information. A mass function is said to be

  • Bayesian

    , if all of its focal sets are singletons. In this case, the mass function reduces to the precise probability distribution;

  • Certain, if the whole mass is allocated to a unique singleton. This corresponds to a situation of complete knowledge;

  • Vacuous, if the whole mass is allocated to . This situation corresponds to complete ignorance.

Shafer Shafer (1976) also defined the belief and plausibility functions as follows

(2)

represents the exact support to and its subsets, and represents the total possible support to . The interval can be seen as the lower and upper bounds of support to . The belief functions , and are in one-to-one correspondence.

For decision-making support, Smets Smets (2005) proposed the pignistic probability to approximate the unknown probability in as follows

(3)

where is the cardinality of set .

2.2 Gaussian Mixture Model for Clustering

Suppose we have a set of objects consisting of observations of a

-dimensional random variable

. The random variable is assumed to be distributed according to a mixture of components (i.e., clusters), with each one represented by a parametric distribution. Then, the entire dataset can be modeled by the following mixture distribution

(4)

where is the set of parameters specifying the th component, and is the probability that an observation belongs to the th component (, and ).

The most commonly used mixture model is the Gaussian mixture model (GMM) Aggarwal and Reddy (2014); Melnykov and Maitra (2010); McLachlan et al. (2019)

, where each component is represented by a parametric Gaussian distribution as

(5)

where is a

-dimensional mean vector,

is a covariance matrix, and denotes the determinant of .

The basic goal of clustering using the GMM is to estimate the unknown parameter from the set of observations . This can be done using maximum likelihood estimation (MLE), with the log-likelihood function given by

(6)

The above MLE problem is well solved by the Expectation-Maximization (EM) algorithm Dempster et al. (1977), with solutions given by

(7)
(8)
(9)

where

is the posterior probabilities given the current parameter estimations

as

(10)

With initialized parameters , , and , the posterior probabilities and the parameters update alternatively until the change in the log-likelihood becomes smaller than some threshold. Finally, the clustering is performed by calculating the posterior probabilities with the estimated parameters using Eq.(10).

3 EGMM: Evidential Gaussian Mixture Model for Clustering

Considering the advantages of belief function theory for representing uncertain information, we extend the classical GMM in belief function framework and develop an evidential Gaussian mixture model (EGMM) for clustering. In Section 3.1, the evidential membership is first introduced to represent the cluster membership of each object. Based on this representation, Section 3.2 describes how the EGMM is derived in detail. Then, the parameters in EGMM are estimated by a specially designed EM algorithm in Section 3.3. The whole algorithm is summarized and analyzed in Section 3.4. Finally, the determination of the number of clusters is further studied in Section 3.5.

3.1 Evidential Membership

Suppose the desired number of clusters is (). The purpose of the EGMM clustering is to assign to the objects in dataset soft labels represented by an -tuple evidential membership structure as

(11)

where , , are mass functions defined on the frame of discernment .

The above evidential membership modeled by mass function provides a general representation model regarding the cluster membership of object :

  • When is a Bayesian mass function, the evidential membership reduces to the probabilistic membership of the GMM defined in Eq.(10) .

  • When is a certain mass function, the evidential membership reduces to the crisp label employed in many hard clustering methods, such as -means Jain (2010), DPC Rodriguez and Laio (2014), et al.

  • When is a vacuous mass function, the class of object

    is completely unknown, which can be seen as an outlier.

Example 1

Let us consider a set of objects with evidential membership regarding a set of classes . Mass functions for each object are given in Table 1. They illustrate various situations: the case of object corresponds to situation of probabilistic uncertainty ( is Bayesian); the class of object is known with precision and certainty ( is certain), whereas the class of sample is completely unknown ( is vacuous); finally, the mass function models the general situation where the class of object is both imprecise and uncertain.

0.2 0 0 0
0.3 0 0 0.1
0 0 0 0
0.5 1 0 0.2
0 0 0 0
0 0 0 0.4
0 0 1 0.3
Table 1: Example of the evidential membership

As illustrated in the above example, the evidential membership is a powerful model to represent the imprecise and uncertain information existing in datasets. In the following part, we will study how to derive a soft label represented by the evidential membership for each object in dataset given a desired number of clusters .

3.2 From GMM to EGMM

In the GMM, each component in the desired cluster set is represented by the following cluster-conditional probability density:

(12)

where is the set of parameters specifying the th component , . It means that any objet in set is drawn from one single cluster in .

Unlike the probabilistic membership in the GMM, the evidential membership introduced in the EGMM enables the object to belong to any subset of , including not only the individual clusters but also the meta-clusters composed of several clusters. In order to model each evidential component (), we construct the following evidential cluster-conditional probability density:

(13)

where is the set of parameters specifying the th evidential component , .

Notice that different evidential components may be nested (e.g., , ), the cluster-conditional probability densities are no longer independent. To model this correlation, we propose to associate to each component the mean vector of the average value of the mean vectors associated to the clusters composing as

(14)

where denotes the cardinal of , and is defined as

(15)

with being the indicator function.

As for the covariance matrix, the values for different components can be free. Some researchers also proposed different assumptions on the component covariance matrix in order to simplify the mixture model Banfield and Raftery (1993). In this paper, we adopt the following constant covariance matrix:

(16)

where is an unknown symmetric matrix. This assumption results in clusters that have the same geometry but need not be spherical.

In the EGMM, each object is assumed to be distributed according to a mixture of components over the powerset of the desired cluster set, with each one defined as the evidential cluster-conditional probability density in Eq. (13). Formally, the evidential Gaussian mixture distribution can be formulated as

(17)

where

is called mixing probability, denoting the prior probability that the object was generated from

th component. Similar with the GMM, the mixing probabilities must satisfy , and .

Remark 1

The above EGMM is an generalization of the classical GMM in the framework of belief functions. When the evidential membership reduces to the probabilistic membership, all the meta-cluster components are assigned zero prior probability, i.e., , . In this case, the evidential Gaussian mixture distribution , which is just the classical Gaussian mixture distribution.

In this formulation of the mixture model, we need to infer a set of parameters from the observations, including the mixing probabilities and the parameters for the component distributions . Considering the constraints for the mean vectors and covariance matrices indicated in Eqs. (14) and (16), the overall parameter of the mixture model is . If we assume that the objects in set are drawn independently from the mixture distribution, then we can obtain the observed-data log-likelihood of generating all the objects as

(18)

In statistics, maximum likelihood estimation (MLE) is an important statistical approach for parameter estimation. The maximum likelihood estimate of is defined as

(19)

which is the best estimate in the sense that it maximizes the probability density of generating all the observations. Different from the normal solutions of the GMM, the MLE of the EGMM is rather complicated as additional constraints (see Eqs. (14) and (16)) are imposed for the estimated parameters. Next, we will derive the maximum likelihood solution for the EGMM via a specially designed EM algorithm.

3.3 Maximum Likelihood Estimation via the EM algorithm

In order to use the EM algorithm to the solve the MLE problem for the EGMM in Eq. (19), we artificially introduce a latent variable to denote the component (cluster) label of each object , , with the form of an -dimensional binary vector , where,

(20)

The latent variable is independent and identically distributed (idd) according to a multinomial distribution of one draw from components with mixing probabilities . In conjunction with the observed data , the complete data are considered to be , . Then, the corresponding complete-data log-likelihood can be formulated as

(21)

The EM algorithm approaches the problem of maximizing the observed-data log-likelihood in Eq. (18) by proceeding iteratively with the above complete-data log-likelihood . Each iteration of the algorithm involves two steps called the expectation step (E-step) and the maximization step (M-step). The derivation of the EM solution for the EGMM is detailed in Appendix. Only the main equations are given here without proof.

As the complete-data log-likelihood depends explicitly on the unobservable data , the E-step is performed on the so-called -function, which is the conditional expectation of given , using the current fit for . More specifically, on the th iteration of the EM algorithm, the E-step computes

(22)

where is the evidential membership of th object to th component, or responsibility of the hidden variable

, given the current fit of parameters. Using Bayes’ theorem, we obtain

(23)

In the M-step, we need to maximize the -function to update the parameters:

(24)

Different from the observed-data log-likelihood in Eq. (18), the logarithm of -function in Eq. (22) works directly on the Gaussian distributions. By keeping the evidential membership fixed, we can maximize with respect to the involved parameters: the mixing probabilities of the components , the mean vectors of the single-clusters and the common covariance matrix . This leads to closed-form solutions for updating these parameters as follows.

  • The mixing probabilities of the components , :

    (25)
  • The mean vectors of the single-clusters , :

    (26)

    where is a matrix of size () composed of all the mean vectors, i.e., , is a matrix of size () defined by

    (27)

    and is a matrix of size () defined by

    (28)
  • The common covariance matrix :

    (29)

    where is the updated mean vector of the th evidential component, which is computed using Eq. (14) based on the updated mean vectors of the single-clusters in Eq. (26).

This algorithm is started by initializing with guesses about the parameters . Then the two updating steps (i.e., E-step and M-step) alternate until the change in the observed-data log-likelihood falls below some threshold . The convergence properties of the EM algorithm are discussed in detail in McLachlan and Krishnan (2007). It is proved that each EM cycle can increase the observed-data log-likelihood, which is guaranteed to convergence to a maximum.

3.4 Summary and Analysis

With the estimated parameters via the above EM algorithm, the clustering is performed by calculating the evidential membership , , , using Eq. (23). The computed -tuple evidential membership provides an evidential partition of the considered objects. As indicated in Denœux and Kanjanatarakul (2016), the evidential partition provides a complex clustering structure, which can boil down to several alternative clustering structures including traditional hard partition, probabilistic (or fuzzy) partition Bezdek (1981); D’Urso and Massari (2019), possibilistic partition Krishnapuram and Keller (1993); Ferone and Maratea (2019), and rough partition Peters (2014, 2015). We summarize the EGMM for clustering in Algorithm 1.

1:: samples in ;  : number of clusters ;  : termination threshold.
2:Initialize the mixing probabilities , the mean vectors of the single-clusters and the common covariance matrix ;
3:;
4:repeat
5:     Calculate the mean vectors and the covariance matrices of the evidential components using the current parameters and based on Eq. (14) and Eq. (16), respectively;
6:     Calculate the evidential memberships using the current mixing probabilities , the mean vectors and the covariance matrices based on Eq. (23);
7:     Update the mixing probabilities and the mean vectors of the single-clusters using the calculated evidential memberships based on Eqs. (25)-(28);
8:     Update the common covariance matrix using the calculated evidential memberships and the updated mean vectors based on Eq. (29);
9:     Compute the observed-data log-likelihood based on the updated parameters using Eq. (18);
10:     ;
11:until ;
12:return the -tuple evidential membership , with each , .
Algorithm 1 EGMM clustering algorithm.

Generality Analysis: The proposed EGMM algorithm provides a general framework for clustering, which boils down to the classical GMM when we constrain all the evidential components to be singletons, i.e., , . Compared with the GMM algorithm, the evidential one allocates for each object a mass of belief to any subsets of possible clusters, which allows to gain a deeper insight on the data.

Convergence Analysis: As indicated in Park and Ozeki (2009), the EM algorithm for mixture models takes many iterations to reach convergence, and reaches multiple local maxima starting from different initializations. In order to find a suitable initialization and speed up the convergence for the proposed EGMM algorithm, it is recommended to run the -means algorithm Jain (2010) and choose the means of the clusters and the average covariance of the clusters for initializing and , respectively. As for the mixing probabilities , if no prior information is available, these values can be initialized equally as .

Complexity Analysis: For each object, the proposed EGMM algorithm distributes a fraction of the unit mass to each non-empty element of . Consequently, the number of parameters to be estimated is exponential in the number of clusters and linear in the number of objects . Considering that, in most cases, the objects assigned to elements of high cardinality are of less interpretability, in practice, we can reduce the complexity by constraining the focal sets to be composed of at most two clusters. By this way, the number of parameters to be estimated is drastically reduced from to .

3.5 Determining the Number of Clusters

One important issue arising in clustering is the determination of the number of clusters. This problem is often referred to as cluster validity. Most of the methods for GMM clustering usually start by obtaining a set of partitions for a range of values of (from to ) which is assumed to contain the optimal . The number of clusters is then selected according to

(30)

where is the estimated parameter with clusters, and is some validity index. A very common criterion can be expressed in the form Aggarwal and Reddy (2014)

(31)

where is the maximized mixture log-likelihood when the number of clusters is chosen as and is an increasing function penalizing higher values of .

Many examples of such criterion have been proposed for the GMM, including Bayesian approximation criteria, such as Laplas-empirical criterion (LEC) McLachlan and Peel (2000), and Bayesian inference criterion (BIC) Fraley and Raftery (2002), and information-theoretic criterion, such as minimum description length (MDL) Grünwald (2007), minimum message length (MML) Yatracos (2015), and Akaike’s information criterion (AIC) Charkhi and Claeskens (2018). Among these criterion, the BIC has given good results in a wide range of applications of model-based clustering. For general mixture models, the BIC is defined as

(32)

where is the number of independent parameters to be estimated in when the number of clusters is chosen as .

For our proposed clustering approach, we adopt the above BIC as the validity index to determine the number of clusters. For EGMM, the mixture log-likelihood is replaced by the evidential Gaussian mixture log-likelihood defined in Eq. (18), and the number of independent parameters in is replaced by that in . Consequently, the evidential version of BIC for EGMM is then derived as

(33)

where , including independent parameters in the mixing probabilities , independent parameters in the mean vectors , and independent parameters in the common covariance matrix . This index has to be maximized to determine the optimal number of clusters.

4 Experiments

This section consists of two parts. In Section 4.1, some numerical examples are used to illustrate the behavior of the EGMM algorithm111The Matlab source code can be downloaded from https://github.com/jlm-138/EGMM.. In Section 4.2, we compare the performance of our proposal with those of related clustering algorithms based on several real datasets.

4.1 Illustrative examples

In this subsection, we consider three numerical examples to illustrate the interest of the proposed EGMM algorithm for deriving evidential partition that better characterizes cluster-membership uncertainty.

4.1.1 Diamond dataset

In the first example, we consider the famous Diamond dataset to illustrate the behavior of EGMM compared with the general GMM Aggarwal and Reddy (2014). This dataset is composed of 11 objets, as shown in Fig. 1. We first calculated the cluster validity indices by running the EGMM algorithm under different numbers of clusters. Table 2 shows the EBIC indices with the desired number of clusters ranging from 2 to 6. It can be seen that the maximum is obtained for clusters, which is consistent with our intuitive understanding for the partition of this dataset. Figs. 2 and 3 show the clustering results (with ) by GMM and EGMM, respectively. For the GMM, object 6, which lies at the cluster boundary, is assigned a high probability to cluster . But for our proposed EGMM, object 6 is assigned a high evidential membership to , which reveals that this point is ambiguous: it could be assigned either to or . In addition, the EGMM can find the approximate locations for both of the two cluster centers, whereas the GMM gives a biased estimation for the center location of cluster . This example demonstrates that the proposed EGMM is more powerful to detect those ambiguous objects, and thus can reveal the underlying structure of the considered data in a more comprehensive way.

Figure 1: Diamond dataset
Cluster number 2 3 4 5 6
EBIC index -55.9 -63.1 -70.6 -91.5 -132.3
Table 2: Diamond dataset: EBIC indices for different numbers of clusters
(a) Cluster probability of each object
(b) Hard partition result and the cluster centers
Figure 2: Diamond dataset: clustering results by GMM
(a) Cluster evidential membership of each object
(b) Hard evidential partition result and the cluster centers
Figure 3: Diamond dataset: clustering results by EGMM

4.1.2 Two-class dataset

In the second example, a dataset generated by two Gaussian distributions is considered to demonstrate the superiority of the proposed EGMM compared with the prototype-based ECM Masson and Denœux (2008) and the model-based bootGMM Denœux (2020), which are two representative evidential clustering algorithms developed in the belief function framework. This dataset is composed of two classes of 400 points, generated from Gaussian distributions with the same covariance matrix and different mean vectors, and , respectively. The dataset and the contours of the distributions are represented in Fig. 4 (a). We first calculated the cluster validity indices by running the EGMM algorithm under different numbers of clusters. Table 3 shows the EBIC indices with the desired number of clusters ranging from 2 to 6. It indicates that the number of clusters should be chosen as , which is consistent with the real class distributions. Figs. 4 (b)-(d) show the clustering results (with ) by ECM, bootGMM and EGMM, respectively. It can be seen that, the ECM fails to recover the underlying structure of the dataset, which is because the Euclidan distance-based similarity measure can only discover hyperspherical clusters. The proposed EGMM accurately recovers the two underlying hyperellipsoid clusters thanks to the adaptive similarity measure derived via MLE. This example demonstrates that the proposed EGMM is more powerful to distinguish hyperellipsoid clusters with arbitrary orientation and shape than ECM. As for the bootGMM, it successfully recovers the two underlying hyperellipsoid clusters by fitting the model based on mixtures of Gaussian distributions with arbitrary geometries. However, it fails to detect those ambiguous objects lying at the cluster boundary via the hard evidential partition, as quite small evidential membership is assign to for these objects. By comparison, the proposed EGMM can automatically detect these ambiguous objects thanks to the mixture models constructed over the powerset of the desired clusters.

Cluster number 2 3 4 5 6
EBIC index -3395.6 -3413.1 -3425.6 -3443.7 -3466.8
Table 3: Two-class dataset: EBIC indices for different numbers of clusters
(a) The dataset
(b) Hard evidential partition result and the cluster centers by ECM
(c) Hard evidential partition result and the cluster centers by bootGMM
(d) Hard evidential partition result and the cluster centers by EGMM
Figure 4: Two-class dataset: clustering results by ECM, bootGMM, and EGMM

4.1.3 Four-class dataset

In the third example, a more complex dataset is considered to illustrate the interest of evidential partition obtained by the proposed EGMM. This dataset is composed of four classes of 200 points, generated from Gaussian distributions with the same covariance matrix and different mean vectors, , , , and , respectively. The dataset and the contours of the distributions are represented in Fig. 5 (a). We first calculated the cluster validity indices by running the EGMM algorithm under different numbers of clusters. Table 4 shows the EBIC indices with the desired number of clusters ranging from 2 to 6. Noting that the maximum is obtained for clusters, the underlying structure of the dataset is correctly discovered. Fig. 5 (b) shows the hard evidential partition result (represented by convex hull) and the cluster centers (marked by red cross) with . It can be seen that the four clusters are accurately recovered, and those points that lie at the cluster boundaries are assigned to the ambiguous sets of clusters. Apart from the hard evidential partition, it is also possible to characterize each cluster by two sets: the set of objects which can be classified as without any ambiguity and the set of objects which could possibly be assigned to Masson and Denœux (2008). These two sets and , referred to as the lower and upper approximations of , are defined as and , with denoting the set of objects for which the mass assigned to is highest. Figs. 5 (c) and (d) show the lower and upper approximations of each cluster, which provide a pessimistic and an optimistic clustering results, respectively. This example demonstrates that the evidential partition generated by the proposed EGMM is quite intuitive and easier to interpret than the numerical probabilities obtained by the GMM, and can provide much richer partition information than the classical hard partition.

Cluster number 2 3 4 5 6
EBIC index -3648.5 -3666.5 -3635.2 -3655.1 -3679.9
Table 4: Four-class dataset: EBIC indices for different numbers of clusters
(a) The dataset
(b) Hard evidential partition result and the cluster centers
(c) Lower approximations of the four clusters
(d) Upper approximations of the four clusters
Figure 5: Four-class dataset: interpretation of the evidential partition generated by EGMM

4.2 Real data test

In this subsection, we aim to evaluate the performance of the proposed EGMM based on several classical benchmark datasets from the UCI Machine Learning Repository Dua and Karra Taniskidou (2020), whose characteristics are summarized in Table 5. The clustering results obtained with proposed EGMM are compared with the following representative clustering algorithms:

  • HCM Jain (2010): hard -means (function kmeans in the MATLAB Statistics toolbox).

  • FCM Bezdek (1981): fuzzy -means (function fcm in the MATLAB Fuzzy Logic toolbox).

  • ECM Masson and Denœux (2008): evidential -means (function ECM in the MATLAB Evidential Clustering package222Available at https://www.hds.utc.fr/~tdenoeux/dokuwiki/en/software).

  • GMM Aggarwal and Reddy (2014): general Gaussian mixture model without constraints on covariance (function fitgmdist in the MATLAB Statistics toolbox).

  • GMM (constrained) Aggarwal and Reddy (2014): Gaussian mixture model with constant covariance across clusters (function fitgmdist with ‘SharedCovariance’ = true).

  • bootGMM Denœux (2020): Calibrated model-based evidential clustering by bootstrapping the most fitted GMM (function bootclus in the R package evclust).

For determining the number of clusters, the validity indices of modified partition coefficient (MPC) Davé (1996) and average normalized specificity (ANS) Masson and Denœux (2008) are used for FCM and ECM, respectively, and the classical BIC Fraley and Raftery (2002) is used for the three model-based algorithms including GMM, GMM (constrained) and bootGMM.

Datasets # Instances # Features # Classes
Iris 150 4 3
Knowledge 403 5 4
Seeds 210 7 3
Vehicle 846 18 4
Wine 178 13 3
Table 5: Characteristics of the real datasets used in the experiment

To perform a fair evaluation of the clustering results, hard partitions are adopted for all the considered algorithms. For the three evidential clustering algorithms, hard partitions are obtained by selecting the cluster with maximum pignistic probability for each object. The following three common external criteria are used for evaluation Manning et al. (2008):

  • Purity: Purity is a simple and transparent evaluation measure. To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned objects and dividing by . Formally,

    (34)

    where is the set of partitioned clusters and is the set of actual classes.

  • NMI (Normalized Mutual Information): NMI is an information-theoretical evaluation measure, which is defined as

    (35)

    where and denote the operations of mutual information and entropy, respectively.

  • ARI (Adjusted Rand Index): ARI is a pair counting based evaluation measure, which is defined as

    (36)

    where TP, TN, FP, FN denote true positive samples, true negative samples, false positive samples and false negative samples, respectively.

For each dataset, two cases of experiment were conducted. In the first case, the number of clusters was unknown and had to be determined based on the affiliated validity indices. For all the algorithms except HCM (which requires the number of clusters to be known in advance), the number of clusters was searched between 2 and 6. All algorithms were run 10 times, and the average estimated number of clusters was calculated for each algorithm. For evaluating the clustering performance, the average NMI and ARI were calculated for each algorithm using the corresponding estimated number of clusters. Note that the purity measure was not used here because it is severely affected by the number of clusters as indicated in Manning et al. (2008) (high purity is easy to achieve when the number of clusters is large). In the second case, the number of clusters was assumed to be known in advance. All algorithms were run 10 times using the correct number of clusters, and the average purity, NMI, and ARI were calculated for each algorithm.

Tables 6-10 show the clustering results of different algorithms on the five considered datasets. We can see that the proposed EGMM performed best for determining the number of clusters (obtaining the best estimation accuracy on four of the five datasets, except Vehicle), while the performance of other algorithms was generally unstable. By comparing the quality of the obtained partitions, as measured by the purity, NMI, and ARI, the proposed EGMM performed well for both cases of unknown and known number of clusters. When was assumed to be unknown, it obtained the best results for Iris, Knowledge, and Seeds, and obtains the second best results for the other two datasets. In the case where was known in advance, it obtained the best results for Knowledge, Seeds, and Vehicle, and obtains the second best results for the other two datasets. These results show the superiority of the proposed EGMM both in finding the number of clusters and clustering the data.

Measures HCM FCM ECM GMM GMM (constrained) bootGMM EGMM
4.11.45 2.00 2.00 4.50.85 2.00 3.40.52
is unknown NMI 0.700.01 0.570.05 0.730 0.770.06 0.730 0.870.05
ARI 0.620.02 0.540 0.570 0.720.11 0.570 0.850.10
Purity 0.870.07 0.890 0.890 0.890.15 0.870.14 0.970.01 0.930.09
is fixed with 3 NMI 0.740 0.750 0.760 0.840.13 0.830.11 0.910.01 0.870.05
ARI 0.700.10 0.730 0.730 0.810.21 0.780.21 0.920.01 0.850.10
Table 6: Clustering results on Iris dataset
Measures HCM FCM ECM GMM GMM (constrained) bootGMM EGMM
4.91.10 2.00 2.50.53 2.50.85 3.00 3.80.63
is unknown NMI 0.290.02 0.330 0.360.12 0.020.02 0.390 0.430.05
ARI 0.220.02 0.280 0.260.11 0.010 0.290 0.310.04
Purity 0.570.05 0.510.01 0.510.01 0.620.02 0.380.10 0.490.01 0.630.04
is fixed with 4 NMI 0.360.03 0.290.03 0.290.02 0.390.02 0.100.17 0.260.01 0.430.05
ARI 0.250.03 0.230.03 0.230.03 0.280.02 0.070.14 0.210.01 0.310.04
Table 7: Clustering results on Knowledge dataset
Measures HCM FCM ECM GMM GMM (constrained) bootGMM EGMM
4.71.06 5.40.52 2.00 3.10.31 4.00 3.00
is unknown NMI 0.610 0.580 0.600 0.780.07 0.590.01 0.800
ARI 0.520 0.520 0.510 0.810.13 0.530.03 0.850
Purity 0.890 0.900 0.890 0.840.09 0.920.09 0.890.01 0.950
is fixed with 3 NMI 0.700.01 0.690 0.660 0.680.08 0.780.07 0.690.02 0.800
ARI 0.710 0.720 0.720 0.650.11 0.810.13 0.720.02 0.850
Table 8: Clustering results on Seeds dataset
Measures HCM FCM ECM GMM GMM (constrained) bootGMM EGMM
4.11.45 6.00 5.10.57 4.70.95 6.00 5.90.32
is unknown NMI 0.180 0.120 0.190.03 0.150.04 0.350.01 0.240
ARI 0.120 0.140 0.130.03 0.100.03 0.210.01 0.130
Purity 0.440.01 0.450 0.400 0.430.03 0.410.03 0.440.01 0.450.18
is fixed with 4 NMI 0.190.01 0.180 0.130 0.170.04 0.170.03 0.200.01 0.210.02
ARI 0.120 0.120 0.120 0.120.03 0.100.02 0.160.01 0.140.02
Table 9: Clustering results on Vehicle dataset
Measures HCM FCM ECM GMM GMM (constrained) bootGMM EGMM
4.11.45 6.00 2.00 5.10.88 4.00 3.80.42
is unknown NMI 0.360 0.350.01 0.580.07 0.810.03 0.950 0.820.05
ARI 0.270 0.240 0.440.06 0.800.05 0.970 0.830.06
Purity 0.690.01 0.690 0.690 0.870.12 0.840.13 0.980.01 0.850.13
is fixed with 3 NMI 0.420.01 0.420 0.400 0.700.19 0.730.15 0.950.01 0.810.14
ARI 0.360.02 0.350 0.350 0.700.19 0.690.20 0.970.01 0.750.21
Table 10: Clustering results on Wine dataset

At the end of this section, we give an evaluation on the run time of the considered clustering algorithms. The computations were executed on a Microsoft Surface Book with an Intel(R) Core(TM) i5-6300U CPU @2.40 GHz and 8 GB memory. All algorithms were tested on MATLAB platform, except bootGMM which was tested on R platform. As both of MATLAB and R are script languages, their execution efficiency is nearly at the same level. Table 11 shows the average run time of different algorithms on the five considered datasets. It can be seen that the three evidential clustering algorithms (i.e., ECM, bootGMM, EGMM) generally cost more time than the non-evidential ones, mainly because more independent parameters are needed to be estimated in the evidential partition. Among these three evidential clustering algorithms, the proposed EGMM runs fastest. In particular, it shows obvious advantage over bootGMM, which costs a great deal of time in the procedures of bootstrapping and calibration.

Datasets HCM FCM ECM GMM GMM (constrained) bootGMM EGMM
Iris 0.004 0.004 0.085 0.012 0.008 7.610 0.029
Knowledge 0.004 0.014 0.080 0.030 0.031 56.660 0.035
Seeds 0.003 0.003 0.078 0.013 0.011 18.160 0.045
Vehicle 0.006 0.022 4.134 0.048 0.034 167.790 2.255
Wine 0.003 0.005 0.116 0.009 0.006 46.770 0.052
Table 11: CPU time (second) on different datasets

5 Conclusions

In this paper, a new model-based approach to evidential clustering has been proposed. It is based on the notion of evidential partition, which extends those of probabilistic (or fuzzy), possibilistic, and rough ones. Different from the approximately calibrated approach in Denœux (2020), our proposal generates the evidential partition directly by searching for the maximum likelihood solution of the new EGMM via EM algorithm. In addition, a validity index is presented to determine the number of clusters automatically. The proposed EGMM is so convenient that it has no open parameter and the convergence properties can also be well guaranteed. More importantly, the generated evidential partition provides a more complete description of the clustering structure than does the probabilistic (or fuzzy) partition provided by the GMM. Examples have shown that more meaningful partitions of the datasets can be obtained. We have also demonstrated the applicability of this approach to several real datasets, showing that the proposed method generally performs better than some other prototype-based and model-based algorithms for finding a partition with unknown or known number of clusters.

As indicated in Banfield and Raftery (1993), different kinds of constraints can be imposed on the covariance matrices of the GMM, which results in a total of fourteen models with different assumptions on the shape, volume and orientation of the clusters. In our work, the commonly used model with equal covariance matrix is adopted to develop the evidential clustering algorithm. It is quite interesting to further study evidential versions of the GMM with other constraints. This research direction will be explored in future work.

Appendix A EM solution for the EGMM: The E-step

In the E-step, we need to derive the -function, by computing the conditional expectation of observed-data log-likelihood given , using the current fit for , i.e.,

(37)

By bringing the expression of the observed-data log-likelihood in Eq. (21) into the above formula, we have

(38)

with

(39)

from which we obtain Eqs. (22) and (23) of the E-step.

Appendix B EM solution for the EGMM: The M-step

In the M-step, we need to maximize the -function derived in the E-step with respect to the involved parameters: the mixing probabilities of the