A Novel Bayesian Cluster Enumeration Criterion for Unsupervised Learning

The Bayesian Information Criterion (BIC) has been widely used for estimating the number of data clusters in an observed data set for decades. The original derivation, referred to as classic BIC, does not include information about the specific model selection problem at hand, which renders it generic. However, very little effort has been made to check its appropriateness for cluster analysis. In this paper we derive BIC from first principle by formulating the problem of estimating the number of clusters in a data set as maximization of the posterior probability of candidate models given observations. We provide a general BIC expression which is independent of the data distribution given some mild assumptions are satisfied. This serves as an important milestone when deriving BIC for specific data distributions. Along this line, we provide a closed-form BIC expression for multivariate Gaussian distributed observations. We show that incorporating the clustering problem during the derivation of BIC results in an expression whose penalty term is different from the penalty term of the classic BIC. We propose a two-step cluster enumeration algorithm that utilizes a model-based unsupervised learning algorithm to partition the observed data according to each candidate model and the proposed BIC for selecting the model with the optimal number of clusters. The performance of the proposed criterion is tested using synthetic and real-data examples. Simulation results show that our proposed criterion outperforms the existing BIC-based cluster enumeration methods. Our proposed criterion is particularly powerful in estimating the number of data clusters when the observations have unbalanced and overlapping clusters.


Robust Bayesian Cluster Enumeration

A major challenge in cluster analysis is that the number of data cluster...

Robust M-Estimation Based Bayesian Cluster Enumeration for Real Elliptically Symmetric Distributions

Robustly determining the optimal number of clusters in a data set is an ...

A novel cluster internal evaluation index based on hyper-balls

It is crucial to evaluate the quality and determine the optimal number o...

Data Consistency Approach to Model Validation

In scientific inference problems, the underlying statistical modeling as...

VARCLUST: clustering variables using dimensionality reduction

VARCLUST algorithm is proposed for clustering variables under the assump...

Selecting the Number of Clusters K with a Stability Trade-off: an Internal Validation Criterion

Model selection is a major challenge in non-parametric clustering. There...

Model Order Selection Based on Information Theoretic Criteria: Design of the Penalty

Information theoretic criteria (ITC) have been widely adopted in enginee...

Please sign up or login with your details

Forgot password? Click here to reset