Bayesian nonparametric mixture inconsistency for the number of components: How worried should we be in practice?

07/29/2022
by   Yannis Chaumeny, et al.
0

We consider the Bayesian mixture of finite mixtures (MFMs) and Dirichlet process mixture (DPM) models for clustering. Recent asymptotic theory has established that DPMs overestimate the number of clusters for large samples and that estimators from both classes of models are inconsistent for the number of clusters under misspecification, but the implications for finite sample analyses are unclear. The final reported estimate after fitting these models is often a single representative clustering obtained using an MCMC summarisation technique, but it is unknown how well such a summary estimates the number of clusters. Here we investigate these practical considerations through simulations and an application to gene expression data, and find that (i) DPMs overestimate the number of clusters even in finite samples, but only to a limited degree that may be correctable using appropriate summaries, and (ii) misspecification can lead to considerable overestimation of the number of clusters in both DPMs and MFMs, but results are nevertheless often still interpretable. We provide recommendations on MCMC summarisation and suggest that although the more appealing asymptotic properties of MFMs provide strong motivation to prefer them, results obtained using MFMs and DPMs are often very similar in practice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/25/2022

Bayesian mixture models (in)consistency for the number of clusters

Bayesian nonparametric mixture models are common for modeling complex da...
research
05/25/2022

Clustering consistency with Dirichlet process mixtures

Dirichlet process mixtures are flexible non-parametric models, particula...
research
05/23/2019

Posterior Distribution for the Number of Clusters in Dirichlet Process Mixture Models

Dirichlet process mixture models (DPMM) play a central role in Bayesian ...
research
09/26/2013

Determinantal Clustering Processes - A Nonparametric Bayesian Approach to Kernel Based Semi-Supervised Clustering

Semi-supervised clustering is the task of clustering data points into cl...
research
11/21/2012

Bayesian nonparametric Plackett-Luce models for the analysis of preferences for college degree programmes

In this paper we propose a Bayesian nonparametric model for clustering p...
research
01/29/2021

How many data clusters are in the Galaxy data set? Bayesian cluster analysis in action

In model-based clustering, the Galaxy data set is often used as a benchm...
research
11/11/2014

Supervised Classification of Flow Cytometric Samples via the Joint Clustering and Matching (JCM) Procedure

We consider the use of the Joint Clustering and Matching (JCM) procedure...

Please sign up or login with your details

Forgot password? Click here to reset