A Bayesian non-parametric method for clustering high-dimensional binary data

by   Tapesh Santra, et al.

In many real life problems, objects are described by large number of binary features. For instance, documents are characterized by presence or absence of certain keywords; cancer patients are characterized by presence or absence of certain mutations etc. In such cases, grouping together similar objects/profiles based on such high dimensional binary features is desirable, but challenging. Here, I present a Bayesian non parametric algorithm for clustering high dimensional binary data. It uses a Dirichlet Process (DP) mixture model and simulated annealing to not only cluster binary data, but also find optimal number of clusters in the data. The performance of the algorithm was evaluated and compared with other algorithms using simulated datasets. It outperformed all other clustering methods that were tested in the simulation studies. It was also used to cluster real datasets arising from document analysis, handwritten image analysis and cancer research. It successfully divided a set of documents based on their topics, hand written images based on different styles of writing digits and identified tissue and mutation specificity of chemotherapy treatments.



There are no comments yet.


page 9

page 10

page 12


A Novel Graph Based Clustering Approach to Document Topic Modeling

Clustering is the task of assigning a set of objects into groups so that...

A Fast Algorithm for Clustering High Dimensional Feature Vectors

We propose an algorithm for clustering high dimensional data. If P featu...

Going deep in clustering high-dimensional data: deep mixtures of unigrams for uncovering topics in textual data

Mixtures of Unigrams (Nigam et al., 2000) are one of the simplest and mo...

Efficient mixture model for clustering of sparse high dimensional binary data

In this paper we propose a mixture model, SparseMix, for clustering of s...

A comparison of different clustering approaches for high-dimensional presence-absence data

Presence-absence data is defined by vectors or matrices of zeroes and on...

Bayesian Approaches for Flexible and Informative Clustering of Microbiome Data

We propose two unsupervised clustering methods that are designed for hum...

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Authorial clustering involves the grouping of documents written by the s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.