A Bayesian non-parametric method for clustering high-dimensional binary data

03/08/2016
by   Tapesh Santra, et al.
0

In many real life problems, objects are described by large number of binary features. For instance, documents are characterized by presence or absence of certain keywords; cancer patients are characterized by presence or absence of certain mutations etc. In such cases, grouping together similar objects/profiles based on such high dimensional binary features is desirable, but challenging. Here, I present a Bayesian non parametric algorithm for clustering high dimensional binary data. It uses a Dirichlet Process (DP) mixture model and simulated annealing to not only cluster binary data, but also find optimal number of clusters in the data. The performance of the algorithm was evaluated and compared with other algorithms using simulated datasets. It outperformed all other clustering methods that were tested in the simulation studies. It was also used to cluster real datasets arising from document analysis, handwritten image analysis and cancer research. It successfully divided a set of documents based on their topics, hand written images based on different styles of writing digits and identified tissue and mutation specificity of chemotherapy treatments.

READ FULL TEXT

page 9

page 10

page 12

research
07/02/2020

A Novel Graph Based Clustering Approach to Document Topic Modeling

Clustering is the task of assigning a set of objects into groups so that...
research
11/02/2018

A Fast Algorithm for Clustering High Dimensional Feature Vectors

We propose an algorithm for clustering high dimensional data. If P featu...
research
02/18/2019

Going deep in clustering high-dimensional data: deep mixtures of unigrams for uncovering topics in textual data

Mixtures of Unigrams (Nigam et al., 2000) are one of the simplest and mo...
research
07/11/2017

Efficient mixture model for clustering of sparse high dimensional binary data

In this paper we propose a mixture model, SparseMix, for clustering of s...
research
12/14/2016

Border-Peeling Clustering

In this paper, we present a novel non-parametric clustering technique, w...
research
08/20/2021

A comparison of different clustering approaches for high-dimensional presence-absence data

Presence-absence data is defined by vectors or matrices of zeroes and on...
research
07/31/2020

Bayesian Approaches for Flexible and Informative Clustering of Microbiome Data

We propose two unsupervised clustering methods that are designed for hum...

Please sign up or login with your details

Forgot password? Click here to reset