Significance-Based Categorical Data Clustering

11/08/2022
by   Lianyu Hu, et al.
0

Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering. Consequently, a new clustering algorithm is proposed in which the significance-based objective function is optimized via a Monte Carlo search procedure. As a by-product, we can further calculate an empirical p-value to assess the statistical significance of a set of clusters and develop an improved gap statistic for estimating the cluster number. Extensive experimental studies suggest that our method is able to achieve comparable performance to state-of-the-art categorical data clustering algorithms. Moreover, the effectiveness of such a significance-based formulation on statistical cluster validation and cluster number estimation is demonstrated through comprehensive empirical results.

READ FULL TEXT
research
01/31/2019

A Novel Initial Clusters Generation Method for K-means-based Clustering Algorithms for Mixed Datasets

Mixed datasets consist of numeric and categorical attributes. Various K-...
research
09/28/2021

An exact test for significance of clusters in binary data

Unsupervised clustering of feature matrix data is an indispensible techn...
research
12/09/2018

A matching based clustering algorithm for categorical data

Cluster analysis is one of the essential tasks in data mining and knowle...
research
10/22/2019

Hypergraph clustering with categorical edge labels

Graphs and networks are a standard model for describing data or systems ...
research
10/17/2019

Multi-level conformal clustering: A distribution-free technique for clustering and anomaly detection

In this work we present a clustering technique called multi-level confor...
research
02/20/2021

nTreeClus: a Tree-based Sequence Encoder for Clustering Categorical Series

The overwhelming presence of categorical/sequential data in diverse doma...
research
06/09/2015

Clustering by transitive propagation

We present a global optimization algorithm for clustering data given the...

Please sign up or login with your details

Forgot password? Click here to reset