An Efficient k-modes Algorithm for Clustering Categorical Datasets

06/06/2020
by   Karin S. Dorman, et al.
0

Mining clusters from datasets is an important endeavor in many applications. The k-means algorithm is a popular and efficient distribution-free approach for clustering numerical-valued data but can not be applied to categorical-valued observations. The k-modes algorithm addresses this lacuna by taking the k-means objective function, replacing the dissimilarity measure and using modes instead of means in the modified objective function. Unlike many other clustering algorithms, both k-modes and k-means are scalable, because they do not require calculation of all pairwise dissimilarities. We provide a fast and computationally efficient implementation of k-modes, OTQT, and prove that it can find superior clusterings to existing algorithms. We also examine five initialization methods and three types of K-selection methods, many of them novel, and all appropriate for k-modes. By examining the performance on real and simulated datasets, we show that simple random initialization is the best intializer, a novel K-selection method is more accurate than two methods adapted from k-means, and that the new OTQT algorithm is more accurate and almost always faster than existing algorithms.

READ FULL TEXT

page 13

page 14

page 15

page 21

page 22

page 24

page 25

page 26

research
04/24/2013

The K-modes algorithm for clustering

Many clustering algorithms exist that estimate a cluster centroid, such ...
research
10/18/2022

Clustering Categorical Data: Soft Rounding k-modes

Over the last three decades, researchers have intensively explored vario...
research
04/12/2017

Finding Modes by Probabilistic Hypergraphs Shifting

In this paper, we develop a novel paradigm, namely hypergraph shift, to ...
research
09/30/2019

K-Metamodes: frequency- and ensemble-based distributed k-modes clustering for security analytics

Nowadays processing of Big Security Data, such as log messages, is commo...
research
09/25/2020

Skew Brownian Motion and Complexity of the ALPS Algorithm

Simulated tempering is a popular method of allowing MCMC algorithms to m...
research
06/02/2021

Band Depth based initialization of k-Means for functional data clustering

The k-Means algorithm is one of the most popular choices for clustering ...
research
08/15/2023

Parametric entropy based Cluster Centriod Initialization for k-means clustering of various Image datasets

One of the most employed yet simple algorithm for cluster analysis is th...

Please sign up or login with your details

Forgot password? Click here to reset