An Efficient k-modes Algorithm for Clustering Categorical Datasets

06/06/2020
by   Karin S. Dorman, et al.
0

Mining clusters from datasets is an important endeavor in many applications. The k-means algorithm is a popular and efficient distribution-free approach for clustering numerical-valued data but can not be applied to categorical-valued observations. The k-modes algorithm addresses this lacuna by taking the k-means objective function, replacing the dissimilarity measure and using modes instead of means in the modified objective function. Unlike many other clustering algorithms, both k-modes and k-means are scalable, because they do not require calculation of all pairwise dissimilarities. We provide a fast and computationally efficient implementation of k-modes, OTQT, and prove that it can find superior clusterings to existing algorithms. We also examine five initialization methods and three types of K-selection methods, many of them novel, and all appropriate for k-modes. By examining the performance on real and simulated datasets, we show that simple random initialization is the best intializer, a novel K-selection method is more accurate than two methods adapted from k-means, and that the new OTQT algorithm is more accurate and almost always faster than existing algorithms.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 14

page 15

page 21

page 22

page 24

page 25

page 26

04/24/2013

The K-modes algorithm for clustering

Many clustering algorithms exist that estimate a cluster centroid, such ...
11/19/2020

Similarity-based Distance for Categorical Clustering using Space Structure

Clustering is spotting pattern in a group of objects and resultantly gro...
04/12/2017

Finding Modes by Probabilistic Hypergraphs Shifting

In this paper, we develop a novel paradigm, namely hypergraph shift, to ...
09/30/2019

K-Metamodes: frequency- and ensemble-based distributed k-modes clustering for security analytics

Nowadays processing of Big Security Data, such as log messages, is commo...
09/25/2020

Skew Brownian Motion and Complexity of the ALPS Algorithm

Simulated tempering is a popular method of allowing MCMC algorithms to m...
06/02/2021

Band Depth based initialization of k-Means for functional data clustering

The k-Means algorithm is one of the most popular choices for clustering ...
01/31/2019

Generalized Dirichlet-process-means for f-separable distortion measures

DP-means clustering was obtained as an extension of K-means clustering. ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.