An Efficient k-modes Algorithm for Clustering Categorical Datasets

by   Karin S. Dorman, et al.

Mining clusters from datasets is an important endeavor in many applications. The k-means algorithm is a popular and efficient distribution-free approach for clustering numerical-valued data but can not be applied to categorical-valued observations. The k-modes algorithm addresses this lacuna by taking the k-means objective function, replacing the dissimilarity measure and using modes instead of means in the modified objective function. Unlike many other clustering algorithms, both k-modes and k-means are scalable, because they do not require calculation of all pairwise dissimilarities. We provide a fast and computationally efficient implementation of k-modes, OTQT, and prove that it can find superior clusterings to existing algorithms. We also examine five initialization methods and three types of K-selection methods, many of them novel, and all appropriate for k-modes. By examining the performance on real and simulated datasets, we show that simple random initialization is the best intializer, a novel K-selection method is more accurate than two methods adapted from k-means, and that the new OTQT algorithm is more accurate and almost always faster than existing algorithms.



There are no comments yet.


page 13

page 14

page 15

page 21

page 22

page 24

page 25

page 26


The K-modes algorithm for clustering

Many clustering algorithms exist that estimate a cluster centroid, such ...

Similarity-based Distance for Categorical Clustering using Space Structure

Clustering is spotting pattern in a group of objects and resultantly gro...

Finding Modes by Probabilistic Hypergraphs Shifting

In this paper, we develop a novel paradigm, namely hypergraph shift, to ...

K-Metamodes: frequency- and ensemble-based distributed k-modes clustering for security analytics

Nowadays processing of Big Security Data, such as log messages, is commo...

Skew Brownian Motion and Complexity of the ALPS Algorithm

Simulated tempering is a popular method of allowing MCMC algorithms to m...

Band Depth based initialization of k-Means for functional data clustering

The k-Means algorithm is one of the most popular choices for clustering ...

Generalized Dirichlet-process-means for f-separable distortion measures

DP-means clustering was obtained as an extension of K-means clustering. ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.