Benchmarking distance-based partitioning methods for mixed-type data

03/30/2022
by   Efthymios Costa, et al.
0

Clustering mixed-type data, that is, observation by variable data that consist of both continuous and categorical variables poses novel challenges. Foremost among these challenges is the choice of the most appropriate clustering method for the data. This paper presents a benchmarking study comparing six distance-based partitioning methods for mixed-type data in terms of cluster recovery performance. A series of simulations carried out by a full factorial design are presented that examined the effect of a variety of factors on cluster recovery. The amount of cluster overlap had the largest effect on cluster recovery and in most of the tested scenarios. Modha-Spangler K-Means, K-Prototypes and a sequential Factor Analysis and K-Means clustering typically performed better than other methods. The study can be a useful reference for practitioners in the choice of the most appropriate method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/09/2018

A matching based clustering algorithm for categorical data

Cluster analysis is one of the essential tasks in data mining and knowle...
research
11/16/2021

A Comparative Study on Transfer Learning and Distance Metrics in Semantic Clustering over the COVID-19 Tweets

This paper is a comparison study in the context of Topic Detection on CO...
research
06/02/2023

Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning

Distance-based clustering and classification are widely used in various ...
research
05/06/2019

Hybrid Density- and Partition-based Clustering Algorithm for Data with Mixed-type Variables

Clustering is an essential technique for discovering patterns in data. T...
research
08/24/2018

To Cluster, or Not to Cluster: An Analysis of Clusterability Methods

Clustering is an essential data mining tool that aims to discover inhere...
research
11/27/2019

K-MACE and Kernel K-MACE Clustering

Determining the correct number of clusters (CNC) is an important task in...
research
05/09/2019

A Bayesian Finite Mixture Model with Variable Selection for Data with Mixed-type Variables

Finite mixture model is an important branch of clustering methods and ca...

Please sign up or login with your details

Forgot password? Click here to reset