Hybrid Density- and Partition-based Clustering Algorithm for Data with Mixed-type Variables

05/06/2019
by   Shu Wang, et al.
0

Clustering is an essential technique for discovering patterns in data. The steady increase in amount and complexity of data over the years led to improvements and development of new clustering algorithms. However, algorithms that can cluster data with mixed variable types (continuous and categorical) remain limited, despite the abundance of data with mixed types particularly in the medical field. Among existing methods for mixed data, some posit unverifiable distributional assumptions or that the contributions of different variable types are not well balanced. We propose a two-step hybrid density- and partition-based algorithm (HyDaP) that can detect clusters after variables selection. The first step involves both density-based and partition-based algorithms to identify the data structure formed by continuous variables and recognize the important variables for clustering; the second step involves partition-based algorithm together with a novel dissimilarity measure we designed for mixed data to obtain clustering results. Simulations across various scenarios and data structures were conducted to examine the performance of the HyDaP algorithm compared to commonly used methods. We also applied the HyDaP algorithm on electronic health records to identify sepsis phenotypes.

READ FULL TEXT

page 11

page 25

page 26

research
11/11/2018

A Survey of Mixed Data Clustering Algorithms

Most of the datasets normally contain either numeric or categorical feat...
research
05/09/2019

A Bayesian Finite Mixture Model with Variable Selection for Data with Mixed-type Variables

Finite mixture model is an important branch of clustering methods and ca...
research
12/22/2022

Co-clustering based exploratory analysis of mixed-type data tables

Co-clustering is a class of unsupervised data analysis techniques that e...
research
02/06/2013

Nonuniform Dynamic Discretization in Hybrid Networks

We consider probabilistic inference in general hybrid networks, which in...
research
11/12/2021

Bayesian Knockoff Generators for Robust Inference Under Complex Data Structure

The recent proliferation of medical data, such as genetics and electroni...
research
03/30/2022

Benchmarking distance-based partitioning methods for mixed-type data

Clustering mixed-type data, that is, observation by variable data that c...
research
04/29/2022

greed: An R Package for Model-Based Clustering by Greedy Maximization of the Integrated Classification Likelihood

The greed package implements the general and flexible framework of arXiv...

Please sign up or login with your details

Forgot password? Click here to reset