Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning

06/02/2023
by   Jesse S. Ghashti, et al.
0

Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data. In many algorithms, a predefined distance measurement is used to cluster data points based on their dissimilarity. While there exist numerous distance-based measures for data with pure numerical attributes and several ordered and unordered categorical metrics, an efficient and accurate distance for mixed-type data that utilizes the continuous and discrete properties simulatenously is an open problem. Many metrics convert numerical attributes to categorical ones or vice versa. They handle the data points as a single attribute type or calculate a distance between each attribute separately and add them up. We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity, with cross-validated optimal bandwidth selection. We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric, and improves clustering accuracy when utilized in existing distance-based clustering algorithms on simulated and real-world datasets containing continuous-only, categorical-only, and mixed-type data.

READ FULL TEXT
research
11/19/2020

Similarity-based Distance for Categorical Clustering using Space Structure

Clustering is spotting pattern in a group of objects and resultantly gro...
research
01/01/1997

Improved Heterogeneous Distance Functions

Instance-based learning techniques typically handle continuous and linea...
research
07/21/2017

A New Family of Near-metrics for Universal Similarity

We propose a family of near-metrics based on local graph diffusion to ca...
research
03/30/2022

Benchmarking distance-based partitioning methods for mixed-type data

Clustering mixed-type data, that is, observation by variable data that c...
research
03/28/2018

Active Metric Learning for Supervised Classification

Clustering and classification critically rely on distance metrics that p...
research
05/06/2020

Graph Spectral Feature Learning for Mixed Data of Categorical and Numerical Type

Feature learning in the presence of a mixed type of variables, numerical...
research
12/09/2018

A matching based clustering algorithm for categorical data

Cluster analysis is one of the essential tasks in data mining and knowle...

Please sign up or login with your details

Forgot password? Click here to reset