Clustering of Big Data with Mixed Features

11/11/2020
by   Joshua Tobin, et al.
0

Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We here develop a new clustering algorithm for large data of mixed type, aiming at improving the applicability and efficiency of the peak-finding technique. The improvements are threefold: (1) the new algorithm is applicable to mixed data; (2) the algorithm is capable of detecting outliers and clusters of relatively lower density values; (3) the algorithm is competent at deciding the correct number of clusters. The computational complexity of the algorithm is greatly reduced by applying a fast k-nearest neighbors method and by scaling down to component sets. We present experimental results to verify that our algorithm works well in practice. Keywords: Clustering; Big Data; Mixed Attribute; Density Peaks; Nearest-Neighbor Graph; Conductance.

READ FULL TEXT
research
02/16/2022

IPD:An Incremental Prototype based DBSCAN for large-scale data with cluster representatives

DBSCAN is a fundamental density-based clustering technique that identifi...
research
07/04/2022

An Improved Probability Propagation Algorithm for Density Peak Clustering Based on Natural Nearest Neighborhood

Clustering by fast search and find of density peaks (DPC) (Since, 2014) ...
research
05/04/2017

Fast k-means based on KNN Graph

In the era of big data, k-means clustering has been widely adopted as a ...
research
06/03/2019

Big-Data Clustering: K-Means or K-Indicators?

The K-means algorithm is arguably the most popular data clustering metho...
research
02/22/2016

Recovering the number of clusters in data sets with noise features using feature rescaling factors

In this paper we introduce three methods for re-scaling data sets aiming...
research
04/21/2020

Revealing Cluster Structures Based on Mixed Sampling Frequencies

This paper proposes a new nonparametric mixed data sampling (MIDAS) mode...
research
02/08/2022

Systematically improving existing k-means initialization algorithms at nearly no cost, by pairwise-nearest-neighbor smoothing

We present a meta-method for initializing (seeding) the k-means clusteri...

Please sign up or login with your details

Forgot password? Click here to reset