CDF Transform-Shift: An effective way to deal with inhomogeneous density datasets

10/05/2018
by   Ye Zhu, et al.
0

Many distance-based algorithms exhibit bias towards dense clusters in inhomogeneous datasets (i.e., those which contain clusters in both dense and sparse regions of the space). For example, density-based clustering algorithms tend to join neighbouring dense clusters together into a single group in the presence of a sparse cluster; while distance-based anomaly detectors exhibit difficulty in detecting local anomalies which are close to a dense cluster in datasets also containing sparse clusters. In this paper, we propose the CDF Transform-Shift (CDF-TS) algorithm which is based on a multi-dimensional Cumulative Distribution Function (CDF) transformation. It effectively converts a dataset with clusters of inhomogeneous density to one with clusters of homogeneous density, i.e., the data distribution is converted to one in which all locally low/high-density locations become globally low/high-density locations. Thus, after performing the proposed Transform-Shift, a single global density threshold can be used to separate the data into clusters and their surrounding noise points. Our empirical evaluations show that CDF-TS overcomes the shortcomings of existing density-based clustering and distance-based anomaly detection algorithms and significantly improves their performance.

READ FULL TEXT
research
06/27/2019

Clustering by the way of atomic fission

Cluster analysis which focuses on the grouping and categorization of sim...
research
05/01/2023

Unsupervised anomaly detection algorithms on real-world data: how many do we need?

In this study we evaluate 32 unsupervised anomaly detection algorithms o...
research
11/23/2019

A Domain Adaptive Density Clustering Algorithm for Data with Varying Density Distribution

As one type of efficient unsupervised learning methods, clustering algor...
research
05/17/2023

Exploring Inductive Biases in Contrastive Learning: A Clustering Perspective

This paper investigates the differences in data organization between con...
research
08/09/2018

α-Approximation Density-based Clustering of Multi-valued Objects

Multi-valued data are commonly found in many real applications. During t...
research
04/29/2019

Clustering Optimization: Finding the Number and Centroids of Clusters by a Fourier-based Algorithm

We propose a Fourier-based approach for optimization of several clusteri...
research
09/16/2020

Robust Unsupervised Mining of Dense Sub-Graphs at Multiple Resolutions

Whereas in traditional partitional clustering, each data point belongs t...

Please sign up or login with your details

Forgot password? Click here to reset