Probabilistic Partitive Partitioning (PPP)

03/09/2020
by   Mujahid Sultan, et al.
0

Clustering is a NP-hard problem. Thus, no optimal algorithm exists, heuristics are applied to cluster the data. Heuristics can be very resource-intensive, if not applied properly. For substantially large data sets computational efficiencies can be achieved by reducing the input space if a minimal loss of information can be achieved. Clustering algorithms, in general, face two common problems: 1) these converge to different settings with different initial conditions and; 2) the number of clusters has to be arbitrarily decided beforehand. This problem has become critical in the realm of big data. Recently, clustering algorithms have emerged which can speedup computations using parallel processing over the grid but face the aforementioned problems. Goals: Our goals are to find methods to cluster data which: 1) guarantee convergence to the same settings irrespective of the initial conditions; 2) eliminate the need to establish the number of clusters beforehand, and 3) can be applied to cluster large datasets. Methods: We introduce a method that combines probabilistic and combinatorial clustering methods to produce repeatable and compact clusters that are not sensitive to initial conditions. This method harnesses the power of k-means (a combinatorial clustering method) to cluster/partition very large dimensional datasets and uses the Gaussian Mixture Model (a probabilistic clustering method) to validate the k-means partitions. Results: We show that this method produces very compact clusters that are not sensitive to initial conditions. This method can be used to identify the most 'separable' set in a dataset which increases the 'clusterability' of a dataset. This method also eliminates the need to specify the number of clusters in advance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2019

A Novel Initial Clusters Generation Method for K-means-based Clustering Algorithms for Mixed Datasets

Mixed datasets consist of numeric and categorical attributes. Various K-...
research
12/29/2022

Cluster-level Group Representativity Fairness in k-means Clustering

There has been much interest recently in developing fair clustering algo...
research
06/27/2018

Quantile-based clustering

A new cluster analysis method, K-quantiles clustering, is introduced. K-...
research
04/06/2021

A New Parallel Adaptive Clustering and its Application to Streaming Data

This paper presents a parallel adaptive clustering (PAC) algorithm to au...
research
11/03/2016

A-Ward_pe̱ṯa̱: Effective hierarchical clustering using the Minkowski metric and a fast k -means initialisation

In this paper we make two novel contributions to hierarchical clustering...
research
07/25/2023

DBGSA: A Novel Data Adaptive Bregman Clustering Algorithm

With the development of Big data technology, data analysis has become in...
research
09/20/2016

An Efficient Method of Partitioning High Volumes of Multidimensional Data for Parallel Clustering Algorithms

An optimal data partitioning in parallel & distributed implementation of...

Please sign up or login with your details

Forgot password? Click here to reset