A Generic Distributed Clustering Framework for Massive Data

06/19/2021
by   Pingyi Luo, et al.
0

In this paper, we introduce a novel Generic distributEd clustEring frameworK (GEEK) beyond k-means clustering to process massive amounts of data. To deal with different data types, GEEK first converts data in the original feature space into a unified format of buckets; then, we design a new Seeding method based on simILar bucKets (SILK) to determine initial seeds. Compared with state-of-the-art seeding methods such as k-means++ and its variants, SILK can automatically identify the number of initial seeds based on the closeness of shared data objects in similar buckets instead of pre-specifying k. Thus, its time complexity is independent of k. With these well-selected initial seeds, GEEK only needs a one-pass data assignment to get the final clusters. We implement GEEK on a distributed CPU-GPU platform for large-scale clustering. We evaluate the performance of GEEK over five large-scale real-life datasets and show that GEEK can deal with massive data of different types and is comparable to (or even better than) many state-of-the-art customized GPU-based methods, especially in large k values.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2016

k2-means for fast and accurate large scale clustering

We propose k^2-means, a new clustering method which efficiently copes wi...
research
08/28/2019

Data ultrametricity and clusterability

The increasing needs of clustering massive datasets and the high cost of...
research
06/13/2020

SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets

This paper presents a batch-wise density-based clustering approach for l...
research
10/16/2019

Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

Training and deploying deep learning models in real-world applications r...
research
10/09/2017

Distributed Kernel K-Means for Large Scale Clustering

Clustering samples according to an effective metric and/or vector space ...
research
05/25/2019

A New Clustering Method Based on Morphological Operations

With the booming development of data science, many clustering methods ha...
research
01/09/2018

An efficient K -means clustering algorithm for massive data

The analysis of continously larger datasets is a task of major importanc...

Please sign up or login with your details

Forgot password? Click here to reset