Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass

07/08/2019
by   Gregor Ulm, et al.
0

Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering algorithms are infeasible due to their high memory requirements and/or unfavorable runtime complexity. In contrast, Contraction Clustering (RASTER) is a single-pass algorithm for identifying density-based clusters with linear time complexity. Due to its favorable runtime and the fact that its memory requirements are constant, this algorithm is highly suitable for big data applications where the amount of data to be processed is huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. This algorithm is extremely fast in both sequential and parallel execution. In single-threaded execution on a contemporary workstation, an implementation in Rust processes a batch of 500 million points with 1 million clusters in less than 50 seconds. The speedup due to parallelization is significant, amounting to a factor of around 4 on an 8-core machine.

READ FULL TEXT
research
11/21/2019

S-RASTER: Contraction Clustering for Evolving Data Streams

Contraction Clustering (RASTER) is a very fast algorithm for density-bas...
research
05/08/2018

Parallel Computation of PDFs on Big Spatial Data Using Spark

We consider big spatial data, which is typically produced in scientific ...
research
05/18/2023

Faster Parallel Exact Density Peaks Clustering

Clustering multidimensional points is a fundamental data mining task, wi...
research
03/24/2023

Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

In the big data era, the key feature that each algorithm needs to have i...
research
07/09/2018

Using Multi-Core HW/SW Co-design Architecture for Accelerating K-means Clustering Algorithm

The capability of classifying and clustering a desired set of data is an...
research
06/13/2020

SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets

This paper presents a batch-wise density-based clustering approach for l...
research
12/07/2016

A Multi-Pass Approach to Large-Scale Connectomics

The field of connectomics faces unprecedented "big data" challenges. To ...

Please sign up or login with your details

Forgot password? Click here to reset