Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

03/24/2023
by   Marco Gaido, et al.
0

In the big data era, the key feature that each algorithm needs to have is the possibility of efficiently running in parallel in a distributed environment. The popular Silhouette metric to evaluate the quality of a clustering, unfortunately, does not have this property and has a quadratic computational complexity with respect to the size of the input dataset. For this reason, its execution has been hindered in big data scenarios, where clustering had to be evaluated otherwise. To fill this gap, in this paper we introduce the first algorithm that computes the Silhouette metric with linear complexity and can easily execute in parallel in a distributed environment. Its implementation is freely available in the Apache Spark ML library.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2022

Writing summary for the state-of-the-art methods for big data clustering in distributed environment

Big Data processing systems handle huge unstructured and structured data...
research
04/14/2022

Big-means: Less is More for K-means Clustering

K-means clustering plays a vital role in data mining. However, its perfo...
research
01/19/2017

Efficient Implementation Of Newton-Raphson Methods For Sequential Data Prediction

We investigate the problem of sequential linear data prediction for real...
research
07/30/2019

Learning over inherently distributed data

The recent decades have seen a surge of interests in distributed computi...
research
10/06/2016

Parallel Large-Scale Attribute Reduction on Cloud Systems

The rapid growth of emerging information technologies and application pa...
research
05/30/2018

Efficient Sequential and Parallel Algorithms for Estimating Higher Order Spectra

Polyspectral estimation is a problem of great importance in the analysis...

Please sign up or login with your details

Forgot password? Click here to reset