Fast communication-efficient spectral clustering over distributed data

05/05/2019
by   Donghui Yan, et al.
0

The last decades have seen a surge of interests in distributed computing thanks to advances in clustered computing and big data technology. Existing distributed algorithms typically assume all the data are already in one place, and divide the data and conquer on multiple machines. However, it is increasingly often that the data are located at a number of distributed sites, and one wishes to compute over all the data with low communication overhead. For spectral clustering, we propose a novel framework that enables its computation over such distributed data, with "minimal" communications while a major speedup in computation. The loss in accuracy is negligible compared to the non-distributed setting. Our approach allows local parallel computing at where the data are located, thus turns the distributed nature of the data into a blessing; the speedup is most substantial when the data are evenly distributed across sites. Experiments on synthetic and large UC Irvine datasets show almost no loss in accuracy with our approach while about 2x speedup under various settings with two distributed sites. As the transmitted data need not be in their original form, our framework readily addresses the privacy concern for data sharing in distributed computing.

READ FULL TEXT

page 16

page 17

research
07/30/2019

Learning over inherently distributed data

The recent decades have seen a surge of interests in distributed computi...
research
12/08/2022

A Distributed Block Chebyshev-Davidson Algorithm for Parallel Spectral Clustering

We develop a distributed Block Chebyshev-Davidson algorithm to solve lar...
research
02/04/2023

FedSpectral+: Spectral Clustering using Federated Learning

Clustering in graphs has been a well-known research problem, particularl...
research
06/30/2023

Hashing-Based Distributed Clustering for Massive High-Dimensional Data

Clustering analysis is of substantial significance for data mining. The ...
research
04/20/2012

A Privacy-Aware Bayesian Approach for Combining Classifier and Cluster Ensembles

This paper introduces a privacy-aware Bayesian approach that combines en...
research
12/20/2019

Heterogeneity-aware and communication-efficient distributed statistical inference

In multicenter research, individual-level data are often protected again...
research
09/24/2020

Distributed Community Detection for Large Scale Networks Using Stochastic Block Model

With rapid developments of information and technology, large scale netwo...

Please sign up or login with your details

Forgot password? Click here to reset