Sketch and Validate for Big Data Clustering

01/22/2015
by   Panagiotis A. Traganitis, et al.
0

In response to the need for learning tools tuned to big data analytics, the present paper introduces a framework for efficient clustering of huge sets of (possibly high-dimensional) data. Building on random sampling and consensus (RANSAC) ideas pursued earlier in a different (computer vision) context for robust regression, a suite of novel dimensionality and set-reduction algorithms is developed. The advocated sketch-and-validate (SkeVa) family includes two algorithms that rely on K-means clustering per iteration on reduced number of dimensions and/or feature vectors: The first operates in a batch fashion, while the second sequential one offers computational efficiency and suitability with streaming modes of operation. For clustering even nonlinearly separable vectors, the SkeVa family offers also a member based on user-selected kernel functions. Further trading off performance for reduced complexity, a fourth member of the SkeVa family is based on a divergence criterion for selecting proper minimal subsets of feature variables and vectors, thus bypassing the need for K-means clustering per iteration. Extensive numerical tests on synthetic and real data sets highlight the potential of the proposed algorithms, and demonstrate their competitive performance relative to state-of-the-art random projection alternatives.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/22/2017

Sketched Subspace Clustering

The immense amount of daily generated and communicated data presents uni...
research
10/06/2015

Large-scale subspace clustering using sketching and validation

The nowadays massive amounts of generated and communicated data present ...
research
10/10/2016

Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data

Sketching techniques have become popular for scaling up machine learning...
research
04/17/2014

Subspace Learning and Imputation for Streaming Big Data Matrices and Tensors

Extracting latent low-dimensional structure from high-dimensional data i...
research
04/15/2020

A Feature-Reduction Multi-View k-Means Clustering Algorithm

The k-means clustering algorithm is the oldest and most known method in ...
research
02/07/2021

Determinantal consensus clustering

Random restart of a given algorithm produces many partitions to yield a ...
research
02/15/2012

The Future of Search and Discovery in Big Data Analytics: Ultrametric Information Spaces

Consider observation data, comprised of n observation vectors with value...

Please sign up or login with your details

Forgot password? Click here to reset