Iterative Subsampling in Solution Path Clustering of Noisy Big Data

12/04/2014
by   Yuliya Marchetti, et al.
0

We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method's relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets. An R package, SPClustering, for the SPC method with iterative subsampling is available at http://www.stat.ucla.edu/ zhou/Software.html.

READ FULL TEXT

page 13

page 15

research
04/24/2014

Solution Path Clustering with Adaptive Concave Penalty

Fast accumulation of large amounts of complex data has created a need fo...
research
12/29/2021

A sampling-based approach for efficient clustering in large datasets

We propose a simple and efficient clustering method for high-dimensional...
research
03/19/2019

A Quantum Annealing-Based Approach to Extreme Clustering

In this age of data abundance, there is a growing need for algorithms an...
research
02/16/2022

IPD:An Incremental Prototype based DBSCAN for large-scale data with cluster representatives

DBSCAN is a fundamental density-based clustering technique that identifi...
research
04/15/2019

Multiple kernel learning for integrative consensus clustering of genomic datasets

Diverse applications - particularly in tumour subtyping - have demonstra...
research
05/29/2023

DMS: Differentiable Mean Shift for Dataset Agnostic Task Specific Clustering Using Side Information

We present a novel approach, in which we learn to cluster data directly ...
research
01/06/2019

Dynamic Visualization and Fast Computation for Convex Clustering and Bi-Clustering

Convex clustering is a promising new approach to the classical problem o...

Please sign up or login with your details

Forgot password? Click here to reset