Clustering small datasets in high-dimension by random projection

08/21/2020
by   Alden Bradford, et al.
0

Datasets in high-dimension do not typically form clusters in their original space; the issue is worse when the number of points in the dataset is small. We propose a low-computation method to find statistically significant clustering structures in a small dataset. The method proceeds by projecting the data on a random line and seeking binary clusterings in the resulting one-dimensional data. Non-linear separations are obtained by extending the feature space using monomials of higher degrees in the original features. The statistical validity of the clustering structures obtained is tested in the projected one-dimensional space, thus bypassing the challenge of statistical validation in high-dimension. Projecting on a random line is an extreme dimension reduction technique that has previously been used successfully as part of a hierarchical clustering method for high-dimensional data. Our experiments show that with this simplified framework, statistically significant clustering structures can be found with as few as 100-200 points, depending on the dataset. The different structures uncovered are found to persist as more points are added to the dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/06/2015

A Probabilistic ℓ_1 Method for Clustering High Dimensional Data

In general, the clustering problem is NP-hard, and global optimality can...
research
03/29/2023

Randomly Projected Convex Clustering Model: Motivation, Realization, and Cluster Recovery Guarantees

In this paper, we propose a randomly projected convex clustering model f...
research
10/06/2021

Clustering Plotted Data by Image Segmentation

Clustering algorithms are one of the main analytical methods to detect p...
research
08/29/2016

Robust Discriminative Clustering with Sparse Regularizers

Clustering high-dimensional data often requires some form of dimensional...
research
05/28/2018

Clustering by latent dimensions

This paper introduces a new clustering technique, called dimensional cl...
research
08/20/2020

An Examination of Grouping and Spatial Organization Tasks for High-Dimensional Data Exploration

How do analysts think about grouping and spatial operations? This overar...
research
01/31/2010

Classifying the typefaces of the Gutenberg 42-line bible

We have measured the dissimilarities among several printed characters of...

Please sign up or login with your details

Forgot password? Click here to reset