Spectral Clustering on Large Datasets: When Does it Work? Theory from Continuous Clustering and Density Cheeger-Buser

05/11/2023
by   Timothy Chu, et al.
0

Spectral clustering is one of the most popular clustering algorithms that has stood the test of time. It is simple to describe, can be implemented using standard linear algebra, and often finds better clusters than traditional clustering algorithms like k-means and k-centers. The foundational algorithm for two-way spectral clustering, by Shi and Malik, creates a geometric graph from data and finds a spectral cut of the graph. In modern machine learning, many data sets are modeled as a large number of points drawn from a probability density function. Little is known about when spectral clustering works in this setting – and when it doesn't. Past researchers justified spectral clustering by appealing to the graph Cheeger inequality (which states that the spectral cut of a graph approximates the “Normalized Cut”), but this justification is known to break down on large data sets. We provide theoretically-informed intuition about spectral clustering on large data sets drawn from probability densities, by proving when a continuous form of spectral clustering considered by past researchers (the unweighted spectral cut of a probability density) finds good clusters of the underlying density itself. Our work suggests that Shi-Malik spectral clustering works well on data drawn from mixtures of Laplace distributions, and works poorly on data drawn from certain other densities, such as a density we call the `square-root trough'. Our core theorem proves that weighted spectral cuts have low weighted isoperimetry for all probability densities. Our key tool is a new Cheeger-Buser inequality for all probability densities, including discontinuous ones.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2016

Data Clustering and Graph Partitioning via Simulated Mixing

Spectral clustering approaches have led to well-accepted algorithms for ...
research
07/01/2019

The SpectACl of Nonconvex Clustering: A Spectral Approach to Density-Based Clustering

When it comes to clustering nonconvex shapes, two paradigms are used to ...
research
04/20/2020

Weighted Cheeger and Buser Inequalities, with Applications to Clustering and Cutting Probability Densities

In this paper, we show how sparse or isoperimetric cuts of a probability...
research
11/21/2019

Local Spectral Clustering of Density Upper Level Sets

We analyze the Personalized PageRank (PPR) algorithm, a local spectral m...
research
04/29/2014

The geometry of kernelized spectral clustering

Clustering of data sets is a standard problem in many areas of science a...
research
09/09/2018

Clustering of graph vertex subset via Krylov subspace model reduction

Clustering via graph-Laplacian spectral imbedding is ubiquitous in data ...
research
10/03/2020

Sparse Quantized Spectral Clustering

Given a large data matrix, sparsifying, quantizing, and/or performing ot...

Please sign up or login with your details

Forgot password? Click here to reset