On the optimality of kernels for high-dimensional clustering

12/01/2019
by   Leena Chennuru Vankadara, et al.
0

This paper studies the optimality of kernel methods in high-dimensional data clustering. Recent works have studied the large sample performance of kernel clustering in the high-dimensional regime, where Euclidean distance becomes less informative. However, it is unknown whether popular methods, such as kernel k-means, are optimal in this regime. We consider the problem of high-dimensional Gaussian clustering and show that, with the exponential kernel function, the sufficient conditions for partial recovery of clusters using the NP-hard kernel k-means objective matches the known information-theoretic limit up to a factor of √(2) for large k. It also exactly matches the known upper bounds for the non-kernel setting. We also show that a semi-definite relaxation of the kernel k-means procedure matches up to constant factors, the spectral threshold, below which no polynomial-time algorithm is known to succeed. This is the first work that provides such optimality guarantees for the kernel k-means as well as its convex relaxation. Our proofs demonstrate the utility of the less known polynomial concentration results for random variables with exponentially decaying tails in a higher-order analysis of kernel methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/10/2016

Phase transitions and optimal algorithms in high-dimensional Gaussian mixture clustering

We consider the problem of Gaussian mixture clustering in the high-dimen...
research
10/26/2018

Hanson-Wright inequality in Hilbert spaces with application to K-means clustering for non-Euclidean data

We derive a dimensional-free Hanson-Wright inequality for quadratic form...
research
06/06/2016

On Robustness of Kernel Clustering

Clustering is one of the most important unsupervised problems in machine...
research
05/31/2021

Optimal Spectral Recovery of a Planted Vector in a Subspace

Recovering a planted vector v in an n-dimensional random subspace of ℝ^N...
research
02/13/2020

Bayesian Kernel Two-Sample Testing

In modern data analysis, nonparametric measures of discrepancies between...
research
02/11/2023

Partial k-means to avoid outliers, mathematical programming formulations, complexity results

A well-known bottleneck of Min-Sum-of-Square Clustering (MSSC, the celeb...
research
02/09/2023

Partial Optimality in Cubic Correlation Clustering

The higher-order correlation clustering problem is an expressive model, ...

Please sign up or login with your details

Forgot password? Click here to reset