Distribution free optimality intervals for clustering

07/30/2021
by   Marina Meila, et al.
0

We address the problem of validating the ouput of clustering algorithms. Given data 𝒟 and a partition 𝒞 of these data into K clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering 𝒞 is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations. Furthermore, we present a generic method to obtain post-inference guarantees of near-optimality and stability for a clustering 𝒞. The method can be instantiated for a variety of clustering criteria (also called loss functions) for which convex relaxations exist. Obtaining the guarantees amounts to solving a convex optimization problem. We demonstrate the practical relevance of this method by obtaining guarantees for the K-means and the Normalized Cut clustering criteria on realistic data sets. We also prove that asymptotic instability implies finite sample instability w.h.p., allowing inferences about the population clusterability from a sample. The guarantees do not depend on any distributional assumptions, but they depend on the data set 𝒟 admitting a stable clustering.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2020

Guarantees for Hierarchical Clustering by the Sublevel Set method

Meila (2018) introduces an optimization based method called the Sublevel...
research
06/19/2016

Clustering with a Reject Option: Interactive Clustering as Bayesian Prior Elicitation

A good clustering can help a data analyst to explore and understand a da...
research
09/04/2023

Selective inference after convex clustering with ℓ_1 penalization

Classical inference methods notoriously fail when applied to data-driven...
research
01/26/2019

A general model for plane-based clustering with loss function

In this paper, we propose a general model for plane-based clustering. Th...
research
06/15/2020

Selecting the Number of Clusters K with a Stability Trade-off: an Internal Validation Criterion

Model selection is a major challenge in non-parametric clustering. There...
research
05/31/2022

Scalable Distributional Robustness in a Class of Non Convex Optimization with Guarantees

Distributionally robust optimization (DRO) has shown lot of promise in p...
research
02/17/2021

Outside the Echo Chamber: Optimizing the Performative Risk

In performative prediction, predictions guide decision-making and hence ...

Please sign up or login with your details

Forgot password? Click here to reset