An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering

02/06/2013
by   Michael Kearns, et al.
0

Assignment methods are at the heart of many algorithms for unsupervised learning and clustering - in particular, the well-known K-means and Expectation-Maximization (EM) algorithms. In this work, we study several different methods of assignment, including the "hard" assignments used by K-means and the ?soft' assignments used by EM. While it is known that K-means minimizes the distortion on the data and EM maximizes the likelihood, little is known about the systematic differences of behavior between the two algorithms. Here we shed light on these differences via an information-theoretic analysis. The cornerstone of our results is a simple decomposition of the expected distortion, showing that K-means (and its extension for inferring general parametric densities from unlabeled sample data) must implicitly manage a trade-off between how similar the data assigned to each cluster are, and how the data are balanced among the clusters. How well the data are balanced is measured by the entropy of the partition defined by the hard assignments. In addition to letting us predict and verify systematic differences between K-means and EM on specific examples, the decomposition allows us to give a rather general argument showing that K ?means will consistently find densities with less "overlap" than EM. We also study a third natural assignment method that we call posterior assignment, that is close in spirit to the soft assignments of EM, but leads to a surprisingly different algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 6

page 8

page 11

research
03/25/2016

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

The present work proposes hybridization of Expectation-Maximization (EM)...
research
09/29/2022

Likelihood adjusted semidefinite programs for clustering heterogeneous data

Clustering is a widely deployed unsupervised learning tool. Model-based ...
research
07/06/2021

Neural Mixture Models with Expectation-Maximization for End-to-end Deep Clustering

Any clustering algorithm must synchronously learn to model the clusters ...
research
01/30/2013

An Experimental Comparison of Several Clustering and Initialization Methods

We examine methods for clustering in high dimensions. In the first part ...
research
08/28/2020

Semi-supervised Learning with the EM Algorithm: A Comparative Study between Unstructured and Structured Prediction

Semi-supervised learning aims to learn prediction models from both label...
research
03/29/2023

Hard Regularization to Prevent Collapse in Online Deep Clustering without Data Augmentation

Online deep clustering refers to the joint use of a feature extraction n...
research
03/23/2021

Towards interpretability of Mixtures of Hidden Markov Models

Mixtures of Hidden Markov Models (MHMMs) are frequently used for cluster...

Please sign up or login with your details

Forgot password? Click here to reset