Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

12/02/2015
by   Jeffrey Miller, et al.
0

Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For example, when performing entity resolution, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new model that exhibits this property. We compare this model to several commonly used clustering models by checking model fit using real and simulated data sets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2016

Flexible Models for Microclustering with Application to Entity Resolution

Most generative models for clustering implicitly assume that the number ...
research
09/06/2022

Fast Generation of Exchangeable Sequence of Clusters Data

Recent advances in Bayesian models for random partitions have led to the...
research
10/29/2020

Attentive Clustering Processes

Amortized approaches to clustering have recently received renewed attent...
research
04/04/2020

Random Partition Models for Microclustering Tasks

Traditional Bayesian random partition models assume that the size of eac...
research
07/15/2020

Mixture Complexity and Its Application to Gradual Clustering Change Detection

In model-based clustering using finite mixture models, it is a significa...
research
10/05/2017

Reliable Learning of Bernoulli Mixture Models

In this paper, we have derived a set of sufficient conditions for reliab...
research
09/23/2016

Fast Learning of Clusters and Topics via Sparse Posteriors

Mixture models and topic models generate each observation from a single ...

Please sign up or login with your details

Forgot password? Click here to reset