Fast Generation of Exchangeable Sequence of Clusters Data

09/06/2022
by   Keith Levin, et al.
0

Recent advances in Bayesian models for random partitions have led to the formulation and exploration of Exchangeable Sequences of Clusters (ESC) models. Under ESC models, it is the cluster sizes that are exchangeable, rather than the observations themselves. This property is particularly useful for obtaining microclustering behavior, whereby cluster sizes grow sublinearly in the number of observations, as is common in applications such as record linkage, sparse networks and genomics. Unfortunately, the exchangeable clusters property comes at the cost of projectivity. As a consequence, in contrast to more traditional Dirichlet Process or Pitman-Yor process mixture models, samples a priori from ESC models cannot be easily obtained in a sequential fashion and instead require the use of rejection or importance sampling. In this work, drawing on connections between ESC models and discrete renewal theory, we obtain closed-form expressions for certain ESC models and develop faster methods for generating samples a priori from these models compared with the existing state of the art. In the process, we establish analytical expressions for the distribution of the number of clusters under ESC models, which was unknown prior to this work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/02/2015

Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

Most generative models for clustering implicitly assume that the number ...
research
10/31/2016

Flexible Models for Microclustering with Application to Entity Resolution

Most generative models for clustering implicitly assume that the number ...
research
08/23/2020

A Prior for Record Linkage Based on Allelic Partitions

In database management, record linkage aims to identify multiple records...
research
07/14/2020

A More Robust t-Test

Standard inference about a scalar parameter estimated via GMM amounts to...
research
09/29/2014

Adaptive Low-Complexity Sequential Inference for Dirichlet Process Mixture Models

We develop a sequential low-complexity inference procedure for Dirichlet...
research
04/04/2020

Random Partition Models for Microclustering Tasks

Traditional Bayesian random partition models assume that the size of eac...
research
04/26/2021

Powered Dirichlet Process for Controlling the Importance of "Rich-Get-Richer" Prior Assumptions in Bayesian Clustering

One of the most used priors in Bayesian clustering is the Dirichlet prio...

Please sign up or login with your details

Forgot password? Click here to reset