Scaling-up Split-Merge MCMC with Locality Sensitive Sampling (LSS)

02/21/2018
by   Chen Luo, et al.
0

Split-Merge MCMC (Monte Carlo Markov Chain) is one of the essential and popular variants of MCMC for problems when an MCMC state consists of an unknown number of components or clusters. It is well known that state-of-the-art methods for split-merge MCMC do not scale well. Strategies for rapid mixing requires smart and informative proposals to reduce the rejection rate. However, all known smart proposals involve cost at least linear in the size of the data > O(N), to suggest informative transitions. Thus, the cost of each iteration is prohibitive for massive scale datasets. It is further known that uninformative but computationally efficient proposals, such as random split-merge, leads to extremely slow convergence. This tradeoff between mixing time and per update cost seems hard to get around. In this paper, we get around this tradeoff by utilizing simple similarity information, such as cosine similarity, between the entity vectors to design a proposal distribution. Such information is readily available in almost all applications. We show that the recent use of locality sensitive hashing for efficient adaptive sampling can be leveraged to obtain a computationally efficient pseudo-marginal MCMC. The new split-merge MCMC has constant time update, just like random split-merge, and at the same time the proposal is informative and needs significantly fewer iterations than random split-merge. Overall, we obtain a sweet tradeoff between convergence and per update cost. As a direct consequence, our proposal, named LSHSM, is around 10x faster than the state-of-the-art sampling methods on both synthetic datasets and two large real datasets KDDCUP and PubMed with several millions of entities and thousands of cluster centers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/08/2012

A Split-Merge MCMC Algorithm for the Hierarchical Dirichlet Process

The hierarchical Dirichlet process (HDP) has become an important Bayesia...
research
08/18/2020

Multi-Scale Merge-Split Markov Chain Monte Carlo for Redistricting

We develop a Multi-Scale Merge-Split Markov chain on redistricting plans...
research
05/31/2014

Adaptive Reconfiguration Moves for Dirichlet Mixtures

Bayesian mixture models are widely applied for unsupervised learning and...
research
07/23/2018

Subsampling MCMC - A review for the survey statistician

The rapid development of computing power and efficient Markov Chain Mont...
research
04/26/2019

Dynamic MCMC Sampling

The Markov chain Monte Carlo (MCMC) methods are the primary tools for sa...
research
12/25/2018

Parallel Clustering of Single Cell Transcriptomic Data with Split-Merge Sampling on Dirichlet Process Mixtures

Motivation: With the development of droplet based systems, massive singl...
research
08/11/2014

Comparing Nonparametric Bayesian Tree Priors for Clonal Reconstruction of Tumors

Statistical machine learning methods, especially nonparametric Bayesian ...

Please sign up or login with your details

Forgot password? Click here to reset