Distributed Bayesian clustering
In many modern applications, there is interest in analyzing enormous data sets that cannot be easily moved across computers or loaded into memory on a single computer. In such settings, it is very common to be interested in clustering. Existing distributed clustering algorithms are mostly distance or density based without a likelihood specification, precluding the possibility of formal statistical inference. We introduce a nearly embarrassingly parallel algorithm using a Bayesian finite mixture of mixtures model for distributed clustering, which we term distributed Bayesian clustering (DIB-C). DIB-C can flexibly accommodate data sets with various shapes (e.g. skewed or multi-modal). With data randomly partitioned and distributed, we first run Markov chain Monte Carlo in an embarrassingly parallel manner to obtain local clustering draws and then refine across nodes for a final cluster estimate based on any loss function on the space of partitions. DIB-C can also provide a posterior predictive distribution, estimate cluster densities, and quickly classify new subjects. Both simulation studies and real data applications show superior performance of DIB-C in terms of robustness and computational efficiency.
READ FULL TEXT