Recent technological advances have greatly accelerated data collection processes, leading to explosively growing data sizes. These data sets are often too big to be stored on a single computer and are costly to move across computers. One common query to these data sets is cluster analysis, which seeks to group observations that are cohesive and separated from other groups. Large scale data sets from astronomy, flow cytometry and many other fields raise questions as to how to discover underlying clusters quickly while allowing for statistical inference.
To cluster large scale data sets, parallel and distributed clustering algorithms have been proposed, mostly based on either distance or density without a likelihood specification. Such algorithms include density based distributed clustering [Januzaj et al., 2004]
, K-Means with MapReduce (PKMeans)[Zhao et al., 2009] and co-clustering with MapReduce [Papadimitriou and Sun, 2008]. These methods, mostly non-parametric, do not have established statistical properties in general.
In contrast to such methods, Bayesian finite mixture models allow simultaneous clustering, density estimation and uncertainty quantification of cluster-specific parameters. Let , be a sample of size . The model assumes that of dimension is generated from a finite mixture with exchangeable mixture components:
where is the weight associated with component satisfying and is the component density specified by parameter . A Bayesian model puts priors on the model parameters (e.g. and ) enabling posterior inference. With each component interpreted as a cluster, this model has been successfully applied to many areas, including agriculture, astronomy, bioinformatics, biology, economics, engineering, genetics, etc.
Finite mixture models often require a predetermined , but the number of clusters is generally unknown. Even though one can fit multiple models with different and identify the best one based on model selection criteria (e.g. BIC), such a procedure can be time-consuming given a large data set and less appealing than the natural Bayesian approach, which is to treat the true number of clusters as an unknown parameter to be estimated jointly with the component-specific parameters. Relevant methods include nonparametric Bayesian mixture models and overfitted finite mixture models. In the former category, commonly used models are Dirichlet process (DP) or Pitman-Yor process (PYP) mixtures; both carry an implicit assumption that as the sample size goes to infinity, the number of clusters inevitably tends to infinity with [Korwar and Hollander, 1973] and (Miller and Harrison  and Orbanz ), respectively, where and are constants. This means the posterior under a DP model fails to concentrate at the true number of components for data from a finite mixture. Overfitted finite mixtures instead apply a finite mixture model with the number of components intentionally set to be greater than the true number of clusters . This approach is more useful for data with a moderate number of clusters that does not increase as the number of observations increases. Setting the prior on to be , Rousseau and Mengersen  studied the asymptotic behavior of its posterior distribution: if , where is the dimension of the cluster-specific parameters , the posterior expectation of the weights associated with empty clusters asymptotically converges to zero.
A critical component of the finite mixture model specification is the choice of kernel. Although the Gaussian distribution is typically used, most data clusters in real world applications deviate from such a simple symmetric choice. Such misspecification can lead to identification of extraneous clusters to improve model fit, which essentially destroys the interpretation of a mixture component as one cluster. A remedy is to instead model each cluster by a finite Gaussian mixture, a model that can accurately approximate a wide class of distributions. However, such an approach encounters major issues in terms of identifiability of the separate components or clusters, so some care is needed in the specification.
Although Bayesian mixture models are appealing in providing uncertainty quantification in clustering and associated inferences, a key disadvantage is the large computational cost, particularly as the sample size grows. Joint posterior distributions typically have an intractable analytic form, and one typically relies on Markov chain Monte Carlo (MCMC) sampling to generate draws from the posterior. Conventional MCMC algorithms are too slow to be practically useful in massive data applications. Their inherently serial nature also poses challenges in constructing a parallel counterpart. Embarassingly parallel (EP) MCMC algorithms have been recently developed to deal with this problem (Scott et al., 2016, Neiswanger et al., 2014, Minsker et al., 2014, Srivastava et al., 2015, Wang et al., 2015, Li et al., 2017). The idea is to break the data into subsets, run MCMC in parallel for each subset and then combine the results.
Inspired by embarrassingly parallel MCMC, we propose a distributed Bayesian clustering framework and we term it DIB-C. The general idea is to produce approximate samples of cluster assignments for big data under a finite mixture of mixtures model and identify an optimal cluster estimate based on a decision-theoretic approach. For scalability, we produce samples of cluster assignments in each partition of the data and combine them for an approximate sample of global cluster allocations. Because clusters describe inherent relationships among all the data points, the independent clustering performed on each subset must be carefully merged. To minimize data transmission, we employ one of the simplest parallel programming paradigms, master-slave, and communicate only summary statistics between the master and slave nodes to refine local cluster assignments across all partitions of the data in order to produce global cluster allocations.
Our main contributions lie in developing a decision theoretic approach to identifying a reliable cluster estimate while minimizing data transmission between the master and slaves nodes, a key consideration in distributed computing. Besides significant computational gain, DIB-C exhibits superior clustering performance in comparison to its non-distributed counterpart due to better mixing in MCMC. In addition, DIB-C can accommodate any loss function on the space of partitions for cluster estimation and enables uncertainty quantification of cluster-specific parameters (such as cluster centers), density estimation and quick classification of a new subject. As a side effect, DIB-C also works for semi-supervised clustering: when the true number of clusters is greater than that represented in the labeled data, DIB-C can automatically determine the number of clusters via the use of an overfitted finite mixture model.
2 Model specification: finite mixture of mixtures
Clusters are groups of data points that are cohesive and connected within a group but are separated from other groups. Assume can be partitioned into non-overlapping subsets, with subset , denoted by , residing on slave node . We follow Malsiner-Walli et al.  (henceforth MWFSG)’s Bayesian finite mixture of Gaussian mixtures (FMGM) for clustering. Assume each component distribution in (1) is a mixture of Gaussian subcomponents:
where and are Gaussian densities used to approximate . Such a formulation leads to a hierarchy: the upper level (1) captures a heterogeneous population with different clusters and each cluster corresponds to a mixture component; the lower level (2) approximates each cluster distribution via a mixture of Gaussian densities . To distinguish the upper and lower level components, we adopt the convention in MWFSG to call cluster distribution and subcomponent distribution in cluster .
where . (3) is invariant to permutations of the subcomponents: exchanging subcomponents between clusters does not alter the likelihood, but goes contrary to the general characterization of data clusters, which is a densely connected cloud of data points far away from other densely connected ones. To incorporate such structure, MWFSG proposed a two-level hierarchical prior that repulses the cluster centers and attracts subcomponent means towards the cluster centers. Additionally, since cluster structure is invariant to the ordering of both clusters and subcomponents within a cluster, symmetric priors should be used for clusters on the upper and lower level, respectively, to ensure exchangeability.
Let be a set of fixed hyper-parameters. The priors at the cluster level are specified so that the clusters are exchangeable:
where , and are independent and identically distributed a priori.
Within each cluster , the prior distribution can be factored as:
where , are independently distributed conditional on , and are independent conditional on .
To create the conditional independence in (4) and (5), MWFSG formulated hierarchical “random effects” priors: first the cluster-specific parameters are drawn from the same set of distributions and then, conditional on these, the subcomponent-specific parameters within cluster are drawn from another set of distributions for all .
Specifically, cluster-specific parameters and , are drawn from:
where is the overall data center. Cluster centers are generated around with controlling the shrinkage of towards . MWFSG set , where is the sample covariance of all data so that cluster centers lie relatively far from each other.
Conditional on the cluster-specific random hyperparametersand the fixed lower level hyperparameters , the subcomponent means and covariance matrices are drawn independently for all :
() should be close to the cluster center to ensure no gaps among subcomponents and (
) should be diffuse so that the boundary or outlier points can be well fitted. Therefore, we needto impose strong shrinkage of ’s toward and
to be small to induce large variances in’s while permitting large variation of ’s.
To elicit the prior, MWFSG decomposes the variation of an observation into three sources:
where is the proportion of variability explained by the cluster centers around the overall mean , and is the proportion of total variation explained by subcomponent means around cluster center , for all . By setting and , we balance the variation of three sources and can elicit prior parameters accordingly.
3 DIB-C framework
The general idea of the DIB-C framework is to produce approximate MCMC samples of cluster assignments under the assumption of a finite mixture of mixtures model in a distributed computing paradigm. Specifically, we sample cluster assignments in each partition of the data and combine them for approximate samples of global cluster allocations.
Because clusters describe inherent relationships among all the data points, the independent clustering performed on each subset must be carefully merged. To minimize data transmission, we communicate only summary statistics between the master and slaves nodes to refine local cluster assignments across all partitions of the data in order to produce reliable global cluster allocations.
Based on the above discussion, we propose a DIB-C framework that consists of five steps: a. partitioning: randomly partitioning the data and distributing them over nodes (if they are stored in one machine); b. local clustering: running an embarrassingly parallel MCMC to obtain samples of cluster assignments on each partition of the data based on a finite mixture of mixture model; c. local cluster refinement: refining samples of local clusters via summary statistics sent to the master node; d. global cluster estimation: from refined local cluster assignments, identifying a global cluster estimate that minimizes the expected posterior loss defined on the space of partitions; and e. parameter estimation: conditional on the optimal global cluster estimate drawing model parameters from MCMC and performing inference. Local cluster refinement (step c), global cluster estimation (step d) and parameter estimation (step e) incur a small amount of data transmission between the master and slave nodes. In the rest of this section, we introduce the notations used throughout the paper and elaborate on steps b, c, d and e.
We represent any quantity associated with subset or node by adding a subscript , . Let the sample size of subset be , and . Assume subset is distributed to and processed on the slave node , .
As discussed in Section 2, the finite mixture of mixtures model induces both latent cluster and subcomponent allocations of the data. Let a latent cluster allocation be , where contains indices of the data in the th cluster and is the number of clusters in of sample size ; similarly let a latent subcomponent allocation be . Alternatively, a latent cluster (resp. subcomponent) allocation can be represented by (resp. ), where (resp. ) indicates that data point belongs to cluster (resp. subcomponent ).
In a distributed computing paradigm, we also express (resp. ) as (resp. ), where (resp. ) represents a latent cluster (resp. subcomponent) allocation for . In local cluster refinement (step b), both and are refined and updated. Let the updated cluster allocation be , where represents the updated cluster allocation based on .
3.2 Local clustering
Let the overall posterior density given and the th subset posterior density given , be
In our algorithm, we run MCMC on nodes in parallel based on (9), producing draws from each subset posterior , . To produce draws from , (or
in a serial algorithm), we can run a block conditional Gibbs sampler with data augmentation. The sampler alternates between imputing cluster allocations and updating parameters specific to each cluster, with the latter step alternating between imputing subcomponent allocations and updating subcomponent-specific parameters. See Appendix for a detailed sampling scheme. One can, however, adopt any other sampling scheme to improve mixing (such as a collapsed Gibbs sampler) as long as samples of cluster and subcomponent allocations are produced.
3.3 Local cluster refinement
The clustering , generated from naively combining the clusterings from the subsets, in general does not mimic a sample of cluster allocation from for two reasons.
First, the cluster labeling can vary across nodes. Figure 1 shows a sample of local cluster allocations from three different nodes when the data set is partitioned and distributed to 4 nodes. We can see that cluster 9 in subset 1 corresponds to cluster 10 in subset 3. Second, the clustering structure could vary across nodes. For example, a single cluster in one subset (e.g. cluster 1 in subset 1 in Figure 1) can correspond to several smaller ones (e.g. cluster 1 and 3 in subset 2) in another subset. Even worse, a cluster in one subset corresponds to a significant number of, but not all, data points from multiple clusters in another subset. Therefore, an algorithm to adjust samples of local cluster assignments, particularly enabling merging and splitting clusters for handling the second issue, is essential.
Ni et al.  proposed an algorithm for cluster merging using Bayesian nonparametric (BNP) methods in a distributed setting. After running an embarrassingly parallel MCMC, they freeze the local clusters, meaning that the observations within each cluster will never be split but possibly merged in the subsequent steps, and randomly divide the frozen clusters into nodes, where , for further clustering based on the same BNP model. They iteratively perform the above procedures until the number of frozen clusters is sufficiently small to be clustered in a single node. A deficiency of this algorithm is that if clusters are incorrectly merged at some iteration, then there is no hope of recovering the true clustering structure because they can never be split.
Inspired by Ni et al. , we propose a simple and communication efficient algorithm that permits both cluster merging and splitting. In contrast with Ni et al. , we freeze observations at the subcomponent level, rather than at the cluster level. One big advantage is that heavily overlapping subcomponents provide a natural solution to cluster refinement: we can merge or align the frozen local subcomponents across subsets based on their degrees of overlap and map the updated subcomponent assignments to the cluster level, which results in automatic merges or splits of the clusters. Another advantage, since each subcomponent can be described by a Gaussian distribution, is that the natural model to enable such joint grouping is simply a finite mixture of Gaussians; if the distributions of any two subcomponents can be well approximated by a single Gaussian kernel, the subcomponents are likely to be grouped together. In Figure 1, for example, subcomponent 1, 7 and 16 from subset 1, 2 and 3, respectively, are likely to be assigned to the same group due to their high overlap; mapping the group label back to the cluster level results in alignment of cluster labels and automatic merging of cluster 2 and 5 in subset 2, which is shown in Figure 2.
Before further illustrating the algorithm, we define some notation. Since each cluster contains subcomponents, we index the th subcomponent in cluster for subset by , and refer to a non-empty subcomponent as an item, using the word “item” to distinguish these objects from single observations. Let a partition by item in subset be , where contains the indices of data points in the th item and is the number of items in subset , ; the total number of items across all subsets is . Label the index set , i.e. the th item in subset , by and by for . Thus the order is For the
item, denote the associated data, group allocation (to be defined in the next paragraph), sample size, mean and second moment by, , , and , respectively, . Let and its associated sample size, mean and second moment be , and , respectively. We use subscript to represent the associated quantity with the th item removed.
To group based on their similarity, we employ a finite mixture of Gaussian model with components, each of which represents a group:
where , and .
Placing commonly used conjugate priors onand
yields a Bayesian model. The refinement procedure is not sensitive to the specific choice of prior parameters because each item contains many observations.
Based on (10), we can derive a collapsed Gibbs sampler, where we integrate out the model parameters
from the joint posterior and only update the latent group allocation through MCMC sampling; such a sampler in general accelerates convergence to the posterior distribution. The key quantity in updating the group allocation is the posterior probability of itembeing assigned to group :
Specifically, the first factor is
and the second factor is
The details of the derivation are included in the Appendix. Equation (15) is the joint marginal density of observations in item , which is a product of -densities.
Therefore, it suffices to have the slaves nodes send these summary statistics, i.e. sample size , mean and second moment () to the master node. After , and are evaluated on the master node, they are used to evaluate Equation (14) on the master node, as well as being simultaneously communicated to the slave node to evaluate Equation (15) for , in an embarrassingly parallel manner. The resulting values will again be communicated to the master node for evaluation of Equation (13) for each combination of and and an update of latent group allocation , for . Finally, ’s are transmitted to the associated slave node , where the cluster label of item is updated to , . Therefore, the updated cluster allocation, denoted by , is
, with the combined vectorconsidered an approximate sample from . See Algorithm 1 for clearly outlined steps. We find refining post burn-in iterations of local clustering samples , …, , as well as running one iteration of refinement starting from is sufficient for excellent performance.
Finally, we can use a randomly selected reference node to create consistent labels across the subsets. Specifically, a node is randomly drawn so that serves as the reference to which we jointly align the group allocations and then the cluster allocations associated with item are updated to be , a consistent set of labels used across nodes.
It is possible that items originating from the same cluster are assigned to different groups and end up forming separate clusters. It is also possible that items originating from different clusters are assigned to the same or nearby groups, which become subcomponents of the same cluster. This means that our refinement algorithm is flexible enough to allow both cluster merging and splitting, and is thus less sensitive to the clusters found on the subsets.
3.4 Global cluster estimation
Taking a decision-theoretic approach, we identify the point estimate of clustering that minimizes the posterior expected loss:
can often be simplified so that it depends on the posterior only through the posterior similarity matrix. For example, under Binder’s loss , where is the posterior probability that data point and belong to one cluster; this quantity can be easily estimated from the posterior similarity matrix. Therefore, a common practice to identify is through the posterior similarity matrix.
For big data of size , the by posterior similarity matrix would incur both large storage and computational costs, the latter of which is for each iteration in the set , where contains all the iterations involved in creating the posterior similarity matrix.
Here, we take a different perspective by directly exploiting the loss function definition in estimating the posterior expected loss. The fact that clustering is invariant to the permutation of data point indices implies that a loss function must be a function of the joint counts , which is the number of data points in both , the set of data indices in cluster under and , the set of data indices in cluster under , and [Binder, 1978]; these joint counts can be easily obtained through parallel computation. Another advantage of this method is that it can accommodate any loss function defined on the space of partitions, due to its direct use of the definition.
Without loss of generality, we demonstrate our algorithm based on the variation of information (VI) loss. The VI loss is an information theoretic criterion for comparing two clustering structures and is defined as:
where is the entropy function and is the mutual information [Meilă, 2007]; and represent the counts in under and under respectively. The posterior expected VI loss can be simplified to:
up to a constant. Because estimating Equation (21) is computationally intensive — for a given clustering candidate , Wade and Ghahramani  upper bounded 21 by via Jensen’s inequality and had computational cost reduced to ; this still remains astronomical for our case.
Based on the definition in Equation (20), the posterior expected VI loss can instead be estimated by
where is the set of iterations under consideration randomly drawn from . The clustering candidates are a random subset of , that is, we randomly draw a set and consider clustering candidates . The choice of and is at the user’s discretion; our experiments show superior clustering performance with and .
In Equation (22), both and can be computed in parallel, because the refined cluster allocations make the following relationships hold:
For iteration , the slave node first evaluates and (for and ) in parallel, ; such evaluation incurs computational cost , where the largest sample size on any slave node. The slave nodes then transmit the statistics to the master node; for a clustering candidate with clusters, the amount of data communication is , where . Next, the master node evaluates all and based on Equation 23 and then the quantity in Equation 22. The point estimate of the clustering is .
3.5 Parameter estimation
Parameter estimation is optional if cluster estimation is the goal, but interests in inference on the cluster-specific quantities or classifying future subjects make it essential. Specifically, to classify a new subject , one can use a Bayes classifier — evaluating its posterior probability of belonging to cluster : and assigning to the cluster that yields the highest posterior probability.
The general idea of our algorithm is to only sample the subcomponent-specific parameters (e.g. and ) conditional on the aligned subcomponent assignments, allowing the MCMC updates to depend on the data only through summary statistics. These statistics include subcomponent sizes , the sum of squares and the data sums , , (see the Appendix for the details of MCMC updates).
Such an approach has several advantages. First, subcomponent assignments are aligned across all subsets through the local cluster refinement algorithm, which makes the following relationships hold:
Such relationships justify our parallel algorithm: the slave nodes first compute the relevant statistics in parallel and then communicate them to the master node for summing. Such communication only occurs once; because conditional on the subcomponent assignments, the summary statistics are fixed throughout the MCMC updates. Second, the fixed subcomponent assignments and the use of summary statistics significantly reduce computation per iteration and lead to faster convergence due to reduced dependence between iterations. Third, posterior estimates of parameters are often non-identifiable in finite mixture models due to label switching, a phenomenon which occurs when exchangeable priors are placed on the parameters. It is particularly challenging to resolve, despite many available post-sampling relabeling schemes [Stephens, 2000, Sperrin et al., 2010, Papastamoulis and Iliopoulos, 2010, Papastamoulis, 2014, Rodríguez and Walker, 2014], for overfitted mixture models, as superfluous clusters may merge or overlap with other ones, or be empty. By fixing the latent cluster assignments, we bypass label switching at the cluster level, eliminating the need to post process MCMC outputs.
In this section, we provide extensive numerical experiments for assessing the performance of DIB-C, as measured by the scalability of the computation time and classification performance with the data size and number of computing nodes. We use training sets of size thousand, thousand, and million and distribute each data set to clusters ranging in size from 1 to 30 nodes. The test sets contain thousand, thousand, and thousand observations, respectively, i.e. of the total data size. All the experiments were conducted on the Duke Compute Cluster; due to limitations of the cluster, we only benchmark the time of DIB-C up to 30 nodes and performance up to 120 nodes.
When the data are distributed to a single node, local cluster refinement is not necessary; that is, a full MCMC for (local) clustering is immediately followed by (global) cluster estimation. The prior setup is shared by all experiments from the same data set. In the local clustering step, we run 1000 iterations of MCMC across all experiments, with the first 500 being burn-ins. Then we refine 100 iterations of local clustering samples, from which we randomly select 20 to be the clustering candidates considered for the global cluster estimation based on the variation of information loss. To estimate the model parameters, we run 2000 iterations of MCMC conditional on the subcomponent and cluster assignments
4.1 Experimental setup
4.1.1 Synthetic data sets
The simulation setup originated in Baudry et al.  and was later augmented by Malsiner-Walli et al. . The data sets contain four clusters of varying shapes, including one triangle shape, one L shape, one cross shape and one ellipse. They are generated from an eight-component Gaussian mixture with component means
cluster weight vectors and subcomponent weight vectors , ,