Consensus Monte Carlo for Random Subsets using Shared Anchors

by   Yang Ni, et al.

We present a consensus Monte Carlo algorithm that scales existing Bayesian nonparametric models for clustering and feature allocation to big data. The algorithm is valid for any prior on random subsets such as partitions and latent feature allocation, under essentially any sampling model. Motivated by three case studies, we focus on clustering induced by a Dirichlet process mixture sampling model, inference under an Indian buffet process prior with a binomial sampling model, and with a categorical sampling model. We assess the proposed algorithm with simulation studies and show results for inference with three datasets: an MNIST image dataset, a dataset of pancreatic cancer mutations, and a large set of electronic health records (EHR). Supplementary materials for this article are available online.


page 15

page 21

page 24


Scalable Bayesian Nonparametric Clustering and Classification

We develop a scalable multi-step Monte Carlo algorithm for inference und...

Bayesian Heterogeneity Pursuit Regression Models for Spatially Dependent Data

Most existing spatial clustering literatures discussed the cluster algor...

Determinantal consensus clustering

Random restart of a given algorithm produces many partitions to yield a ...

Bayesian Double Feature Allocation for Phenotyping with Electronic Health Records

We propose a categorical matrix factorization method to infer latent dis...

Large-data determinantal clustering

Determinantal consensus clustering is a promising and attractive alterna...

Bayesian Non-Exhaustive Classification for Active Online Name Disambiguation

The name disambiguation task partitions a collection of records pertaini...

Summary Statistics for Partitionings and Feature Allocations

Infinite mixture models are commonly used for clustering. One can sample...

1 Introduction

We develop a consensus Monte Carlo (CMC) algorithm for Bayesian nonparametric (BNP) inference with large datasets that are too big for full posterior simulation on a single machine, due to CPU or memory limitations. The proposed algorithm is for inference under BNP models for random subsets, including clustering, feature allocation (FA), and related models. We distribute a large dataset to multiple machines, run separate instances of Markov chain Monte Carlo (MCMC) simulations in parallel and then aggregate the Monte Carlo samples across machines. The idea of the proposed CMC hinges on choosing a portion of observations as

anchor points (Kunkel and Peruggia, 2018) which are distributed to every machine along with other observations that are only available to one machine. Those anchor points then serve as anchors to merge Monte Carlo draws of clusters or features across machines.

Clustering is an unsupervised learning method that partitions observations into non-overlapping subsets (clusters) with the aim of creating homogeneous groups. A widely used approaches for model-based inference on random partitions is based on mixtures, with each mixture component being a cluster. Bayesian finite mixture models with a random number of terms that allow for inference with an

a priori unknown number of clusters, is first discussed in Richardson and Green (1997)

. A natural generalization of such models are infinite mixtures with respect to discrete random probability measures. In fact, any exchangeable random partition can be represented this way

(Kingman, 1978)

. Prior probability models for random probability measures, such as the mixing measure in this representation, are known as BNP models. Examples include Dirichlet process mixtures (DPM,

Lo 1984; MacEachern 2000; Lau and Green 2007), Pitman-Yor process mixtures (Pitman and Yor, 1997) and variations with different data structures, such as Rodriguez et al. (2011) for mixtures of graphical models, normalized inverse Gaussian process mixtures (Lijoi et al., 2005), normalized generalized Gamma process mixtures (Lijoi et al., 2007), and more general classes of BNP mixture models (Barrios et al., 2013; Favaro and Teh, 2013; Argiento et al., 2010). For a more general discussion of BNP priors on random partitions see also de Blasi et al. (2015).

Feature allocation, also known as overlapping clustering, relaxes the restriction to mutually exclusive and non-overlapping subsets and allocates each observation to possibly more than one subset (“feature”). The most commonly used feature allocation model is the Indian buffet process (IBP, Ghahramani and Griffiths 2006; Broderick et al. 2013a, c).

A convenient way to represent random subsets, in clustering as well as in feature allocation, is as a binary matrix with indicating that experimental unit is a member in the -th random subset. Feature allocation for experimental units into an unknown number of subsets then takes the form of a prior for a random binary ) matrix with a random number of columns. Similarly, a random partition becomes a prior with exactly one non-zero element in each row, and a random number of columns, . An important feature of BNP clustering and feature allocation models is that they do not require an a prior specification of the number of subsets, (clusters or features). An important limitation is the requirement for intensive posterior simulation. Implementation of posterior inference by MCMC simulation is usually not scalable to large datasets.

Several approaches have been proposed to overcome these computational limitations in general big data problems, not restricted to BNP models. Huang and Gelman (2005) proposed CMC algorithms that distribute data to multiple machines, each of which runs a separate MCMC simulation in parallel. Various ways of consolidating simulations from these local posteriors have been proposed. Scott et al. (2016) combined the local posterior draws by weighted averages. Neiswanger et al. (2013) and White et al. (2015) proposed parametric, semiparametric and nonparametric approaches to approximate the full posterior density as the product of local posterior densities. Wang and Dunson (2013) introduced the Weierstrass sampler which applies a Weierstrass transform to each local posterior. Minsker et al. (2014) found the geometric median of local posterior distributions using a reproducing kernel Hilbert space embedding. Rabinovich et al. (2015) developed a variational Bayes algorithm that optimizes over aggregation functions to achieve a better approximation to the full posterior. All but the method by Rabinovich et al. (2015) are specifically designed to aggregate global parameters. However, local parameters i.e., parameters that are indexed by experimental units, such as cluster assignment and latent feature allocation are also of great importance in big data analytics. Even though the method by Rabinovich et al. (2015) can be used to aggregate local structures such as cluster assignment, it assumes the number of clusters is fixed a priori, which limits it applicability. Our contribution is to bridge this gap in the literature.

Blei et al. (2006) developed the first variational Bayes algorithm for posterior inference of DPM which was later extended to several variational algorithms by Kurihara et al. (2007). A similar variational Bayes algorithm for posterior inference under an IBP prior was derived by Doshi-Velez et al. (2009a). Doshi-Velez et al. (2009b) propose a parallel MCMC algorithm which relies on efficient message passing between the root and worker processors. Wang and Dunson (2011) developed a single-pass sequential algorithm for conjugate DPM models. In each iteration, they deterministically assign the next subject to the cluster with the highest probability conditional on past cluster assignments and data up to the current observation. Rai and Daume (2011)

developed a beam-search heuristic to perform approximate maximum a posteriori inference for linear-Gaussian models with IBP prior. Under the same model,

Reed and Ghahramani (2013) proposed a greedy maximization-expectation algorithm, a special case of variational Bayes algorithm, which exploits the submodularity of the evidence lower bound. Broderick et al. (2013b) and Xu et al. (2015)

develop small-variance asymptotics for approximate posterior inference under an IBP prior.

Lin (2013) proposed a one-pass sequential algorithm for DPM models. The algorithm utilizes a constructive characterization of the posterior distribution of the mixing distribution given data and a partition. Variational inference is adopted to sequentially approximate the marginalization. Williamson et al. (2013) introduced a parallel MCMC for DPM models which involves iteration over local updates and a global update. For the local update, they exploit the fact that a Dirichlet mixtures of Dirichlet processes (DP) again defines a DP, if the parameters of Dirichlet mixture are suitably chosen. Ge et al. (2015) used a similar characterization of the DP as in Lin (2013). But instead of a variational approximation, they adapted the slice sampler for parallel computing under a MapReduce framework. Tank et al. (2015) developed two variational inference algorithms for general BNP mixture models. Recently, Zuanetti et al. (2019)

suggested two efficient alternatives using DPM with conjugate priors. The first approach is based on a predictive recursion algorithm

(Newton et al., 1998) which requires only a single scan of all observations and avoids expensive MCMC. The second approach is a two-step MCMC algorithm. It first distributes data onto different machines and computes clusters locally in parallel. A second step combines the local clusters into global clusters. All steps are carried out using MCMC simulation under a common DPM model. Ni et al. (2019a) extended the second approach to non-conjugate models.

We propose a new CMC algorithm specifically for aggregating subject-specific latent combinatorial structures that often arise from BNP models. The proposed CMC algorithm is applicable to both clustering and feature allocation problems. It uses a similar notion, anchor points, as in Kunkel and Peruggia (2018) but with a completely different purpose. In Kunkel and Peruggia (2018)

, they used anchor points to address label switching in finite Gaussian mixture models whereas in this paper, we use anchor points to combine posterior draws of random clusters or features across different shards of the original data. The proposed CMC can reduce the computation cost by at least a factor of

where is the number of computing cores at disposal. Since modern high performance computing centers typically have tens of thousands to hundreds of thousands computing cores in total, is easily in the order of thousands.

The rest of this article is organized as follows. We provide relevant background for CMC and two BNP models in Section 2. The proposed CMC is introduced in Section 3. The utility of the proposed CMC is demonstrated with simulation studies in Section 4 and three real applications in Section 5. We conclude the paper by a brief discussion in Section 6.

2 Background

CMC schemes are algorithms to generate approximate posterior Monte Carlo samples given a large dataset. The idea of CMC can be summarized in three steps: (i) distribute a large dataset to multiple machines (shards); (ii) run separate MCMC algorithm on each machine without any inter-machine communication; and (iii) aggregate Monte Carlo samples across machines to construct consensus posterior samples. Step (iii) is non-trivial.

Let denote the full dataset and let denote shard for . And let denote the global parameters. Assuming to be exchangeable, the posterior distribution can be written as


where the fractional prior preserves the total amount of prior information. Expression (1) is exact and involves no approximation. Ideally, if one can compute analytically for each shard , then (1) can be directly used to obtain the posterior distribution based on the entire dataset. However, is usually not analytically available and only Monte Carlo samples are returned from step (ii), which necessitates step (iii).

Various methods have been proposed to aggregate Monte Carlo samples such as weighted averaging (Scott et al., 2016)

, density estimation

(Neiswanger et al., 2013; White et al., 2015), geometric median (Minsker et al., 2014) and a variational algorithm (Rabinovich et al., 2015). One common limitation of existing approaches is that they cannot aggregate Monte Carlo samples of subject-specific latent structure (e.g. cluster assignment and feature allocation) that often arise in posterior inference under BNP models. For example, in a feature allocation problem each shard introduces new, additional feature memberships. That is, could be partitioned as , with each shard adding an additional set of parameters .

BNP models are Bayesian models defined on an infinite-dimensional parameter space. Common examples of BNP models include the DPM model for clustering and the IBP prior for feature allocation. The main attraction of BNP models are flexible modeling and the full prior support in many models. For example, DP and IBP can automatically select the number of clusters and features based on available data. As examples illustrating the proposed CMC algorithm, we consider applications of the DPM model and the IBP model in three different inference problems. The inference problems are clustering, using a DP prior; FA, using an IBP prior; and double feature allocation (DFA, Ni et al. 2019), using an IBP prior. For the implementation we use the R packages DPpackage and dfa. Next we introduce some notations by way of a brief review of the three models and the inference problems.

2.1 Dirichlet process mixtures and random partitions

Let denote data observed on experimental units . Some applications call for clustering, i.e., a partition of into , with for . A widely used model-based approach to implement inference on the unknown partition is to assume i.i.d. sampling from a mixture model, . Here

is, for example, a normal distribution with location

(leaving the scale parameter as an additional hyperparameter), and

is a discrete mixing measure. Introducing latent variables , the model can equivalently be written as a hierarchical model


The discrete nature of gives rise to ties among the , which in turn naturally define the desired partition of . Assume there are unique values, denoted by , and define clusters . Sometimes it is more convenient to alternatively represent the partition using cluster membership indicators if . Yet another alternative representation, that will be useful later, is using binary indicators with if . Collecting all indicators in an binary matrix , the constraint to non-overlapping subsets becomes for all rows .

Model (2) is completed with a BNP prior on , for example, where denotes a DP prior with concentration parameter and base measure . Model (2

) with the DP hyperprior on

defines the DPM model. See, for example, Ghoshal (2010) for a good review. The implied distribution over partitions under a DPM model is known as the Chinese restaurant process. Inference on the random partition is straightforward to implement through, for example, Algorithm 8 in Neal (2000). Neal’s algorithm 8 iterates between two steps: (i) For , sample given the th observations , cluster-specific parameters , and cluster assignments for the rest of the observations. (ii) For , sample given data and cluster assignments .

2.2 Indian buffet process and feature allocation

The IBP is a BNP prior for random subsets that can possibly overlap and need not be exhaustive, i.e, without the restrictions of a partition. Again, the number of random subsets is random. The subsets are known as features. A prior defines a random feature allocation model (Broderick et al., 2013a). Similar to before, we can alternatively represent the feature allocation with feature membership indicators, if , now without the constraint to unit row sums. The columns of represent the features .

Xu et al. (2015) use random feature allocation to develop inference for tumor heterogeneity, i.e., the deconvolution of a heterogeneous population of tumor cells into latent homogeneous subclones (i.e. cell subtypes). The experiment records short reads counts of single nucleotide variants (SNVs) (essentially, mutations relative to a given reference) in tumor tissues . The hypothesized homogeneous subclones are charecterized by the presence or absence of these SNVs. In this application, if SNV is present in subclone . Let denote the observed counts of SNV in sample and let denote the total counts at locus . Let denote the unknown proportion of subclone in tumor . The experimental setup implies independent binomial sampling, for and ,


where is the relative frequency of a SNV in the background and are feature-specific parameters. The model is completed with a Dirichlet prior on , a beta prior on and a prior on the feature allocation .

The most widely used prior for feature allocation is the IBP. It defines a prior distribution for an binary matrix with a random number of columns. We start the model construction assuming a fixed number of features, to be relaxed later. Conditional on ,

’s are assumed to be independent Bernoulli random variables,

with following a conjugate beta prior, . Here is a fixed hyperparameter. Marginalizing out ,

where is the sum of the th column of .

Let be the -th harmonic number. Next, take the limit and remove columns of with all zeros. Let denote the number of non-empty columns. The resulting matrix follows an prior (without a specific column ordering), with probability


With a finite sample size, the number of non-empty columns is finite almost surely. Let denote the sum in column , excluding row . Then the conditional probability for is


provided , where is the -th column of excluding -th row. And the distribution of the number of new features (non-empty columns) for each row is . Posterior inference can be carried out using an algorithm similarly to Neal’s algorithm 8. The posterior distribution may present many peaked modes for moderate to large total counts , which makes MCMC inefficient. To improve mixing, Ni et al. (2019b) used parallel tempering to flatten the posterior while still targeting the right posterior distribution. We will follow their strategy in MCMC.

2.3 Double feature allocation

Some applications call for simultaneous clustering of rows and columns of a data matrix. This is known as bi-clustering (Hartigan, 1972). Ni et al. (2019) introduces a similar prior model for random row and column subsets, but now without the restrictions of a partition, i.e., random feature allocation on rows and columns simultaneously. Figure 1 illustrates how matching pairs of subsets and define a disease in electronic health records (EHR) data. The data is an matrix of recorded symptoms, , on patients .

Figure 1: Double feature allocation: simultaneous feature allocation on rows (patients) and columns (symptoms) defines diseases. The data are recorded symptoms, , for patients, , in electronic health records (EHR) data. Each subset of patients and matching subset of symptoms is a different disease, marked by different color boxes. The data matrix records trinary symptoms with possible outcomes .

A prior on random pairs of matching subsets defines a double feature allocation (DFA, Ni et al. 2019). A simple extension of a feature allocation model, such as the IBP, can be used to define a DFA. Let denote a matrix of feature membership indicators for the subsets , as before. We first assume . In particular, induces a prior on the number of subsets . Conditional on we then define a second membership matrix , for membership in matched subsets . Here with recording membership of column in . In anticipation of the upcoming sampling model for trinary symptoms, , we allow for membership , with indicating that disease favors symptom at level (e.g., low blood pressure), indicating that the symptom is favored at level (e.g., high blood pressure), and indicating that the symptom is not related to disease . Let

denote a probability vector. We assume

, with a conjugate hyperprior

. We assume conditionally independent trinary latent logistic regression as a sampling model for the observed symptoms

for patient ,


where are feature-specific weights, and are symptom-specific offsets. The model is completed with priors for the hyperparameters, , and , . In model (6) the feature-specific parameters are for feature . Recognizing the columns of as just another part of feature-specific parameters reveals the nature of the DFA model as a special case of a feature allocation, with one of the feature-specific parameters selecting a matching subset of the columns in the data matrix.

Straightforward modifications define similar models for categorical data with fewer or more categories. Posterior inference can be carried out using an algorithm similarly to Neal’s algorithm 8; see Ni et al. (2019) for implementation details.

3 A consensus Monte Carlo algorithm for random subsets

We describe the proposed CMC algorithm in its general form. Let denote the data of observations. Let denote the latent subset membership matrix (for clusters, features, or row-features in random partitions, FA, and DFA, respectively) where if observation belongs to subset . Let be the -th row of . Let denote an infinite sequence of subset-specific parameters and let denote a subsequence indexed by , i.e. . Many BNP models including DPM, FA, and DFA models can be generically written as a hierarchical model,


for and . The sampling model for in the first line depends on the specific inference model. In the case of a random partition it is the kernel of the mixture model in (2). In the case of an FA or DFA model the sampling model might include multiple when , as in (3) or (6). The specific interpretation is problem-specific. The dependence on in the second line is only indirect through the random number of columns in which defines the number of subsets. In DPM models, is the baseline distribution and is the Chinese restaurant process. In FA and DFA models, is the prior of feature-specific parameters and hyperparameters, and is the IBP.

For a small to moderate sample size , MCMC has been commonly used to implement posterior inference. However, when is large, MCMC becomes computationally prohibitive because at each iteration, it has to scan through all observations. The idea of CMC is to distribute the large dataset onto many shards so that MCMC can be efficiently implemented on each shard with much smaller sample size. Let be a large integer. We randomly divide the observations into non-overlapping shards and . Define for , so that are new shards that all share but are otherwise disjoint. We call the anchor points. Let where denote Monte Carlo samples obtained from the MCMC algorithm applied to the shard . To aggregate the Monte Carlo samples from shards and , we consider merging (i.e. taking the union of) two clusters or features and if


where is the number of different elements in and and is the number of common elements. By convention, we set if . In words, if two random subsets (clusters or features) have similar sets of anchor points, we merge them. The similarity is controlled by the tuning parameter which is a small fixed constant. The choice of (8) includes arbitrary choices. In particular, we note that, for example, the IBP includes positive prior probability for two identical columns in , which could question the appearance of identical subsets of anchor points as a criterion for merging. However, in most applications, including the two motivating applications related to feature allocation in this article, this is not a desirable feature of the IBP, and we argue that the criterion introduces an even desirable approximation. Alternatively, the criterion could include a comparison of feature-specific parameters . In simulation studies and the motivating applications we found the proposed criterion to work well, and prefer the simplicity of (8). We also find that the proposed algorithm is relatively robust with respect to the choice of (see Section 5.1 for sensitivity analysis) if the number of anchor points is sufficiently large.

If we decide to merge and , we merge the associated parameters and according to the following operation. For continuous parameters that index a sampling model, such as in (2), the values are averaged, weighted by the relative sizes of the two merged random subsets; and for categorical values, such as column of in (6), a majority vote is used, again based on the relative subset sizes. Note that the aggregating step is trivially parallelizable with respect to the number of Monte Carlo samples and hence the computation time is neglible compared to MCMC. The complete CMC is summarized in Algorithm 1.

1:Data preparation. Split data into disjoint shards . Form shards: for . Anchor points are present in every shard, whereas only appear in shard .
2:parfor   do
4:end parfor
5:parfor  do
6:     for each pair of pairs and  do
7:         if  then
8:              merge and
9:              merge and
10:         end if
11:     end for
12:end parfor
Algorithm 1

We implement the algorithm in the upcoming simulation studies, and in the motivating applications in the context of DPM models and FA and DFA based on IBP models. However, the approach is more general. It remains valid for any alternative BNP prior on in (2), and any alternative FA prior. For example, the BNP prior could be any other random discrete probability measure, including a normalized completely random measure as in Barrios et al. (2013); Favaro and Teh (2013) or Argiento et al. (2010). The IBP could be replaced by any other random feature allocation (Broderick et al., 2013c). Also, the sampling models in (2), (3) and (6

) are examples. Any other sampling model could be substituted, including a regression on additional covariates, or, in the case of feature allocation, a linear-Gaussian model. While we use it here only for BNP models, the same algorithm can be implemented for inference under any parametric model for random subsets, for example, finite mixture models or finite feature allocation models.

We propose a simple diagnostic to summarize the level of approximation in the CMC. First select any two shards, without loss of generality assuming they are and . We then apply CMC with as anchor points. Denote the point estimate of random subsets (e.g. clusters, latent features) by . In addition, we run a full MCMC simulation in and denote the point estimate by . We summarize the level of the approximation by measuring the distance between and . We illustrate the diagnostic in Section 5.1.

4 Simulation

We carry out simulation studies to assess the proposed CMC algorithm for DPM, FA and DFA models. We use relatively small datasets in the simulations, so that we can make comparison with full MCMC. Scalability will be explored later, in applications. For all models, we evenly split the observations into 5 shards and use one of the shards as anchor points. We report frequentist summaries based on 50 repetitions. For both CMC and MCMC, we run 5,000 iterations, discard the first 50% of Monte Carlo samples as burn-in and only keep every 5th sample. We choose and find it work well throughout the simulations and applications. The sensitivity of the choice of will be assessed in Section 5.1.

4.1 Simulation 1: Clustering under the DPM model

The first simulation considers a CMC approximation of posterior inference in a DPM model for a dimensional variable :

where . The hyperparameters are and .

We construct a simulation truth with true clusters with equal sizes. For , we generate data , where , , , , and , for . The data of one randomly selected simulation is shown in figure 2(a).

In figures 3(a) and 3(b), we show the bar plots of posterior modes for the number of clusters across repeat simulations, evaluated using the CMC and a full MCMC implementation, respectively. Compared to full MCMC, it tends to slightly overestimate the number of clusters. This might be due to the fact that inference under the DP prior typically includes many small clusters, which might include few or no anchor points. We report the misclustering rates and the MSEs of parameter estimation in table 1.

(a) True clusters
(b) True
(c) True
Figure 2: Simulations 1 and 3: Simulation truth. Panel (a) shows pairwise scatter plots of one randomly selected simulated dataset for simulation 1. The marker type and color indicate the true cluster. Panels (b) and (c) show heatmaps of true and for simulation 3. Green cells represent 1, black cells 0 and red cells -1.
(c) FA-CMC
Figure 3: Simulations 1, 2 and 3: Bar plots of posterior mode across repeat simulations for CMC and full MCMC implementations for Simulations 1 (a and b), 2 (c and d), and 3 (e and f), respectively. .
0.06 (0.03) 0.03 (0.01) 0.15 (0.05) 0.05 (0.06) 0.08 (0.01) 0.02 (0.00)
0.00 (0.00) 0.00 (0.00) 0.01 (0.01) 0.02 (0.03) 0.05 (0.04) 0.01 (0.01)
Table 1: Simulations 1, 2 and 3. We report the performance of CMC versus full MCMC for three models, DPM, FA and DFA. The error reports the misallocation rate in estimating , and

reports the MSE or average Hamming distance in estimating subset-specific continuous parameters (simulations 1 and 2) or categorical matrix parameters (simulation 3). The standard deviations are given within the parentheses. Both, mean and standard deviation, are with respect to repeat simulations.

4.2 Simulation 2: Feature allocation using the IBP

Here, we test the performance of CMC for the FA model in Section 2.2. We generate a data matrix with SNVs and tumors. We use the same simulation truth as in Xu et al. (2015). We assume subclones. The latent binary matrix is set as follows: for , for , for and for . We draw where is a random permutation of . We set and , and generate from model (3). The same tempering scheme as in Ni et al. (2019b) is adopted.

Figures 3(c) and 3(d) show the bar plots of the posterior mode of the number of features across simulations, under the CMC (panel (c)) and full MCMC (d) implementations. Similarly to DPM clustering, CMC tends to slightly overestimate the number of features compared to MCMC. Let denote the Hamming distance of two matrices and let denote the total number of elements in a matrix. We define a mis-allocation rate for as the average Hamming distance between the estimator and the truth: . In the case where has more columns than , we remove the extra columns from . The error rate that is reported in table 1 for the FA model summarizes the MSE in estimating the proportions .

4.3 Simulation 3: Double feature allocation using an IBP prior

This simulation considers a CMC approximation of posterior inference in a DFA model with IBP prior. We generate the feature allocation matrix from an IBP() model with and sample size . The resulting matrix with columns and rows is displayed in figure 2(b). Given , we set the feature-specific parameters, with as in figures 2(c). The observations are then generated from the sampling model (6).

Figures 3(e) and 3(f) show the bar plots of posterior modes for the number of features, across simulations for CMC (e) and full MCMC (f) implementations, respectively. As before, CMC tends to slightly overestimate the number of features compared to MCMC. The error rate that is reported in Table 1 for the DFA model summarizes the error in estimating the matched column subsets, i.e., the rows of . We use the same definition based on the Hamming distance as for .

We conclude that the proposed CMC algorithm implements a useful approximation for posterior inference on random subsets under widely used BNP models, for problems similar to the simulation scenarios, which were chosen to mimic the main features of the three motivating examples.

5 Applications

5.1 MNIST image clustering

MNIST (LeCun et al., 1998) is a dataset of images for classification, containing handwritten digits from different writers. The data is often used as a benchmark problem for clustering and classification algorithms. Each image has 2828 pixels that take values between 0 and 255 representing the grey levels. A subset of the images are shown in figures 4(a). To visualize the high-dimensional image data with a 2D scatter plot (figure 4(b)), we use the t-distributed stochastic neighbor embedding (t-SNE, Maaten and Hinton 2008

) with perplexity parameter 75. The t-SNE algorithm is a non-linear dimension reduction tool that maps high-dimensional data onto a two- or three-dimensional manifold.

We randomly split the 70,000 images evenly into 140 shards, each with 500 images, and use one shard as anchor points. We apply the proposed CMC to the t-SNE transformed data with 5,000 iterations (discarding the first 2,500 samples and then thinning out by 5) and . CMC takes less than minutes to run, whereas full MCMC simulation takes approximately 30 minutes for only the first 10 iterations. The posterior distribution of the number of clusters is shown in figure 4(c) which is peaked at a mode at . Conditional on , the estimated clusters are shown in figure 4(b). Keeping in mind that the DP prior favors many singleton clusters, we drop clusters with fewer than 1% of the total number of images (9 posterior estimated clusters were singletons), leaving 12 practically relevant clusters, which only slightly overestimates the desired number of 10 clusters. Next, we evaluate the clustering performance, relative to the known truth in this example, by computing the normalized mutual information (NMI) between the estimated cluster labels and the true labels. NMI is defined as , where denotes the mutual information between two sets of labels and and

is the entropy. NMI is between 0 and 1 with 1 being perfect match between two clusterings. NMI for CMC is 0.76 better than that of K-means 0.72 with

. Note that the purpose of this application using MNIST data is not to train a classifier or supervised model for the prediction of the 10 digits. Instead, we use the MNIST to examine the feasibility and performance of the proposed CMC algorithm for the clustering (i.e., unsupervised learning) of the relatively large dataset.

To assess the level of approximation of CMC, we use the diagnostic proposed earlier. Specifically, we sample two shards and then run CMC (using the same anchor points as before) as well as full MCMC posterior simulation on the merged dataset of the two selected shards and anchor points. Repeating the same procedure 69 times, we find an average NMI (between the estimated partitions from CMC and MCMC) of 0.85 with a standard deviation 0.02, which suggests a good approximation. To assess the sensitivity of the choice of , we repeat the same diagnostic procedure for and . The average NMI is 0.84 for both ’s.

(a) MNIST handwritten digits
(b) Estimated clusters
(c) Posterior distribution on
Figure 4: MNIST. (a) A subset of 100 MNIST handwritten digits. (b) The estimated clusters visualized on 1,000 randomly selected images. The images are embedded in a two-dimensional space using the t-distributed stochastic neighbor embedding (t-SNE) algorithm. The true labels are represented by the numbers and the clusters are represented by the colors. (c) The posterior distribution of with mode .

5.2 Inter-tumor Tumor heterogeneity

Tumors are genetically heterogeneous, often containing diverse subclones characterized by genotypic differences. Next-generation sequencing of tumor samples generates short reads from the genomes of multiple cells. Since the sequencing is performed in bulk, it is challenging to reconstruct subclones based on aggregated variant counts (recall the notation from model (3)). FA models have been proposed in the literature to infer tumor heterogeneity, including Bayesian approaches in Lee et al. (2015) and Ni et al. (2019b). Due to the computational limitation of MCMC simulation, these methods are restricted to a relatively small number of variants (typically, ). Xu et al. (2015) proposed a scalable optimization-based algorithm (MAD-Bayes algorithm) to find a posterior mode. We will compare the proposed CMC with MAD-Bayes. We use the same pancreatic ductal adenocarcinoma (PDAC) mutation data analyzed in Xu et al. (2015). The PDAC data record the total read counts and variant read counts at SNVs from tumors.

We randomly split the 6,599 SNV’s into 33 shards. The first 32 shards have 200 SNVs each and the last shard has 199 SNVs which are used as anchor points. We apply the proposed CMC with 5,000 iterations (discarding the first 2,500 as burn-in and then thinning by 5) and the same as before. CMC takes approximately 90 minutes, whereas the full MCMC is infeasible. We find 15 major subclones across tumors after removing latent features with fewer than 5% SNVs. The estimated feature allocation matrix is shown as a heatmap at the top of figure 5. The estimated subclone proportions in each tumor are shown as a heatmap at the bottom of figure 5. The number of subclones and the “checkerboard” pattern of the proportions suggest strong inter-tumor heterogeneity in this data.

We compare the computational efficiency with MAD-Bayes, an optimization-based approach. Since each run of MAD-Bayes algorithm may return different outputs, it is recommended to run the algorithm repeatedly (Xu et al., 2015). MAD-Bayes has a regularization parameter that controls the number of subclones. The parameter needs to be carefully tuned over a range of values (say, 10 values). Using the same computer resources as for CMC (i.e. 32 computing cores), it cannot be finished in a reasonable amount of time.

Figure 5: Tumor heterogeneity. The top part of the heatmap shows the estimated subclonal genotypes of the selected SNV with black and grey cells representing 1 (mutant) and 0 (wildtype), respectively. The columns are subclones and the rows are SNVs. The bottom part of the heatmap shows the estimated proportions of subclones in each tumor. The columns are subclones and the rows are tumors.

5.3 Electronic health records phenotyping

We consider a large EHR dataset with patients from China. The data are from a physical exam of Chinese residents in 2016. We extract the blood test results measured on testing items (shown in table 2) and diagnostic codes for diabetes from the EHR. We implement inference under a DFA prior and sampling model (6). Each latent feature can be interpreted as a latent disease that favors symptoms and is related to a subset of patients . That is, describes the patient-disease relationships and describes symptom-disease relationships. We follow the same procedure in Ni et al. (2019) to preprocess the data. Using the reference range for each test item, we discretize the data and define a symptom if the value of an item falls beyond the reference range. We fix the first column of in the DFA model according to the diabetes diagnosis. Moreover, since diabetes is clinically associated with high glucose level, we incorporate this prior information by fixing the corresponding entry in the first column of . Additional prior knowledge regarding the symptom-disease relationships is incorporated. Creatinine and blood urea nitrogen (BUN) are two important indicators of kidney disease. High levels of these two items suggest impaired kidney function. We fix the two entries (corresponding to creatinine and BUN) of the second column of to 1 and the rest to 0. Similarly, elevated systolic blood pressure and diastolic blood pressure indicate hypertension, and abnormal levels of total bilirubin (TB), aspartate aminotransferase (AST) and alanine aminotransferase (ALT) are indicators of liver diseases. We fix the corresponding entries of the third and fourth column of .

We randomly split the 100,000 patients evenly into 500 shards, each with 200 patients, and use one shard as anchor points. We apply the proposed CMC with 50,000 iterations (discarding the first 25,000 as burn-in and then thinning out to every 10th) and the same as before. CMC simulation takes approximately 1 hour, whereas full MCMC takes 90 minutes for the first 10 iterations. We find 8 major latent diseases in addition to the 4 known diseases after removing tiny latent features (with 1% patients). The estimated symptom-disease relationships are represented by a bipartite network in Figure 6 and also as a heatmap at the top of Figure 7. The estimated patient-disease relationships for 1000 randomly selected patients are shown as a heatmap at the bottom of Figure 7.

Unlike MNIST or the application to tumor heterogeneity, there is no ground truth or alternative implementation for posterior inference for the full data. Instead we compare the results with previous results by Ni et al. (2019) who used a full MCMC implementation for a subset of 1000 patients from the same dataset. Some of our findings are consistent with the earlier results, which suggests a good approximation of the proposed CMC to full MCMC. Moreover, with a hundred times more observations, we are able to find more interpretable results compared to theirs.

Latent disease X1 (previously referred to as lipid disorder) is primarily associated with elevated total cholesterol (TC) and low density lipoprotein (LDL). Patients with high levels of TC and LDL have higher risks of heart disease and stroke. Latent diseases X2 and X3 are associated with the same set of symptoms but with opposite signs. This interesting result is also found in Ni et al. (2019) where X2 and X3 were identified as polycythemia and anemia, respectively. Each of X5, X6 and X8 also finds good correspondence in Ni et al. (2019) as bacterial infection, viral infection and thrombocytopenia.

One prevelant latent disease reported in Ni et al. (2019) does not have a clear interpretation due to the excessive number of symptoms. They suspect a subset of symptoms like decreased plateletcrit (PCT), leukocytes and GRA are due to weak immune system of the elderly population. However, other symptoms are not related to immune system. With a much larger sample size, we are able to single out those symptoms without spurious links through the latent disease X7.

alanine aminotransferase (ALT) aspartate aminotransferase (AST) total bilirubin (TB)
total cholesterol (TC) triglycerides low density lipoproteins (LDL)
high density lipoproteins (HDL) urine pH (UrinePH) mn corpuscular hemoglobin conc (MCHC)
% of monocytes (%MON) alpha fetoprotein (AFP) carcinoembryonic antigen (CEA)
number of monocytes (#MON) plateletcrit (PCT) CV of platelet dist. (PDW-CV)
% of eosinophil (%Eosinophil) basophil ct (#Basophil) % of basophil (%Basophil)
platelet large cell ratio (P-LCR) platelets systolic blood pressure (Systolic)
% of granulocyte (%GRA) body temperature (BodyTemperature) leukocytes
hemoglobin creatinine blood urea nitrogen (BUN)
glucose diastolic blood pressure (Diastolic) heart rate (HeartRate)
erythrocytes hematocrit (HCT) uric acid (UricAcid)
% of lymphocyte (%LYM) mn corpuscular volume (MCV) mn corpuscular hemoglobin (MCH)
lymphocyte ct (#LYM) granulocyte ct (#GRA) mn platelet volume (MPV)
Table 2: Blood test items. Acronyms are given within parentheses. CV indicates coefficient of variation, “dist” is distribution, “mn” is mean, “ct”is count, and “conc” is concentration.
Figure 6: EHR. Bipartite network for symptom-disease relationships from DFA. The diseases are represented by rectangles. Latent diseases are represented by X1X8. Solid lines are symptom-disease relationships inferred from the data whereas dashed lines are fixed by prior knowledge. Grey lines indicate the symptoms are binary. Red (blue) lines with bars (arrowheads) indicate the disease causes the symptom to be lower (higher) than normal range.
Figure 7: EHR. The top part of the heatmap shows the estimated symptom-disease relationships with green, black and red cells representing 1, 0 and -1, respectively. The bottom part of the heatmap shows the estimated patient-disease relationships for 1000 randomly selected patients. The columns correspond to diseases and the rows are symptoms (top portion) and patients (bottom portion) .

6 Discussion

We have developed a simple CMC algorithm for fast approximate posterior inference of BNP models. The proposed CMC is a general algorithm in the sense that it can be used to scale up practically any Bayesian clustering and feature allocation methods. CMC runs MCMC on subsets of observations in parallel and aggregates the Monte Carlo samples. The main idea of this paper is using a subset of observations as anchor points to merge clusters or latent features from different machines. The aggregation step has a tuning parameter which is not influential of the results in our simulations and applications. We explicitly address the problem of large sample size but have not considered the issue of high-dimensionality, which we plan to deal with in our future work.

We discussed the approach for random subsets in random partitions and (double) feature allocation. In the motivating examples we used DPM and IBP priors. But the algorithm remains equally valid for any other prior model. The approach with common anchors remains useful also for any other models that involve random subsets that can be split and merged in a similar fashion, including latent trait models, finite mixture models and finite feature allocation.


The authors express thanks to Andres Christen (CIMAT, Guanajuato, Mexico) for first suggesting the use of anchor points for CMC with random subsets. PM and YJ are partly supported by NIH R01 CA132897.

Supplementary Materials

R code

generates all simulated data.

C code

merges Monte Carlo samples of clusters and latent features.


  • Argiento et al. (2010) Argiento, R., Guglielmi, A., and Pievatolo, A. (2010). Bayesian density estimation and model selection using nonparametric hierarchical mixtures. Computational Statistics & Data Analysis, 54(4):816 – 832.
  • Barrios et al. (2013) Barrios, E., Lijoi, A., Nieto-Barajas, L. E., and Prünster, I. (2013). Modeling with normalized random measure mixture models. Statistical Science, 28(3):313–334.
  • Blei et al. (2006) Blei, D. M., Jordan, M. I., et al. (2006). Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121–143.
  • Broderick et al. (2013a) Broderick, T., Jordan, M. I., and Pitman, J. (2013a). Cluster and feature modeling from combinatorial stochastic processes. Statistical Science, 28(3):289–312.
  • Broderick et al. (2013b) Broderick, T., Kulis, B., and Jordan, M. (2013b). MAD-Bayes: MAP-based asymptotic derivations from Bayes. In

    International Conference on Machine Learning

    , pages 226–234.
  • Broderick et al. (2013c) Broderick, T., Pitman, J., Jordan, M. I., et al. (2013c). Feature allocations, probability functions, and paintboxes. Bayesian Analysis, 8(4):801–836.
  • de Blasi et al. (2015) de Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Prünster, I., and Ruggiero, M. (2015). Are Gibbs-type priors the most natural generalization of the Dirichlet process? IEEE Transactions on Pattern Analysis and Machine Intelligence, 37:212–229.
  • Doshi-Velez et al. (2009a) Doshi-Velez, F., Miller, K., Van Gael, J., and Teh, Y. W. (2009a). Variational inference for the Indian buffet process. In Artificial Intelligence and Statistics, pages 137–144.
  • Doshi-Velez et al. (2009b) Doshi-Velez, F., Mohamed, S., Ghahramani, Z., and Knowles, D. A. (2009b).

    Large scale nonparametric Bayesian inference: Data parallelisation in the Indian buffet process.

    In Advances in Neural Information Processing Systems, pages 1294–1302.
  • Favaro and Teh (2013) Favaro, S. and Teh, Y. W. (2013). MCMC for normalized random measure mixture models. Statistical Science, 28(3):335–359.
  • Ge et al. (2015) Ge, H., Chen, Y., Wan, M., and Ghahramani, Z. (2015). Distributed inference for Dirichlet process mixture models. In Proceedings of the 32nd International Conference on Machine Learning, pages 2276–2284.
  • Ghahramani and Griffiths (2006) Ghahramani, Z. and Griffiths, T. L. (2006). Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, pages 475–482.
  • Ghoshal (2010) Ghoshal, S. (2010). The Dirichlet process, related priors and posterior asymptotics. In Hjort, N. L., Holmes, C., Müller, P., and Walker, S. G., editors, Bayesian Nonparametrics, pages 22–34. Cambridge University Press.
  • Hartigan (1972) Hartigan, J. A. (1972). Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123–129.
  • Huang and Gelman (2005) Huang, Z. and Gelman, A. (2005). Sampling for Bayesian computation with large datasets. Technical report, Department of Statistics, Columbia University.
  • Kingman (1978) Kingman, J. F. C. (1978). The representation of partition structures. J. London Math. Soc. (2), 18(2):374–380.
  • Kunkel and Peruggia (2018) Kunkel, D. and Peruggia, M. (2018). Anchored bayesian gaussian mixture models. arXiv preprint arXiv:1805.08304.
  • Kurihara et al. (2007) Kurihara, K., Welling, M., and Teh, Y. W. (2007). Collapsed variational Dirichlet process mixture models. In IJCAI, volume 7, pages 2796–2801.
  • Lau and Green (2007) Lau, J. W. and Green, P. J. (2007). Bayesian model-based clustering procedures. Journal of Computational and Graphical Statistics, 16(3):526–558.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
  • Lee et al. (2015) Lee, J., Müller, P., Gulukota, K., and Ji, Y. (2015). A Bayesian feature allocation model for tumor heterogeneity. Ann. Appl. Stat., 9(2):621–639.
  • Lijoi et al. (2005) Lijoi, A., Mena, R. H., and Prünster, I. (2005). Hierarchical mixture modeling with normalized inverse-Gaussian priors. Journal of the American Statistical Association, 100(472):1278–1291.
  • Lijoi et al. (2007) Lijoi, A., Mena, R. H., and Prünster, I. (2007). Controlling the reinforcement in Bayesian non-parametric mixture models. Journal of the Royal Statistical Society: Series B, 69(4):715–740.
  • Lin (2013) Lin, D. (2013). Online learning of nonparametric mixture models via sequential variational approximation. In Advances in Neural Information Processing Systems, pages 395–403.
  • Lo (1984) Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates: I. Density estimates. The Annals of Statistics, 12(1):351–357.
  • Maaten and Hinton (2008) Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605.
  • MacEachern (2000) MacEachern, S. N. (2000). Dependent Dirichlet processes. Unpublished manuscript, Department of Statistics, The Ohio State University, pages 1–40.
  • Minsker et al. (2014) Minsker, S., Srivastava, S., Lin, L., and Dunson, D. (2014). Scalable and robust Bayesian inference via the median posterior. In International Conference on Machine Learning, pages 1656–1664.
  • Neal (2000) Neal, R. M. (2000). Markov chain sampling methods for dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265.
  • Neiswanger et al. (2013) Neiswanger, W., Wang, C., and Xing, E. (2013). Asymptotically exact, embarrassingly parallel MCMC. arXiv preprint arXiv:1311.4780.
  • Newton et al. (1998) Newton, M. A., Quintana, F. A., and Zhang, Y. (1998). Nonparametric Bayes methods using predictive updating. In

    Practical nonparametric and semiparametric Bayesian statistics

    , pages 45–61. Springer.
  • Ni et al. (2019) Ni, Y., Mueller, P., and Ji, Y. (2019). Bayesian Double Feature Allocation for Phenotyping with Electronic Health Records. Journal of the American Statistical Association, page in press.
  • Ni et al. (2019a) Ni, Y., Müller, P., Diesendruck, M., Williamson, S., Zhu, Y., and Ji, Y. (2019a). Scalable bayesian nonparametric clustering and classification. Journal of Computational and Graphical Statistics, page in press.
  • Ni et al. (2019b) Ni, Y., Müller, P., Shpak, M., and Ji, Y. (2019b). Parallel-Tempered Feature Allocation for Large-scale Tumor Heterogeneity with Deep Sequencing Data. Technical Report.
  • Pitman and Yor (1997) Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, pages 855–900.
  • Rabinovich et al. (2015) Rabinovich, M., Angelino, E., and Jordan, M. I. (2015). Variational consensus Monte Carlo. In Advances in Neural Information Processing Systems, pages 1207–1215.
  • Rai and Daume (2011) Rai, P. and Daume, H. (2011). Beam search based map estimates for the Indian buffet process. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 705–712. Citeseer.
  • Reed and Ghahramani (2013) Reed, C. and Ghahramani, Z. (2013). Scaling the Indian buffet process via submodular maximization. In International Conference on Machine Learning, pages 1013–1021.
  • Richardson and Green (1997) Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: series B, 59(4):731–792.
  • Rodriguez et al. (2011) Rodriguez, A., Lenkoski, A., and Dobra, A. (2011). Sparse covariance estimation in heterogeneous samples. Electronic Journal of Statistics, 5:981.
  • Scott et al. (2016) Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (2016). Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78–88.
  • Tank et al. (2015) Tank, A., Foti, N., and Fox, E. (2015). Streaming variational inference for Bayesian nonparametric mixture models. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages 968–976.
  • Wang and Dunson (2011) Wang, L. and Dunson, D. B. (2011). Fast Bayesian inference in Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 20(1):196–216.
  • Wang and Dunson (2013) Wang, X. and Dunson, D. B. (2013). Parallelizing MCMC via Weierstrass sampler. arXiv preprint arXiv:1312.4605.
  • White et al. (2015) White, S., Kypraios, T., and Preston, S. (2015).

    Piecewise Approximate Bayesian Computation: fast inference for discretely observed Markov models using a factorised posterior distribution.

    Statistics and Computing, 25(2):289.
  • Williamson et al. (2013) Williamson, S. A., Dubey, A., and Xing, E. P. (2013). Parallel Markov chain Monte Carlo for nonparametric mixture models. In Proceedings of the 30th International Conference on International Conference on Machine Learning, pages 98–106.
  • Xu et al. (2015) Xu, Y., Müller, P., Yuan, Y., Gulukota, K., and Ji, Y. (2015). MAD Bayes for tumor heterogeneity – feature allocation with exponential family sampling. Journal of the American Statistical Association, 110(510):503–514.
  • Zuanetti et al. (2019) Zuanetti, D. A., Müller, P., Zhu, Y., Yang, S., and Ji, Y. (2019). Bayesian nonparametric clustering for large data sets. Statistics and Computing, 29:203–215.