MCMC inference algorithms for normalized random measure mixtures
Normalized random measures (NRMs) provide a broad class of discrete random measures that are often used as priors for Bayesian nonparametric models. Dirichlet process is a well-known example of NRMs. Most of posterior inference methods for NRM mixture models rely on MCMC methods since they are easy to implement and their convergence is well studied. However, MCMC often suffers from slow convergence when the acceptance rate is low. Tree-based inference is an alternative deterministic posterior inference method, where Bayesian hierarchical clustering (BHC) or incremental Bayesian hierarchical clustering (IBHC) have been developed for DP or NRM mixture (NRMM) models, respectively. Although IBHC is a promising method for posterior inference for NRMM models due to its efficiency and applicability to online inference, its convergence is not guaranteed since it uses heuristics that simply selects the best solution after multiple trials are made. In this paper, we present a hybrid inference algorithm for NRMM models, which combines the merits of both MCMC and IBHC. Trees built by IBHC outlines partitions of data, which guides Metropolis-Hastings procedure to employ appropriate proposals. Inheriting the nature of MCMC, our tree-guided MCMC (tgMCMC) is guaranteed to converge, and enjoys the fast convergence thanks to the effective proposals guided by trees. Experiments on both synthetic and real-world datasets demonstrate the benefit of our method.READ FULL TEXT VIEW PDF
Nonparametric mixture models based on the Dirichlet process are an elega...
We investigate the class of σ-stable Poisson-Kingman random
Normalized compound random measures are flexible nonparametric priors fo...
Repulsive mixture models have recently gained popularity for Bayesian cl...
When using Markov chain Monte Carlo (MCMC) algorithms to perform inferen...
MCMC methods for sampling from the space of DAGs can mix poorly due to t...
Bayesian models are naturally equipped to provide recursive inference be...
MCMC inference algorithms for normalized random measure mixtures
Normalized random measures (NRMs) form a broad class of discrete random measures, including Dirichlet proccess (DP)  normalized inverse Gaussian process , and normalized generalized Gamma process [3, 4]. NRM mixture (NRMM) model  is a representative example where NRM is used as a prior for mixture models. Recently NRMs were extended to dependent NRMs (DNRMs) [6, 7] to model data where exchangeability fails. The posterior analysis for NRM mixture (NRMM) models has been developed [8, 9], yielding simple MCMC methods . As in DP mixture (DPM) models 
, there are two paradigms in the MCMC algorithms for NRMM models: (1) marginal samplers and (2) slice samplers. The marginal samplers simulate the posterior distributions of partitions and cluster parameters given data (or just partitions given data provided that conjugate priors are assumed) by marginalizing out the random measures. The marginal samplers include the Gibbs sampler, and the split-merge sampler , although it was not formally extended to NRMM models. The slice sampler  maintains random measures and explicitly samples the weights and atoms of the random measures. The term ”slice” comes from the auxiliary slice variables used to control the number of atoms to be used. The slice sampler is known to mix faster than the marginal Gibbs sampler when applied to complicated DNRM mixture models where the evaluation of marginal distribution is costly .
The main drawback of MCMC methods for NRMM models is their poor scalability, due to the nature of MCMC methods. Moreover, since the marginal Gibbs sampler and slice sampler iteratively sample the cluster assignment variable for a single data point at a time, they easily get stuck in local optima. Split-merge sampler may resolve the local optima problem to some extent, but is still problematic for large-scale datasets since the samples proposed by split or merge procedures are rarely accepted. Recently, a deterministic alternative to MCMC algorithms for NRM (or DNRM) mixture models were proposed , extending Bayesian hierarchical clustering (BHC)  which was developed as a tree-based inference for DP mixture models. The algorithm, referred to as incremental BHC (IBHC)  builds binary trees that reflects the hierarchical cluster structures of datasets by evaluating the approximate marginal likelihood of NRMM models, and is well suited for the incremental inferences for large-scale or streaming datasets. The key idea of IBHC is to consider only exponentially many posterior samples (which are represented as binary trees), instead of drawing indefinite number of samples as in MCMC methods. However, IBHC depends on the heuristics that chooses the best trees after the multiple trials, and thus is not guaranteed to converge to the true posterior distributions.
In this paper, we propose a novel MCMC algorithm that elegantly combines IBHC and MCMC methods for NRMM models. Our algorithm, called the tree-guided MCMC, utilizes the trees built from IBHC to proposes a good quality posterior samples efficiently. The trees contain useful information such as dissimilarities between clusters, so the errors in cluster assignments may be detected and corrected with less efforts. Moreover, designed as a MCMC methods, our algorithm is guaranteed to converge to the true posterior, which was not possible for IBHC. We demonstrate the efficiency and accuracy of our algorithm by comparing it to existing MCMC algorithms.
Throughout this paper we use the following notations. Denote by a set of indices and by a dataset. A partition of is a set of disjoint nonempty subsets of whose union is . Cluster is an entry of , i.e., . Data points in cluster is denoted by for . For the sake of simplicity, we often use to represent a singleton for . In this section, we briefly review NRMM models, existing posterior inference methods such as MCMC and IBHC.
Let be a homogeneous completely random measure (CRM) on measure space with Lévy intensity and base measure , written as . We also assume that,
so that has infinitely many atoms and the total mass is finite: A NRM is then formed by normalizing by its total mass . For each index , we draw the corresponding atoms from NRM, . Since is discrete, the set naturally form a partition of with respect to the assigned atoms. We write the partition as a set of sets whose elements are non-empty and non-overlapping subsets of , and the union of the elements is . We index the elements (clusters) of with the symbol , and denote the unique atom assigned to as . Summarizing the set as , the posterior random measure is written as follows:
() Let be samples drawn from where . With an auxiliary variable , the posterior random measure is written as
Moreover, the marginal distribution is written as
Using (4), the predictive distribution for the novel atom is written as
The most general CRM may be used is the generalized Gamma , with Lévy intensity . In NRMM models, the observed dataset is assusmed to be generated from a likelihood with parameters drawn from NRM. We focus on the conjugate case where is conjugate to , so that the integral is tractable.
The goal of posterior inference for NRMM models is to compute the posterior with the marginal likelihood .
Marginal Gibbs Sampler: marginal Gibbs sampler is basesd on the predictive distribution (6). At each iteration, cluster assignments for each data point is sampled, where may join an existing cluster
with probability proportional to, or create a novel cluster with probability proportional to .
Slice sampler: instead of marginalizing out , slice sampler explicitly sample the atoms and weights of . Since maintaining infinitely many atoms is infeasible, slice variables are introduced for each data point, and atoms with masses larger than a threshold (usually set as ) are kept and remaining atoms are added on the fly as the threshold changes. At each iteration, is assigned to the th atom with probability .
Split-merge sampler: both marginal Gibbs and slice sampler alter a single cluster assignment at a time, so are prone to the local optima. Split-merge sampler, originally developed for DPM, is a marginal sampler that is based on (6). At each iteration, instead of changing individual cluster assignments, split-merge sampler splits or merges clusters to propose a new partition. The split or merged partition is proposed by a procedure called the restricted Gibbs sampling, which is Gibbs sampling restricted to the clusters to split or merge. The proposed partitions are accepted or rejected according to Metropolis-Hastings schemes. Split-merge samplers are reported to mix better than marginal Gibbs sampler.
Bayesian hierarchical clustering (BHC, ) is a probabilistic model-based agglomerative clustering, where the marginal likelihood of DPM is evaluated to measure the dissimilarity between nodes. Like the traditional agglomerative clustering algorithms, BHC repeatedly merges the pair of nodes with the smallest dissimilarities, and builds binary trees embedding the hierarchical cluster structure of datasets. BHC defines the generative probability of binary trees which is maximized during the construction of the tree, and the generative probability provides a lower bound on the marginal likelihood of DPM. For this reason, BHC is considered to be a posterior inference algorithm for DPM. Incremental BHC (IBHC, ) is an extension of BHC to (dependent) NRMM models. Like BHC is a deterministic posterior inference algorithm for DPM, IBHC serves as a deterministic posterior inference algorithms for NRMM models. Unlike the original BHC that greedily builds trees, IBHC sequentially insert data points into trees, yielding scalable algorithm that is well suited for online inference. We first explain the generative model of trees, and then explain the sequential algorithm of IBHC.
IBHC aims to maximize the joint probability of the data and the auxiliary variable :
Let be a binary tree whose leaf nodes consist of the indices in . Let and denote the left and right child of the set in tree, and thus the corresponding trees are denoted by and . The generative probability of trees is described with the potential function , which is the unnormalized reformulation of the original definition . The potential function of the data given the tree is recursively defined as follows:
Here, is the hypothesis that was generated from a single cluster. The first therm is proportional to the probability that is true, and came from the term inside the product of (7). The second term is proportional to the probability that was generated from more than two clusters embedded in the subtrees and
. The posterior probability ofis then computed as
is defined to be the dissimilarity between and . In the greedy construction, the pair of nodes with smallest are merged at each iteration. When the minimum dissimilarity exceeds one (), is concluded to be false and the construction stops. This is an important mechanism of BHC (and IBHC) that naturally selects the proper number of clusters. In the perspective of the posterior inference, this stopping corresponds to selecting the MAP partition that maximizes . If the tree is built and the potential function is computed for the entire dataset , a lower bound on the joint likelihood (7) is obtained [15, 14]:
Now we explain the sequential tree construction of IBHC. IBHC constructs a tree in an incremental manner by inserting a new data point into an appropriate position of the existing tree, without computing dissimilarities between every pair of nodes. The procedure, which comprises three steps, is elucidated in Fig. 1.
Step 1 (left): Given , suppose that trees are built by IBHC, yielding to a partition . When a new data point arrives, this step assigns to a tree , which has the smallest distance, i.e., , or create a new tree if .
Step 2 (middle): Suppose that the tree chosen in Step 1 is . Then Step 2 determines an appropriate position of when it is inserted into the tree , and this is done by the procedure . chooses the position of among three cases (Fig. 1). Case 1 elucidates an option where is placed on the top of the tree . Case 2 and 3 show options where is added as a sibling of the subtree or , respectively. Among these three cases, the one with the highest potential function is selected, which can easily be done by comparing , and . If is the smallest, then Case 1 is selected and the insertion terminates. Otherwise, if is the smallest, is inserted into and SeqInsert is recursively executed. The same procedure is applied to the case where is smallest.
Step 3 (right): After Step 1 and 2 are applied, the potential functions of should be computed again, starting from the subtree of to which is inserted, to the root . During this procedure, updated values may exceed 1. In such a case, we split the tree at the level where , and re-insert all the split nodes.
After having inserted all the data points in , the auxiliary variable
and hyperparameters forare resampled, and the tree is reconstructed. This procedure is repeated several times and the trees with the highest potential functions are chosen as an output.
IBHC should reconstruct trees from the ground whenever and hyperparameters are resampled, and this is obviously time consuming, and more importantly, converge is not guaranteed. Instead of completely reconstructing trees, we propose to refine the parts of existing trees with MCMC. Our algorithm, called tree-guided MCMC (tgMCMC), is a combination of deterministic tree-based inference and MCMC, where the trees constructed via IBHC guides MCMC to propose good-quality samples. tgMCMC initialize a chain with a single run of IBHC. Given a current partition and trees , tgMCMC proposes a novel partition by global and local moves. Global moves split or merges clusters to propose , and local moves alters cluster assignments of individual data points via Gibbs sampling. We first explain the two key operations used to modify tree structures, and then explain global and local moves. More details on the algorithm can be found in the supplementary material.
SampleSub: given a tree , draw a subtree with probability . is added for leaf nodes whose , and set to the maximum among all subtrees of . The drawn subtree is likely to contain errors to be corrected by splitting. The probability of drawing is multiplied to , where is usually set to transition probabilities.
StocInsert: a stochastic version of IBHC. may be inserted to via with probability , or may just be put into (create a new cluster in ) with probability . If is inserted via SeqInsert, the potential functions are updated accordingly, but the trees are not split even if the update dissimilarities exceed 1. As in SampleSub, the probability is multiplied to .
The global moves of tgMCMC are tree-guided analogy to split-merge sampling. In split-merge sampling, a pair of data points are randomly selected, and split partition is proposed if they belong to the same cluster, or merged partition is proposed otherwise. Instead, tgMCMC finds the clusters that are highly likely to be split or merged using the dissimilarities between trees, which goes as follows in detail. First, we randomly pick a tree in uniform. Then, we compute for , and put in a set with probability (the probability of merging and ). The transition probability up to this step is . The set contains candidate clusters to merge with . If is empty, which means that there are no candidates to merge with , we propose by splitting . Otherwise, we propose by merging and clusters in .
Split case: we start splitting by drawing a subtree by 111Here, we restrict SampleSub to sample non-leaf nodes, since leaf nodes cannot be split.. Then we split to , destroy all the parents of and collect the split trees into a set (Fig. 2, top). Then we reconstruct the tree by for all . After the reconstruction, has at least two clusters since we split before insertion. The split partition to propose is . The reverse transition probability is computed as follows. To obtain from , we must merge the clusters in to . For this, we should pick a cluster , and put other clusters in into . Since we can pick any at first, the reverse transition probability is computed as a sum of all those possibilities:
Merge case: suppose that we have 222We assume that clusters are given their own indices (such as hash values) so that they can be ordered.. The merged partition to propose is given as , where . We construct the corresponding binary tree as a cascading tree, where we put on top of in order (Fig. 2, bottom). To compute the reverse transition probability , we should compute the probability of splitting back into . For this, we should first choose and put nothing into the set to provoke splitting. up to this step is . Then, we should sample the parent of (the subtree connecting and ) via , and this would result in and . Finally, we insert into via for , where we select each to create a new cluster in . Corresponding update to by StocInsert is,
Once we’ve proposed and computed both and , is accepted with probability where .
Ergodicity of the global moves: to show that the global moves are ergodic, it is enough to show that we can move an arbitrary point from its current cluster to any other cluster in finite step. This can easily be done by a single split and merge moves, so the global moves are ergodic.
Time complexity of the global moves: the time complexity of is , where is a height of the tree to insert . The total time complexity of split proposal is mainly determined by the time to execute . This procedure is usually efficient, especially when the trees are well balanced. The time complexity to propose merged partition is .
In local moves, we resample cluster assignments of individual data points via Gibbs sampling. If a leaf node is moved from to , we detach from and run 333We do not split even if the update dissimilarity exceed one, as in StocInsert.. Here, instead of running Gibbs sampling for all data points, we run Gibbs sampling for a subset of data points , which is formed as follows. For each , we draw a subtree by SampleSub. Then, we draw a subtree of again by SampleSub. We repeat this subsampling for times, and put the leaf nodes of the final subtree into . Smaller would result in more data points to resample, so we can control the tradeoff between iteration time and mixing rates.
Cycling: at each iteration of tgMCMC, we cycle the global moves and local moves, as in split-merge sampling. We first run the global moves for times, and run a single sweep of local moves. Setting and were the moderate choice for all data we’ve tested.
In this section, we compare marginal Gibbs sampler (Gibbs), split-merge sampler (SM) and tgMCMC on synthetic and real datasets.
We first compared the samplers on simple toy dataset that has 1,300 two-dimensional points with 13 clusters, sampled from the mixture of Gaussians with predefined means and covariances. Since the partition found by IBHC is almost perfect for this simple data, instead of initializing with IBHC, we initialized the binary tree (and partition) as follows. As in IBHC, we sequentially inserted data points into existing trees with a random order. However, instead of inserting them via SeqInsert, we just put data points on top of existing trees, so that no splitting would occur. tgMCMC was initialized with the tree constructed from this procedure, and Gibbs and SM were initialized with corresponding partition. We assumed the Gaussian-likelihood and Gaussian-Wishart base measure,
where , , is the dimensionality, is the sample mean and ( is the sample covariance). We compared the samplers using both DP and NGGP priors. For tgMCMC, we fixed the number of global moves and the parameter for local moves , except for the cases where we controlled them explicitly. All the samplers were run for 10 seconds, and repeated 10 times. We compared the joint log-likelihood of samples and the effective sample size (ESS) of the number of clusters found. For SM and tgMCMC, we compared the average log value of the acceptance ratio . The results are summarized in Fig. 3. As shown in the log-likelihood trace plot, tgMCMC quickly converged to the ground truth solution for both DP and NGGP cases. Also, tgMCMC mixed better than other two samplers in terms of ESS. Comparing the average values of SM and tgMCMC, we can see that the partitions proposed by tgMCMC is more often accepted. We also controlled the parameter and ; as expected, higher resulted in faster convergence. However, smaller (more data points involved in local moves) did not necessarily mean faster convergence.
We also compared the three samplers on larger dataset containing 10,000 points, which we will call as 10K dataset, generated from six-dimensional mixture of Gaussians with labels drawn from . We used the same base measure and initialization with those of the toy datasets, and used the NGGP prior, We ran the samplers for 1,000 seconds and repeated 10 times. Gibbs and SM were too slow, so the number of samples produced in 1,000 seconds were too small. Hence, we also compared Gibbs_sub and SM_sub, where we uniformly sampled the subset of data points and ran Gibbs sweep only for those sampled points. We controlled the subset size to make their running time similar to that of tgMCMC. The results are summarized in Fig. 4. Again, tgMCMC outperformed other samplers both in terms of the log-likelihoods and ESS. Interestingly, SM was even worse than Gibbs, since most of the samples proposed by split or merge proposal were rejected. Gibbs_sub and SM_sub were better than Gibbs and SM, but still failed to reach the best state found by tgMCMC.
We also compared the samplers on NIPS corpus444https://archive.ics.uci.edu/ml/datasets/Bag+of+Words, containing 1,500 documents with 12,419 words. We used the multinomial likelihood and symmetric Dirichlet base measure , used NGGP prior, and initialized the samplers with normal IBHC. As for the 10K dataset, we compared Gibbs_sub and SM_sub along. We ran the samplers for 10,000 seconds and repeated 10 times. The results are summarized in Fig. 5. tgMCMC outperformed other samplers in terms of the log-likelihood; all the other samplers were trapped in local optima and failed to reach the states found by tgMCMC. However, ESS for tgMCMC were the lowest, meaning the poor mixing rates. We still argue that tgMCMC is a better option for this dataset, since we think that finding the better log-likelihood states is more important than mixing rates.
In this paper we have presented a novel inference algorithm for NRMM models. Our sampler, called tgMCMC, utilized the binary trees constructed by IBHC to propose good quality samples. tgMCMC explored the space of partitions via global and local moves which were guided by the potential functions of trees. tgMCMC was demonstrated to be outperform existing samplers in both synthetic and real world datasets.
: This work was supported by the IT R&D Program of MSIP/IITP (B0101-15-0307, Machine Learning Center), National Research Foundation (NRF) of Korea (NRF-2013R1A2A2A01067464), and IITP-MSRA Creative ICT/SW Research Project.
We first explain two key operations used for tgMCMC.
SampleSub: given a tree , SampleSub aims to draw a subtree of . We want to have high dissimilarity between its subtrees, so we draw with probability proportional to , where is the maximum dissimilarity between subtrees. is added for leaf nodes, whose dissimilarity between subtrees are defined to be zero. See Figure 6.
As depicted in Figure 6, the probability of drawing the subtree is computed as . If we draw as a result, corresponding probability is multiplied into ; . This procedure is needed to compute the forward transition probability. If SampleSub takes another argument as , we don’t do the actual sampling but compute the probability of drawing subtree from and multiply the probability to . In above example, does . This operation is needed to compute the reverse transition probability.
StocInsert: given a set of clusters (and corresponding trees), StocInsert inserts into one of the trees , or just put in without insertion. If is inserted into , the insertion is done by . See Figure 7 for example. The set contains three clusters, and may be inserted into via with probability , or may just be put into without insertion with probability . In the original IBHC algorithm, is inserted into the tree with smallest , or put into without insertion if the minimum exceeds one. StocInsert is a probabilistic operation of this procedure, where the probability is proportional to . As emphasized in the paper, the trees are not split after the update of the dissimilarities of
inserted trees; this is to ensure reversibility and make the computation of transition probability easier. Even if some dissimilarity exceeds one after the insertion, we expect StocInsert operation to find those nodes needed to be split. After the insertion, corresponding probability is multiplied to , as in StocInsert operation. If StocInsert takes another arguments as , we don’t do insertion but compute the probability of inserting into and multiply it to . In above example, multiplies . multiplies the probability of creating a new tree with , .
In this section, we explain global moves for our tgMCMC.
In global moves, we randomly choose between splitting and merging operations. The basic principle is this; pick a tree, and find candidate trees that can be merged with the picked tree. If there is any candidate, invoke merge operation, and invoke split operation otherwise.
As an example, suppose that our current partition is , with 6 clusters. We first uniformly pick a cluster among them, and initialize the forward transition probability . Suppose that is selected. Then, for , we put into a set with probability ; note that this probability is equal to , the probability of merging and . The splitting is invoked if contains no cluster, and merging is invoked otherwise.
Suppose that contains no cluster. The forward transition probability up to this is
Now suppose that looks like in Figure 8. The split proposal is then proposed according to the following procedure.
First, a subtree of is sampled via SampleSub procedure. Then, the tree is cut at , collecting and remaining split nodes . Nodes in are inserted into via StocInsert. Since were initialized to be split in , the resulting partition must have more than two clusters. During SampleSub and StocInsert, every intermediate transition probabilities are multiplied into . The final partition to propose is .
Suppose that contains three clusters; . The forward transition probability up to this is
In partition space, there is only one way to merge and into a single cluster , so no transition probability is multiplied. However, there can be many trees representing , since we can merge the nodes in any order or topology, in terms of tree. Hence, we fix the merged tree to be a cascaded tree, merged in order of indices of nodes; see Figure 9. As we wrote in our paper, we assume that each nodes are given unique indices (such as hash values) so that they can be sorted in order. In this example, the indices for and are 1 and 2, respectively. Then, we build a cascaded tree, merged in order . The resulting merged partition is then .
Now we explain how to compute the reverse transition probability for splitting case. We start from the illustrative partition in subsection B.2. Starting from , we must merge and back into to retrieve . For this, there are two possibilities: we first pick and collect , or pick and collect . Hence, the reverse transition probability is computed as
As we explained in subsection B.3, no more reverse transition probabilities are needed since there is only one way to merge nodes back into in partition space.
We explain how to compute the reverse transition probability for merging case, starting from the illustrative partition in subsection B.3, . See Figure 9. To retrieve , we should split into . For this, we should first invoke split operation for ; pick and collect . The reverse transition probability up to this is
Now, we should split to retrieve the original partition. The achieve this, we should first pick direct parent of and ( in Figure 9) via SampleSub. The reverse transition probability for this is computed by , and we have and . To get the original partition, and should be put into without insertion, and the reverse transition probability for this is computed by and .
Local moves resample cluster assignments of individual data points (leaf nodes) with Gibbs sampling. However, instead of running Gibbs sampling for full data, we select a random subset and run Gibbs sampling for data points in . is constructed as follows. Given a partition , for each , we sample its subtree via SampleSub. Then we sample a subtree of drawn subtree again with SampleSub. This procedure is repeated for times, where is a parameter set by users. Figure 10 depicts the subsampling procedure for with .
As a result of Figure 10, the leaf nodes are added to . After having collected for all , we run Gibbs sampling for those leaf nodes in , and if a leaf node should be moved from to another , we insert into with . We can control the tradeoff between mixing rate and speed by controlling ; higher means more elements in , and thus more data points are moved by Gibbs sampling.
A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model.Journal of Computational and Graphical Statistics, 13:158–182, 2000.
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Reykjavik, Iceland, 2014.