1 Introduction
Methods  Data  Exact  Parallel  Distributed  BN Processes  

Size  betaBernoulli  Other  DP  HDP  PitmanYor  
Millions  Process  CRMs  Process  
Smyth et al. 2009  M  ✕  ✓  ✓  ✕  ✕  ✓  ✓  ✕ 
DoshiVelez and Ghahramani 2009  M  ✓  ✕  ✕  ✓  ✕  ✕  ✕  ✕ 
DoshiVelez et al. 2009b  M  ✕  ✓  ✓  ✓  ✕  ✕  ✕  ✕ 
Williamson et al. 2013  M  ✓  ✓  ✕  ✕  ✕  ✓  ✓  ✕ 
Chang and Fisher III 2013  M  ✓  ✓  ✕  ✕  ✕  ✓  ✓  ✕ 
Dubey et al. 2014  M  ✓  ✓  ✕  ✕  ✕  ✕  ✕  ✓ 
Ge et al. 2015  M  ✓  ✓  ✓  ✕  ✕  ✓  ✓  ✕ 
Yerebakan and Dundar 2017  M  ✓  ✓  ✓  ✕  ✕  ✓  ✕  ✕ 
This paper  M  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ 
Bayesian nonparametric (BNP) models are a flexible class of models whose complexity adapts to the data under consideration. BNP models place priors on infinitedimensional objects, such as partitions with infinitely many blocks; matrices with infinitely many columns; or discrete measures with infinitely many atoms. A finite set of observations is assumed to be generated from a finite—but random—subset of these components, allowing flexibility in the underlying dimensionality and providing the ability to incorporate previously unseen properties as our dataset grows.
While the flexibility of these models are a good fit for large, complex data sets, distributing existing inference algorithms across multiple machines is challenging. If we explicitly represent subsets of the underlying infinitedimensional object – for example, using a slice sampler – we can face high memory requirements and slow convergence. Conversely, if we integrate out the infinitedimensional object, we run into problems due to induced global dependencies.
Moreover, a key goal of distributed algorithms is to minimize communication between agents. This can be achieved by breaking the algorithm into independent subalgorithms, which can be run independently on different agents. In practice, we usually cannot split an MCMC sampler on a Bayesian hierarchical model into entirely independent subalgorithms since there are typically some global dependencies implied by the hierarchy. Instead, we make use of conditional independencies to temporarily partition our algorithm.
Contributions: In this paper, we propose a distributable sampler for completely random measures, which unifies exact parallel inference for a wide class of Bayesian nonparametric priors, including the popularly used Dirichlet process (Ferguson, 1973) and the betaBernoulli process (Griffiths and Ghahramani, 2011a). After introducing appropriate background material, we first introduce general recipes for (nondistributed) partially collapsed samplers appropriate for a wide range of BNP models, focusing on the betaBernoulli process and the Dirichlet process as exemplars. We then demonstrate that these methods can be easily extended to a distributed setting. Next we provide experimental results for our hybrid and distributed sampler on DP and BB inference. Lastly, we conclude the paper with a discussion of future directions for our idea.
2 Background and Related Work
Completely random measures (CRMs, Kingman, 1967)
are random measures that assign independent masses to disjoint subsets of a space. For example, the gamma process assigns a gammadistributed mass to each subset. Other examples include the beta process
(Hjort, 1990) and the Poisson process. The distribution of a CRM is completely determined by it’s Lévy measure, which controls the size and location of atoms.Many nonparametric distributions can be expressed in terms of CRMs. For example, if we sample from a (homogeneous) beta process, and generate a sequence of subsets where w.p. , then we obtain an exchangeable distribution over sequences of subsets known as the betaBernoulli process (Thibaux and Jordan, 2007), which is related to the Indian buffet process (IBP, Ghahramani and Griffiths, 2005). If we sample from a gamma process on with base measure , then is distributed according to a Dirichlet process with concentration parameter and base measure .
Inference in such models tend to fall into three categories: uncollapsed samplers that alternate between sampling the latent measure and the assignments (Ishwaran and Zarepour, 2002; Paisley and Carin, 2009; Zhou et al., 2009; Walker, 2007; Teh et al., 2007); collapsed samplers where the latent measure is integrated out (Ishwaran and James, 2011; Neal, 2000; Ghahramani and Griffiths, 2005); and optimizationbased methods that work with approximating distributions where the parameters are assumed to have a meanfield distribution (Blei et al., 2006; DoshiVelez et al., 2009a).
Collapsed methods tend to mix slowly due to the dependency between assignments, while blocked updates mean uncollapsed methods typically have good mixing properties at convergence (Ishwaran and James, 2011). Uncollapsed methods are often slow to incorporate new components, since they typically rely on sampling unoccupied components from the prior. In high dimensions, such components are unlikely to be close to the data. Conversely, collapsed methods can make use of the data when introducing new points, which tends to lead to faster convergence (Neal, 2000).
Several works have attempted to combine collapsed and uncollapsed methods, although such approaches have been restricted to specific models. DoshiVelez and Ghahramani (2009) introduced a linear time “accelerated” Gibbs sampler for conjugate IBPs that effectively marginalized over the latent factors, while more recently Yerebakan and Dundar (2017) developed a sampler by partially marginalizing latent random measure for DPs. These methods can be seen as special cases of our hybrid framework (Section 3), but do not generalize to the distributed setting.
Several inference algorithms allow computation to be distributed across multiple machines—although again, such algorithms are specific to a single model. The approximate uncollapsed algorithm of Smyth et al. (2009) is only developed for Dirichlet processbased models, and lacks asymptotic convergence guarantees. Distributed splitmerge methods have been developed for Dirichlet processbased models, but not extended to more general nonparametric models (Chang and Fisher III, 2013, 2014). Partitionbased algorithms based on properties of CRMs have been developed for Dirichlet process and PitmanYor processbased models (Williamson et al., 2013; Dubey et al., 2014), but it is unclear how to extend to other model families. A lowcommunication, distributedmemory slice sampler has been developed for the Dirichlet process, but since it is based on an uncollapsed method it will tend to perform poorly in high dimensions (Ge et al., 2015). DoshiVelez et al. (2009b) developed an approximate distributed inference algorithm for the Indian buffet process which is superficially similar to our distributed betaBernoulli sampler. However, their approach allows all processors to add new features, which will lead to overestimating the number of features. We contrast our proposed approach with the partially collapsed and distributed methods described above, in Table 1.
3 Hybrid Inference for CRMbased models
By definition, completely random measures can be decomposed into independent random measures. If the CRM has been transformed in some manner we can often still decompose the resulting random measure into independent random measures – for example, a normalized random measure can be decomposed as a mixture of normalized random measures. Such representations allow us to decompose our inference algorithms, and use different inference techniques on the constituent measures.
As discussed in Section 2, collapsed and uncollapsed methods both have advantages and disadvantages. Loosely, collapsed methods are good at adding new components and exploring the tail of the distribution, while uncollapsed methods offer better mixing in established clusters and easy parallelization. We make use of decomposition properties of CRMs to partition our model into two components: One containing (finitely many) components currently associated with multiple observations, and one containing the infinite tail of components.
3.1 Models constructed directly from CRMs
Consider a generic hierarchical model,
(1)  
where is some deterministic transformation such that for all ; is a noise term; and is a likelihood that forms a conjugate pair with , i.e. the posterior distribution is a CRM in the same family. The index refer to the observations and the index refer to the features. This framework includes exchangeable feature allocation models such as the betaBernoulli process (Ghahramani and Griffiths, 2005; Thibaux and Jordan, 2007), the infinite gammaPoisson feature model (Titsias, 2008), and the beta negative binomial process (Zhou et al., 2012; Broderick et al., 2014). We assume, as is the case in these examples, that both collapsed and uncollapsed posterior inference algorithms can be described. We also assume for simplicity that the prior contains no fixedlocation atoms, although this assumption could be relaxed (see Broderick et al., 2018).
Lemma 1 (Broderick et al. 2018).
If and , and if and form a conjugate pair, then the posterior can be decomposed into two CRMs, each with known distribution. The first, , is a finite measure with fixedlocation atoms at locations . The distribution over the corresponding weights is proportional to . The second, with infinitely many randomlocation atoms, has Lévy measure .
Based on Lemma 1, we partition into a finite CRM for , that contains all, or a subset of, the fixedlocation atoms; and an infinite CRM that contains the remaining atoms. We use an uncollapsed sampler to sample , and then sample . Then, we use a collapsed sampler to sample the allocations . The size should be changed periodically to make sure to avoid explicitly instantiating atoms that are not associated with data. In our experiments, we set .
Example 1: BetaBernoulli Process. As a specific example, consider the betaBernoulli process. Let be a homogeneous beta process Thibaux and Jordan (2007), and let . The posterior is given by
In this case the following lemma helps us in decomposing the posterior distribution:
Lemma 2 (Thibaux and Jordan 2007).
If is the number of features where and , we can decompose the posterior distribution of betaBernoulli process as as where
(2)  
We note that the atom sizes of are random variables. This allows us to split the betaBernoulli process into two independent feature selection mechanisms: one with a finite number of currently instantiated features, and one with an unbounded number of features.
This gives us the hybrid sampler, which we summarize in Algorithm 1. For each data point , sample in a three step manner. If ,
(3)  
If and ,
(4)  
If
is a normal distribution and
we can marginalize over the uninstantiated as described in Ghahramani and Griffiths (2005). If this is not possible, we can instantiate and include them in the appropriate likelihood term. Finally, we propose adding new features, using a MetropolisHastings step. Once we have sampled the , for every instantiated feature , sample and its corresponding parameters .3.2 Models based on transformations of random measures
While applying transformations to CRMs means the posterior is no longer directly decomposable, in some cases we can still apply the above general ideas.
Example 2: Dirichlet Process. As noted in Section 2, the Dirichlet process with concentration parameter and base measure can be constructed by normalizing a gamma process with base measure . If the Dirichlet process is used as the prior in a mixture model (DPMM), the posterior distribution conditioned on the cluster allocations , having unique clusters is again a Dirichlet process:
(5) 
where and is the number of clusters with . In this case also the following lemma helps us in decomposing the posterior.
Lemma 3.
Assuming , and , we can decompose the posterior of DP as
where
Proof.
We note that the posterior atom weights for the finite component are distributed according to , and can easily be sampled as part of an uncollapsed sampler. Conditioned on and we can sample the cluster allocation, of point as
(6) 
where is the likelihood for each mixing component;
is the conditional probability of
given other members of the th cluster; and . This procedure is summarized in Algorithm 2.Example 3: PitmanYor processes The PitmanYor process (Perman et al., 1992; Pitman and Yor, 1997) is a distribution over probability measures, parameterized by and , that is obtained from a ‐stable CRM via a change of measure and normalization. Provided , it can also be represented as a Dirichlet process mixture of normalized stable CRMs (Lemma 22, Pitman and Yor, 1997). This representation allows us to decompose the posterior distribution into a beta mixture of a finitedimensional PitmanYor process and an infinitedimensional PitmanYor process. We provide more details in the supplementary.
Example 4: Hierarchical Dirichlet processes We can decompose the hierarchical Dirichlet process (HDP, Teh et al., 2006) in a manner comparable to the Dirichlet process, allowing our hybrid sampler to be used on the HDP. For space reasons, we defer discussion to the supplement.
4 Distributed Inference for CRMbased models
The sampling algorithms in Section 3 can easily be adapted to a distributed setting, where data are partitioned across several machines and communication is limited. In this setting, we set (i.e. the number of currently instantiated features) after every communication step. We instantiate the finite measure ( in the case of CRMs, for DPMMs), with globally shared atom sizes and locations, on all processors.
We then randomly select one out of processors by sampling . On the remaining processors, we sample the allocations using restricted Gibbs sampling (Neal, 2000). In the CRM setting, we enforce for . In the DPMM setting, we only allow observations to belong to the clusters in . Since the associated distributions only depend on the instantiated measures (or ), we do not need knowledge of the feature allocations on other processor
On the remaining processor , we sample the using unrestricted Gibbs. If we are working with a CRMbased model, we sample conditioned on on and for , and sample conditioned on on and for . If we are working with a DPMM, we sample following Equation 6. Note that the only data points that can be associated with atoms in / are those on processor , so we can evaluate the associated probabilities without any knowledge about the other processors.
At each global step, we gather the sufficient statistics from all instantiated clusters – from both the finite component / and the infinite component / – and sample parameters for those clusters. We then create a new partition, redefining as the current number of instantiated component parameters. In the case of the DPMM, we also resample . We summarize the distributed algorithm for the special cases of the betaBernoulli process and the DPMM in Algorithms 3 and 4 and for PYMM in algorithm 6 in supplementary.
4.1 Warmstart heuristics
While asymptotically correct, an unfortunate consequence of the samplers proposed above is that they are slow to instantiate new clusters. With only of the data points eligible to start a new cluster or feature, the rate at which new clusters/features are added will decrease with the number of processors. This can lead to slow convergence if we start with too few clusters. To avoid this problem, we initialize our algorithm by allowing all
processors to instantiate new clusters. At each global step, we decrease the number of randomly selected processors eligible to instantiate new clusters, until we end up with a single processor. This approach tend to overestimate the number of clusters. However, the procedure acts in a manner similar to simulated annealing, by encouraging large moves early in the algorithm but gradually decreasing the excess stochasticity until we are sampling from the correct algorithm. We note that a sampler with multiple processors instantiating new clusters is not a correct sampler until we revert to a single processor proposing new features, as correct MCMC samplers are invariant to its starting position.
5 Experimental evaluation
While our primary contribution is in the development of distributed algorithms, we first consider, in Section 5.1, the performance of the hybrid algorithms developed in Section 3 in a nondistributed setting. We show that this performance extends to the distributed setting, and offers impressive scaling, in Section 5.2.
5.1 Evaluating the hybrid sampler
We begin by considering the performance of the hybrid samplers introduced in Section 3 in a nondistributed setting. For this, we focus on the Dirichlet process, since there exist a number of collapsed and uncollapsed inference algorithms; we expect similar results under other models.
We compare the hybrid sampler of Algorithm 2 with a standard collapsed Gibbs sampler and an uncollapsed sampler based on Algorithm 8 of Neal (2000). Algorithm 8 collapses occupied clusters and instantiates a subset of unoccupied clusters; we modify this to instantiate the atoms associated with unoccupied clusters. Concretely, at each iteration, we sample weights for the instantiated clusters and uninstantiated clusters as , and sample locations for the uninstantiated clusters from the base measure . We note that this method can be distributed relatively easily.
Figure 1 shows convergence plots for the three algorithms. The data set is a dimensional synthetic data set consisting of observations of Gaussian mixtures with true mixture components centered at and with an identity covariance matrix.
While the three algorithms perform comparably on lowdimensional data, as the dimension increases the performance of the uncollapsed sampler degrades much more than the collapsed sampler. This is because in high dimensions, it is unlikely that a proposed parameter will be near our data, so the associated likelihood of any given data point will be low. This is in contrast to the collapsed setting, where we integrate over all possible locations. While the hybrid method performs worse in high dimensions than the collapsed method, it outperforms the uncollapsed method.
The synthetic data in Figure 1 has fairly lowdimensional structure, so we do not see negative effects due to the poor mixing of the collapsed sampler. Next, we evaluate the algorithms on the CIFAR100 dataset (Krizhevsky, 2009). We used PCA to reduce the dimension of the data to between 8 and 64, and plot the test set log likelihood over time in Figure 2. Each marker represents a single iteration. We see that the uncollapsed sampler requires more iterations to converge than the collapsed sampler; however since each iteration takes less time, in some cases the wall time to convergence is slower. The hybrid method has comparable iteration time to the collapsed, but, in general, converges faster. We see that, even without taking advantage of parallelization, the hybrid method is a compelling competitor to purecollapsed and pureuncollapsed algorithms.
5.2 Evaluating the distributed samplers
Here, we show that the distributed inference algorithms introduced in Section 4 allow inference in BNP models to be scaled to large datasets, without sacrificing accuracy. We focus on two cases: the betaBernoulli process (Algorithm 3) and the Dirichlet process (Algorithm 4)^{1}^{1}1Our code is available online at https://github.com/michaelzhang01/hybridBNP.
5.2.1 BetaBernoulli process
We evaluate the betaBernoulli sampler on synthetic data based on the “Cambridge” dataset, used in the original IBP paper (Griffiths and Ghahramani, 2011b)
, where each datapoint is the superposition of a randomly selected subset of four binary features of dimension 36, plus Gaussian noise with standard deviation 0.5.
^{2}^{2}2See Figure 7 in the supplement for more details We model this data using a linear Gaussian likelihood, with , ,We initialized to a single feature, and ran the hybrid sampler for 1,000 total observations with a synchronization step every iterations, distributing over 1, 4, 8, 16, 32, 64 and 128 processors.
We first evaluate the hybrid algorithm under a “cold start”, where only one processor is allowed to introduce new features for the entire duration of the sampler. In the top left plot of Figure 3, we see that the cold start results in slow convergence of in the test set log likelihood for large numbers of processors. We can see in the bottom left plot of Figure 3 that the number of features grows very slowly, as only processors are allowed to propose new features in the exact setting.
Next, we explore warmstart initialization, as described in Section 4.1. For the first oneeighth of the total number of MCMC iterations, all processors can propose new features; after this we revert to the standard algorithm. The top central plot of Figure 3 shows predictive log likelihood over time, and the bottom central plot shows number of features. We see that convergence is significantly improved relative to the coldstart experiments. Since we revert to the asymptotically correct sampler, the final number of features is generally close to the true number of features, 4.^{3}^{3}3Note that BNP models are not guaranteed to find the correct number of features in the posterior; see Miller and Harrison (2013) Additionally, we see that convergence rate increases monotonically in the number of processors.
Next, we allowed all processors to propose new features for the entire duration (“alwayshot”). This setting approximately replicates the behavior of the parallel IBP sampler of DoshiVelez et al. (2009b). In the top right plot of Figure 3, we can see that all experiments roughly converge to the same test log likelihood. However, the number of features introduced (bottom right plot) is much greater than the warm start experiment, grows with the number of processors. Moreover, the difference in convergence rates between processors is not as dramatic as in the warmstart trials.
Next, we demonstrate the scalability of our distributed algorithm on a massive synthetic example, showing it can be used for largescale latent feature models. We generate one million “Cambridge” synthetic data points as described for the previous experiments, and distribute the data over 256 processors. This experiment represents the largest experiment ran for a parallel IBP/betaBernoulli process algorithm (the next largest being 200,000 data points, in DoshiVelez et al. 2009b). We limited the sampler to run for one day and completed 860 MCMC iterations. In Figure 4, we see test set log likelihood traceplot that we can converge to a local mode fairly quickly under a massive distributed setting.
5.2.2 Dirichlet process
Our distributed inference framework can also speed up inference in a DP mixture of Gaussians, using the version described in Algorithm 4. We used a dataset containing the top 64 principle components of the CIFAR100 dataset, as described in Section 5.1. We compared against two existing distributed inference algorithms for the Dirichlet process mixture model, chosen to represent models based on both uncollapsed and collapsed samplers: 1) A DP variant of the asynchronous sampler of Smyth et al. (2009), an approximate collapsed method; and 2) the distributed slice sampler of Ge et al. (2015), an uncollapsed method.
Figure 5 shows, when distributed over eight processors, our algorithm converges faster than the two comparison methods, showing that the high quality performance seen in Section 5.1 extends to the distributed setting. Further, in Figure 6 we see roughly linear speedup in convergence as we increase the number of processors from 1 to 8.
6 Conclusion
We have proposed a general inference framework for a wide variety of BNP models. We use the inherent decomposability of the underlying completely random measures to partition the latent random measures into a finitedimensional component that represents the majority of the data, and an infinitedimensional component that represents mostly uninstantiated tail. This allows us to take advantage of the inherent parallelizability of the uncollapsed sampler on the finite partition and the better performance of the collapsed sampler for proposing new components. Thus the proposed hybrid inference method can be easily distributed over multiple machines, providing provably correct inference for many BNP models. Experiments show that, for both the DP and the betaBernoulli process, our proposed distributed hybrid sampler converges faster than the comparison methods.
References
 Blei et al. (2006) Blei, D. M., Jordan, M. I., et al. (2006). Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121–143.
 Broderick et al. (2014) Broderick, T., Mackey, L., Paisley, J., and Jordan, M. I. (2014). Combinatorial clustering and the beta negative binomial process. IEEE transactions on pattern analysis and machine intelligence, 37(2):290–306.
 Broderick et al. (2018) Broderick, T., Wilson, A. C., and Jordan, M. I. (2018). Posteriors, conjugacy, and exponential families for completely random measures. Bernoulli, 24(4B):3181–3221.
 Chang and Fisher III (2013) Chang, J. and Fisher III, J. W. (2013). Parallel sampling of DP mixture models using subcluster splits. In Advances in Neural Information Processing Systems, pages 620–628.
 Chang and Fisher III (2014) Chang, J. and Fisher III, J. W. (2014). Parallel sampling of HDPs using subcluster splits. In Advances in Neural Information Processing Systems, pages 235–243.

DoshiVelez and Ghahramani (2009)
DoshiVelez, F. and Ghahramani, Z. (2009).
Accelerated sampling for the Indian buffet process.
In
Proceedings of the 26th Annual International Conference on Machine Learning
, pages 273–280. ACM.  DoshiVelez et al. (2009a) DoshiVelez, F., Miller, K., Van Gael, J., and Teh, Y. W. (2009a). Variational inference for the Indian buffet process. In Artificial Intelligence and Statistics, pages 137–144.

DoshiVelez et al. (2009b)
DoshiVelez, F., Mohamed, S., Ghahramani, Z., and Knowles, D. A. (2009b).
Large scale nonparametric Bayesian inference: Data parallelisation in the Indian buffet process.
In Advances in Neural Information Processing Systems, pages 1294–1302. 
Dubey et al. (2014)
Dubey, A., Williamson, S., and Xing, E. P. (2014).
Parallel Markov chain Monte Carlo for PitmanYor mixture models.
In Uncertainty in Artificial Intelligence, pages 142–151.  Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist., 1(2):209–230.
 Ge et al. (2015) Ge, H., Chen, Y., Wan, M., and Ghahramani, Z. (2015). Distributed inference for Dirichlet process mixture models. In Proceedings of the 32nd International Conference on Machine Learning, pages 2276–2284.
 Ghahramani and Griffiths (2005) Ghahramani, Z. and Griffiths, T. L. (2005). Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, pages 475–482.
 Ghosh and Ramamoorthi (2003) Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian nonparametrics. In Springer.
 Griffiths and Ghahramani (2011a) Griffiths, T. L. and Ghahramani, Z. (2011a). The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12(Apr):1185–1224.
 Griffiths and Ghahramani (2011b) Griffiths, T. L. and Ghahramani, Z. (2011b). The Indian buffet process: An introduction and review. The Journal of Machine Learning Research, 12:1185–1224.
 Hjort (1990) Hjort, N. L. (1990). Nonparametric Bayes estimators based on beta processes in models for life history data. The Annals of Statistics, pages 1259–1294.
 Ishwaran and James (2011) Ishwaran, H. and James, L. F. (2011). Gibbs sampling methods for stickbreaking priors. Journal of the American Statistical Association.
 Ishwaran and Zarepour (2002) Ishwaran, H. and Zarepour, M. (2002). Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30(2):269–283.
 Kingman (1967) Kingman, J. F. C. (1967). Completely random measures. Pacific Journal of Mathematics, 21(1):59–78.
 Krizhevsky (2009) Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. In Tech. Report.
 Miller and Harrison (2013) Miller, J. W. and Harrison, M. T. (2013). A simple example of dirichlet process mixture inconsistency for the number of components. In Advances in neural information processing systems, pages 199–206.
 Neal (2000) Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265.
 Paisley and Carin (2009) Paisley, J. and Carin, L. (2009). Nonparametric factor analysis with beta process priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 777–784. ACM.
 Perman et al. (1992) Perman, M., Pitman, J., and Yor, M. (1992). Sizebiased sampling of poisson point processes and excursions. Probability Theory and Related Fields, 92(1):21–39.

Pitman et al. (1996)
Pitman, J. et al. (1996).
Some developments of the blackwellmacqueen urn scheme.
In
Statistics, probability and game theory
, pages 245–267. Institute of Mathematical Statistics.  Pitman and Yor (1997) Pitman, J. and Yor, M. (1997). The twoparameter poissondirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2):855–900.
 Smyth et al. (2009) Smyth, P., Welling, M., and Asuncion, A. U. (2009). Asynchronous distributed learning of topic models. In Advances in Neural Information Processing Systems, pages 81–88.
 Teh et al. (2007) Teh, Y. W., Görür, D., and Ghahramani, Z. (2007). Stickbreaking construction for the Indian buffet process. In International Conference on Artificial Intelligence and Statistics, pages 556–563.
 Teh et al. (2006) Teh, Y.W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.
 Thibaux and Jordan (2007) Thibaux, R. and Jordan, M. I. (2007). Hierarchical beta processes and the Indian buffet process. In International conference on artificial intelligence and statistics, pages 564–571.
 Titsias (2008) Titsias, M. K. (2008). The infinite gammaPoisson feature model. In Advances in Neural Information Processing Systems, pages 1513–1520.
 Walker (2007) Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics—Simulation and Computation, 36(1):45–54.

Weyrauch et al. (2004)
Weyrauch, B., Heisele, B., Huang, J., and Blanz, V. (2004).
Componentbased face recognition with 3D morphable models.
In2004 Conference on Computer Vision and Pattern Recognition Workshop
, pages 85–85. IEEE.  Williamson et al. (2013) Williamson, S., Dubey, A., and Xing, E. (2013). Parallel Markov chain Monte Carlo for nonparametric mixture models. In Proceedings of the 30th International Conference on Machine Learning, pages 98–106.
 Yerebakan and Dundar (2017) Yerebakan, H. Z. and Dundar, M. (2017). Partially collapsed parallel Gibbs sampler for Dirichlet process mixture models. Pattern Recognition Letters, 90:22–27.
 Zhou et al. (2009) Zhou, M., Chen, H., Ren, L., Sapiro, G., Carin, L., and Paisley, J. W. (2009). Nonparametric bayesian dictionary learning for sparse image representations. In Advances in neural information processing systems, pages 2295–2303.
 Zhou et al. (2012) Zhou, M., Hannah, L., Dunson, D., and Carin, L. (2012). Betanegative binomial process and poisson factor analysis. In Artificial Intelligence and Statistics, pages 1462–1471.
Distributed, partially collapsed MCMC for Bayesian Nonparametrics: Supplementary
References
 Blei et al. (2006) Blei, D. M., Jordan, M. I., et al. (2006). Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121–143.
 Broderick et al. (2014) Broderick, T., Mackey, L., Paisley, J., and Jordan, M. I. (2014). Combinatorial clustering and the beta negative binomial process. IEEE transactions on pattern analysis and machine intelligence, 37(2):290–306.
 Broderick et al. (2018) Broderick, T., Wilson, A. C., and Jordan, M. I. (2018). Posteriors, conjugacy, and exponential families for completely random measures. Bernoulli, 24(4B):3181–3221.
 Chang and Fisher III (2013) Chang, J. and Fisher III, J. W. (2013). Parallel sampling of DP mixture models using subcluster splits. In Advances in Neural Information Processing Systems, pages 620–628.
 Chang and Fisher III (2014) Chang, J. and Fisher III, J. W. (2014). Parallel sampling of HDPs using subcluster splits. In Advances in Neural Information Processing Systems, pages 235–243.

DoshiVelez and Ghahramani (2009)
DoshiVelez, F. and Ghahramani, Z. (2009).
Accelerated sampling for the Indian buffet process.
In
Proceedings of the 26th Annual International Conference on Machine Learning
, pages 273–280. ACM.  DoshiVelez et al. (2009a) DoshiVelez, F., Miller, K., Van Gael, J., and Teh, Y. W. (2009a). Variational inference for the Indian buffet process. In Artificial Intelligence and Statistics, pages 137–144.

DoshiVelez et al. (2009b)
DoshiVelez, F., Mohamed, S., Ghahramani, Z., and Knowles, D. A. (2009b).
Large scale nonparametric Bayesian inference: Data parallelisation in the Indian buffet process.
In Advances in Neural Information Processing Systems, pages 1294–1302. 
Dubey et al. (2014)
Dubey, A., Williamson, S., and Xing, E. P. (2014).
Parallel Markov chain Monte Carlo for PitmanYor mixture models.
In Uncertainty in Artificial Intelligence, pages 142–151.  Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist., 1(2):209–230.
 Ge et al. (2015) Ge, H., Chen, Y., Wan, M., and Ghahramani, Z. (2015). Distributed inference for Dirichlet process mixture models. In Proceedings of the 32nd International Conference on Machine Learning, pages 2276–2284.
 Ghahramani and Griffiths (2005) Ghahramani, Z. and Griffiths, T. L. (2005). Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, pages 475–482.
 Ghosh and Ramamoorthi (2003) Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian nonparametrics. In Springer.
 Griffiths and Ghahramani (2011a) Griffiths, T. L. and Ghahramani, Z. (2011a). The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12(Apr):1185–1224.
 Griffiths and Ghahramani (2011b) Griffiths, T. L. and Ghahramani, Z. (2011b). The Indian buffet process: An introduction and review. The Journal of Machine Learning Research, 12:1185–1224.
 Hjort (1990) Hjort, N. L. (1990). Nonparametric Bayes estimators based on beta processes in models for life history data. The Annals of Statistics, pages 1259–1294.
 Ishwaran and James (2011) Ishwaran, H. and James, L. F. (2011). Gibbs sampling methods for stickbreaking priors. Journal of the American Statistical Association.
 Ishwaran and Zarepour (2002) Ishwaran, H. and Zarepour, M. (2002). Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30(2):269–283.
 Kingman (1967) Kingman, J. F. C. (1967). Completely random measures. Pacific Journal of Mathematics, 21(1):59–78.
 Krizhevsky (2009) Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. In Tech. Report.
 Miller and Harrison (2013) Miller, J. W. and Harrison, M. T. (2013). A simple example of dirichlet process mixture inconsistency for the number of components. In Advances in neural information processing systems, pages 199–206.
 Neal (2000) Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265.
 Paisley and Carin (2009) Paisley, J. and Carin, L. (2009). Nonparametric factor analysis with beta process priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 777–784. ACM.
 Perman et al. (1992) Perman, M., Pitman, J., and Yor, M. (1992). Sizebiased sampling of poisson point processes and excursions. Probability Theory and Related Fields, 92(1):21–39.

Pitman et al. (1996)
Pitman, J. et al. (1996).
Some developments of the blackwellmacqueen urn scheme.
In
Statistics, probability and game theory
, pages 245–267. Institute of Mathematical Statistics.  Pitman and Yor (1997) Pitman, J. and Yor, M. (1997). The twoparameter poissondirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2):855–900.
 Smyth et al. (2009) Smyth, P., Welling, M., and Asuncion, A. U. (2009). Asynchronous distributed learning of topic models. In Advances in Neural Information Processing Systems, pages 81–88.
 Teh et al. (2007) Teh, Y. W., Görür, D., and Ghahramani, Z. (2007). Stickbreaking construction for the Indian buffet process. In International Conference on Artificial Intelligence and Statistics, pages 556–563.
 Teh et al. (2006) Teh, Y.W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.
 Thibaux and Jordan (2007) Thibaux, R. and Jordan, M. I. (2007). Hierarchical beta processes and the Indian buffet process. In International conference on artificial intelligence and statistics, pages 564–571.
 Titsias (2008) Titsias, M. K. (2008). The infinite gammaPoisson feature model. In Advances in Neural Information Processing Systems, pages 1513–1520.
 Walker (2007) Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics—Simulation and Computation, 36(1):45–54.

Weyrauch et al. (2004)
Weyrauch, B., Heisele, B., Huang, J., and Blanz, V. (2004).
Componentbased face recognition with 3D morphable models.
In2004 Conference on Computer Vision and Pattern Recognition Workshop
, pages 85–85. IEEE.  Williamson et al. (2013) Williamson, S., Dubey, A., and Xing, E. (2013). Parallel Markov chain Monte Carlo for nonparametric mixture models. In Proceedings of the 30th International Conference on Machine Learning, pages 98–106.
 Yerebakan and Dundar (2017) Yerebakan, H. Z. and Dundar, M. (2017). Partially collapsed parallel Gibbs sampler for Dirichlet process mixture models. Pattern Recognition Letters, 90:22–27.
 Zhou et al. (2009) Zhou, M., Chen, H., Ren, L., Sapiro, G., Carin, L., and Paisley, J. W. (2009). Nonparametric bayesian dictionary learning for sparse image representations. In Advances in neural information processing systems, pages 2295–2303.
 Zhou et al. (2012) Zhou, M., Hannah, L., Dunson, D., and Carin, L. (2012). Betanegative binomial process and poisson factor analysis. In Artificial Intelligence and Statistics, pages 1462–1471.
Appendix A Hybrid Sampler for the Pitman Yor Mixture Model (PYMM)
In this section, we expand upon Example 3 in Section 3.
Example 3: PitmanYor processes The PitmanYor process (Perman et al. (1992); Pitman and Yor (1997)) is a distribution over probability measures, parametrized by a discount parameter , a concentration parameter , and a base measure . While the PitmanYor process is not a normalized CRM, it can be derived from a ‐stable CRM via a change of measure and normalization. When the discount parameter is zero, we recover the Dirichlet process. As the discount parameter increases, we get increasingly heavytailed distributions over the atom sizes in the resulting probability measure.
Lemma 4.
If with , and , then the posterior distribution is described by
(7)  
where is the number of occupied clusters, , and .
Proof.
We note that the posterior atom weights for the finite component are distributed according to , and can easily be sampled as part of an uncollapsed sampler. Conditioned on and we can sample the cluster allocation, of point as
(8) 
where is the likelihood for each mixing component; is the conditional probability of given other members of the th cluster; and . This procedure is summarized in Algorithm 5
We can similarly derive the distributed sampler for PYMM shown in algorithm 6
Appendix B Hybrid Sampler for Hierarchical Dirichlet Processes
In this section, we expand upon Example 4 in Section 3.
Example 4: Hierarchical Dirichlet Process. Hierarchical Dirichlet processes (HDPs, Teh et al., 2006)
extend the DP to model grouped data. The Hierarchical Dirichlet Process is a distribution over probability distributions
, each of which is conditionally distributed according to a DP. These distributions are coupled using a discrete common basemeasure , itself distributed according to a DP. Each distribution can be used to model a collection of observations , where(9) 
for and .
We consider a Chinese Restaurant Franchise (CRF, Teh et al., 2006) representation of the HDP, where each data point is represented by a customer, each atom in is represented by a table, and each atom location in the support of is represented by a dish. Let represent the th customer in the th restaurant; let be the table assignment of customer ; let be the dish assigned to table in restaurant . Let denote the number of tables in restaurant serving disk , and denote the number of customers in restaurant at table having dish .
Lemma 5.
Conditioned on the table/dish assignment counts , the posterior distribution can be written as
where
where is the total number of instantiated dishes; ; ; and
Proof.
This is a direct extension of Lemma 3, applied to the toplevel Dirichlet process . ∎
We can therefore construct a hybrid sampler, where is represented via and corresponding , and is represented using a Chinese restaurant process. We can then sample the table allocations according to
(10) 
and sample each table according to
(11) 
We summarize the hybrid sampler shown in Algorithm 7
If we ensure that data associated with each “restaurant” or group lies on the same table, we can extend this hybrid algorithm to a distributed setting, as described in Algorithm 8.
(12) 
(13) 
Appendix C Further IBP Empirical Results
For the “Cambridge” data sets described in the main paper, we generated images based on a superposition of the four features in the top row of Figure 7
, and then flattened the image to create a 36dimensional vector. The bottom row of Figure
7 shows some sample data points.In addition to synthetic data, we also evaluated the distributed betaBernoulli process sampler to a realworld data set, the CBCL MIT face dataset (Weyrauch et al., 2004). This data set consists of 2,429 images of dimensional faces. We distributed the data across 32 processors and ran the sampler in parallel for iterations. Figure 9 shows the test set log likelihood of the sampler over time and Figure 8 shows the features learned by the hybrid IBP method and we can clearly see that our method can discover the underlying facial features in this data set.
Comments
There are no comments yet.