Distributed, partially collapsed MCMC for Bayesian Nonparametrics

01/15/2020 ∙ by Avinava Dubey, et al. ∙ 33

Bayesian nonparametric (BNP) models provide elegant methods for discovering underlying latent features within a data set, but inference in such models can be slow. We exploit the fact that completely random measures, which commonly used models like the Dirichlet process and the beta-Bernoulli process can be expressed as, are decomposable into independent sub-measures. We use this decomposition to partition the latent measure into a finite measure containing only instantiated components, and an infinite measure containing all other components. We then select different inference algorithms for the two components: uncollapsed samplers mix well on the finite measure, while collapsed samplers mix well on the infinite, sparsely occupied tail. The resulting hybrid algorithm can be applied to a wide class of models, and can be easily distributed to allow scalable inference without sacrificing asymptotic convergence guarantees.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Methods Data Exact Parallel Distributed BN Processes
Size beta-Bernoulli Other DP HDP Pitman-Yor
Millions Process CRMs Process
Smyth et al. 2009 M
Doshi-Velez and Ghahramani 2009 M
Doshi-Velez et al. 2009b M
Williamson et al. 2013 M
Chang and Fisher III 2013 M
Dubey et al. 2014 M
Ge et al. 2015 M
Yerebakan and Dundar 2017 M
This paper M
Table 1: Comparison of various parallel and distributed inference algorithm proposed for BNPs. Other CRMs include gamma-Poisson process, beta-negative binomial process etc.

Bayesian nonparametric (BNP) models are a flexible class of models whose complexity adapts to the data under consideration. BNP models place priors on infinite-dimensional objects, such as partitions with infinitely many blocks; matrices with infinitely many columns; or discrete measures with infinitely many atoms. A finite set of observations is assumed to be generated from a finite—but random—subset of these components, allowing flexibility in the underlying dimensionality and providing the ability to incorporate previously unseen properties as our dataset grows.

While the flexibility of these models are a good fit for large, complex data sets, distributing existing inference algorithms across multiple machines is challenging. If we explicitly represent subsets of the underlying infinite-dimensional object – for example, using a slice sampler – we can face high memory requirements and slow convergence. Conversely, if we integrate out the infinite-dimensional object, we run into problems due to induced global dependencies.

Moreover, a key goal of distributed algorithms is to minimize communication between agents. This can be achieved by breaking the algorithm into independent sub-algorithms, which can be run independently on different agents. In practice, we usually cannot split an MCMC sampler on a Bayesian hierarchical model into entirely independent sub-algorithms since there are typically some global dependencies implied by the hierarchy. Instead, we make use of conditional independencies to temporarily partition our algorithm.

Contributions: In this paper, we propose a distributable sampler for completely random measures, which unifies exact parallel inference for a wide class of Bayesian nonparametric priors, including the popularly used Dirichlet process (Ferguson, 1973) and the beta-Bernoulli process (Griffiths and Ghahramani, 2011a). After introducing appropriate background material, we first introduce general recipes for (non-distributed) partially collapsed samplers appropriate for a wide range of BNP models, focusing on the beta-Bernoulli process and the Dirichlet process as exemplars. We then demonstrate that these methods can be easily extended to a distributed setting. Next we provide experimental results for our hybrid and distributed sampler on DP and BB inference. Lastly, we conclude the paper with a discussion of future directions for our idea.

2 Background and Related Work

Completely random measures (CRMs, Kingman, 1967)

are random measures that assign independent masses to disjoint subsets of a space. For example, the gamma process assigns a gamma-distributed mass to each subset. Other examples include the beta process

(Hjort, 1990) and the Poisson process. The distribution of a CRM is completely determined by it’s Lévy measure, which controls the size and location of atoms.

Many nonparametric distributions can be expressed in terms of CRMs. For example, if we sample from a (homogeneous) beta process, and generate a sequence of subsets where w.p. , then we obtain an exchangeable distribution over sequences of subsets known as the beta-Bernoulli process (Thibaux and Jordan, 2007), which is related to the Indian buffet process (IBP, Ghahramani and Griffiths, 2005). If we sample from a gamma process on with base measure , then is distributed according to a Dirichlet process with concentration parameter and base measure .

Inference in such models tend to fall into three categories: uncollapsed samplers that alternate between sampling the latent measure and the assignments (Ishwaran and Zarepour, 2002; Paisley and Carin, 2009; Zhou et al., 2009; Walker, 2007; Teh et al., 2007); collapsed samplers where the latent measure is integrated out (Ishwaran and James, 2011; Neal, 2000; Ghahramani and Griffiths, 2005); and optimization-based methods that work with approximating distributions where the parameters are assumed to have a mean-field distribution (Blei et al., 2006; Doshi-Velez et al., 2009a).

Collapsed methods tend to mix slowly due to the dependency between assignments, while blocked updates mean uncollapsed methods typically have good mixing properties at convergence (Ishwaran and James, 2011). Uncollapsed methods are often slow to incorporate new components, since they typically rely on sampling unoccupied components from the prior. In high dimensions, such components are unlikely to be close to the data. Conversely, collapsed methods can make use of the data when introducing new points, which tends to lead to faster convergence (Neal, 2000).

Several works have attempted to combine collapsed and uncollapsed methods, although such approaches have been restricted to specific models. Doshi-Velez and Ghahramani (2009) introduced a linear time “accelerated” Gibbs sampler for conjugate IBPs that effectively marginalized over the latent factors, while more recently Yerebakan and Dundar (2017) developed a sampler by partially marginalizing latent random measure for DPs. These methods can be seen as special cases of our hybrid framework (Section 3), but do not generalize to the distributed setting.

Several inference algorithms allow computation to be distributed across multiple machines—although again, such algorithms are specific to a single model. The approximate uncollapsed algorithm of Smyth et al. (2009) is only developed for Dirichlet process-based models, and lacks asymptotic convergence guarantees. Distributed split-merge methods have been developed for Dirichlet process-based models, but not extended to more general nonparametric models (Chang and Fisher III, 2013, 2014). Partition-based algorithms based on properties of CRMs have been developed for Dirichlet process- and Pitman-Yor process-based models (Williamson et al., 2013; Dubey et al., 2014), but it is unclear how to extend to other model families. A low-communication, distributed-memory slice sampler has been developed for the Dirichlet process, but since it is based on an uncollapsed method it will tend to perform poorly in high dimensions (Ge et al., 2015). Doshi-Velez et al. (2009b) developed an approximate distributed inference algorithm for the Indian buffet process which is superficially similar to our distributed beta-Bernoulli sampler. However, their approach allows all processors to add new features, which will lead to overestimating the number of features. We contrast our proposed approach with the partially collapsed and distributed methods described above, in Table 1.

3 Hybrid Inference for CRM-based models

By definition, completely random measures can be decomposed into independent random measures. If the CRM has been transformed in some manner we can often still decompose the resulting random measure into independent random measures – for example, a normalized random measure can be decomposed as a mixture of normalized random measures. Such representations allow us to decompose our inference algorithms, and use different inference techniques on the constituent measures.

As discussed in Section 2, collapsed and uncollapsed methods both have advantages and disadvantages. Loosely, collapsed methods are good at adding new components and exploring the tail of the distribution, while uncollapsed methods offer better mixing in established clusters and easy parallelization. We make use of decomposition properties of CRMs to partition our model into two components: One containing (finitely many) components currently associated with multiple observations, and one containing the infinite tail of components.

3.1 Models constructed directly from CRMs

Consider a generic hierarchical model,

(1)

where is some deterministic transformation such that for all ; is a noise term; and is a likelihood that forms a conjugate pair with , i.e. the posterior distribution is a CRM in the same family. The index refer to the observations and the index refer to the features. This framework includes exchangeable feature allocation models such as the beta-Bernoulli process (Ghahramani and Griffiths, 2005; Thibaux and Jordan, 2007), the infinite gamma-Poisson feature model (Titsias, 2008), and the beta negative binomial process (Zhou et al., 2012; Broderick et al., 2014). We assume, as is the case in these examples, that both collapsed and uncollapsed posterior inference algorithms can be described. We also assume for simplicity that the prior contains no fixed-location atoms, although this assumption could be relaxed (see Broderick et al., 2018).

Lemma 1 (Broderick et al. 2018).

If and , and if and form a conjugate pair, then the posterior can be decomposed into two CRMs, each with known distribution. The first, , is a finite measure with fixed-location atoms at locations . The distribution over the corresponding weights is proportional to . The second, with infinitely many random-location atoms, has Lévy measure .

Based on Lemma 1, we partition into a finite CRM for , that contains all, or a subset of, the fixed-location atoms; and an infinite CRM that contains the remaining atoms. We use an uncollapsed sampler to sample , and then sample . Then, we use a collapsed sampler to sample the allocations . The size should be changed periodically to make sure to avoid explicitly instantiating atoms that are not associated with data. In our experiments, we set .

1:while not Converged do
2:     Select
3:     for  do
4:         Sample
5:         Sample
6:         Sample according to Equation 3
7:     end for
8:     for  do
9:         Sample according to Equation
10:         Sample new features
11:     end for
12:end while
Algorithm 1 Hybrid Beta-Bernoulli Sampler

Example 1: Beta-Bernoulli Process. As a specific example, consider the beta-Bernoulli process. Let be a homogeneous beta process Thibaux and Jordan (2007), and let . The posterior is given by

In this case the following lemma helps us in decomposing the posterior distribution:

Lemma 2 (Thibaux and Jordan 2007).

If is the number of features where and , we can decompose the posterior distribution of beta-Bernoulli process as as where

(2)

We note that the atom sizes of are random variables. This allows us to split the beta-Bernoulli process into two independent feature selection mechanisms: one with a finite number of currently instantiated features, and one with an unbounded number of features.

This gives us the hybrid sampler, which we summarize in Algorithm 1. For each data point , sample in a three step manner. If ,

(3)

If and ,

(4)

If

is a normal distribution and

we can marginalize over the uninstantiated as described in Ghahramani and Griffiths (2005). If this is not possible, we can instantiate and include them in the appropriate likelihood term. Finally, we propose adding new features, using a Metropolis-Hastings step. Once we have sampled the , for every instantiated feature , sample and its corresponding parameters .

We note that similar algorithms can be easily derived for other nonparmetric latent feature models such as those based on the infinite gamma-Poisson process (Thibaux and Jordan, 2007) and the beta-negative Binomial process (Zhou et al., 2012; Broderick et al., 2018).

3.2 Models based on transformations of random measures

While applying transformations to CRMs means the posterior is no longer directly decomposable, in some cases we can still apply the above general ideas.

Example 2: Dirichlet Process. As noted in Section 2, the Dirichlet process with concentration parameter and base measure can be constructed by normalizing a gamma process with base measure . If the Dirichlet process is used as the prior in a mixture model (DPMM), the posterior distribution conditioned on the cluster allocations , having unique clusters is again a Dirichlet process:

(5)

where and is the number of clusters with . In this case also the following lemma helps us in decomposing the posterior.

Lemma 3.

Assuming , and , we can decompose the posterior of DP as

where

Proof.

This is a direct extension of the fact that the Dirichlet process has Dirichlet-distributed marginals (Ferguson, 1973). See Chapter 3 of Ghosh and Ramamoorthi (2003) for a detailed analysis. ∎

1:while not converged do
2:     Select
3:     Sample
4:     Sample
5:     For , sample
6:     For each data point sample according to Equation 6
7:end while
Algorithm 2 Hybrid DPMM Sampler

We note that the posterior atom weights for the finite component are distributed according to , and can easily be sampled as part of an uncollapsed sampler. Conditioned on and we can sample the cluster allocation, of point as

(6)

where is the likelihood for each mixing component;

is the conditional probability of

given other members of the th cluster; and . This procedure is summarized in Algorithm 2.

Example 3: Pitman-Yor processes The Pitman-Yor process (Perman et al., 1992; Pitman and Yor, 1997) is a distribution over probability measures, parameterized by and , that is obtained from a ‐stable CRM via a change of measure and normalization. Provided , it can also be represented as a Dirichlet process mixture of normalized -stable CRMs (Lemma 22, Pitman and Yor, 1997). This representation allows us to decompose the posterior distribution into a beta mixture of a finite-dimensional Pitman-Yor process and an infinite-dimensional Pitman-Yor process. We provide more details in the supplementary.

Example 4: Hierarchical Dirichlet processes We can decompose the hierarchical Dirichlet process (HDP, Teh et al., 2006) in a manner comparable to the Dirichlet process, allowing our hybrid sampler to be used on the HDP. For space reasons, we defer discussion to the supplement.

4 Distributed Inference for CRM-based models

1:procedure Local() Global variables , ,
2:     for  do
3:         Sample according to (3)
4:     end for
5:     if  then
6:         For , sample according to (4)
7:         Sample new features
8:     end if
9:end procedure
10:procedure Global()
11:     Gather feature counts and parameter sufficient statistics from all processors.
12:     Let be the number of instantiated features.
13:     For , sample
14:     Sample
15:end procedure
Algorithm 3 Distributed Beta Bernoulli Sampler

The sampling algorithms in Section 3 can easily be adapted to a distributed setting, where data are partitioned across several machines and communication is limited. In this setting, we set (i.e. the number of currently instantiated features) after every communication step. We instantiate the finite measure ( in the case of CRMs, for DPMMs), with globally shared atom sizes and locations, on all processors.

We then randomly select one out of processors by sampling . On the remaining processors, we sample the allocations using restricted Gibbs sampling (Neal, 2000). In the CRM setting, we enforce for . In the DPMM setting, we only allow observations to belong to the clusters in . Since the associated distributions only depend on the instantiated measures (or ), we do not need knowledge of the feature allocations on other processor

On the remaining processor , we sample the using unrestricted Gibbs. If we are working with a CRM-based model, we sample conditioned on on and for , and sample conditioned on on and for . If we are working with a DPMM, we sample following Equation 6. Note that the only data points that can be associated with atoms in / are those on processor , so we can evaluate the associated probabilities without any knowledge about the other processors.

At each global step, we gather the sufficient statistics from all instantiated clusters – from both the finite component / and the infinite component / – and sample parameters for those clusters. We then create a new partition, redefining as the current number of instantiated component parameters. In the case of the DPMM, we also resample . We summarize the distributed algorithm for the special cases of the beta-Bernoulli process and the DPMM in Algorithms 3 and 4 and for PYMM in algorithm 6 in supplementary.

1:procedure Local() Global variables
2:     if  then
3:         Sample according to (6)
4:     else
5:         
6:     end if
7:end procedure
8:procedure Global()
9:     Gather cluster counts and parameter sufficient statistics from all processors.
10:     Sample
11:     Let be the number of instantiated clusters.
12:     Sample
13:     For , sample
14:     Sample
15:end procedure
Algorithm 4 Distributed DPMM Sampler

4.1 Warm-start heuristics

While asymptotically correct, an unfortunate consequence of the samplers proposed above is that they are slow to instantiate new clusters. With only of the data points eligible to start a new cluster or feature, the rate at which new clusters/features are added will decrease with the number of processors. This can lead to slow convergence if we start with too few clusters. To avoid this problem, we initialize our algorithm by allowing all

processors to instantiate new clusters. At each global step, we decrease the number of randomly selected processors eligible to instantiate new clusters, until we end up with a single processor. This approach tend to over-estimate the number of clusters. However, the procedure acts in a manner similar to simulated annealing, by encouraging large moves early in the algorithm but gradually decreasing the excess stochasticity until we are sampling from the correct algorithm. We note that a sampler with multiple processors instantiating new clusters is not a correct sampler until we revert to a single processor proposing new features, as correct MCMC samplers are invariant to its starting position.

Figure 1: Synthetic data experiments. Comparison of F1 scores over iteration for the collapsed, uncollapsed and hybrid samplers for growing dimensions.
Figure 2: Test log-likelihood with time on CIFAR-100 dataset as the dimensionality increases from 8 to 64.

5 Experimental evaluation

While our primary contribution is in the development of distributed algorithms, we first consider, in Section 5.1, the performance of the hybrid algorithms developed in Section 3 in a non-distributed setting. We show that this performance extends to the distributed setting, and offers impressive scaling, in Section 5.2.

5.1 Evaluating the hybrid sampler

We begin by considering the performance of the hybrid samplers introduced in Section 3 in a non-distributed setting. For this, we focus on the Dirichlet process, since there exist a number of collapsed and uncollapsed inference algorithms; we expect similar results under other models.

We compare the hybrid sampler of Algorithm 2 with a standard collapsed Gibbs sampler and an uncollapsed sampler based on Algorithm 8 of Neal (2000). Algorithm 8 collapses occupied clusters and instantiates a subset of unoccupied clusters; we modify this to instantiate the atoms associated with unoccupied clusters. Concretely, at each iteration, we sample weights for the instantiated clusters and uninstantiated clusters as , and sample locations for the uninstantiated clusters from the base measure . We note that this method can be distributed relatively easily.

Figure 1 shows convergence plots for the three algorithms. The data set is a dimensional synthetic data set consisting of observations of Gaussian mixtures with true mixture components centered at and with an identity covariance matrix.

While the three algorithms perform comparably on low-dimensional data, as the dimension increases the performance of the uncollapsed sampler degrades much more than the collapsed sampler. This is because in high dimensions, it is unlikely that a proposed parameter will be near our data, so the associated likelihood of any given data point will be low. This is in contrast to the collapsed setting, where we integrate over all possible locations. While the hybrid method performs worse in high dimensions than the collapsed method, it outperforms the uncollapsed method.

The synthetic data in Figure 1 has fairly low-dimensional structure, so we do not see negative effects due to the poor mixing of the collapsed sampler. Next, we evaluate the algorithms on the CIFAR-100 dataset (Krizhevsky, 2009). We used PCA to reduce the dimension of the data to between 8 and 64, and plot the test set log likelihood over time in Figure 2. Each marker represents a single iteration. We see that the uncollapsed sampler requires more iterations to converge than the collapsed sampler; however since each iteration takes less time, in some cases the wall time to convergence is slower. The hybrid method has comparable iteration time to the collapsed, but, in general, converges faster. We see that, even without taking advantage of parallelization, the hybrid method is a compelling competitor to pure-collapsed and pure-uncollapsed algorithms.

5.2 Evaluating the distributed samplers

Here, we show that the distributed inference algorithms introduced in Section 4 allow inference in BNP models to be scaled to large datasets, without sacrificing accuracy. We focus on two cases: the beta-Bernoulli process (Algorithm 3) and the Dirichlet process (Algorithm 4)111Our code is available online at https://github.com/michaelzhang01/hybridBNP.

5.2.1 Beta-Bernoulli process

Figure 3: Top row left to right: Test set log likelihood on synthetic data without warm-start initialization, with warm start, and with all processors on. The x-axis represents CPU wall time in seconds, on a log scale. The y-axis represents the test set log likelihood. Bottom row left to right number of features over iteration with cold-start, warm-start, and all processors introducing features. The y-axis represents the number of instantiated features.

We evaluate the beta-Bernoulli sampler on synthetic data based on the “Cambridge” dataset, used in the original IBP paper (Griffiths and Ghahramani, 2011b)

, where each datapoint is the superposition of a randomly selected subset of four binary features of dimension 36, plus Gaussian noise with standard deviation 0.5.

222See Figure 7 in the supplement for more details We model this data using a linear Gaussian likelihood, with , ,

We initialized to a single feature, and ran the hybrid sampler for 1,000 total observations with a synchronization step every iterations, distributing over 1, 4, 8, 16, 32, 64 and 128 processors.

We first evaluate the hybrid algorithm under a “cold start”, where only one processor is allowed to introduce new features for the entire duration of the sampler. In the top left plot of Figure 3, we see that the cold start results in slow convergence of in the test set log likelihood for large numbers of processors. We can see in the bottom left plot of Figure 3 that the number of features grows very slowly, as only processors are allowed to propose new features in the exact setting.

Next, we explore warm-start initialization, as described in Section 4.1. For the first one-eighth of the total number of MCMC iterations, all processors can propose new features; after this we revert to the standard algorithm. The top central plot of Figure 3 shows predictive log likelihood over time, and the bottom central plot shows number of features. We see that convergence is significantly improved relative to the cold-start experiments. Since we revert to the asymptotically correct sampler, the final number of features is generally close to the true number of features, 4.333Note that BNP models are not guaranteed to find the correct number of features in the posterior; see Miller and Harrison (2013) Additionally, we see that convergence rate increases monotonically in the number of processors.

Figure 4: Test set log likelihood trace plot for a million observation “Cambridge” data set.

Next, we allowed all processors to propose new features for the entire duration (“always-hot”). This setting approximately replicates the behavior of the parallel IBP sampler of Doshi-Velez et al. (2009b). In the top right plot of Figure 3, we can see that all experiments roughly converge to the same test log likelihood. However, the number of features introduced (bottom right plot) is much greater than the warm start experiment, grows with the number of processors. Moreover, the difference in convergence rates between processors is not as dramatic as in the warm-start trials.

Next, we demonstrate the scalability of our distributed algorithm on a massive synthetic example, showing it can be used for large-scale latent feature models. We generate one million “Cambridge” synthetic data points as described for the previous experiments, and distribute the data over 256 processors. This experiment represents the largest experiment ran for a parallel IBP/beta-Bernoulli process algorithm (the next largest being 200,000 data points, in Doshi-Velez et al. 2009b). We limited the sampler to run for one day and completed 860 MCMC iterations. In Figure 4, we see test set log likelihood traceplot that we can converge to a local mode fairly quickly under a massive distributed setting.

5.2.2 Dirichlet process

Figure 5: CIFAR-100 test log-likelihood with time, with 8 processors: Method comparison.
Figure 6: CIFAR-100 test log-likelihood with time. Convergence for varying numbers of processors, .

Our distributed inference framework can also speed up inference in a DP mixture of Gaussians, using the version described in Algorithm 4. We used a dataset containing the top 64 principle components of the CIFAR-100 dataset, as described in Section 5.1. We compared against two existing distributed inference algorithms for the Dirichlet process mixture model, chosen to represent models based on both uncollapsed and collapsed samplers: 1) A DP variant of the asynchronous sampler of Smyth et al. (2009), an approximate collapsed method; and 2) the distributed slice sampler of Ge et al. (2015), an uncollapsed method.

Figure 5 shows, when distributed over eight processors, our algorithm converges faster than the two comparison methods, showing that the high quality performance seen in Section 5.1 extends to the distributed setting. Further, in Figure 6 we see roughly linear speed-up in convergence as we increase the number of processors from 1 to 8.

6 Conclusion

We have proposed a general inference framework for a wide variety of BNP models. We use the inherent decomposability of the underlying completely random measures to partition the latent random measures into a finite-dimensional component that represents the majority of the data, and an infinite-dimensional component that represents mostly uninstantiated tail. This allows us to take advantage of the inherent parallelizability of the uncollapsed sampler on the finite partition and the better performance of the collapsed sampler for proposing new components. Thus the proposed hybrid inference method can be easily distributed over multiple machines, providing provably correct inference for many BNP models. Experiments show that, for both the DP and the beta-Bernoulli process, our proposed distributed hybrid sampler converges faster than the comparison methods.

References

  • Blei et al. (2006) Blei, D. M., Jordan, M. I., et al. (2006). Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121–143.
  • Broderick et al. (2014) Broderick, T., Mackey, L., Paisley, J., and Jordan, M. I. (2014). Combinatorial clustering and the beta negative binomial process. IEEE transactions on pattern analysis and machine intelligence, 37(2):290–306.
  • Broderick et al. (2018) Broderick, T., Wilson, A. C., and Jordan, M. I. (2018). Posteriors, conjugacy, and exponential families for completely random measures. Bernoulli, 24(4B):3181–3221.
  • Chang and Fisher III (2013) Chang, J. and Fisher III, J. W. (2013). Parallel sampling of DP mixture models using sub-cluster splits. In Advances in Neural Information Processing Systems, pages 620–628.
  • Chang and Fisher III (2014) Chang, J. and Fisher III, J. W. (2014). Parallel sampling of HDPs using sub-cluster splits. In Advances in Neural Information Processing Systems, pages 235–243.
  • Doshi-Velez and Ghahramani (2009) Doshi-Velez, F. and Ghahramani, Z. (2009). Accelerated sampling for the Indian buffet process. In

    Proceedings of the 26th Annual International Conference on Machine Learning

    , pages 273–280. ACM.
  • Doshi-Velez et al. (2009a) Doshi-Velez, F., Miller, K., Van Gael, J., and Teh, Y. W. (2009a). Variational inference for the Indian buffet process. In Artificial Intelligence and Statistics, pages 137–144.
  • Doshi-Velez et al. (2009b) Doshi-Velez, F., Mohamed, S., Ghahramani, Z., and Knowles, D. A. (2009b).

    Large scale nonparametric Bayesian inference: Data parallelisation in the Indian buffet process.

    In Advances in Neural Information Processing Systems, pages 1294–1302.
  • Dubey et al. (2014) Dubey, A., Williamson, S., and Xing, E. P. (2014).

    Parallel Markov chain Monte Carlo for Pitman-Yor mixture models.

    In Uncertainty in Artificial Intelligence, pages 142–151.
  • Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist., 1(2):209–230.
  • Ge et al. (2015) Ge, H., Chen, Y., Wan, M., and Ghahramani, Z. (2015). Distributed inference for Dirichlet process mixture models. In Proceedings of the 32nd International Conference on Machine Learning, pages 2276–2284.
  • Ghahramani and Griffiths (2005) Ghahramani, Z. and Griffiths, T. L. (2005). Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, pages 475–482.
  • Ghosh and Ramamoorthi (2003) Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian non-parametrics. In Springer.
  • Griffiths and Ghahramani (2011a) Griffiths, T. L. and Ghahramani, Z. (2011a). The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12(Apr):1185–1224.
  • Griffiths and Ghahramani (2011b) Griffiths, T. L. and Ghahramani, Z. (2011b). The Indian buffet process: An introduction and review. The Journal of Machine Learning Research, 12:1185–1224.
  • Hjort (1990) Hjort, N. L. (1990). Nonparametric Bayes estimators based on beta processes in models for life history data. The Annals of Statistics, pages 1259–1294.
  • Ishwaran and James (2011) Ishwaran, H. and James, L. F. (2011). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association.
  • Ishwaran and Zarepour (2002) Ishwaran, H. and Zarepour, M. (2002). Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30(2):269–283.
  • Kingman (1967) Kingman, J. F. C. (1967). Completely random measures. Pacific Journal of Mathematics, 21(1):59–78.
  • Krizhevsky (2009) Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. In Tech. Report.
  • Miller and Harrison (2013) Miller, J. W. and Harrison, M. T. (2013). A simple example of dirichlet process mixture inconsistency for the number of components. In Advances in neural information processing systems, pages 199–206.
  • Neal (2000) Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265.
  • Paisley and Carin (2009) Paisley, J. and Carin, L. (2009). Nonparametric factor analysis with beta process priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 777–784. ACM.
  • Perman et al. (1992) Perman, M., Pitman, J., and Yor, M. (1992). Size-biased sampling of poisson point processes and excursions. Probability Theory and Related Fields, 92(1):21–39.
  • Pitman et al. (1996) Pitman, J. et al. (1996). Some developments of the blackwell-macqueen urn scheme. In

    Statistics, probability and game theory

    , pages 245–267. Institute of Mathematical Statistics.
  • Pitman and Yor (1997) Pitman, J. and Yor, M. (1997). The two-parameter poisson-dirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2):855–900.
  • Smyth et al. (2009) Smyth, P., Welling, M., and Asuncion, A. U. (2009). Asynchronous distributed learning of topic models. In Advances in Neural Information Processing Systems, pages 81–88.
  • Teh et al. (2007) Teh, Y. W., Görür, D., and Ghahramani, Z. (2007). Stick-breaking construction for the Indian buffet process. In International Conference on Artificial Intelligence and Statistics, pages 556–563.
  • Teh et al. (2006) Teh, Y.-W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.
  • Thibaux and Jordan (2007) Thibaux, R. and Jordan, M. I. (2007). Hierarchical beta processes and the Indian buffet process. In International conference on artificial intelligence and statistics, pages 564–571.
  • Titsias (2008) Titsias, M. K. (2008). The infinite gamma-Poisson feature model. In Advances in Neural Information Processing Systems, pages 1513–1520.
  • Walker (2007) Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics—Simulation and Computation, 36(1):45–54.
  • Weyrauch et al. (2004) Weyrauch, B., Heisele, B., Huang, J., and Blanz, V. (2004).

    Component-based face recognition with 3D morphable models.

    In

    2004 Conference on Computer Vision and Pattern Recognition Workshop

    , pages 85–85. IEEE.
  • Williamson et al. (2013) Williamson, S., Dubey, A., and Xing, E. (2013). Parallel Markov chain Monte Carlo for nonparametric mixture models. In Proceedings of the 30th International Conference on Machine Learning, pages 98–106.
  • Yerebakan and Dundar (2017) Yerebakan, H. Z. and Dundar, M. (2017). Partially collapsed parallel Gibbs sampler for Dirichlet process mixture models. Pattern Recognition Letters, 90:22–27.
  • Zhou et al. (2009) Zhou, M., Chen, H., Ren, L., Sapiro, G., Carin, L., and Paisley, J. W. (2009). Non-parametric bayesian dictionary learning for sparse image representations. In Advances in neural information processing systems, pages 2295–2303.
  • Zhou et al. (2012) Zhou, M., Hannah, L., Dunson, D., and Carin, L. (2012). Beta-negative binomial process and poisson factor analysis. In Artificial Intelligence and Statistics, pages 1462–1471.

Distributed, partially collapsed MCMC for Bayesian Nonparametrics: Supplementary

References

  • Blei et al. (2006) Blei, D. M., Jordan, M. I., et al. (2006). Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121–143.
  • Broderick et al. (2014) Broderick, T., Mackey, L., Paisley, J., and Jordan, M. I. (2014). Combinatorial clustering and the beta negative binomial process. IEEE transactions on pattern analysis and machine intelligence, 37(2):290–306.
  • Broderick et al. (2018) Broderick, T., Wilson, A. C., and Jordan, M. I. (2018). Posteriors, conjugacy, and exponential families for completely random measures. Bernoulli, 24(4B):3181–3221.
  • Chang and Fisher III (2013) Chang, J. and Fisher III, J. W. (2013). Parallel sampling of DP mixture models using sub-cluster splits. In Advances in Neural Information Processing Systems, pages 620–628.
  • Chang and Fisher III (2014) Chang, J. and Fisher III, J. W. (2014). Parallel sampling of HDPs using sub-cluster splits. In Advances in Neural Information Processing Systems, pages 235–243.
  • Doshi-Velez and Ghahramani (2009) Doshi-Velez, F. and Ghahramani, Z. (2009). Accelerated sampling for the Indian buffet process. In

    Proceedings of the 26th Annual International Conference on Machine Learning

    , pages 273–280. ACM.
  • Doshi-Velez et al. (2009a) Doshi-Velez, F., Miller, K., Van Gael, J., and Teh, Y. W. (2009a). Variational inference for the Indian buffet process. In Artificial Intelligence and Statistics, pages 137–144.
  • Doshi-Velez et al. (2009b) Doshi-Velez, F., Mohamed, S., Ghahramani, Z., and Knowles, D. A. (2009b).

    Large scale nonparametric Bayesian inference: Data parallelisation in the Indian buffet process.

    In Advances in Neural Information Processing Systems, pages 1294–1302.
  • Dubey et al. (2014) Dubey, A., Williamson, S., and Xing, E. P. (2014).

    Parallel Markov chain Monte Carlo for Pitman-Yor mixture models.

    In Uncertainty in Artificial Intelligence, pages 142–151.
  • Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist., 1(2):209–230.
  • Ge et al. (2015) Ge, H., Chen, Y., Wan, M., and Ghahramani, Z. (2015). Distributed inference for Dirichlet process mixture models. In Proceedings of the 32nd International Conference on Machine Learning, pages 2276–2284.
  • Ghahramani and Griffiths (2005) Ghahramani, Z. and Griffiths, T. L. (2005). Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, pages 475–482.
  • Ghosh and Ramamoorthi (2003) Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian non-parametrics. In Springer.
  • Griffiths and Ghahramani (2011a) Griffiths, T. L. and Ghahramani, Z. (2011a). The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12(Apr):1185–1224.
  • Griffiths and Ghahramani (2011b) Griffiths, T. L. and Ghahramani, Z. (2011b). The Indian buffet process: An introduction and review. The Journal of Machine Learning Research, 12:1185–1224.
  • Hjort (1990) Hjort, N. L. (1990). Nonparametric Bayes estimators based on beta processes in models for life history data. The Annals of Statistics, pages 1259–1294.
  • Ishwaran and James (2011) Ishwaran, H. and James, L. F. (2011). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association.
  • Ishwaran and Zarepour (2002) Ishwaran, H. and Zarepour, M. (2002). Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30(2):269–283.
  • Kingman (1967) Kingman, J. F. C. (1967). Completely random measures. Pacific Journal of Mathematics, 21(1):59–78.
  • Krizhevsky (2009) Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. In Tech. Report.
  • Miller and Harrison (2013) Miller, J. W. and Harrison, M. T. (2013). A simple example of dirichlet process mixture inconsistency for the number of components. In Advances in neural information processing systems, pages 199–206.
  • Neal (2000) Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265.
  • Paisley and Carin (2009) Paisley, J. and Carin, L. (2009). Nonparametric factor analysis with beta process priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 777–784. ACM.
  • Perman et al. (1992) Perman, M., Pitman, J., and Yor, M. (1992). Size-biased sampling of poisson point processes and excursions. Probability Theory and Related Fields, 92(1):21–39.
  • Pitman et al. (1996) Pitman, J. et al. (1996). Some developments of the blackwell-macqueen urn scheme. In

    Statistics, probability and game theory

    , pages 245–267. Institute of Mathematical Statistics.
  • Pitman and Yor (1997) Pitman, J. and Yor, M. (1997). The two-parameter poisson-dirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2):855–900.
  • Smyth et al. (2009) Smyth, P., Welling, M., and Asuncion, A. U. (2009). Asynchronous distributed learning of topic models. In Advances in Neural Information Processing Systems, pages 81–88.
  • Teh et al. (2007) Teh, Y. W., Görür, D., and Ghahramani, Z. (2007). Stick-breaking construction for the Indian buffet process. In International Conference on Artificial Intelligence and Statistics, pages 556–563.
  • Teh et al. (2006) Teh, Y.-W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.
  • Thibaux and Jordan (2007) Thibaux, R. and Jordan, M. I. (2007). Hierarchical beta processes and the Indian buffet process. In International conference on artificial intelligence and statistics, pages 564–571.
  • Titsias (2008) Titsias, M. K. (2008). The infinite gamma-Poisson feature model. In Advances in Neural Information Processing Systems, pages 1513–1520.
  • Walker (2007) Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics—Simulation and Computation, 36(1):45–54.
  • Weyrauch et al. (2004) Weyrauch, B., Heisele, B., Huang, J., and Blanz, V. (2004).

    Component-based face recognition with 3D morphable models.

    In

    2004 Conference on Computer Vision and Pattern Recognition Workshop

    , pages 85–85. IEEE.
  • Williamson et al. (2013) Williamson, S., Dubey, A., and Xing, E. (2013). Parallel Markov chain Monte Carlo for nonparametric mixture models. In Proceedings of the 30th International Conference on Machine Learning, pages 98–106.
  • Yerebakan and Dundar (2017) Yerebakan, H. Z. and Dundar, M. (2017). Partially collapsed parallel Gibbs sampler for Dirichlet process mixture models. Pattern Recognition Letters, 90:22–27.
  • Zhou et al. (2009) Zhou, M., Chen, H., Ren, L., Sapiro, G., Carin, L., and Paisley, J. W. (2009). Non-parametric bayesian dictionary learning for sparse image representations. In Advances in neural information processing systems, pages 2295–2303.
  • Zhou et al. (2012) Zhou, M., Hannah, L., Dunson, D., and Carin, L. (2012). Beta-negative binomial process and poisson factor analysis. In Artificial Intelligence and Statistics, pages 1462–1471.

Appendix A Hybrid Sampler for the Pitman Yor Mixture Model (PYMM)

In this section, we expand upon Example 3 in Section 3.

Example 3: Pitman-Yor processes The Pitman-Yor process (Perman et al. (1992); Pitman and Yor (1997)) is a distribution over probability measures, parametrized by a discount parameter , a concentration parameter , and a base measure . While the Pitman-Yor process is not a normalized CRM, it can be derived from a ‐stable CRM via a change of measure and normalization. When the discount parameter is zero, we recover the Dirichlet process. As the discount parameter increases, we get increasingly heavy-tailed distributions over the atom sizes in the resulting probability measure.

Lemma 4.

If with , and , then the posterior distribution is described by

(7)

where is the number of occupied clusters, , and .

Proof.

The proof is a direct consequence of Lemma 22 in Pitman and Yor (1997) and Theorem 1 in Dubey et al. (2014). The special case for is presented in Corollary 20 of Pitman et al. (1996). ∎

We note that the posterior atom weights for the finite component are distributed according to , and can easily be sampled as part of an uncollapsed sampler. Conditioned on and we can sample the cluster allocation, of point as

(8)

where is the likelihood for each mixing component; is the conditional probability of given other members of the th cluster; and . This procedure is summarized in Algorithm 5

1:while not converged do
2:     Select
3:     Sample
4:     Sample
5:     For , sample
6:     For each data point sample according to Equation 8
7:end while
Algorithm 5 Hybrid PYMM Sampler

We can similarly derive the distributed sampler for PYMM shown in algorithm 6

1:procedure Local() Global variables
2:     if  then
3:         Sample according to (8)
4:     else
5:         
6:     end if
7:end procedure
8:procedure Global()
9:     Gather cluster counts and parameter sufficient statistics from all processors.
10:     Let be the number of instantiated clusters.
11:     Sample
12:     Sample
13:     For , sample
14:     Sample
15:end procedure
Algorithm 6 Distributed PYMM Sampler

Appendix B Hybrid Sampler for Hierarchical Dirichlet Processes

In this section, we expand upon Example 4 in Section 3.

Example 4: Hierarchical Dirichlet Process. Hierarchical Dirichlet processes (HDPs, Teh et al., 2006)

extend the DP to model grouped data. The Hierarchical Dirichlet Process is a distribution over probability distributions

, each of which is conditionally distributed according to a DP. These distributions are coupled using a discrete common base-measure , itself distributed according to a DP. Each distribution can be used to model a collection of observations , where

(9)

for and .

We consider a Chinese Restaurant Franchise (CRF, Teh et al., 2006) representation of the HDP, where each data point is represented by a customer, each atom in is represented by a table, and each atom location in the support of is represented by a dish. Let represent the th customer in the th restaurant; let be the table assignment of customer ; let be the dish assigned to table in restaurant . Let denote the number of tables in restaurant serving disk , and denote the number of customers in restaurant at table having dish .

Lemma 5.

Conditioned on the table/dish assignment counts , the posterior distribution can be written as

where

where is the total number of instantiated dishes; ; ; and

Proof.

This is a direct extension of Lemma 3, applied to the top-level Dirichlet process . ∎

We can therefore construct a hybrid sampler, where is represented via and corresponding , and is represented using a Chinese restaurant process. We can then sample the table allocations according to

(10)

and sample each table according to

(11)

We summarize the hybrid sampler shown in Algorithm 7

1:while not converged do
2:     Select
3:     Sample
4:     Sample
5:     For , sample
6:     For each data point sample according to Equation 10
7:     Sample the dish according to Equation 11
8:end while
Algorithm 7 Hybrid HDP Sampler

If we ensure that data associated with each “restaurant” or group lies on the same table, we can extend this hybrid algorithm to a distributed setting, as described in Algorithm 8.

1:procedure Local() Global variables
2:     if  then
3:         Sample according to (10)
4:         Sample according to (11)
5:     else
6:         Sample tables according to
(12)
7:         Sample dishes according to
(13)
8:     end if
9:end procedure
10:procedure Global()
11:     Gather cluster counts and parameter sufficient statistics from all processors.
12:     Sample
13:     Let be the number of instantiated clusters.
14:     Sample
15:     For , sample
16:     Sample
17:end procedure
Algorithm 8 Distributed HDP Sampler

Appendix C Further IBP Empirical Results

For the “Cambridge” data sets described in the main paper, we generated images based on a superposition of the four features in the top row of Figure 7

, and then flattened the image to create a 36-dimensional vector. The bottom row of Figure 

7 shows some sample data points.

In addition to synthetic data, we also evaluated the distributed beta-Bernoulli process sampler to a real-world data set, the CBCL MIT face dataset (Weyrauch et al., 2004). This data set consists of 2,429 images of dimensional faces. We distributed the data across 32 processors and ran the sampler in parallel for iterations. Figure 9 shows the test set log likelihood of the sampler over time and Figure 8 shows the features learned by the hybrid IBP method and we can clearly see that our method can discover the underlying facial features in this data set.

Figure 7: Top: The true features present in the synthetic data set. Bottom: Examples of observations in the synthetic data set.
Figure 8: Learned features from the hybrid IBP for the CBCL face data set.
Figure 9: Test set log-likelihood trace plot for CBCL face data set.