1 Introduction
Bayesian nonparametric (BNP) stochastic processes are streaming priors – their unique feature is that they specify, in a probabilistic sense, that the complexity of a latent model should grow as the amount of observed data increases. This property captures common sense in many data analysis problems – for example, one would expect to encounter far more topics in a document corpus after reading documents than after reading – and becomes crucial in settings with unbounded, persistent streams of data. While their fixed, parametric cousins can be used to infer model complexity for datasets with known magnitude a priori [1, 2], such priors are silent with respect to notions of model complexity growth in streaming data settings.
Bayesian nonparametrics are also naturally suited to parallelization of data processing, due to the exchangeability, and thus conditional independence, they often exhibit via de Finetti’s theorem. For example, labels from the Chinese Restaurant process [3] are rendered i.i.d. by conditioning on the underlying Dirichlet process (DP) random measure, and feature assignments from the Indian Buffet process [4] are rendered i.i.d. by conditioning on the underlying beta process (BP) random measure.
Given these properties, one might expect there to be a wealth of inference algorithms for BNPs that address the challenges associated with parallelization and streaming. However, previous work has only addressed these two settings in concert for parametric models
[5, 6], and only recently has each been addressed individually for BNPs. In the streaming setting, [7] and [8] developed streaming inference for DP mixture models using sequential variational approximation. Stochastic variational inference [9] and related methods [10, 11, 12, 13] are often considered streaming algorithms, but their performance depends on the choice of a learning rate and on the dataset having known, fixed size a priori [5]. Outside of variational approaches, which are the focus of the present paper, there exist exact parallelized MCMC methods for BNPs [14, 15]; the tradeoff in using such methods is that they provide samples from the posterior rather than the distribution itself, and results regarding assessing convergence remain limited. Sequential particle filters for inference have also been developed [16], but these suffer issues with particle degeneracy and exponential forgetting.The main challenge posed by the streaming, distributed setting for BNPs is the combinatorial problem of component identification. Most BNP models contain some notion of a countably infinite set of latent “components” (e.g. clusters in a DP mixture model), and do not impose an inherent ordering on the components. Thus, in order to combine information about the components from multiple processors, the correspondence between components must first be found. Brute force search is intractable even for moderately sized models – there are possible correspondences for two sets of components of sizes and . Furthermore, there does not yet exist a method to evaluate the quality of a component correspondence for BNP models. This issue has been studied before in the MCMC literature, where it is known as the “label switching problem”, but past solution techniques are generally modelspecific and restricted to use on very simple mixture models [17, 18].
This paper presents a methodology for creating streaming, distributed inference algorithms for Bayesian nonparametric models. In the proposed framework (shown for a single node A in Figure 1
), processing nodes receive a sequence of data minibatches, compute a variational posterior for each, and make asynchronous streaming updates to a central model using a mapping obtained from a component identification optimization. The key contributions of this work are as follows. First, we develop a minibatch posterior decomposition that motivates a learningratefree streaming, distributed framework suitable for Bayesian nonparametrics. Then, we derive the component identification optimization problem by maximizing the probability of a component matching. We show that the BNP prior regularizes model complexity in the optimization; an interesting side effect of this is that regardless of whether the minibatch variational inference scheme is truncated, the proposed algorithm is truncationfree. Finally, we provide an efficiently computable regularization bound for the Dirichlet process prior based on Jensen’s inequality
^{1}^{1}1Regularization bounds for other popular BNP priors may be found in the supplement.. The paper concludes with applications of the methodology to the DP mixture model, with experimental results demonstrating the scalability and performance of the method in practice.2 Streaming, distributed Bayesian nonparametric inference
The proposed framework, motivated by a posterior decomposition that will be discussed in Section 2.1, involves a collection of processing nodes with asynchronous access to a central variational posterior approximation (shown for a single node in Figure 1). Data is provided to each processing node as a sequence of minibatches. When a processing node receives a minibatch of data, it obtains the central posterior (Figure 0(a)), and using it as a prior, computes a minibatch variational posterior approximation (Figure 0(b)). When minibatch inference is complete, the node then performs component identification between the minibatch posterior and the current central posterior, accounting for possible modifications made by other processing nodes (Figure 0(c)). Finally, it merges the minibatch posterior into the central variational posterior (Figure 0(d)).
In the following sections, we use the DP mixture [3] as a guiding example for the technical development of the inference framework. However, it is emphasized that the material in this paper generalizes to many other BNP models, such as the hierarchical DP (HDP) topic model [19], BP latent feature model [20], and PitmanYor (PY) mixture [21] (see the supplement for further details).
2.1 Posterior decomposition
Consider a DP mixture model [3], with cluster parameters , assignments , and observed data . For each asynchronous update made by each processing node, the dataset is split into three subsets for analysis. When the processing node receives a minibatch of data , it queries the central processing node for the original posterior , which will be used as the prior for minibatch inference. Once inference is complete, it again queries the central processing node for the intermediate posterior which accounts for asynchronous updates from other processing nodes since minibatch inference began. Each subset , , has observations , and each variable assigns to cluster parameter . Given the independence of and in the prior, and the conditional independence of the data given the latent parameters, Bayes’ rule yields the following decomposition of the posterior of and given ,
(1) 
This decomposition suggests a simple streaming, distributed, asynchronous update rule for a processing node: first, obtain the current central posterior density , and using it as a prior, compute the minibatch posterior ; and then update the central posterior density by using (1) with the current central posterior density . However, there are two issues preventing the direct application of the decomposition rule (1):
Unknown component correspondence: Since it is generally intractable to find the minibatch posteriors exactly, approximate methods are required. Further, as (1) requires the multiplication of densities, samplingbased methods are difficult to use, suggesting a variational approach. Typical meanfield variational techniques introduce an artificial ordering of the parameters in the posterior, thereby breaking symmetry that is crucial to combining posteriors correctly using density multiplication [6]. The use of (1) with meanfield variational approximations thus requires first solving a component identification problem.
Unknown model size: While previous posterior merging procedures required a 1to1 matching between the components of the minibatch posterior and central posterior [5, 6], Bayesian nonparametric posteriors break this assumption. Indeed, the datasets , , and from the same nonparametric mixture model can be generated by the same, disjoint, or an overlapping set of cluster parameters. In other words, the global number of unique posterior components cannot be determined until the component identification problem is solved and the minibatch posterior is merged.
2.2 Variational component identification
Suppose we have the following meanfield exponential family prior and approximate variational posterior densities in the minibatch decomposition (1),
(2)  
where , are products of categorical distributions for the cluster labels , and the goal is to use the posterior decomposition (1) to find the updated posterior approximation
(3) 
As mentioned in the previous section, the artificial ordering of components causes the naïve application of (1) with variational approximations to fail, as disparate components from the approximate posteriors may be merged erroneously. This is demonstrated in Figure 2(a), which shows results from a synthetic experiment (described in Section 4) ignoring component identification. As the number of parallel threads increases, more matching mistakes are made, leading to decreasing model quality.
To address this, first note that there is no issue with the first components of and ; these can be merged directly since they each correspond to the components of . Thus, the component identification problem reduces to finding the correspondence between the last components of the minibatch posterior and the last components of the intermediate posterior. For notational simplicity (and without loss of generality), fix the component ordering of the intermediate posterior , and define to be the 1to1 mapping from minibatch posterior component to updated central posterior component , where . The fact that the first components have no ordering ambiguity can be expressed as . Note that the maximum number of components after merging is , since each of the last components in the minibatch posterior may correspond to new components in the intermediate posterior. After substituting the three variational approximations (2) into (1), the goal of the component identification optimization is to find the 1to1 mapping that yields the largest updated posterior normalizing constant, i.e. matches components with similar densities,
(4) 
where is the distribution such that . Taking the logarithm of the objective and exploiting the meanfield decoupling allows the separation of the objective into a sum of two terms: one expressing the quality of the matching between components (the integral over ), and one that regularizes the final model size (the sum over ). While the first term is available in closed form, the second is in general not. Therefore, using the concavity of the logarithm function, Jensen’s inequality yields a lower bound that can be used in place of the intractable original objective, resulting in the final component identification optimization:
(5) 
A more detailed derivation of the optimization may be found in the supplement. denotes expectation under the distribution , and
(6) 
where denotes the range of the mapping . The definitions in (6) ensure that the prior is used whenever a posterior does not contain a particular component . The intuition for the optimization (5) is that it combines finding component correspondences with high similarity (via the logpartition function) with a regularization term^{2}^{2}2This is equivalent to the KLdivergence regularization . on the final updated posterior model size.
Despite its motivation from the Dirichlet process mixture, the component identification optimization (5) is not specific to this model. Indeed, the derivation did not rely on any properties specific to the Dirichlet process mixture; the optimization applies to any Bayesian nonparametric model with a set of “components” , and a set of combinatorial “indicators” . For example, the optimization applies to the hierarchical Dirichlet process topic model [10] with topic word distributions and localtoglobal topic correspondences , and to the beta process latent feature model [4] with features
and binary assignment vectors
. The form of the objective in the component identification optimization (5) reflects this generality. In order to apply the proposed streaming, distributed method to a particular model, one simply needs a blackbox variational inference algorithm that computes posteriors of the form (2), and a way to compute or bound the expectation in the objective of (5).2.3 Updating the central posterior
To update the central posterior, the node first locks it and solves for via (5). Locking prevents other nodes from solving (5) or modifying the central posterior, but does not prevent other nodes from reading the central posterior, obtaining minibatches, or performing inference; the synthetic experiment in Section 4 shows that this does not incur a significant time penalty in practice. Then the processing node transmits and its minibatch variational posterior to the central processing node where the product decomposition (1) is used to find the updated central variational posterior in (3), with parameters
(7) 
Finally, the node unlocks the central posterior, and the next processing node to receive a new minibatch will use the above , , and from the central node as their , , and .
3 Application to the Dirichlet process mixture model
The expectation in the objective of (5) is typically intractable to compute in closedform; therefore, a suitable lower bound may be used in its place. This section presents such a bound for the Dirichlet process, and discusses the application of the proposed inference framework to the Dirichlet process mixture model using the developed bound. Crucially, the lower bound decomposes such that the optimization (5) becomes a maximumweight bipartite matching problem. Such problems are solvable in polynomial time [22] by the Hungarian algorithm, leading to a tractable component identification step in the proposed streaming, distributed framework.
3.1 Regularization lower bound
For the Dirichlet process with concentration parameter , is the Exchangeable Partition Probability Function (EPPF) [23]
(8) 
where is the amount of data assigned to cluster , and is the set of labels of nonempty clusters. Given that the variational distribution , is a product of independent categorical distributions , Jensen’s inequality may be used to bound the regularization in (5) below (see the supplement for further details) by
(9) 
where is a constant with respect to the component mapping , and
(10) 
Note that the bound (9) allows incremental updates: after finding the optimal mapping , the central update (7) can be augmented by updating the values of and on the central node to
(11) 
As with , , and from (7), after performing the regularization statistics update (11), a processing node that receives a new minibatch will use the above and as their and , respectively.
Figure 2 demonstrates the behavior of the lower bound in a synthetic experiment with datapoints for various DP concentration parameter values . The true regularization was computed by sample approximation with samples. In Figure 1(a), the number of clusters was varied, with symmetric categorical label weights set to . This figure demonstrates two important phenomena. First, the bound increases as ; in other words, it gives preference to fewer, larger clusters, which is the typical BNP “rich get richer” property. Second, the behavior of the bound as depends on the concentration parameter – as increases, more clusters are preferred. In Figure 1(b), the number of clusters was fixed to 10, and the categorical label weights were sampled from a symmetric Dirichlet distribution with parameter . This figure demonstrates that the bound does not degrade significantly with high labelling uncertainty, and is nearly exact for low labelling uncertainty. Overall, Figure 1(a) demonstrates that the proposed lower bound exhibits similar behaviors to the true regularization, supporting its use in the optimization (5).
3.2 Solving the component identification optimization
Given that both the regularization (9) and component matching score in the objective (5) decompose as a sum of terms for each , the objective can be rewritten using a matrix of matching scores and selector variables . Setting indicates that component in the minibatch posterior is matched to component in the intermediate posterior (i.e. ), providing a score defined using (6) and (10) as
(12) 
The optimization (5) can be rewritten in terms of and as
(13) 
The first two constraints express the 1to1 property of . The constraint fixes the upper block of to (due to the fact that the first components are matched directly), and the offdiagonal blocks to . Denoting , to be the lower right blocks of , , the remaining optimization problem is a linear assignment problem on with cost matrix , which can be solved using the Hungarian algorithm^{3}^{3}3For the experiments in this work, we used the implementation at github.com/hrldcpr/hungarian.. Note that if or , this implies that no matching problem needs to be solved – the first components of the minibatch posterior are matched directly, and the last are set as new components. In practical implementation of the framework, new clusters are typically discovered at a diminishing rate as more data are observed, so the number of matching problems that are solved likewise tapers off. The final optimal component mapping is found by finding the nonzero elements of :
(14) 
4 Experiments
In this section, the proposed inference framework is evaluated on the DP Gaussian mixture with a normalinverseWishart (NIW) prior. We compare the streaming, distributed procedure coupled with standard variational inference [24] (SDADP) to five stateoftheart inference algorithms: memoized online variational inference (moVB) [13], stochastic online variational inference (SVI) [9] with learning rate , sequential variational approximation (SVA) [7] with cluster creation threshold and prune/merge threshold , subcluster splits MCMC (SC) [14], and batch variational inference (Batch) [24]. Priors were set by hand and all methods were initialized randomly. Methods that use multiple passes through the data (e.g. moVB, SVI) were allowed to do so. moVB was allowed to make birth/death moves, while SVI/Batch had fixed truncations. All experiments were performed on a computer with 24 CPU cores and 12GiB of RAM.
Synthetic:
This dataset consisted of 100,000 2dimensional vectors generated from a Gaussian mixture model with 100 clusters and a
prior with , , , and . The algorithms were given the true NIW prior, DP concentration , and minibatches of size 50. SDADP minibatch inference was truncated to components, and all other algorithms were truncated to components. Figure 3 shows the results from the experiment over 30 trials, which illustrate a number of important properties of SDADP. First and foremost, ignoring the component identification problem leads to decreasing model quality with increasing number of parallel threads, since more matching mistakes are made (Figure 2(a)). Second, if component identification is properly accounted for using the proposed optimization, increasing the number of parallel threads reduces execution time, but does not affect the final model quality (Figure 2(b)). Third, SDADP (with 40 threads) converges to the same final test log likelihood as the comparison algorithms in significantly reduced time (Figure 2(c)). Fourth, each component identification optimization typically takes seconds, and thus matching accounts for less than a millisecond of total computation and does not affect the overall computation time significantly (Figure 2(d)). Fifth, the majority of the component matching problems are solved within the first 80 minibatch updates (out of a total of 2,000) – afterwards, the true clusters have all been discovered and the processing nodes contribute to those clusters rather than creating new ones, as per the discussion at the end of Section 3.2 (Figure 2(e)). Finally, increased parallelization can be advantageous in discovering the correct number of clusters; with only one thread, mistakes made early on are built upon and persist, whereas with more threads there are more component identification problems solved, and thus more chances to discover the correct clusters (Figure 2(f)).

Airplane Trajectories: This dataset consisted of
3,000,000 automatic dependent surveillance broadcast (ADSB) messages collected from planes across the United States during the period 20130322 01:30:00UTC to 20130328 12:00:00UTC. The messages were connected based on plane call sign and time stamp, and erroneous trajectories were filtered based on reasonable spatial/temporal bounds, yielding 15,022 trajectories with 1,000 held out for testing. The latitude/longitude points in each trajectory were fit via linear regression, and the 3dimensional parameter vectors were clustered. Data was split into minibatches of size 100, and SDADP used 16 parallel threads.
MNIST Digits [25]: This dataset consisted of 70,000 images of handwritten digits, with 10,000 held out for testing. The images were reduced to 20 dimensions with PCA prior to clustering. Data was split into minibatches of size 500, and SDADP used 48 parallel threads.
SUN Images [26]: This dataset consisted of 108,755 images from 397 scene categories, with 8,755 held out for testing. The images were reduced to 20 dimensions with PCA prior to clustering. Data was split into minibatches of size 500, and SDADP used 48 parallel threads.
Figure 4 shows the results from the experiments on the three real datasets. From a qualitative standpoint, SDADP discovers sensible clusters in the data, as demonstrated in Figures 3(a)–3(c). However, an important quantitative result is highlighted by Table 0(a): the larger a dataset is, the more the benefits of parallelism provided by SDADP become apparent. SDADP consistently provides a model quality that is competitive with the other algorithms, but requires orders of magnitude less computation time, corroborating similar findings on the synthetic dataset.
5 Conclusions
This paper presented a streaming, distributed, asynchronous inference algorithm for Bayesian nonparametric models, with a focus on the combinatorial problem of matching minibatch posterior components to central posterior components during asynchronous updates. The main contributions are a component identification optimization based on a minibatch posterior decomposition, a tractable bound on the objective for the Dirichlet process mixture, and experiments demonstrating the performance of the methodology on largescale datasets. While the present work focused on the DP mixture as a guiding example, it is not limited to this model – exploring the application of the proposed methodology to other BNP models is a potential area for future research.
Acknowledgments
This work was supported by the Office of Naval Research under ONR MURI grant N000141110688.
References
 Nobile [1994] Agostino Nobile. Bayesian Analysis of Finite Mixture Distributions. PhD thesis, Carnegie Mellon University, 1994.
 Miller and Harrison [2013] Jeffrey W. Miller and Matthew T. Harrison. A simple example of Dirichlet process mixture inconsistency for the number of components. In Advances in Neural Information Processing Systems 26, 2013.

Teh [2010]
Yee Whye Teh.
Dirichlet processes.
In
Encyclopedia of Machine Learning
. Springer, New York, 2010.  Griffiths and Ghahramani [2005] Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems 22, 2005.
 Broderick et al. [2013] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. Streaming variational Bayes. In Advances in Neural Information Procesing Systems 26, 2013.

Campbell and How [2014]
Trevor Campbell and Jonathan P. How.
Approximate decentralized Bayesian inference.
InProceedings of the 30th Conference on Uncertainty in Artificial Intelligence
, 2014.  Lin [2013] Dahua Lin. Online learning of nonparametric mixture models via sequential variational approximation. In Advances in Neural Information Processing Systems 26, 2013.
 Zhang et al. [2014] Xiaole Zhang, David J. Nott, Christopher Yau, and Ajay Jasra. A sequential algorithm for fast fitting of Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 23(4):1143–1162, 2014.
 Hoffman et al. [2013] Matt Hoffman, David Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14:1303–1347, 2013.
 Wang et al. [2011] Chong Wang, John Paisley, and David M. Blei. Online variational inference for the hierarchical Dirichlet process. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, 2011.
 Bryant and Sudderth [2009] Michael Bryant and Erik Sudderth. Truly nonparametric online variational inference for hierarchical Dirichlet processes. In Advances in Neural Information Proecssing Systems 23, 2009.
 Wang and Blei [2012] Chong Wang and David Blei. Truncationfree stochastic variational inference for Bayesian nonparametric models. In Advances in Neural Information Processing Systems 25, 2012.
 Hughes and Sudderth [2013] Michael Hughes and Erik Sudderth. Memoized online variational inference for Dirichlet process mixture models. In Advances in Neural Information Processing Systems 26, 2013.
 Chang and Fisher III [2013] Jason Chang and John Fisher III. Parallel sampling of DP mixture models using subclusters splits. In Advances in Neural Information Procesing Systems 26, 2013.
 Neiswanger et al. [2014] Willie Neiswanger, Chong Wang, and Eric P. Xing. Asymptotically exact, embarassingly parallel MCMC. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, 2014.
 Carvalho et al. [2010] Carlos M. Carvalho, Hedibert F. Lopes, Nicholas G. Polson, and Matt A. Taddy. Particle learning for general mixtures. Bayesian Analysis, 5(4):709–740, 2010.
 Stephens [2000] Matthew Stephens. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B, 62(4):795–809, 2000.
 Jasra et al. [2005] Ajay Jasra, Chris Holmes, and David Stephens. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science, 20(1):50–67, 2005.
 Teh et al. [2006] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.
 DoshiVelez and Ghahramani [2009] Finale DoshiVelez and Zoubin Ghahramani. Accelerated sampling for the Indian buffet process. In Proceedings of the International Conference on Machine Learning, 2009.
 Dubey et al. [2014] Avinava Dubey, Sinead Williamson, and Eric Xing. Parallel Markov chain Monte Carlo for PitmanYor mixture models. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, 2014.
 Edmonds and Karp [1972] Jack Edmonds and Richard Karp. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the Association for Computing Machinery, 19:248–264, 1972.
 Pitman [1995] Jim Pitman. Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102(2):145–158, 1995.
 Blei and Jordan [2006] David M. Blei and Michael I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121–144, 2006.
 [25] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. MNIST database of handwritten digits. Online: yann.lecun.com/exdb/mnist.
 [26] Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. SUN 397 image database. Online: vision.cs.princeton.edu/projects/2010/SUN.
Comments
There are no comments yet.