1 Introduction
Bayesian nonparametric (BNP) models have been successfully applied on a wide range of domains but despite significant improvements in computational hardware, statistical inference in most BNP models remains infeasible in the context of large datasets. The flexibility gained by such models is paid for with severe decreases in computational efficiency, and this makes these models somewhat impractical. Therefore, there is an emerging need for approaches that simultaneously minimize both empirical risk and computational complexity (Bousquet and Bottou, 2008)
. Towards that end we present a simple, statistically rigorous and computationally efficient approach for the estimation of BNP models based on maximum aposteriori (MAP) inference and concentrate on inference in
Dirichlet process mixtures (DPMs).DPMs are mixture models which use the Dirichlet Process (DP) (Ferguson, 1973)
as a prior over the parameters of the distribution of some random variable. The random variable has a distribution with a potentially infinite number of mixture components. The DP is an adaptation of the discrete Dirichlet distribution to the infinite, uncountable sample space. A draw from a DP is itself a density function. A DP is the Bayesian conjugate density to the empirical probability density function, much as the discrete Dirichlet distribution is conjugate to the categorical distribution. Hence, DPs have value in Bayesian probabilistic models because they are priors over completely general density functions. This is one sense in which DPMs are nonparametric.
An additional, interesting property of DPdistributed density functions is that they are discrete in the following sense: they are formed of an infinite, but countable mixture of Dirac delta functions. Since the Dirac has zero measure, the support of the density function is also countable. This discreteness means that draws from such densities have a nonzero probability of being repeats of previous draws. Furthermore, the more often a sample is repeated, the higher the probability of that sample being drawn again – an effect known as the “rich get richer” property (known as preferential attachment in the network science literature (Barabási and Albert, 1999)). This repetition, coupled with preferential attachment, leads to another valuable property of DPs: samples from DPdistributed densities have a strong clustering property whereby draws can be partitioned into representative draws, where and is not fixed apriori.
Practical methods to deal with drawing densities from DPs, and samples from these densities, revolve around three equivalent constructions: the (generalized) Pólya urn scheme (Blackwell and MacQueen, 1973) allowing draws of samples directly from the DP, the stickbreaking method (Ishwaran and James, 2001) which creates explicit representations of DPdistributed densities, and the Chinese restaurant process (CRP) which defines exchangeable conditional distributions over partitions of the draws from the DP, defined by the
representatives. All three constructions lead to practical stochastic sampling inference schemes for DPMs. Sampling inference is most often conducted using Markov Chain Monte Carlo (MCMC) methods, e.g. Gibbs
(Stuart Geman, 1984) and slice (Neal, 2003) samplers. Here we discuss only the Gibbs sampler as it is the “starting point” for the development of the novel algorithm of this paper.Approximate inference algorithms that scale better with data size include variational Bayes (VB) schemes for DPMs (Blei and Jordan, 2004). However, the factorization assumption required leads to biased estimates of the posterior of interest. Further, VB algorithms can become trapped in local minima and often underestimate the variance of the quantities of interest (Bishop, 2006, page 462). VB for DPMs also truncate the infinite number of components in the DPM which causes additional approximation error. More importantly, even with this truncation, closed form maximization steps for the different DPM quantities are rarely obtained so that iterative optimization steps are required. Although DPM VB often converges more quickly than MCMC, it usually requires high computational effort for each step. Finally, obtaining the optimization schedule for the variational distribution with respect to the auxiliary variables involved is often a challenging task.
Small variance asymptotics create degenerate, point mass Dirac distributions in the probabilistic model to devise simplified inference algorithms. The DPmeans algorithm (Kulis and Jordan, 2012) is derived from a Gaussian DPM Gibbs sampler by shrinking the component covariances to zero (further discussed in Section 2.3). The approach was later extended to exponential family distributions by Jiang et al. (2012). Later Broderick et al. (2013) uses DPmeans as a tool for finding the MAP solution of the degenerate complete likelihood for a DPM, and applies the same principle to Bayesian nonparametric latent feature models. However, in using a degenerate likelihood, some of the defining properties of the DPM, for example the prior over the partition (see Section 2.1), are lost. In this work, we present an algorithm for finding the solution of the MAP problem posed in Broderick et al. (2013) without resorting to a degenerate likelihood. This enables the algorithm to be more faithful to inference in the corresponding probabilistic model, and also allows the use of standard rigorous tools such as outofsample prediction for crossvalidation.
We concentrate on inference in DP mixtures and show how the CRP may be exploited to produce simplified MAP inference algorithms for DPMs. Similar to DPmeans it provides only point estimates of the joint posterior. However, while DPmeans follows the close relationship between
means and the finite Gaussian mixture model (GMM) to derive a “nonparametric
means”, we exploit the concept of iterated conditional modes (ICM) (Kittler and Föglein, 1984).After reviewing the CRP (Section 2.1) and DPM (Section 2.2), we discuss the DPmeans algorithm highlighting some of its deficiencies in Section 2.3 and we show how these can be overcome using nondegenerate MAP inference in Section 3. We compare the different approaches on synthetic and real datasets in Section 4 and we conclude with a discussion of future directions Section 5.
2 Background
2.1 Chinese restaurant process (CRP)
The CRP is a discrete time stochastic process over the sample space of partitions, or equivalently can be thought as a probability distribution over cluster indicator variables. It is strictly defined by an integer
(number of items) and a positive, real concentration parameter . A draw from a CRP has probability:(1) 
with indicators , where is the unknown number of items and is the number of indicators taking value with . For any finite we will have and usually will be much smaller than , so the CRP returns a partitioning of elements into some smaller number of groups . The probability over indicators is constructed in a sequential manner using the following conditional probability:
(2) 
By increasing the value of from to
and using the corresponding conditional probabilities, we obtain the joint distribution over indicators from (
1), . The stochastic process is often explained using the metaphor of customers sitting at tables at a Chinese restaurant, where the probability of customer sitting on a previously occupied table or a new table is given by (2).2.2 The Dirichlet process Gaussian mixture model and Gibbs sampler
DPMs are popular nonparametric Bayesian models, related to the finite mixture model, but making additional assumptions that allow for greater flexibility. For illustration, consider the case where the mixture components are Gaussian with joint mean and precision parameters . We will denote using the full data matrix formed of the observed data points which are
dimensional vectors
. The Dirichlet process Gaussian mixture model (DPGMM) with collapsed mixture weights can be written:(3)  
for . Here, denotes a partition drawn from a CRP which implicitly constrains the indicators for data points. Variables and are, respectively, component means and precision matrices that are jointly normalWishart (NW) distributed. Here, indexes observations and is a prior density function over these parameters. Given the clustering property of the CRP, there will be draws from pointed to by cluster indicators , so that become the cluster parameters for cluster . Each observation is then Gaussian but sharing parameters with all the other observations in the same cluster i.e. for .
One simple and popular sampling algorithm for parameter inference in the DPM is CRPbased Gibbs sampling, first applied by West et al. (1994) and later discussed in Neal (2000). This MCMC algorithm alternates until convergence between the two stages of sampling the component indicators while holding the cluster parameters fixed, and sampling new cluster parameters while holding the cluster indicators fixed. The first stage samples the cluster indicators from the following conditional distribution:
(4) 
where denotes the number of times an observation, apart from the observation , has been assigned to cluster . However, as this is a DPM there is always a finite probability of creating a new cluster, whereby we sample new cluster parameters and add a new value to the possible values for each indicator. The probability of creating a new cluster is:
(5) 
The parameters are sampled from the posterior distribution defined by the prior and the single observation likelihood . Due to conjugacy, this posterior is NW with the prior parameters we have set for . In the second stage, we sample the new cluster parameters for each cluster from the posterior distribution over conditional on the updated , the prior , and the joint likelihood for all observations assigned to cluster , i.e. every where . This likelihood is a product of Gaussian distributions, and by conjugacy, the posterior distribution will be another NW.
2.3 Hard clustering via small variance asymptotics
Based on the CRP Gibbs sampler described above and with some simplifying assumptions, Kulis and Jordan (2012) describe a hard clustering algorithm that is closely related to means clustering, but where can vary with the number of observations . This DPmeans algorithm mirrors the wellknown small variance asymptotic derivation of the means algorithm from a simplification of the expectationmaximization (EM) algorithm for the finite Gaussian mixture model (Bishop, 2006, page 423).
In DPmeans, each Gaussian component is spherical with identical covariance , and the variance parameter is assumed known and hence fixed in the algorithm. This is equivalent to assuming in the DPGMM model (3). Then, since the cluster components have fixed covariances, the conjugate choice for the cluster means is Gaussian. To obtain closed form solutions Kulis and Jordan (2012) assume a zero mean Gaussian prior with covariance and fixed . They further assume a functional dependency between and the covariances, , for some new parameter . The probability of assigning observation to cluster becomes:
(6) 
and the probability for creating a new cluster is:
(7) 
Then, in the small variance asymptotic limit (as in means) the probability over collapses to when has the smallest distance to ; or instead, the probability of creating a new cluster becomes when is smaller than any of these distances. Therefore, a new cluster is created if there are any observed data points for which is smaller than the distance from that data point to any existing component mean vector. If a new component is generated, it will have because in the small variance limit, the covariance of the posterior over becomes zero.
The component parameter update stage simplifies to the means update, i.e. the means of each component are simply replaced by the mean of every observation assigned to that component. This occurs because by conjugacy the posterior over the component means is multivariate Gaussian and as the likelihood term dominates over the prior.
3 MAPDPM using elliptical multivariate Gaussians
Although the DPmeans algorithm presented above is straightforward, it has various drawbacks in practice. The most problematic is that the functional dependency between the concentration parameter and the covariances destroys the preferential attachment property of the DPM because the counts of assignments to components no longer influence which component gets assigned to an observed data point. Only the geometry in the data space matters. A new cluster is created by comparing the parameter against the distances between cluster centers and data points so that the number of clusters is controlled by the geometry alone, and not by the number of data points already assigned to each cluster. So, for highdimensional datasets, it is not clear how to choose the parameter . By contrast, in the CRP Gibbs sampler for the DPM, the concentration parameter controls the rate at which new clusters are produced in a way which is largely independent of the geometry.
Another problem with small variance asymptotics is that the introduction of degenerate Dirac point masses causes likelihood comparisons to be no longer meaningful since the model likelihood becomes infinite. This means that we cannot readily choose parameters such as using standard model selection methods such as crossvalidation.
Here, we propose a DPM inference algorithm based on iterated conditional modes (ICM, see Kittler and Föglein (1984) and also Bishop (2006, page 546)). This is also called the maximizationmaximization (MM) algorithm by Welling and Kurihara (2006). The basic idea is to use conditional modal point estimates rather than samples from the conditional probabilities used in Gibbs.
3.1 Probabilistic model overview
We will make the simplifying assumption that are diagonal, denoting the nonzero entries . So, the component distributions are a product of univariate Gaussians . Then we can replace the NW prior over the cluster parameters with the simpler product of normalGamma (NG) priors, retaining conjugacy (Appendix A.1). The prior for a specific cluster component is therefore:
The NG prior parameters need to be specified. As we are only interested in clustering, we will integrate out the cluster parameters (a process known as RaoBlackwellization (Blackwell, 1947) which often leads to more efficient samplers (Cassella and Robert, 1994)). In the Gibbs sampler framework, this requires integrating out the cluster parameters from the conditional of the cluster indicators given the cluster parameters, . In the DPGMM Gibbs sampler, we assign observation to cluster with probability . The integration is tractable because of conjugacy. Integrating out the means and precisions from the corresponding Gaussian likelihoods of observation , we obtain the cluster assignment probabilities for assigning a point to an existing cluster or a new cluster , respectively:
(9)  
(10) 
where for brevity we have used the shorthand . The right hand side conditional probabilities are:
(11)  
(12) 
Here, denotes a StudentT distribution with mean , precision
, and are the NG posterior parameters – which we call component statistics. For cluster and dimension , after removing the effect of the current observation (Neal, 2000):(13)  
Note that these component statistics can be efficiently computed by adding and removing the effect of a single observation from the statistics of each cluster, and (see Appendix A.2, Algorithm 2).
The degrees of freedom of the StudentT marginal likelihoods
depend upon the number of observations in the corresponding component. The likelihood for smaller components will have heavier than Gaussian tails, while the likelihood for clusters assigned a large number of observations will be close to Gaussian. This makes the clustering more robust to outliers and penalizes creating clusters with few observations assigned to them.
In summary, the nonparametric Bayesian probabilistic model we will use is (see Figure 1):
(14)  
3.2 MAPDPM inference algorithm
MAP inference applied to the probabilistic model above involves finding the mode for each individual Gibbs step described in (9)(10). For observation , we compute the negative log probability for each existing cluster and for a new cluster :
(15)  
omitting quantities independent of (detailed expressions in the Appendix A.2). For each observation we compute the above dimensional vector and select the cluster number of the smallest element from it:
The algorithm proceeds to the next observation by updating the component statistics to reflect the new value of and remove the effect of data point . To check convergence of the algorithm we compute the complete data likelihood:
(16) 
where is the Kronecker delta function and is defined in (1). The negative log of this quantity (negative log likelihood, NLL) is used to assess convergence as described in Algorithm 1. ICM is guaranteed to never increase the NLL at each iteration step and therefore the MAPDPM algorithm will converge to a fixed point (Welling and Kurihara, 2006). The susceptibility of the algorithm to local minima can be alleviated using multiple restarts with random parameter initializations. The existence of a closedform, nondegenerate likelihood (unlike in small variance asymptotic approaches) can be used to estimate the model parameters, such as the concentration parameter, and perform model selection (see Appendix A.3). We can also use techniques such as crossvalidation by computing an outofsample likelihood, which we now discuss.
3.3 Outofsample prediction
To compute the outofsample likelihood for a new observation we can consider two approaches that differ in how the indicator is estimated. Firstly, the unknown indicator may be integrated out resulting in a mixture density:
(17) 
where we have omitted the dependency on the training observations . The assignment probability is for an existing cluster and for a new cluster. The second term corresponds to the StudentT marginal in (11) and (12) for an existing and a new cluster, respectively. Alternatively, we can use a point estimate for by picking the minimum negative log posterior of the indicator or equivalently:
(18) 
The first (marginalization) approach is used in Blei and Jordan (2004) and is more robust as it incorporates the probability mass of all cluster components. However the second (modal) approach can be useful in cases where only a point prediction is needed such as when computing the test set normalised mutual information (see Section 4.1).
4 Experiments
4.1 Synthetic CRP parameter estimation
We examine the performance of the MAPDPM algorithm in terms of discovering the partitioning distribution, and the computational effort needed to do this. We generate 100 samples from a twodimensional CRP model (3) with diagonal precision matrices (example in Figure 2). The partitioning is sampled from a CRP with fixed concentration parameter and data size . Gaussian component parameters are sampled from an NG prior with parameters . As expected, when using a CRP prior, the sizes of the different clusters vary significantly with many small clusters containing only a few observations in them.
We fit a CRP mixture model with integratedout component parameters (14) by MAP and integratedout Gibbs inference, using the known, ground truth model values for the NG prior and used to generate the data. Convergence for the Gibbs algorithm is tested using the Raftery diagnostic () Raftery and Lewis (1992). We use a high convergence acceptance tolerance of to obtain less conservative estimates for the number of iterations required. As there is no commonly accepted way to check the convergence of MCMC algorithms, our comparison is by necessity somewhat arbitrary but we believe the choices we have made are realistic and useful conclusions may be drawn from the comparison.
We measure clustering estimation accuracy using the (sum) normalized mutual information between the ground truth clustering and the estimate (Vinh et al., 2010), where is the entropy. NMI lies in the range with higher values signifying closer agreement between the clusterings (see Appendix A.4 for other reported measures, e.g. the adjusted mutual information (Vinh et al., 2010)).
Gibbs  MAPDPM  

NMI  0.67 (0.23)  0.71 (0.23) 
Iterations  2,020 (900)  13.3 (13.9) 
CPU time (secs)  7,300 (4,000)  28 (32) 
0.91 (7.12)  6.94 (6.93)  
Empty clusters  5,900 (3,600)  10.33 (22.1) 
Test set NMI  0.72 (0.20)  0.71 (0.23) 
). Mean and two standard deviations (in brackets) reported across the 100 CRP mixture samples. The quantity
is the difference between the estimated number of clusters and the known number of clusters. The range of the normalized mutual information (NMI) is with higher values reflecting lower clustering error.In Table 1 a range of performance metrics for the MAPDPM and Gibbs algorithms are shown. Both MAPDPM and Gibbs achieve similar clustering performance for both training and test set NMI. To assess outofsample performance, another set of observations were sampled from each CRP mixture sample and the outofsample point prediction calculated for each model (Section 3.3). The Gibbs sampler requires, on average, approximately 152 times the number of iterations to converge than MAPDPM, as reflected in the CPU times. Also, the number of empty clusters created in Gibbs is higher than MAPDPM due to the higher number of iterations required; an effective Gibbs implementation therefore would need to efficiently handle the empty clusters.
When examining the number of clusters (), Gibbs is closest to the ground truth whilst MAPDPM produces signficant underestimates. In Figure 3 the median partitioning is shown in terms of the partitioning and the number of clusters. MAPDPM fails to identify the smaller clusters whereas the Gibbs sampler is able to do so to a much greater extent. This is a form of underfitting where the MAP algorithm captures the mode of the partitioning distribution but fails to put enough mass on the tails (the smaller clusters); that may also be described as an underestimation of the variance of the partitioning distribution. The NMI scores do not reflect this effect as the impact of the smaller clusters on the overall measure is minimal.
4.2 UCI datasets
We compare the DPmeans, MAPDPM and Gibbs samplers on seven UCI datasets and assess their performance using the same NMI measure as in Section 4.1. Class labels in the datasets are treated as cluster numbers.^{1}^{1}1We do not assess “Car” and “Balance scale” datasets used in Kulis and Jordan (2012) because they consist of a complete enumeration of 6 and 4 categorical factors respectively, and it is not meaningful to apply an unsupervised clustering algorithm to such a setting. As in Section 4.1 the Gibbs sampler is stopped using the Raftery diagnostic (Raftery and Lewis, 1992). For DPmeans, we choose to give the true number of clusters in the corresponding dataset (Kulis and Jordan, 2012). The NG prior in the Gibbs and MAPDPM algorithms is set empirically; the mean is set to the sample mean of the data, , a scaled version of value proposed in Kass and Wasserman (1995), and each dimension of is set to the sample variance of the data in that corresponding dimension. Concentration parameter is selected by minimizing the NLL (Section 3) in the MAPDPM algorithm across a discrete set of candidate values and reused in the Gibbs algorithm. For the Gibbs algorithm, we compute the mean and two standard deviations for the NMI score across all samples (Table 2).
DPmeans  Gibbs  MAPDPM  

Wine  0.42  0.72 (0.06)  0.86 
Iris  0.76  0.75 (0.06)  0.76 
Breast cancer  0.75  0.70 (0.01)  0.71 
Soybean  0.36  0.49 (0.00)  0.40 
Parkinson’s  0.02  0.13 (0.02)  0.12 
Pima  0.03  0.07 (0.01)  0.07 
Vehicle  0.21  0.15 (0.02)  0.15 
On almost all of the datasets (6 out of 7), MAPDPM is comparable to, or even better than, the Gibbs sampler, and on 5 out 7 datasets it performs better than DPmeans (Table 2). DPmeans performs well on lowerdimensional datasets with a small number of clusters. In higher dimensions, it is more likely for the clusters to be elliptical rather than spherical and in such cases the other algorithms outperform DPmeans because of the more flexible model assumptions. In addition, for higher dimensional data it is more often the case that the different features have different numerical scales, so the squared Euclidean distance used in DPmeans is inappropriate. Furthermore, MAPDPM and the Gibbs sampler are more robust to smaller clusters due to the longer tails of the StudentT distribution and the richgetricher effect of existing clusters assigned many observations. DPmeans is especially sensitive to geometric outliers and can easily produce excessive numbers of spurious clusters for poor choices of .
Even though MAPDPM only gives a point estimate of the full Gibbs distribution, MAPDPM can in practice achieve higher NMI scores. This can occur both because of Gibbs convergence issues, and because we take the average NMI across all Gibbs samples, where MAPDPM could correspond to a Gibbs sample with much higher NMI than average. For instance, in the Wine dataset (178 observations, 13 dimensions), the NMI across different Gibbs samples varies considerably and the MAPDPM NMI score is close to the highest one achieved across all Gibbs samples. In the Soybean dataset (266 observations, 35 dimensions), visual inspection of the Gibbs samples revealed slow Markov chain mixing and even after 10,000 iterations, the samples had not converged. The sparseness of the data in such a highdimensional space makes this a particularly challenging clustering problem and a more sophisticated MCMC sampling method would likely be required in practice.
We emphasize that these algorithms attempt to maximize the model fit rather than maximize NMI. The true labels would not be available in practice and it is not always the case that maximizing the likelihood also maximizes NMI. Furthermore, if we choose the NG parameters for each dataset separately, by minimizing the negative log likelihood with respect to each parameter, higher NMI can been achieved, but choosing empirical estimates for the model parameters simplifies the computations.
DPmeans  Gibbs  MAPDPM  

Wine  19  2,365  11 
Iris  8  1,543  5 
Breast cancer  8  939  8 
Soybean  14  10,000+  9 
Parkinson’s  1,000+  1,307  13 
Pima  20  1,189  17 
Vehicle  12  939  9 
In all cases, the MAPDPM algorithm converges more rapidly than the other algorithms (Table 3). The Gibbs sampler takes, on average, approximately more iterations to converge than MAPDPM to achieve comparable NMI scores. The computational complexity per iteration for Gibbs and MAPDPM is comparable, requiring the computation of the same quantities. This makes the Gibbs sampler significantly less efficient than MAPDPM in finding a good labeling for the data. The price per iteration for DPmeans can often be considerably smaller than MAPDPM or the Gibbs sampler, as one iteration often does not include a sweep through all of the data points. This occurs because the sweep ends when a new cluster has to be created, unlike MAPDPM and Gibbs. But, this also implies that DPmeans requires more iterations to converge than MAPDPM.
5 Discussion and future directions
We have presented a simple algorithm for inference in DPMs based on nondegenerate MAP, and demonstrated its efficiency and accuracy by comparison to the ubiquitous Gibbs sampler, and a simple alternative, the small variance asymptotic approach. We believe our approach is highly relevant to applications since, unlike the small variance approach, it retains the preferential attachment (richgetricher) property while needing two orders of magnitude fewer iterations than Gibbs. Unlike the asymptotic approach, an outofsample likelihood may be computed allowing the use of standard model selection and model fit diagnostic procedures. Lastly, the nondegenerate MAP approach requires no factorization assumptions for the model distribution unlike VB.
As with all MAP methods, the algorithm can get trapped in local minima, however, standard heuristics such as multiple random restarts can be employed to mitigate the risk. This would increase the total computational cost of the algorithm somewhat but even with random restarts it would still be far more efficient than the Gibbs sampler.
Although not reported here due to space limitations, we point out that different implementations of the Gibbs sampler can lead to different MAP inference algorithms for different model DPMs which naturally arise from this probabilistic graphical model structure (Neal, 2000)
. In general, we have found the resulting alternative algorithms to be less robust in practice, as they retain the Gaussian likelihood over the observations given the cluster indicators. If such assumptions are justified, however, then our MAP approach can be readily applied to these models as well, for example, where nonconjugate priors are appropriate.
The generality and the simplicity of our approach makes it reasonable to adapt to other Bayesian nonparametric mixture models, for example the PitmanYor process which generalizes the CRP (Pitman and Yor, 1997). The MAP approach can also be readily applied to hierarchical Bayesian nonparametric models such as the Hierarchical DP (Teh et al., 2006) and nested DP (Rodriguez et al., 2008). Another useful direction, for largescale datasets in particular, would be to extend our approach to perform inference that does not need to sweep through the entire dataset in each iteration, for increased efficiency (Welling and Teh, 2011).
References
 Bousquet and Bottou [2008] Olivier Bousquet and Léon Bottou. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pages 161–168, 2008.
 Ferguson [1973] Thomas S Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230, 03 1973. doi: 10.1214/aos/1176342360. URL http://dx.doi.org/10.1214/aos/1176342360.
 Barabási and Albert [1999] AlbertLászló Barabási and Réka Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999.
 Blackwell and MacQueen [1973] David Blackwell and James B MacQueen. Ferguson distributions via Pólya urn schemes. Annals of Statistics, 1(2):353–355, 1973.
 Ishwaran and James [2001] Hemant Ishwaran and Lancelot F James. Gibbs sampling methods for stickbreaking priors. Journal of the American Statistical Association, 96(453), 2001.
 Stuart Geman [1984] Donald Geman Stuart Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741, 1984.
 Neal [2003] Radford M Neal. Slice sampling. Annals of Statistics, 31(3):705–767, 2003. doi: 10.1214/aos/1056562461.

Blei and Jordan [2004]
David M Blei and Michael I Jordan.
Variational methods for the Dirichlet process.
In
Proceedings of the 21th International Conference on Machine Learning (ICML)
, page 12. ACM, 2004.  Bishop [2006] Christopher M Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN 0387310738.
 Kulis and Jordan [2012] Brian Kulis and Michael I Jordan. Revisiting Kmeans: New algorithms via Bayesian nonparametrics. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 513–520, 2012.
 Jiang et al. [2012] Ke Jiang, Brian Kulis, and Michael I Jordan. Smallvariance asymptotics for exponential family Dirichlet process mixture models. In Advances in Neural Information Processing Systems, pages 3158–3166, 2012.
 Broderick et al. [2013] Tamara Broderick, Brian Kulis, and Michael I Jordan. MADBayes: MAPbased asymptotic derivations from Bayes. Proceedings of the 30th International Conference on Machine Learning (ICML), 28(3):226–234, 2013.
 Kittler and Föglein [1984] Josef Kittler and J Föglein. Contextual classification of multispectral pixel data. Image and Vision Computing, 2(1):13–29, 1984.
 West et al. [1994] Mike West, Peter Muller, and Michael D Escobar. Hierarchical priors and mixture models, with application in regression and density estimation. Aspects of Uncertainty, 1:363–386, 1994.
 Neal [2000] Radford M Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.
 Welling and Kurihara [2006] Max Welling and Kenichi Kurihara. Bayesian KMeans as a "MaximizationExpectation" algorithm. In SDM, pages 474–478. SIAM, 2006.
 Blackwell [1947] David Blackwell. Conditional expectation and unbiased sequential estimation. The Annals of Mathematical Statistics, 18(1):105–110, 03 1947. doi: 10.1214/aoms/1177730497. URL http://dx.doi.org/10.1214/aoms/1177730497.
 Cassella and Robert [1994] George Cassella and Christian P Robert. RaoBlackwellisation of sampling schemes. Biometrics, 83:81–94, 1994.
 Raftery and Lewis [1992] Adrian E Raftery and Steven Lewis. How many iterations in the Gibbs sampler. Bayesian Statistics, 4(2):763–773, 1992.
 Vinh et al. [2010] Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, 11:2837–2854, 2010.

Kass and Wasserman [1995]
Robert E Kass and Larry Wasserman.
A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion.
Journal of the American Statistical Association, 90(431):928–934, 1995.  Pitman and Yor [1997] Jim Pitman and Marc Yor. The twoparameter PoissonDirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2):855–900, 1997. doi: 10.1214/aop/1024404422. URL http://dx.doi.org/10.1214/aop/1024404422.
 Teh et al. [2006] Yee W Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 2006.
 Rodriguez et al. [2008] Abel Rodriguez, David B Dunson, and Alan E Gelfand. The nested Dirichlet process. Journal of the American Statistical Association, 103(483), 2008.
 Welling and Teh [2011] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 681–688, 2011.
 Rasmussen [1999] Carl E Rasmussen. The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems, volume 12, pages 554–560, 1999.
Appendix A Appendix
a.1 Model and prior
The mixture model given the component means, precisions and weights for each component, is:
(19) 
where , the number of clusters, becomes infinity in the DPM model. Variable is a dimensional vector of cluster means for cluster , and is a diagonal matrix of precisions, . This makes the covariance diagonal, so the normal can be written as a product of univariate normals, = .
We place the conjugate normalGamma prior over each dimension of the cluster means and variances:
(20)  
Then, we integrate out the cluster parameters of the model, which is analytically tractable because of the conjugate priors. This results in a collapsed model structure that places a mixture of StudentT likelihoods for the probability over an observation:
(21) 
where the parameters of the StudentT are functions of the normalGamma posteriors defined in the paper with effect of observation removed from them. Recall that , and . The removal of the effect of the current observation in the corresponding conditionals occurs because a dependency is introduced between the cluster indicators, which is a result of integrating out the cluster parameters (see Algorithm 3 in Neal [2000]).
a.2 Cluster assignments
To complete the MAP algorithm we take the negative logarithm of the assignment probabilities to arrive at (15). For an observation , we compute the negative log probability for each existing cluster, ignoring constant terms:
(22) 
Similarly, for a new cluster:
(23) 
For each obsevation we compute the above dimensional vector and select the index of the smallest element from it:
(24) 
In Algorithm 2 a fast update version of the method is described where the NG statistics are updated by removing the effect of one point rather than processing the entire data set.
a.3 Learning the concentration parameter
We have considered the following approaches to infer the concentration parameter :

Crossvalidation. By considering a finite set of values for , choose the value corresponding to the minimum, average outofsample likelihood across all crossvalidation folds. This approach is taken in Blei and Jordan [2004] to compare different inference methods.

Multiple restarts. Compute the NLL at convergence for each different value of , and pick the corresponding to the smallest NLL at convergence. This is the approach taken in the UCI experiment section of the paper.

MAP estimate. Compute the posterior over and numerically locate the mode. An example posterior is given in Rasmussen [1999]:
(25) where is the number of existing clusters. This posterior arises from the choice of an inverse Gamma prior on the concentration parameter and the likelihood . We generalize this calculation by using a Gamma prior :
(26) We numerically minimize the negative log of this posterior using Newton’s method. To ensure the solution is positive we compute the gradient with respect to : as Rasmussen [1999] notes is logconcave and therefore has a unique maximum (so that the negative log of this has a unique minimum), where is the number of nonzero represented components.
In practice we found the third approach least effective due to the presence of local minima when doing MAP estimation. The second approach is the simplest to apply in practice but can be prone to overfitting for small datasets where we recommend using the crossvalidation approach.
a.4 CRP experiment
We provide more details on the CRP experiment presented in the paper. In Tables 45 we include the maximum NMI and adjusted mutual information (AMI) score which corrects for chance effects when comparing clusterings by penalizing partitions with larger numbers of clusters [Vinh et al., 2010]. To assess outofsample accuracy we also include the average, leaveoneout negative log likelihood discussed in Section 3.3 of the paper in Table 5. All metrics are similar for the Gibbs and MAPDPM algorithms, reflecting the good clustering performance of the MAP algorithm.
Measure  Gibbs  MAP 

NMI sum  0.67 (0.23)  0.71 (0.23) 
NMI max  0.64 (0.24)  0.64 (0.27) 
AMI  0.62 (0.25)  0.62 (0.27) 
Iterations  2,000 (900)  13 (14) 
CPU Time (secs)  7,300 (4,000)  28 (32) 
0.91 (7.1)  6.9 (6.9)  
Empty clusters  6,000 (3,600)  14 (28) 
Measure  Gibbs  MAP 

NMI sum  0.72 (0.20)  0.71 (0.23) 
NMI max  0.68 (0.22)  0.64 (0.27) 
AMI  0.67 (0.23)  0.62 (0.27) 
NLL (leaveoneout)  7.17 (1.10)  7.20 (1.11) 
In Figure 4
we also show the 5th and 95th quantiles of the clustering distribution for the CRP groundtruth, MAP and Gibbs algorithms. We see that the effect discussed in the paper is present at both extreme quantiles with the MAP algorithm consistently underestimating the total number of clusters by not identifying the smaller clusters.