Bayesian nonparametric (BNP) models have been successfully applied on a wide range of domains but despite significant improvements in computational hardware, statistical inference in most BNP models remains infeasible in the context of large datasets. The flexibility gained by such models is paid for with severe decreases in computational efficiency, and this makes these models somewhat impractical. Therefore, there is an emerging need for approaches that simultaneously minimize both empirical risk and computational complexity (Bousquet and Bottou, 2008)
. Towards that end we present a simple, statistically rigorous and computationally efficient approach for the estimation of BNP models based on maximum a-posteriori (MAP) inference and concentrate on inference inDirichlet process mixtures (DPMs).
DPMs are mixture models which use the Dirichlet Process (DP) (Ferguson, 1973)
as a prior over the parameters of the distribution of some random variable. The random variable has a distribution with a potentially infinite number of mixture components. The DP is an adaptation of the discrete Dirichlet distribution to the infinite, uncountable sample space. A draw from a DP is itself a density function. A DP is the Bayesian conjugate density to the empirical probability density function, much as the discrete Dirichlet distribution is conjugate to the categorical distribution. Hence, DPs have value in Bayesian probabilistic models because they are priors over completely general density functions. This is one sense in which DPMs are nonparametric.
An additional, interesting property of DP-distributed density functions is that they are discrete in the following sense: they are formed of an infinite, but countable mixture of Dirac delta functions. Since the Dirac has zero measure, the support of the density function is also countable. This discreteness means that draws from such densities have a non-zero probability of being repeats of previous draws. Furthermore, the more often a sample is repeated, the higher the probability of that sample being drawn again – an effect known as the “rich get richer” property (known as preferential attachment in the network science literature (Barabási and Albert, 1999)). This repetition, coupled with preferential attachment, leads to another valuable property of DPs: samples from DP-distributed densities have a strong clustering property whereby draws can be partitioned into representative draws, where and is not fixed a-priori.
Practical methods to deal with drawing densities from DPs, and samples from these densities, revolve around three equivalent constructions: the (generalized) Pólya urn scheme (Blackwell and MacQueen, 1973) allowing draws of samples directly from the DP, the stick-breaking method (Ishwaran and James, 2001) which creates explicit representations of DP-distributed densities, and the Chinese restaurant process (CRP) which defines exchangeable conditional distributions over partitions of the draws from the DP, defined by the
representatives. All three constructions lead to practical stochastic sampling inference schemes for DPMs. Sampling inference is most often conducted using Markov Chain Monte Carlo (MCMC) methods, e.g. Gibbs(Stuart Geman, 1984) and slice (Neal, 2003) samplers. Here we discuss only the Gibbs sampler as it is the “starting point” for the development of the novel algorithm of this paper.
Approximate inference algorithms that scale better with data size include variational Bayes (VB) schemes for DPMs (Blei and Jordan, 2004). However, the factorization assumption required leads to biased estimates of the posterior of interest. Further, VB algorithms can become trapped in local minima and often underestimate the variance of the quantities of interest (Bishop, 2006, page 462). VB for DPMs also truncate the infinite number of components in the DPM which causes additional approximation error. More importantly, even with this truncation, closed form maximization steps for the different DPM quantities are rarely obtained so that iterative optimization steps are required. Although DPM VB often converges more quickly than MCMC, it usually requires high computational effort for each step. Finally, obtaining the optimization schedule for the variational distribution with respect to the auxiliary variables involved is often a challenging task.
Small variance asymptotics create degenerate, point mass Dirac distributions in the probabilistic model to devise simplified inference algorithms. The DP-means algorithm (Kulis and Jordan, 2012) is derived from a Gaussian DPM Gibbs sampler by shrinking the component covariances to zero (further discussed in Section 2.3). The approach was later extended to exponential family distributions by Jiang et al. (2012). Later Broderick et al. (2013) uses DP-means as a tool for finding the MAP solution of the degenerate complete likelihood for a DPM, and applies the same principle to Bayesian nonparametric latent feature models. However, in using a degenerate likelihood, some of the defining properties of the DPM, for example the prior over the partition (see Section 2.1), are lost. In this work, we present an algorithm for finding the solution of the MAP problem posed in Broderick et al. (2013) without resorting to a degenerate likelihood. This enables the algorithm to be more faithful to inference in the corresponding probabilistic model, and also allows the use of standard rigorous tools such as out-of-sample prediction for cross-validation.
We concentrate on inference in DP mixtures and show how the CRP may be exploited to produce simplified MAP inference algorithms for DPMs. Similar to DP-means it provides only point estimates of the joint posterior. However, while DP-means follows the close relationship between
-means and the finite Gaussian mixture model (GMM) to derive a “nonparametric-means”, we exploit the concept of iterated conditional modes (ICM) (Kittler and Föglein, 1984).
After reviewing the CRP (Section 2.1) and DPM (Section 2.2), we discuss the DP-means algorithm highlighting some of its deficiencies in Section 2.3 and we show how these can be overcome using non-degenerate MAP inference in Section 3. We compare the different approaches on synthetic and real datasets in Section 4 and we conclude with a discussion of future directions Section 5.
2.1 Chinese restaurant process (CRP)
The CRP is a discrete time stochastic process over the sample space of partitions, or equivalently can be thought as a probability distribution over cluster indicator variables. It is strictly defined by an integer(number of items) and a positive, real concentration parameter . A draw from a CRP has probability:
with indicators , where is the unknown number of items and is the number of indicators taking value with . For any finite we will have and usually will be much smaller than , so the CRP returns a partitioning of elements into some smaller number of groups . The probability over indicators is constructed in a sequential manner using the following conditional probability:
By increasing the value of from to
and using the corresponding conditional probabilities, we obtain the joint distribution over indicators from (1), . The stochastic process is often explained using the metaphor of customers sitting at tables at a Chinese restaurant, where the probability of customer sitting on a previously occupied table or a new table is given by (2).
2.2 The Dirichlet process Gaussian mixture model and Gibbs sampler
DPMs are popular nonparametric Bayesian models, related to the finite mixture model, but making additional assumptions that allow for greater flexibility. For illustration, consider the case where the mixture components are Gaussian with joint mean and precision parameters . We will denote using the full data matrix formed of the observed data points which are
-dimensional vectors. The Dirichlet process Gaussian mixture model (DPGMM) with collapsed mixture weights can be written:
for . Here, denotes a partition drawn from a CRP which implicitly constrains the indicators for data points. Variables and are, respectively, component means and precision matrices that are jointly normal-Wishart (NW) distributed. Here, indexes observations and is a prior density function over these parameters. Given the clustering property of the CRP, there will be draws from pointed to by cluster indicators , so that become the cluster parameters for cluster . Each observation is then Gaussian but sharing parameters with all the other observations in the same cluster i.e. for .
One simple and popular sampling algorithm for parameter inference in the DPM is CRP-based Gibbs sampling, first applied by West et al. (1994) and later discussed in Neal (2000). This MCMC algorithm alternates until convergence between the two stages of sampling the component indicators while holding the cluster parameters fixed, and sampling new cluster parameters while holding the cluster indicators fixed. The first stage samples the cluster indicators from the following conditional distribution:
where denotes the number of times an observation, apart from the observation , has been assigned to cluster . However, as this is a DPM there is always a finite probability of creating a new cluster, whereby we sample new cluster parameters and add a new value to the possible values for each indicator. The probability of creating a new cluster is:
The parameters are sampled from the posterior distribution defined by the prior and the single observation likelihood . Due to conjugacy, this posterior is NW with the prior parameters we have set for . In the second stage, we sample the new cluster parameters for each cluster from the posterior distribution over conditional on the updated , the prior , and the joint likelihood for all observations assigned to cluster , i.e. every where . This likelihood is a product of Gaussian distributions, and by conjugacy, the posterior distribution will be another NW.
2.3 Hard clustering via small variance asymptotics
Based on the CRP Gibbs sampler described above and with some simplifying assumptions, Kulis and Jordan (2012) describe a hard clustering algorithm that is closely related to -means clustering, but where can vary with the number of observations . This DP-means algorithm mirrors the well-known small variance asymptotic derivation of the -means algorithm from a simplification of the expectation-maximization (E-M) algorithm for the finite Gaussian mixture model (Bishop, 2006, page 423).
In DP-means, each Gaussian component is spherical with identical covariance , and the variance parameter is assumed known and hence fixed in the algorithm. This is equivalent to assuming in the DPGMM model (3). Then, since the cluster components have fixed covariances, the conjugate choice for the cluster means is Gaussian. To obtain closed form solutions Kulis and Jordan (2012) assume a zero mean Gaussian prior with covariance and fixed . They further assume a functional dependency between and the covariances, , for some new parameter . The probability of assigning observation to cluster becomes:
and the probability for creating a new cluster is:
Then, in the small variance asymptotic limit (as in -means) the probability over collapses to when has the smallest distance to ; or instead, the probability of creating a new cluster becomes when is smaller than any of these distances. Therefore, a new cluster is created if there are any observed data points for which is smaller than the distance from that data point to any existing component mean vector. If a new component is generated, it will have because in the small variance limit, the covariance of the posterior over becomes zero.
The component parameter update stage simplifies to the -means update, i.e. the means of each component are simply replaced by the mean of every observation assigned to that component. This occurs because by conjugacy the posterior over the component means is multivariate Gaussian and as the likelihood term dominates over the prior.
3 MAP-DPM using elliptical multivariate Gaussians
Although the DP-means algorithm presented above is straightforward, it has various drawbacks in practice. The most problematic is that the functional dependency between the concentration parameter and the covariances destroys the preferential attachment property of the DPM because the counts of assignments to components no longer influence which component gets assigned to an observed data point. Only the geometry in the data space matters. A new cluster is created by comparing the parameter against the distances between cluster centers and data points so that the number of clusters is controlled by the geometry alone, and not by the number of data points already assigned to each cluster. So, for high-dimensional datasets, it is not clear how to choose the parameter . By contrast, in the CRP Gibbs sampler for the DPM, the concentration parameter controls the rate at which new clusters are produced in a way which is largely independent of the geometry.
Another problem with small variance asymptotics is that the introduction of degenerate Dirac point masses causes likelihood comparisons to be no longer meaningful since the model likelihood becomes infinite. This means that we cannot readily choose parameters such as using standard model selection methods such as cross-validation.
Here, we propose a DPM inference algorithm based on iterated conditional modes (ICM, see Kittler and Föglein (1984) and also Bishop (2006, page 546)). This is also called the maximization-maximization (M-M) algorithm by Welling and Kurihara (2006). The basic idea is to use conditional modal point estimates rather than samples from the conditional probabilities used in Gibbs.
3.1 Probabilistic model overview
We will make the simplifying assumption that are diagonal, denoting the non-zero entries . So, the component distributions are a product of univariate Gaussians . Then we can replace the NW prior over the cluster parameters with the simpler product of normal-Gamma (NG) priors, retaining conjugacy (Appendix A.1). The prior for a specific cluster component is therefore:
The NG prior parameters need to be specified. As we are only interested in clustering, we will integrate out the cluster parameters (a process known as Rao-Blackwellization (Blackwell, 1947) which often leads to more efficient samplers (Cassella and Robert, 1994)). In the Gibbs sampler framework, this requires integrating out the cluster parameters from the conditional of the cluster indicators given the cluster parameters, . In the DPGMM Gibbs sampler, we assign observation to cluster with probability . The integration is tractable because of conjugacy. Integrating out the means and precisions from the corresponding Gaussian likelihoods of observation , we obtain the cluster assignment probabilities for assigning a point to an existing cluster or a new cluster , respectively:
where for brevity we have used the shorthand . The right hand side conditional probabilities are:
Here, denotes a Student-T distribution with mean , precision, and are the NG posterior parameters – which we call component statistics. For cluster and dimension , after removing the effect of the current observation (Neal, 2000):
The degrees of freedom of the Student-T marginal likelihoods
depend upon the number of observations in the corresponding component. The likelihood for smaller components will have heavier than Gaussian tails, while the likelihood for clusters assigned a large number of observations will be close to Gaussian. This makes the clustering more robust to outliers and penalizes creating clusters with few observations assigned to them.
In summary, the nonparametric Bayesian probabilistic model we will use is (see Figure 1):
3.2 MAP-DPM inference algorithm
MAP inference applied to the probabilistic model above involves finding the mode for each individual Gibbs step described in (9)-(10). For observation , we compute the negative log probability for each existing cluster and for a new cluster :
omitting quantities independent of (detailed expressions in the Appendix A.2). For each observation we compute the above -dimensional vector and select the cluster number of the smallest element from it:
The algorithm proceeds to the next observation by updating the component statistics to reflect the new value of and remove the effect of data point . To check convergence of the algorithm we compute the complete data likelihood:
where is the Kronecker delta function and is defined in (1). The negative log of this quantity (negative log likelihood, NLL) is used to assess convergence as described in Algorithm 1. ICM is guaranteed to never increase the NLL at each iteration step and therefore the MAP-DPM algorithm will converge to a fixed point (Welling and Kurihara, 2006). The susceptibility of the algorithm to local minima can be alleviated using multiple restarts with random parameter initializations. The existence of a closed-form, non-degenerate likelihood (unlike in small variance asymptotic approaches) can be used to estimate the model parameters, such as the concentration parameter, and perform model selection (see Appendix A.3). We can also use techniques such as cross-validation by computing an out-of-sample likelihood, which we now discuss.
3.3 Out-of-sample prediction
To compute the out-of-sample likelihood for a new observation we can consider two approaches that differ in how the indicator is estimated. Firstly, the unknown indicator may be integrated out resulting in a mixture density:
where we have omitted the dependency on the training observations . The assignment probability is for an existing cluster and for a new cluster. The second term corresponds to the Student-T marginal in (11) and (12) for an existing and a new cluster, respectively. Alternatively, we can use a point estimate for by picking the minimum negative log posterior of the indicator or equivalently:
The first (marginalization) approach is used in Blei and Jordan (2004) and is more robust as it incorporates the probability mass of all cluster components. However the second (modal) approach can be useful in cases where only a point prediction is needed such as when computing the test set normalised mutual information (see Section 4.1).
4.1 Synthetic CRP parameter estimation
We examine the performance of the MAP-DPM algorithm in terms of discovering the partitioning distribution, and the computational effort needed to do this. We generate 100 samples from a two-dimensional CRP model (3) with diagonal precision matrices (example in Figure 2). The partitioning is sampled from a CRP with fixed concentration parameter and data size . Gaussian component parameters are sampled from an NG prior with parameters . As expected, when using a CRP prior, the sizes of the different clusters vary significantly with many small clusters containing only a few observations in them.
We fit a CRP mixture model with integrated-out component parameters (14) by MAP and integrated-out Gibbs inference, using the known, ground truth model values for the NG prior and used to generate the data. Convergence for the Gibbs algorithm is tested using the Raftery diagnostic () Raftery and Lewis (1992). We use a high convergence acceptance tolerance of to obtain less conservative estimates for the number of iterations required. As there is no commonly accepted way to check the convergence of MCMC algorithms, our comparison is by necessity somewhat arbitrary but we believe the choices we have made are realistic and useful conclusions may be drawn from the comparison.
We measure clustering estimation accuracy using the (sum) normalized mutual information between the ground truth clustering and the estimate (Vinh et al., 2010), where is the entropy. NMI lies in the range with higher values signifying closer agreement between the clusterings (see Appendix A.4 for other reported measures, e.g. the adjusted mutual information (Vinh et al., 2010)).
|NMI||0.67 (0.23)||0.71 (0.23)|
|Iterations||2,020 (900)||13.3 (13.9)|
|CPU time (secs)||7,300 (4,000)||28 (32)|
|-0.91 (7.12)||-6.94 (6.93)|
|Empty clusters||5,900 (3,600)||10.33 (22.1)|
|Test set NMI||0.72 (0.20)||0.71 (0.23)|
). Mean and two standard deviations (in brackets) reported across the 100 CRP mixture samples. The quantityis the difference between the estimated number of clusters and the known number of clusters. The range of the normalized mutual information (NMI) is with higher values reflecting lower clustering error.
In Table 1 a range of performance metrics for the MAP-DPM and Gibbs algorithms are shown. Both MAP-DPM and Gibbs achieve similar clustering performance for both training and test set NMI. To assess out-of-sample performance, another set of observations were sampled from each CRP mixture sample and the out-of-sample point prediction calculated for each model (Section 3.3). The Gibbs sampler requires, on average, approximately 152 times the number of iterations to converge than MAP-DPM, as reflected in the CPU times. Also, the number of empty clusters created in Gibbs is higher than MAP-DPM due to the higher number of iterations required; an effective Gibbs implementation therefore would need to efficiently handle the empty clusters.
When examining the number of clusters (), Gibbs is closest to the ground truth whilst MAP-DPM produces signficant underestimates. In Figure 3 the median partitioning is shown in terms of the partitioning and the number of clusters. MAP-DPM fails to identify the smaller clusters whereas the Gibbs sampler is able to do so to a much greater extent. This is a form of underfitting where the MAP algorithm captures the mode of the partitioning distribution but fails to put enough mass on the tails (the smaller clusters); that may also be described as an underestimation of the variance of the partitioning distribution. The NMI scores do not reflect this effect as the impact of the smaller clusters on the overall measure is minimal.
4.2 UCI datasets
We compare the DP-means, MAP-DPM and Gibbs samplers on seven UCI datasets and assess their performance using the same NMI measure as in Section 4.1. Class labels in the datasets are treated as cluster numbers.111We do not assess “Car” and “Balance scale” datasets used in Kulis and Jordan (2012) because they consist of a complete enumeration of 6 and 4 categorical factors respectively, and it is not meaningful to apply an unsupervised clustering algorithm to such a setting. As in Section 4.1 the Gibbs sampler is stopped using the Raftery diagnostic (Raftery and Lewis, 1992). For DP-means, we choose to give the true number of clusters in the corresponding dataset (Kulis and Jordan, 2012). The NG prior in the Gibbs and MAP-DPM algorithms is set empirically; the mean is set to the sample mean of the data, , a scaled version of value proposed in Kass and Wasserman (1995), and each dimension of is set to the sample variance of the data in that corresponding dimension. Concentration parameter is selected by minimizing the NLL (Section 3) in the MAP-DPM algorithm across a discrete set of candidate values and reused in the Gibbs algorithm. For the Gibbs algorithm, we compute the mean and two standard deviations for the NMI score across all samples (Table 2).
|Breast cancer||0.75||0.70 (0.01)||0.71|
On almost all of the datasets (6 out of 7), MAP-DPM is comparable to, or even better than, the Gibbs sampler, and on 5 out 7 datasets it performs better than DP-means (Table 2). DP-means performs well on lower-dimensional datasets with a small number of clusters. In higher dimensions, it is more likely for the clusters to be elliptical rather than spherical and in such cases the other algorithms outperform DP-means because of the more flexible model assumptions. In addition, for higher dimensional data it is more often the case that the different features have different numerical scales, so the squared Euclidean distance used in DP-means is inappropriate. Furthermore, MAP-DPM and the Gibbs sampler are more robust to smaller clusters due to the longer tails of the Student-T distribution and the rich-get-richer effect of existing clusters assigned many observations. DP-means is especially sensitive to geometric outliers and can easily produce excessive numbers of spurious clusters for poor choices of .
Even though MAP-DPM only gives a point estimate of the full Gibbs distribution, MAP-DPM can in practice achieve higher NMI scores. This can occur both because of Gibbs convergence issues, and because we take the average NMI across all Gibbs samples, where MAP-DPM could correspond to a Gibbs sample with much higher NMI than average. For instance, in the Wine dataset (178 observations, 13 dimensions), the NMI across different Gibbs samples varies considerably and the MAP-DPM NMI score is close to the highest one achieved across all Gibbs samples. In the Soybean dataset (266 observations, 35 dimensions), visual inspection of the Gibbs samples revealed slow Markov chain mixing and even after 10,000 iterations, the samples had not converged. The sparseness of the data in such a high-dimensional space makes this a particularly challenging clustering problem and a more sophisticated MCMC sampling method would likely be required in practice.
We emphasize that these algorithms attempt to maximize the model fit rather than maximize NMI. The true labels would not be available in practice and it is not always the case that maximizing the likelihood also maximizes NMI. Furthermore, if we choose the NG parameters for each dataset separately, by minimizing the negative log likelihood with respect to each parameter, higher NMI can been achieved, but choosing empirical estimates for the model parameters simplifies the computations.
In all cases, the MAP-DPM algorithm converges more rapidly than the other algorithms (Table 3). The Gibbs sampler takes, on average, approximately more iterations to converge than MAP-DPM to achieve comparable NMI scores. The computational complexity per iteration for Gibbs and MAP-DPM is comparable, requiring the computation of the same quantities. This makes the Gibbs sampler significantly less efficient than MAP-DPM in finding a good labeling for the data. The price per iteration for DP-means can often be considerably smaller than MAP-DPM or the Gibbs sampler, as one iteration often does not include a sweep through all of the data points. This occurs because the sweep ends when a new cluster has to be created, unlike MAP-DPM and Gibbs. But, this also implies that DP-means requires more iterations to converge than MAP-DPM.
5 Discussion and future directions
We have presented a simple algorithm for inference in DPMs based on non-degenerate MAP, and demonstrated its efficiency and accuracy by comparison to the ubiquitous Gibbs sampler, and a simple alternative, the small variance asymptotic approach. We believe our approach is highly relevant to applications since, unlike the small variance approach, it retains the preferential attachment (rich-get-richer) property while needing two orders of magnitude fewer iterations than Gibbs. Unlike the asymptotic approach, an out-of-sample likelihood may be computed allowing the use of standard model selection and model fit diagnostic procedures. Lastly, the non-degenerate MAP approach requires no factorization assumptions for the model distribution unlike VB.
As with all MAP methods, the algorithm can get trapped in local minima, however, standard heuristics such as multiple random restarts can be employed to mitigate the risk. This would increase the total computational cost of the algorithm somewhat but even with random restarts it would still be far more efficient than the Gibbs sampler.
Although not reported here due to space limitations, we point out that different implementations of the Gibbs sampler can lead to different MAP inference algorithms for different model DPMs which naturally arise from this probabilistic graphical model structure (Neal, 2000)
. In general, we have found the resulting alternative algorithms to be less robust in practice, as they retain the Gaussian likelihood over the observations given the cluster indicators. If such assumptions are justified, however, then our MAP approach can be readily applied to these models as well, for example, where non-conjugate priors are appropriate.
The generality and the simplicity of our approach makes it reasonable to adapt to other Bayesian nonparametric mixture models, for example the Pitman-Yor process which generalizes the CRP (Pitman and Yor, 1997). The MAP approach can also be readily applied to hierarchical Bayesian nonparametric models such as the Hierarchical DP (Teh et al., 2006) and nested DP (Rodriguez et al., 2008). Another useful direction, for large-scale datasets in particular, would be to extend our approach to perform inference that does not need to sweep through the entire dataset in each iteration, for increased efficiency (Welling and Teh, 2011).
- Bousquet and Bottou  Olivier Bousquet and Léon Bottou. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pages 161–168, 2008.
- Ferguson  Thomas S Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230, 03 1973. doi: 10.1214/aos/1176342360. URL http://dx.doi.org/10.1214/aos/1176342360.
- Barabási and Albert  Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999.
- Blackwell and MacQueen  David Blackwell and James B MacQueen. Ferguson distributions via Pólya urn schemes. Annals of Statistics, 1(2):353–355, 1973.
- Ishwaran and James  Hemant Ishwaran and Lancelot F James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453), 2001.
- Stuart Geman  Donald Geman Stuart Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741, 1984.
- Neal  Radford M Neal. Slice sampling. Annals of Statistics, 31(3):705–767, 2003. doi: 10.1214/aos/1056562461.
Blei and Jordan 
David M Blei and Michael I Jordan.
Variational methods for the Dirichlet process.
Proceedings of the 21th International Conference on Machine Learning (ICML), page 12. ACM, 2004.
- Bishop  Christopher M Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN 0387310738.
- Kulis and Jordan  Brian Kulis and Michael I Jordan. Revisiting K-means: New algorithms via Bayesian nonparametrics. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 513–520, 2012.
- Jiang et al.  Ke Jiang, Brian Kulis, and Michael I Jordan. Small-variance asymptotics for exponential family Dirichlet process mixture models. In Advances in Neural Information Processing Systems, pages 3158–3166, 2012.
- Broderick et al.  Tamara Broderick, Brian Kulis, and Michael I Jordan. MAD-Bayes: MAP-based asymptotic derivations from Bayes. Proceedings of the 30th International Conference on Machine Learning (ICML), 28(3):226–234, 2013.
- Kittler and Föglein  Josef Kittler and J Föglein. Contextual classification of multispectral pixel data. Image and Vision Computing, 2(1):13–29, 1984.
- West et al.  Mike West, Peter Muller, and Michael D Escobar. Hierarchical priors and mixture models, with application in regression and density estimation. Aspects of Uncertainty, 1:363–386, 1994.
- Neal  Radford M Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.
- Welling and Kurihara  Max Welling and Kenichi Kurihara. Bayesian K-Means as a "Maximization-Expectation" algorithm. In SDM, pages 474–478. SIAM, 2006.
- Blackwell  David Blackwell. Conditional expectation and unbiased sequential estimation. The Annals of Mathematical Statistics, 18(1):105–110, 03 1947. doi: 10.1214/aoms/1177730497. URL http://dx.doi.org/10.1214/aoms/1177730497.
- Cassella and Robert  George Cassella and Christian P Robert. Rao-Blackwellisation of sampling schemes. Biometrics, 83:81–94, 1994.
- Raftery and Lewis  Adrian E Raftery and Steven Lewis. How many iterations in the Gibbs sampler. Bayesian Statistics, 4(2):763–773, 1992.
- Vinh et al.  Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, 11:2837–2854, 2010.
Kass and Wasserman 
Robert E Kass and Larry Wasserman.
A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion.Journal of the American Statistical Association, 90(431):928–934, 1995.
- Pitman and Yor  Jim Pitman and Marc Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2):855–900, 1997. doi: 10.1214/aop/1024404422. URL http://dx.doi.org/10.1214/aop/1024404422.
- Teh et al.  Yee W Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 2006.
- Rodriguez et al.  Abel Rodriguez, David B Dunson, and Alan E Gelfand. The nested Dirichlet process. Journal of the American Statistical Association, 103(483), 2008.
- Welling and Teh  Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 681–688, 2011.
- Rasmussen  Carl E Rasmussen. The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems, volume 12, pages 554–560, 1999.
Appendix A Appendix
a.1 Model and prior
The mixture model given the component means, precisions and weights for each component, is:
where , the number of clusters, becomes infinity in the DPM model. Variable is a -dimensional vector of cluster means for cluster , and is a diagonal matrix of precisions, . This makes the covariance diagonal, so the normal can be written as a product of univariate normals, = .
We place the conjugate normal-Gamma prior over each dimension of the cluster means and variances:
Then, we integrate out the cluster parameters of the model, which is analytically tractable because of the conjugate priors. This results in a collapsed model structure that places a mixture of Student-T likelihoods for the probability over an observation:
where the parameters of the Student-T are functions of the normal-Gamma posteriors defined in the paper with effect of observation removed from them. Recall that , and . The removal of the effect of the current observation in the corresponding conditionals occurs because a dependency is introduced between the cluster indicators, which is a result of integrating out the cluster parameters (see Algorithm 3 in Neal ).
a.2 Cluster assignments
To complete the MAP algorithm we take the negative logarithm of the assignment probabilities to arrive at (15). For an observation , we compute the negative log probability for each existing cluster, ignoring constant terms:
Similarly, for a new cluster:
For each obsevation we compute the above dimensional vector and select the index of the smallest element from it:
In Algorithm 2 a fast update version of the method is described where the NG statistics are updated by removing the effect of one point rather than processing the entire data set.
a.3 Learning the concentration parameter
We have considered the following approaches to infer the concentration parameter :
Cross-validation. By considering a finite set of values for , choose the value corresponding to the minimum, average out-of-sample likelihood across all cross-validation folds. This approach is taken in Blei and Jordan  to compare different inference methods.
Multiple restarts. Compute the NLL at convergence for each different value of , and pick the corresponding to the smallest NLL at convergence. This is the approach taken in the UCI experiment section of the paper.
MAP estimate. Compute the posterior over and numerically locate the mode. An example posterior is given in Rasmussen :
where is the number of existing clusters. This posterior arises from the choice of an inverse Gamma prior on the concentration parameter and the likelihood . We generalize this calculation by using a Gamma prior :
We numerically minimize the negative log of this posterior using Newton’s method. To ensure the solution is positive we compute the gradient with respect to : as Rasmussen  notes is log-concave and therefore has a unique maximum (so that the negative log of this has a unique minimum), where is the number of non-zero represented components.
In practice we found the third approach least effective due to the presence of local minima when doing MAP estimation. The second approach is the simplest to apply in practice but can be prone to overfitting for small datasets where we recommend using the cross-validation approach.
a.4 CRP experiment
We provide more details on the CRP experiment presented in the paper. In Tables 4-5 we include the maximum NMI and adjusted mutual information (AMI) score which corrects for chance effects when comparing clusterings by penalizing partitions with larger numbers of clusters [Vinh et al., 2010]. To assess out-of-sample accuracy we also include the average, leave-one-out negative log likelihood discussed in Section 3.3 of the paper in Table 5. All metrics are similar for the Gibbs and MAP-DPM algorithms, reflecting the good clustering performance of the MAP algorithm.
|NMI sum||0.67 (0.23)||0.71 (0.23)|
|NMI max||0.64 (0.24)||0.64 (0.27)|
|AMI||0.62 (0.25)||0.62 (0.27)|
|Iterations||2,000 (900)||13 (14)|
|CPU Time (secs)||7,300 (4,000)||28 (32)|
|-0.91 (7.1)||-6.9 (6.9)|
|Empty clusters||6,000 (3,600)||14 (28)|
|NMI sum||0.72 (0.20)||0.71 (0.23)|
|NMI max||0.68 (0.22)||0.64 (0.27)|
|AMI||0.67 (0.23)||0.62 (0.27)|
|NLL (leave-one-out)||7.17 (1.10)||7.20 (1.11)|
In Figure 4
we also show the 5th and 95th quantiles of the clustering distribution for the CRP ground-truth, MAP and Gibbs algorithms. We see that the effect discussed in the paper is present at both extreme quantiles with the MAP algorithm consistently underestimating the total number of clusters by not identifying the smaller clusters.