# Posterior Distribution for the Number of Clusters in Dirichlet Process Mixture Models

Dirichlet process mixture models (DPMM) play a central role in Bayesian nonparametrics, with applications throughout statistics and machine learning. DPMMs are generally used in clustering problems where the number of clusters is not known in advance, and the posterior distribution is treated as providing inference for this number. Recently, however, it has been shown that the DPMM is inconsistent in inferring the true number of components in certain cases. This is an asymptotic result, and it would be desirable to understand whether it holds with finite samples, and to more fully understand the full posterior. In this work, we provide a rigorous study for the posterior distribution of the number of clusters in DPMM under different prior distributions on the parameters and constraints on the distributions of the data. We provide novel lower bounds on the ratios of probabilities between s+1 clusters and s clusters when the prior distributions on parameters are chosen to be Gaussian or uniform distributions.

## Authors

• 4 publications
• 44 publications
• 221 publications
• ### Exact and Efficient Parallel Inference for Nonparametric Mixture Models

Nonparametric mixture models based on the Dirichlet process are an elega...
11/29/2012 ∙ by Sinead A. Williamson, et al. ∙ 0

• ### BALSON: Bayesian Least Squares Optimization with Nonnegative L1-Norm Constraint

A Bayesian approach termed BAyesian Least Squares Optimization with Nonn...
07/08/2018 ∙ by Jiyang Xie, et al. ∙ 0

• ### Mixture model fitting using conditional models and modal Gibbs sampling

In this paper, we present a novel approach to fitting mixture models bas...
12/27/2017 ∙ by Virgilio Gomez-Rubio, et al. ∙ 0

• ### On uniform continuity of posterior distributions

In the setting of dominated statistical models, we provide conditions yi...
09/23/2019 ∙ by Emanuele Dolera, et al. ∙ 0

• ### Fast Learning of Clusters and Topics via Sparse Posteriors

Mixture models and topic models generate each observation from a single ...
09/23/2016 ∙ by Michael C. Hughes, et al. ∙ 0

• ### Finite mixture models are typically inconsistent for the number of components

Scientists and engineers are often interested in learning the number of ...
07/08/2020 ∙ by Diana Cai, et al. ∙ 7

• ### The Bernstein-von Mises theorem for the Pitman-Yor process of nonnegative type

The Pitman-Yor process is a nonparametric species sampling prior with nu...
02/11/2021 ∙ by S. E. M. P. Franssen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Since the introduction of Dirichlet process in Ferguson’s seminal paper [4, 2], models based on the Dirichlet process and related combinatorial stochastic processes, such as the Pitman-Yor process (see for example [22]), have been used for various statistical and machine learning problems. These models are generally mixture models, with the Dirichlet process or Pitman-Yor process used as a prior on mixture components [1]

. Statistical applications have included a wide variety of problems in density estimation

[3, 6, 7, 25, 11, 8, 9, 27, 10] and parameter estimation [26, 23, 5, 21]

. Applied problems that have been studied include problems such as computer vision, and text processing, as well as in scientific fields such as economics, astronomy, molecular biology, and genetics

[26, 23, 20, 3, 14, 13, 17]. The use of Dirichlet process mixture models (DPMMs) in these fields is generally motivated by the assertion that it can be used to determine the number of components in nonparametric mixture models.

Classical clustering algorithms such as

-means or Gaussian mixture models generally require us to set the number of clusters

a priori. However, in practice the real number of clusters is rarely known, and almost never known for dynamically growing data sets. This has motivated the use of DPMMs to find the number of clusters. Unlike -means or GMMs, the DPMM is based on the Dirichlet process which has infinite components and does not require one to specify the number of components at first. The goal is to use the posterior distributions of the number of clusters to find an optimal choice for clustering.

However, unlike the case of density estimation, a theoretical understanding of the convergence of the number of components in DPMM is still largely missing in the literature. On the negative side, it has been demonstrated that DPMM and PYPMM may exhibit posterior inconsistencies in the number of components when the true number of components is finite [18, 19]. Moreover, in practice, it has been observed that DPMM-based inference can generate small clusters that do not reflect the underlying data-generating process, especially when the real number of components is small instead of infinite. Despite these observations, quantitatively we have little understanding about the behavior of the posterior distribution of the number of components when the number of samples goes to infinity, not to mention in the nonasymptotic regime.

To fill the gap, in this work, we study the posterior distribution of the number of clusters for DPMM-based clustering models. Our main results are lower bounds on the ratio of the probabilities of obtaining clusters and clusters under Gaussian or uniform priors for the parameters, with different assumptions and constraints on the data. This yields a fine-grained understanding of the posterior distribution induced by the DPMM on the number of clusters, and positions us for future work on topics such as the rate of growth for the number of clusters in the posterior when the number of clusters is not fixed but also growing with the sample size.

## 2 Model Description

We first introduce some key notation. We use to denote the samples , to denote the set , to denote the set such that ’s form an s-partition of , where is the set of all -partitions on .

The DPMM [1, 15] is specified as follows:

 p(A,k) := αkα(n)k∏i=1(|Ai|−1)! (1) p(θ|A,k) := k∏i=1π(θi) (2) p({xi}ni=1∣∣{θj}kj=1,A,k) := k∏j=1∏xi∈Ajfθj(xi), (3)

where stands for a given prior on the parameter while is a known family of density functions.

The DPMM has been widely used in machine learning and statistics for problems including density estimation and parameter estimation. In this paper, we specifically focus on the application of DPMM to clustering problems. For this application, the prior for the number of clusters with samples is given by

 P(Kn=s)=∑A∈ρs(n)p(A,s).

Given the above prior distribution for the number of clusters, the posterior for the number of clusters admits the following formulation:

 P(Kn=s|{xi}ni=1) =P({xi}ni=1|Kn=s)P(Kn=s)P({xi}ni=1) ∝∑A∈ρs(n)p(A,s)⋅∫{θj}sj=1p({xi}ni=1|{θj}sj=1)p({θj}sj=1|A,k)d{θj}sj=1 =∑A∈ρs(n)p(A,s)⋅∫{θj}sj=1∈Θs(s∏j=1∏xi∈Ajfθj(xi)s∏j=1π(θj))d{θj}sj=1.

The central goal of this work is a rigorous study with the behavior of under two representative choices of prior and different assumptions on the data generating processes. In particular, to ease the ensuing discussion, we use to denote the cluster probability:

 m(xAj)=∫θjfθj(xj,1)⋯fθj(xj,aj)π(θj)dθj

where in the above integral for all . Given this definition of , we can rewrite as follows:

 P(Kn=s|{xi}ni=1) ∝ ∑A∈ρs(n)(p(A,s)⋅s∏j=1m(xAj)) (4) = ∑A∈ρs(n)(αsα(n)s∏i=1(|Ai|−1)!⋅s∏j=1m(xAj)).

To understand the behavior of the posterior distribution of the number of clusters in (4), we consider the ratio between its values at and , which can be computed directly as follows:

 R(s|{xi}ni=1) := P(Kn=s+1|{xi}ni=1)P(Kn=s|{xi}ni=1) (5) = ∑A∈ρs+1(n)(p(A,s+1)⋅∏s+1j=1m(xAj))∑B∈ρs(n)(p(B,s)⋅∏sj=1m(xBj)) = α⋅∑A∈ρs+1(n)((∏s+1i=1(|Ai|−1)!)⋅∏s+1j=1m(xAj))∑B∈ρs(n)((∏si=1(|Bi|−1)!)⋅∏sj=1m(xBj)).

Throughout the paper, we consider the Dirichlet mixture of standard normals, i.e., for all .

## 3 Uniform Prior

We first consider a study with posterior distribution of the number of clusters of DPMM where the data lie on a bounded set [24]. Practically speaking, many data sets that naturally arise in fields such as biology, genetics, and economics are essentially bounded. Under this setting of data, the parameter space is usually chosen as a compact set.

In this section, to ease the complexity of proof argument, we specifically consider a simple uniform prior on the parameter space where is a bounded segment of with size . With this choice, we have for all . Now, we start with the following result regarding the lower bound of under certain assumptions with the data:

###### Theorem 1.

Given DPMM defined in (3) with a uniform prior on . Then, when is sufficiently large, if and for some , then the ratio between consecutive terms is lower bounded by

 R(s|{xi}ni=1)≿αs|Θ|. (6)
###### Remark.

The condition of Theorem 1 regarding the data can be relaxed to requiring only most of the data to be within

; however, that weaker condition will require a slightly more complicated proof. Additionally, that condition is mild in many problems since for a clustering problem, when one applies a uniform prior for the parameters (the means of normal distributions), one expects the uniform prior to be big enough to capture the means of all the components.

###### Proof.

Now, for and , we define two key terms and as follows:

 ηs+1(A) :={~A∈ρs(n):!∃i∈[s]:∀j≠i,Aj=~Aj,Ai∪As+1=~Ai}, ~ηs(B) :={~B∈ρs+1(n):B∈ηs+1(~B)}.

To avoid notation cluttering, in the following we fix in our discussion and will write and , unless otherwise specified. To interpret our results above, we note that is the set of partitions in that can be obtained from by combining two elements of into one element and keeping everyone else the same. Conversely, is the set of partitions in which can combine two of its elements into one to get

. Toward showing the result, we further define the posterior probabilities of a partition for

to be:

 p(A,x):=p(A,s+1)⋅s+1∏i=1m(xAi),p(B,x):=p(B,s)⋅s∏i=1m(xBi).

Based on the above definitions, we can rewrite the ratio as follows:

 R(s)=P(Kn=s+1|{xi}ni=1)P(Kn=s|{xi}ni=1)=2(s+1)s⋅∑B∈ρs(n)(∑A∈~ηs(B)p(A,x))∑B∈ρs(n)p(B,x) (7)

for .

Assume that the above claim is given at the moment (the proof of this claim is deferred to the end of the proof of Theorem

1). We proceed to finish the proof of the theorem. Let and denote the set of such that and for . Additionally, let be the subset of such that the corresponding has . Here, we note that the order in the partition does not matter, so we choose and for notational convenience. Furthermore, to ease the ensuing presentation, we denote and . Then, we obtain the following equations

 ∑A∈~ηs(B)p(A|x)p(B|x) =∑A∈~ηs(B)α⋅∏s+1i=1(|Ai|−1)!m(xAi)∏si=1(|Bi|−1)!m(xBi) =α⋅s∑i=1bi−1∑j=1∑A∈~ηi,j(B)(j−1)!(bi−j−1)!m(xAi)m(xAs+1)(bi−1)!m(xBi) =α⋅s∑i=1bi−1∑j=1(j−1)!(bi−j−1)!(bi−1)!(∑A∈~ηi,j(B)m(XAi)m(XAs+1)m(XBi)). (8)

Given the above results, we define the following shorthands:

 ¯¯¯¯¯XAi=1|Ai|⋅∑x∈xAix;S2Ai=∑x∈xAix2.

With simple algebra, we can verify that

 m(xAi)m(xAs+1)m(xA) =√2πaiexp(−(S2i+ai¯¯¯¯¯X2i)2)Pi(Θ)⋅√2πas+1exp(−(S2s+1+as+1¯¯¯¯¯X2s+1)2)Ps+1(Θ)|Θ|√2πai+as+1exp(−(S2+(ai+as+1)¯¯¯¯¯X2)2)Pi∪s+1(Θ) =√2π|Θ|⋅√ai+as+1√aias+1⋅exp(ai¯¯¯¯¯X2i+as+1¯¯¯¯¯X2s+1−(ai+as+1)¯¯¯¯¯X22)⋅Pi(Θ)Ps+1(Θ)Pi∪s+1(Θ),

where . If the samples satisfies that and for some , where for simplicity in presentattion we may choose but note that the result holds for any with some constant depending on , then we have:

 (0.997)2

On the other hand, for any sets with sizes and means , whose union is with size and mean , we have

 n∑i=1ni¯¯¯¯¯X2i−n¯¯¯¯¯X2 = n∑i=1ni¯¯¯¯¯X2i−(∑ni=1ni¯¯¯¯¯Xi)2n = n∑i=1ni(n−ni)n¯¯¯¯¯X2i−∑i≠jninjn¯¯¯¯¯Xi¯¯¯¯¯Xj = 1n∑i

The above result leads to the following equation

 ai¯¯¯¯¯X2Ai+as+1¯¯¯¯¯X2As+1−(ai+as+1)¯¯¯¯¯X22=aias+1(¯¯¯¯¯XAi−¯¯¯¯¯XAs+1)22(ai+as+1).

Combining all the results above, we obtain the following inequality

 m(xAi)m(xAs+1)m(xAi∪As+1)≥(0.997)2√2π|Θ|⋅√ai+as+1√aias+1⋅exp(aias+1(¯¯¯¯¯XAi−¯¯¯¯¯XAs+1)22(ai+as+1)).

Given the above inequality, we can derive the following bounds for the term in (8):

 ∑A∈~ηs(B)p(A|x)p(B|x) ≥ α(0.997)2√2π|Θ|⋅s∑i=1bi−1∑j=1(bij(bi−j))3/2 (9) ×(1(bij)∑A∈~ηi,j(B)exp(j(bi−j)(¯¯¯¯¯XAi−¯¯¯¯¯XAs+1)22bi)) ≥ α(0.997)2√2π|Θ|⋅s∑i=1bi−1∑j=1(bij(bi−j))3/2.

 ∫bi−1x=1(bix(bi−x))3/2dx≤bi−1∑j=1(bij(bi−j))3/2 ≤∫bi−1x=1(bix(bi−x))3/2dx+2(bi1(bi−1))3/2.

The above result yields that

 4(bi−2)√(bi−1)bi≤bi−1∑j=1(bij(bi−j))3/2≤4(bi−2)√(bi−1)bi+25/2.

When , simple algebra indicates that . Additionally, the left hand side in the inequalities above is always no less than 2 for . Invoking these results, we have the following lower bound

 ∑A∈~ηs(B)p(A|x)p(B|x)≥α(0.997)2√2π|Θ|⋅s∑i=12⋅Ibi≥2≿αs|Θ|.

Combining the above lower bound with equation (7), we eventually obtain the following evaluation with the ratio between consecutive terms

 R(s)=P(Kn=s+1|{xi}ni=1)P(Kn=s|{xi}ni=1)≿αs|Θ|.

As a consequence, we reach the conclusion of the theorem.

### Proof of claim (7):

Using equation (4), we can rewrite the ratio between the posterior probability of components and that of components as follows:

 P(Kn=s+1|{xi}ni=1)P(Kn=s|{xi}ni=1)=∑A∈ρs+1(n)p(A,x)∑B∈ρs(n)p(B,x).

Note that for each , we can merge any two of its parts to get some . The number of distinct ways to do so is exactly . Also, for each , the set finds all such that they can merge some parts to get . Thus, the index of the numerator in the second equation’s right hand side counts each exactly , from which the equation follows. Note that is required to prevent the case we have degenerate components. Although we only need , but for simplicity and consistence in the proof argument, we choose to have . Therefore, we achieve the conclusion of claim (7). ∎ The bound in the result of Theorem 1 does not require the data-generating distribution to be a mixture distribution. In particular, noting that empirical average of the exponential term in equation (9) goes to infinity as

provided that the true underlying distribution has finite and nonzero variance. This result is implied by the moment generating function of the Chi-squared distribution. Therefore, given the result of Theorem

1, we obtain the following corollary:

###### Corollary 2.

If the conditions in Theorem 1 hold as , then we obtain that

 limn→∞R(s|{xi}ni=1)→∞.

Combining the results from Theorem 1 and Corollary 2, we can see that for any true distributions with finite but nonzero variance, the posterior probability of obtaining clusters will eventually exceed that of obtaining clusters, and their ratio will grow in an unbounded way. Provided the original distribution has a finite number of components, with more samples the result may even worsen since the model ultimately will fit an infinite number of clusters almost surely. However, in finite samples, their behavior depends more on the distribution’s properties.

## 4 Gaussian Prior

Moving beyond the uniform prior, we consider the Gaussian prior on the parameter , which has been widely employed with DPMM [3, 16]

. In particular, we choose the prior density coming from the univariate Gaussian distribution

with fixed variance , namely, for all . Given this prior on , we have the following asymptotic result regarding the lower bound of :

###### Theorem 3.

For the DPMM defined in (3), with a Gaussian prior on , as goes to infinity, the ratio satisfies the following asymptotic lower bound:

 limn→∞R(s|{xi}ni=1)≥Cαs2⋅11+√σ2, (10)

where is a universal constant.

###### Remark.

This result also holds in high probability in finite samples, provided sufficiently large number of samples. However, the number of samples required to attain this bound with fixed probability is highly dependent on the real data distribution.

###### Proof.

To ease the ensuing presentation, we reuse the notation from the proof of Theorem 1 in this proof. Direct computations yield the following result:

 m(xA1)m(xA2)m(xA) = 1√σ2  ⎷(a1+a2)+1σ2(a1+1σ2)(a2+1σ2)exp(12(a21¯¯¯¯¯X21a1+1σ2+a22¯¯¯¯¯X22a2+1σ2−a2¯¯¯¯¯X2a+1σ2)).

To simplify the notation, we let be the precision, and rewrite the above expression as:

 √τ⋅√a1+a2+τ(a1+τ)(a2+τ)exp(12(a21¯¯¯¯¯X21a1+τ+a22¯¯¯¯¯X22a2+τ−a2¯¯¯¯¯X2a+τ)F(τ;X1,X2))

.

Note that the term is nonnegative at zero since

 F(0;xA1,xA2)=a1a2a1+a2(¯¯¯¯¯X1−¯¯¯¯¯X2)2≥0.

Solving a quadratic function gives that has its positive root on:

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩14[ ⎷8a1a2(¯¯¯¯¯X1−¯¯¯¯¯X2)2¯¯¯¯¯X1¯¯¯¯¯X2+Q2+Q],if ¯¯¯¯¯X1¯¯¯¯¯X2>014[ ⎷8a1a2(¯¯¯¯¯X1−¯¯¯¯¯X2)2¯¯¯¯¯X1¯¯¯¯¯X2+Q2+Q] or ∅,if ¯¯¯¯¯X1¯¯¯¯¯X2<0∅if ¯¯¯¯¯X1=0,¯¯¯¯¯X2≠0 or ¯¯¯¯¯X1≠0,¯¯¯¯¯X2=0R+if ¯¯¯¯¯X1=¯¯¯¯¯X2=0,

where .

If there is no positive root or every positive number is a root, then for any . Otherwise, as shown in the above, there exists a unique positive root

, a random variable depending on

, whose probability density function favors larger and larger values as long as one of

goes to infinity. That is, the root grows larger in probability as increases, where rate it scales up depends on the real data distribution. For any fixed , as goes to infinity, it follows that for most partitions of into and , falls into , so we have that .

Returning to the computation of the ratio . For fixed and a partition , we define

 U(B):={i∈[s]:bi≥ns2}.

For any , since increases as increases, we may assume that the aforementioned condition that asymptotically holds for any fixed proportion (less than 1) for all the partitions of . Note that for any positive integers and nonnegative number , we have

 a1+a2+τ(a1+τ)(a2+τ)>12⋅11+τ⋅a1+a2a1a2.

Then, for sufficiently large we have:

 ∑A∈~ηs(B)p(A|x)p(B|x) = α⋅s∑i=1bi−1∑j=1(j−1)!(bi−j−1)!(bi−1)!(∑A∈~ηi,j(B)m(XAi)m(XAs+1)m(XBi)) (11) w.h.p.≥ C0α√τ⋅∑i∈U(B)bi−1∑j=1(j−1)!(bi−j−1)!(bi−1)! ×(∑A∈~ηi,j(B)√bi+τ(j+τ)(bi−j+τ)) ≥ C0α√τ⋅∑i∈U(B)bi−1∑j=1(j−1)!(bi−j−1)!(bi−1)! ×(∑A∈~ηi,j(B)1√2(1+τ)√bij(bi−j)) ≥ C√τ1+√τ⋅α,

with high probability where is a universal constant. Here, the second step follows with high probability by our previous argument where is a positive universal constant between zero and one, and the last step follows by a similar argument as in the case of uniform prior with being some constant independent of , except that here it is possible to have , so the result can only be bounded by a constant multiple of without the factor in the uniform case.

Finally, note that as goes to infinity, the above result holds in probability 1. Therefore, we obtain that

 limn→∞R(s|{xi}ni=1)=limn→∞P(Kn=s+1|{xi}ni=1)P(Kn=s|{xi}ni=1)≿αs2⋅√τ