Since the introduction of Dirichlet process in Ferguson’s seminal paper [4, 2], models based on the Dirichlet process and related combinatorial stochastic processes, such as the Pitman-Yor process (see for example ), have been used for various statistical and machine learning problems. These models are generally mixture models, with the Dirichlet process or Pitman-Yor process used as a prior on mixture components 
. Statistical applications have included a wide variety of problems in density estimation[3, 6, 7, 25, 11, 8, 9, 27, 10] and parameter estimation [26, 23, 5, 21]
. Applied problems that have been studied include problems such as computer vision, and text processing, as well as in scientific fields such as economics, astronomy, molecular biology, and genetics[26, 23, 20, 3, 14, 13, 17]. The use of Dirichlet process mixture models (DPMMs) in these fields is generally motivated by the assertion that it can be used to determine the number of components in nonparametric mixture models.
Classical clustering algorithms such as
-means or Gaussian mixture models generally require us to set the number of clustersa priori. However, in practice the real number of clusters is rarely known, and almost never known for dynamically growing data sets. This has motivated the use of DPMMs to find the number of clusters. Unlike -means or GMMs, the DPMM is based on the Dirichlet process which has infinite components and does not require one to specify the number of components at first. The goal is to use the posterior distributions of the number of clusters to find an optimal choice for clustering.
However, unlike the case of density estimation, a theoretical understanding of the convergence of the number of components in DPMM is still largely missing in the literature. On the negative side, it has been demonstrated that DPMM and PYPMM may exhibit posterior inconsistencies in the number of components when the true number of components is finite [18, 19]. Moreover, in practice, it has been observed that DPMM-based inference can generate small clusters that do not reflect the underlying data-generating process, especially when the real number of components is small instead of infinite. Despite these observations, quantitatively we have little understanding about the behavior of the posterior distribution of the number of components when the number of samples goes to infinity, not to mention in the nonasymptotic regime.
To fill the gap, in this work, we study the posterior distribution of the number of clusters for DPMM-based clustering models. Our main results are lower bounds on the ratio of the probabilities of obtaining clusters and clusters under Gaussian or uniform priors for the parameters, with different assumptions and constraints on the data. This yields a fine-grained understanding of the posterior distribution induced by the DPMM on the number of clusters, and positions us for future work on topics such as the rate of growth for the number of clusters in the posterior when the number of clusters is not fixed but also growing with the sample size.
2 Model Description
We first introduce some key notation. We use to denote the samples , to denote the set , to denote the set such that ’s form an s-partition of , where is the set of all -partitions on .
where stands for a given prior on the parameter while is a known family of density functions.
The DPMM has been widely used in machine learning and statistics for problems including density estimation and parameter estimation. In this paper, we specifically focus on the application of DPMM to clustering problems. For this application, the prior for the number of clusters with samples is given by
Given the above prior distribution for the number of clusters, the posterior for the number of clusters admits the following formulation:
The central goal of this work is a rigorous study with the behavior of under two representative choices of prior and different assumptions on the data generating processes. In particular, to ease the ensuing discussion, we use to denote the cluster probability:
where in the above integral for all . Given this definition of , we can rewrite as follows:
To understand the behavior of the posterior distribution of the number of clusters in (4), we consider the ratio between its values at and , which can be computed directly as follows:
Throughout the paper, we consider the Dirichlet mixture of standard normals, i.e., for all .
3 Uniform Prior
We first consider a study with posterior distribution of the number of clusters of DPMM where the data lie on a bounded set . Practically speaking, many data sets that naturally arise in fields such as biology, genetics, and economics are essentially bounded. Under this setting of data, the parameter space is usually chosen as a compact set.
In this section, to ease the complexity of proof argument, we specifically consider a simple uniform prior on the parameter space where is a bounded segment of with size . With this choice, we have for all . Now, we start with the following result regarding the lower bound of under certain assumptions with the data:
Given DPMM defined in (3) with a uniform prior on . Then, when is sufficiently large, if and for some , then the ratio between consecutive terms is lower bounded by
The condition of Theorem 1 regarding the data can be relaxed to requiring only most of the data to be within ; however, that weaker condition will require a slightly more complicated proof. Additionally, that condition is mild in many problems since for a clustering problem, when one applies a uniform prior for the parameters (the means of normal distributions), one expects the uniform prior to be big enough to capture the means of all the components.
; however, that weaker condition will require a slightly more complicated proof. Additionally, that condition is mild in many problems since for a clustering problem, when one applies a uniform prior for the parameters (the means of normal distributions), one expects the uniform prior to be big enough to capture the means of all the components.
Now, for and , we define two key terms and as follows:
To avoid notation cluttering, in the following we fix in our discussion and will write and , unless otherwise specified. To interpret our results above, we note that is the set of partitions in that can be obtained from by combining two elements of into one element and keeping everyone else the same. Conversely, is the set of partitions in which can combine two of its elements into one to get
. Toward showing the result, we further define the posterior probabilities of a partition forto be:
Based on the above definitions, we can rewrite the ratio as follows:
Assume that the above claim is given at the moment (the proof of this claim is deferred to the end of the proof of Theorem1). We proceed to finish the proof of the theorem. Let and denote the set of such that and for . Additionally, let be the subset of such that the corresponding has . Here, we note that the order in the partition does not matter, so we choose and for notational convenience. Furthermore, to ease the ensuing presentation, we denote and . Then, we obtain the following equations
Given the above results, we define the following shorthands:
With simple algebra, we can verify that
where . If the samples satisfies that and for some , where for simplicity in presentattion we may choose but note that the result holds for any with some constant depending on , then we have:
On the other hand, for any sets with sizes and means , whose union is with size and mean , we have
The above result leads to the following equation
Combining all the results above, we obtain the following inequality
Given the above inequality, we can derive the following bounds for the term in (8):
Direct computations lead to
The above result yields that
When , simple algebra indicates that . Additionally, the left hand side in the inequalities above is always no less than 2 for . Invoking these results, we have the following lower bound
Combining the above lower bound with equation (7), we eventually obtain the following evaluation with the ratio between consecutive terms
As a consequence, we reach the conclusion of the theorem.
Proof of claim (7):
Using equation (4), we can rewrite the ratio between the posterior probability of components and that of components as follows:
Note that for each , we can merge any two of its parts to get some . The number of distinct ways to do so is exactly . Also, for each , the set finds all such that they can merge some parts to get . Thus, the index of the numerator in the second equation’s right hand side counts each exactly , from which the equation follows. Note that is required to prevent the case we have degenerate components. Although we only need , but for simplicity and consistence in the proof argument, we choose to have . Therefore, we achieve the conclusion of claim (7). ∎ The bound in the result of Theorem 1 does not require the data-generating distribution to be a mixture distribution. In particular, noting that empirical average of the exponential term in equation (9) goes to infinity as
provided that the true underlying distribution has finite and nonzero variance. This result is implied by the moment generating function of the Chi-squared distribution. Therefore, given the result of Theorem1, we obtain the following corollary:
If the conditions in Theorem 1 hold as , then we obtain that
Combining the results from Theorem 1 and Corollary 2, we can see that for any true distributions with finite but nonzero variance, the posterior probability of obtaining clusters will eventually exceed that of obtaining clusters, and their ratio will grow in an unbounded way. Provided the original distribution has a finite number of components, with more samples the result may even worsen since the model ultimately will fit an infinite number of clusters almost surely. However, in finite samples, their behavior depends more on the distribution’s properties.
4 Gaussian Prior
. In particular, we choose the prior density coming from the univariate Gaussian distributionwith fixed variance , namely, for all . Given this prior on , we have the following asymptotic result regarding the lower bound of :
For the DPMM defined in (3), with a Gaussian prior on , as goes to infinity, the ratio satisfies the following asymptotic lower bound:
where is a universal constant.
This result also holds in high probability in finite samples, provided sufficiently large number of samples. However, the number of samples required to attain this bound with fixed probability is highly dependent on the real data distribution.
To ease the ensuing presentation, we reuse the notation from the proof of Theorem 1 in this proof. Direct computations yield the following result:
To simplify the notation, we let be the precision, and rewrite the above expression as:
Note that the term is nonnegative at zero since
Solving a quadratic function gives that has its positive root on:
If there is no positive root or every positive number is a root, then for any . Otherwise, as shown in the above, there exists a unique positive root
, a random variable depending on
, whose probability density function favors larger and larger values as long as one ofgoes to infinity. That is, the root grows larger in probability as increases, where rate it scales up depends on the real data distribution. For any fixed , as goes to infinity, it follows that for most partitions of into and , falls into , so we have that .
Returning to the computation of the ratio . For fixed and a partition , we define
For any , since increases as increases, we may assume that the aforementioned condition that asymptotically holds for any fixed proportion (less than 1) for all the partitions of . Note that for any positive integers and nonnegative number , we have
Then, for sufficiently large we have:
with high probability where is a universal constant. Here, the second step follows with high probability by our previous argument where is a positive universal constant between zero and one, and the last step follows by a similar argument as in the case of uniform prior with being some constant independent of , except that here it is possible to have , so the result can only be bounded by a constant multiple of without the factor in the uniform case.
Finally, note that as goes to infinity, the above result holds in probability 1. Therefore, we obtain that