Flexible clustering via hidden hierarchical Dirichlet priors

by   Antonio Lijoi, et al.

The Bayesian approach to inference stands out for naturally allowing borrowing information across heterogeneous populations, with different samples possibly sharing the same distribution. A popular Bayesian nonparametric model for clustering probability distributions is the nested Dirichlet process, which however has the drawback of grouping distributions in a single cluster when ties are observed across samples. With the goal of achieving a flexible and effective clustering method for both samples and observations, we investigate a nonparametric prior that arises as the composition of two different discrete random structures and derive a closed-form expression for the induced distribution of the random partition, the fundamental tool regulating the clustering behavior of the model. On the one hand, this allows to gain a deeper insight into the theoretical properties of the model and, on the other hand, it yields an MCMC algorithm for evaluating Bayesian inferences of interest. Moreover, we single out limitations of this algorithm when working with more than two populations and, consequently, devise an alternative more efficient sampling scheme, which as a by-product, allows testing homogeneity between different populations. Finally, we perform a comparison with the nested Dirichlet process and provide illustrative examples of both synthetic and real data.



page 8

page 16

page 28

page 30


Latent nested nonparametric priors

Discrete random structures are important tools in Bayesian nonparametric...

The semi-hierarchical Dirichlet Process and its application to clustering homogeneous distributions

Assessing homogeneity of distributions is an old problem that has receiv...

Powered Dirichlet Process for Controlling the Importance of "Rich-Get-Richer" Prior Assumptions in Bayesian Clustering

One of the most used priors in Bayesian clustering is the Dirichlet prio...

A Common Atom Model for the Bayesian Nonparametric Analysis of Nested Data

The use of high-dimensional data for targeted therapeutic interventions ...

Inference of global clusters from locally distributed data

We consider the problem of analyzing the heterogeneity of clustering dis...

Enriched Pitman-Yor processes

In Bayesian nonparametrics there exists a rich variety of discrete prior...

Kullback-Leibler Divergence for Bayesian Nonparametric Model Checking

Bayesian nonparametric statistics is an area of considerable research in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dirichlet process (DP) mixtures are well-established and highly successful Bayesian nonparametric models for density estimation and clustering, which also enjoy appealing frequentist asymptotic properties

(Lo1984a; Escobar1994; Escobar1995; Ghosal2017a). However, they are not suitable to model data that are recorded under different, though related, experimental conditions. This is due to exchangeability implying a common underlying distribution across populations, a homogeneity assumption which is clearly too restrictive. To make things concrete we consider the Collaborative Perinatal Project, which is a large prospective epidemiologic study conducted from 1959 to 1974 (analyzed in Section 5.3), where pregnant women were enrolled in 12 hospitals and followed over time. Using a standard DP mixture on the patients enrolled across all 12 hospitals would correspond to ignoring the information on the specific center where the data are collected and, thus, the heterogeneity across samples. The opposite, also unrealistic, extreme case corresponds to modeling data from each hospital independently, thus ignoring possible similarities among them.

A natural compromise between the aforementioned extreme cases is partial exchangeability (DeFinetti1938), which entails exchangeability within each experimental condition (but not across) and dependent population–specific distributions (thus allowing borrowing of information). See Kallenberg2006 for a detailed account of the topic. In this framework the proposal of dependent versions of the DP date back to the seminal papers of Cifarelli1978c and MacEachern1999; MacEachern2000. Dependent DPs can be readily used within mixtures leading to several success stories in topic modeling, biostatistics, speaker diarization, genetics, fMRI analysis, and so forth. See Dunson2010d; Teh2010b; Foti2015; ddp2021_Quintana and references therein.

Two hugely popular dependent nonparametric priors, which will also represent the key ingredients of the present contribution, are the hierarchical Dirichlet process (HDP) (Teh2006) and the nested Dirichlet process (NDP) (Rodriguez2008a). The HDP clusters observations within and across populations. The NDP aims to cluster both population distributions and observations, but as shown in Camerlenghi2019c, does not achieve this goal. In fact, if there is a cluster of observations shared by different samples, the model degenerates to exchangeability across samples. This issue is successfully overcome in Camerlenghi2019c by introducing latent nested nonparametric priors. However, while this proposal has the merit of being the first to solve the degeneracy problem, it suffers from other limitations in terms of implementation and modeling: (a) with data from more than two populations the analytical and computational burden implied by the additive structure becomes overwhelming; (b) the model lacks the flexibility needed to capture different weights that common clusters may feature across different populations. More details can be found in the discussion to Camerlenghi2019c.

The goal of this paper is thus to devise a principled Bayesian nonparametric approach, which allows to cluster simultaneously distributions and observations (within and across populations). We achieve this by blending peculiar features of both the NDP and the HDP into a model, which we term Hidden Hierarchical Dirichlet Process (HHDP). Importantly, the HHDP overcomes the above-mentioned theoretical, modeling, and computational limitations since it, respectively, does not suffer from the degeneracy flaw, is able to effectively capture different weights of shared clusters and allows to handle several populations as showcased in the real data application. Note that the idea of the model was first hinted at in james2008discussion and, later, considered in Agrawal2013 from a mere computational point of view without providing results on distributional properties that are relevant for Bayesian inference. Hence, as a by-product, our theoretical results shed also some light on the topic modeling applications of Agrawal2013. Additionally, the same model was independently applied in balocchi2021clustering to successfully cluster urban areal units at different levels of resolution simultaneously.

Section 2 concisely reviews the HDP and the NDP with a focus on the random partitions they induce. In Section 3 we define the HHDP and investigate its properties, foremost its clustering structure (induced by a partially exchangeable array of observations). These findings lead to the development of marginal and conditional Gibbs sampling schemes in Section 4. In Section 5 we draw a comparison between HHDP and NDP on synthetic data and present a real data application for our model. Finally, Section 6 is devoted to some concluding remarks and possible future research.

2 Bayesian nonparametric priors for clustering

The assumption of exchangeability that characterizes widely used Bayesian inferential procedures is equivalent to assuming data homogeneity. This is not realistic in many applied contexts, for instance, for data recorded under

different experimental conditions inducing heterogeneity. A natural assumption that relaxes exchangeability and is suited for arrays of random variables

is partial exchangeability, which amounts to assuming homogeneity within each population, though not across different populations. This is characterized by

for every finitary permutation with henceforth denoting equality in distribution. Thanks to de Finetti’s representation theorem for partially exchangeable arrays, the dependence structure is effectively represented through the following hierarchical formulation


Here we focus on priors defined as compositions of discrete random structures and including, as special cases, both the HDP and the NDP. More specifically, we consider in (1) that is defined as follows


with discrete random probability measures (), and . The data are denoted by with and the size of the th sample. Discreteness of these random structures entails that with positive probability there are ties within each sample and also across samples , i.e.  for any , and for any . Hence, induces a random partition of the integers with , whose distribution encapsulates the whole probabilistic clustering of the model and is, therefore, the key quantity to study. Importantly, the random partition can be characterized in terms of the partially exchangeable partition probability function (pEPPF) as defined in Camerlenghi2019. The pEPPF is the natural generalization of the concept of exchangeable partition probability function (EPPF) for the exchangeable case (see e.g. Pitman2006). More precisely, is the number of distinct values among the observations in the overall sample . The vector of frequency counts is denoted by with indicating the number of elements in the th sample that coincide with the th distinct value in order of arrival. Clearly, and . One may well have , which implies that the th distinct value is not recorded in the th sample, though by virtue of it must be recorded at least in one of the samples. The th distinct value is shared by any two samples and if and only if . The probability law of the random partition is characterized by the pEPPF defined as


with the constraint , for each and where is the space in which the ’s take values and is the collection of vectors in whose entries are all distinct. We stress that the expected value in (3) is computed with respect to the joint law of the vector of random probabilities , that is the de Finetti measure in (1). Hence, the pEPPF may also be interpreted as a marginal likelihood when directly model the observations according to (1). Obviously, for a single population, that is , the standard EPPF is recovered and (3) is further interpretable as an extension of a product partition model to a multiple samples framework. As such, it provides an alternative approach to popular covariate–dependent product partition models. See, e.g., MQR, Page_Quintana_2016 and Page_Quintana.

If we specify and such that they give rise to an NDP, then one may have ties also among the population probability distributions , i.e.  for any . Therefore, in the framework of (1) and (2), one may investigate two types of clustering: (i) distributional clustering, which is related to and (ii) observational clustering, which refers to . The composition of these two clustering structures is the main tool we rely on to devise a simple, yet effective, model that considerably improves over existing alternatives.

2.1 Hierarchical Dirichlet process

Probably the most popular nonparametric prior for the partially exchangeable case is the HDP of Teh2006, which can be nicely framed in the composition scheme (2) as


where denotes the law of a DP with concentration parameter and baseline probability measure . Here we assume that is a non–atomic probability measure on and we refer to such prior as the -dimensional HDP denoted by . Hence, the ’s share the atoms through and this leads to the creation of shared clusters of observations (or latent features) across the groups. The pEPPF induced by a partially exchangeable array in (1) with has been determined in Camerlenghi2019. It is important to stress that the model is not suited for comparing populations’ distributions since for any (unless the ’s are degenerate at , in which case all distributions are equal). Similar compositions have been considered in Camerlenghi2019 and, later, in argiento and bassetti. Hierarchically dependent mixture hazards have been introduced in camerlenghi2021survival. Anyhow, the HDP and its variations cannot be used to cluster both populations and observations. To achieve this, one has to rely on priors induced by nested structures, the most popular being the NDP.

2.2 Nested Dirichlet process

The NDP, introduced by Rodriguez2008a, is the most widely used nonparametric prior allowing to cluster both observations and populations. However, as proved in Camerlenghi2019c, it suffers from a degeneracy issue, because even a single tie shared across samples is enough to group the population distributions into a single cluster.

Like the HDP, also the NDP can be framed in the composition structure (2) as


where is a random probability measure on the space of probability measures on and is degenerate at the atom , which is the law of a DP on the sample space . As in (4), is assumed to be a non-atomic probability measure on . Henceforth, we write . By virtue of the well–known stick–breaking representation of the DP (Sethuraman1994a) one has


where the weights and the random distributions are independent. Recall that GEM stands for the distribution of probability weights after Griffiths, Engen, and McCloskey, according to the well-established terminology of Ewens1990. Given a sequence such that , this means that and , for any . Since for any , generates ties among the random distributions ’s with positive probability and, thus, clusters populations. Furthermore, a structure similar to the one displayed in (6) holds for each , i.e.

and, due to the non–atomicity of , the are all distinct values.

The discrete structure of the ’s generates ties across the samples with positive probability. For example, for any . Hence, the ’s induce the clustering of the observations .

If the data are modeled as in (1), with , conditional on a partition of the ’s the observations from populations allocated to the same cluster are exchangeable and those from populations allocated to distinct clusters are independent. This potentially appealing feature of the NDP is however the one responsible for the above-mentioned degeneracy issue. For exposition clarity, consider the case of populations. If the two populations belong to different clusters, i.e. , they cannot share even a single atom due to the non–atomicity of . Hence, for any and . Therefore there is neither clustering of observations nor borrowing of information across different populations. On the contrary, . These two findings are quite intuitive. Indeed, means they are independent realizations of a DP with atoms iid from the same non-atomic probability distribution and, thus, they are almost surely different. Instead, corresponds to all observations coming from the same population distribution, more precisely from the same DP, and ties occur with positive probability. A less intuitive fact is that when a single atom, say , is shared between and the model degenerates to the exchangeable case, namely and the two populations have (almost surely) equal distributions. Hence, the NDP is not an appropriate specification when aiming at clustering both populations and observations across different populations. This was shown in Camerlenghi2019c where, spurred by this anomaly of the NDP, a novel class of priors named latent nested processes (LNP) designed to ensure that is proposed. However, while this formally solves the problem, it has computational and modeling limitations. On the one hand, the implementation of LNPs with more than two samples is not feasible due to severe computational hurdles. On the other hand, LNPs have limited flexibility since the weights of the common clusters of observations across different populations are the same. This feature is not suited to several applications and the discussion to Camerlenghi2019c provides interesting examples. See also soriano_ma2019; christ_ma2020; denti2021common; beraha2021semi for further stimulating contributions to this literature.

Hence, within the composition structure framework (2), our goal is to obtain a prior distribution able to infer the clustering structure of both populations and observations, which is highly flexible and implementable for a large number of populations and associated samples.

3 Hidden hierarchical Dirichlet process

Our proposal consists in blending the HDP and the NDP in a way to leverage on their strengths, namely clustering data across multiple heterogeneous samples for the HDP and clustering different populations (or probability distributions) for the NDP. More precisely we combine these two models in a structure (2) as

This leads to the following definition.

Definition 1.

The vector of random probability measures is a hidden hierarchical Dirichlet process (HHDP) if

with and independent. In the sequel we write .

In terms of a graphical model, the HHDP can be represented as in Figure 1.

Figure 1: Graphical model representing the dependencies for a . Here the ’s are auxiliary integer–valued random variables that assign each to a specific atom of .

The sequence acts as a hidden, or latent, component that is crucial to avoid the bug of the NDP, namely clustering of populations when they share some observations. Moreover, by extending (4) to , it can be more conveniently represented as


where independence holds true between the sequences and and between and . Combining the stick-breaking representation and a closure property of the DP with respect to grouping, one further has

where , and , for .

In this scheme, the clustering of populations is governed, a priori, by the NDP layer through . However, the aforementioned degeneracy issue of the NDP, a posteriori, is successfully avoided. The intuition is quite straightforward: unlike for the NDP, the distinct distributions in the HHDP are dependent and have a common random discrete base measure , which leads to shared atoms across the ’s and thus borrowing of information, similarly to the HDP case.

3.1 Some distributional properties

Given the discreteness of , the key quantity to derive is the induced random partition, which controls the clustering mechanism of the model. However, it is useful to start with a description of pairwise dependence of the elements of the vector , which allows a better understanding of the model and intuitive parameter elicitation. To this end, as customary, we evaluate the correlation between and : whenever it does not depend on the specific measurable set , it is used as a measure of overall dependence between and .

Proposition 1.

If and is a measurable subset of , then

Arguments similar to those in the proof of Proposition 1 lead to determine the correlation between observations, either from the same or from different samples.

Proposition 2.

If are from according to (1), then

Figure 2:

Correlations as functions of the hyperparameters

and with . The left plot represents the correlation between random probabilities , the middle one between observations collected in the same population and the right one between observations from different populations.

The correlation between observations of the same sample depends only on the parameters of the underlying HDP that governs the atoms : this is not surprising since, whatever the value of the parameter at the NDP layer, observations from the same sample are exchangeable. Moreover, an appealing feature is that such a correlation is higher than for the case of observations from different samples, i.e. . As for the dependence on the hyperparameters , when the ’s ar forced to equal different unique distributions , similarly to the NDP case. However, unlike the NDP, this does not imply that the distributions are independent, and the correlation is controlled by the hyperparameters and (increasing in and decreasing in ). In Fig. 2 we report the aforementioned correlations as functions of and with set equal . Finally, if the a priori probability to degenerate to the exchangeable case, i.e. all ’s coincide a.s., tends to and so does also .

We now investigate the random partition structure associated with a HHDP, namely the partition of , with , induced by a partially exchangeable sample modeled as in (1). Since a arises from the composition of two discrete random structures, it is clear that the partition induced by will depend on the partition, say , of the random probability measures . As for the latter, the ’s are drawn from a discrete random probability measure on whose weights have a distribution and whose atoms are almost surely different since they are sampled from an . Then the probability distribution of is the well–known Ewens sampling formula, namely

where is a partition of , with , the frequencies are such that and . This structure a priori implies, as in the NDP case, that for any . However, unlike the NDP, a posteriori the HHDP yields , regardless of the shared clusters across the samples . Moreover, let denote the pEPPF of a , namely

where , and . An explicit expression of has been established in Camerlenghi2019, even beyond the DP case. Now we can state the pEPPF induced by in (1), where is the law of a .

Theorem 1.

The random partition induced by the partially exchangeable array drawn from , according to (1), is characterized by the following pEPPF


where the sum runs over all partitions of and for each , .

Given the composition structure underlying the , the pEPPF (8) unsurprisingly is a mixture of pEPPF’s induced by different HDPs. For ease of interpretation consider the case of populations and note that the pEPPF boils down to


where is the EPPF of a single , namely , while is the pEPPF of a with two samples, namely . Clearly (9) arises from mixing with respect to partitions of in either and groups, where the former corresponds to exchangeability across the two populations. Still for the case

, a straightforward application of the pEPPF leads to the posterior probability of gathering the two probability curves,

and , in the same cluster thus making the two samples exchangeable, or homogeneous.

Proposition 3.

If the sample is from , according to (1), the posterior probability of degeneracy is


where and are the EPPF and the pEPPF induced by the for a single exchangeable sample and for two partially exchangeable samples, respectively.

The pEPPF is a fundamental tool in Bayesian calculus and it plays, in the partially exchangeable framework, the same role of the EPPF in the exchangeable case. Indeed, the pEPPF governs the learning mechanism, e.g. the strength of the borrowing information, clustering, and, in view of Proposition 3, it allows to perform hypothesis testing for distributional homogeneity between populations. Finally, one can obtain a Pólya urn scheme that is essential for inference and prediction, See LABEL:subsec:MargGibbs in the Supplementary Material. In the next section, we provide a characterization of the that is reminiscent of the popular Chinese restaurant franchise metaphor for the HDP and allows us to devise a suitable sampling algorithm and further understand the model behavior.

3.2 The hidden Chinese restaurant franchise

The marginalization of the underlying random probability measures, as displayed in Theorem 1, can be characterized in terms of a hidden Chinese restaurant franchise (HCRF) metaphor. This representation sheds further light on the HHDP and clarifies the sense in which it generalizes the well-known Chinese restaurant (CRP) and franchise (CRF) processes induced by the DP and the HDP, respectively. For simplicity we consider the case .

As with simpler sampling schemes, all restaurants of the franchise share the same menu, which has an infinite number of dishes generated by the non–atomic base measure . However, unlike the standard CRF, the restaurants of the franchise are merged into a single one if , while they differ if . Moreover, each identifies the label of the dish that customer from the –th population chooses from the shared menu , with the unique dishes . If , customers may be assigned to different restaurants and when , they are all seated in the same restaurant. Given such a grouping of the restaurants, the customers are, then, seated according to the CRF applied either to a single restaurant or to two distinct restaurants (Teh2006; Camerlenghi2018e). Furthermore, each restaurant has infinitely many tables. The first customer who arrives at a previously unoccupied table chooses a dish that is shared by all the customers who will join the table afterward. It is to be noted that distinct tables within each restaurant and across restaurants may share the same dish. An additional distinctive feature, compared to the CRF, is that tables can be shared across populations when they are assigned to the same restaurant, i.e. when . Accordingly, the allocation of each customer to a specific restaurant clearly depends on having either or .

The sampling scheme simplifies if latent variables ’s, denoting the tables’ labels for customer from population , are introduced. We stress that, if , the number of shared tables across the two populations is zero, given the populations are assigned to different restaurants, labeled , respectively. Conversely, if , one may have shared tables across populations, since they are assigned to the same restaurant .

Now define as the frequencies of observations sitting at table eating the th dish, for a table specific to restaurant . Moreover, is the dish label corresponding to table and the frequency of tables serving dish in restaurant . Marginal frequencies are represented with dots, e.g.  is the number of tables in restaurant . Throughout the symbol identifies either a set or a frequency obtained upon removing the element from . Finally, stands for an indicator function such that if , while if .

The stepwise structure of the sampling procedure reflects the composition of the three layers , and in (7) relying on a conditional CRF. First, one sample the populations’ clustering and, given the allocations of the populations to the restaurants, one has a CRF. Hence, the algorithm becomes

  1. Sample the population assignments to the restaurants from .

  2. Sequentially sample the table assignments and corresponding dishes from

where is the index set associated to the future random variables not yet sampled.

4 Posterior Inference for HHDP mixture models

Thanks to the results of Section 3, we now devise MCMC algorithms for drawing posterior inferences with mixture models driven by a HHDP. Though the samplers are tailored to mixture models, they are easily adapted to other inferential problems such as e.g. survival analysis and species sampling. Henceforth, is a density kernel and we consider


We develop two samplers: (i) a marginal algorithm that relies on the posterior degeneracy probability (Proposition 3) in LABEL:subsec:MargGibbs of the Supplementary Material; (ii) a conditional blocked Gibbs sampler, in the same spirit of the sampler proposed for the NDP by Rodriguez2008a, in Section 4.1. As for (i), the underlying random probability measures and ’s are integrated out leading to urn schemes that extend the class of Blackwell-MacQueen Pólya urn processes. In such a way we generalize the a posteriori sampling scheme of the Chinese restaurant process for the DP mixture Neal2000 and the one of the Chinese restaurant franchise for the HDP mixture (Teh2006). In the Supplementary Material, we describe the marginal sampler for the case of populations. Even if in principle it can be generalized in a straightforward way, it is computationally intractable for a larger number of populations. Similarly to the hidden Chinese restaurant franchise situation, one has to evaluate the posterior probability of all possible groupings of , which boils down to when but becomes involved for .

This shortcoming is overcome by the conditional algorithm we derive in Section 4.1, which relies on finite–dimensional approximations of the trajectories of the underlying random probability measure. Its effectiveness in dealing with populations is further illustrated in the synthetic data example 5.2 and in the application of Section 5.3.

4.1 A conditional blocked Gibbs sampler

A more effective algorithm is based on a simple blocked conditional procedure. To this end, we use a finite approximation of the DP in the spirit of Muliere1998 and Ishwaran2001. However, instead of truncating the stick–breaking representation of the DP, we use a finite Dirichlet approximation. See Ishwaran2002a. Therefore, we approximate , with a – and an –dimensional Dirichlet distribution, respectively. More precisely, we consider the following approximation


implying that , for .

Introduce the auxiliary variables and which represent the distributional and observational cluster memberships, respectively, such that and if and only if and . Henceforth, and, in order to identify the full conditionals of the Gibbs sampler, we note that under the finite Dirichlet approximation (12)

This leads to the following

  1. Sample the unique from

  2. Sample distributional cluster probabilities from

    with .

  3. Sample probability weights of the base DP from


    with .

  4. Sample the observational cluster probabilities independently from

    with .

  5. Sample distributional and observational cluster membership from

Importantly, all the full conditional distributions are available in simple closed forms, with the exception of the distributions of and, possibly, of . To update we perform a Metropolis-Hastings step, where we work on the unconstrained space after the transformation and we adopt a component–wise adaptive random walk proposal following Roberts2009. The update of the unique atoms is standard, as with the DP mixture model in the exchangeable case.

In Section 5 we assume a Gaussian kernel and a conjugate Normal-inverse-Gamma base measure and obtain

with , , , and