# Latent nested nonparametric priors

Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalising to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by product. The results and their inferential implications are showcased on synthetic and real data.

## Authors

• 14 publications
• 83 publications
• 9 publications
• 8 publications
• 13 publications
01/18/2022

### Flexible clustering via hidden hierarchical Dirichlet priors

The Bayesian approach to inference stands out for naturally allowing bor...
08/17/2020

### A Common Atom Model for the Bayesian Nonparametric Analysis of Nested Data

The use of high-dimensional data for targeted therapeutic interventions ...
08/27/2021

### A class of dependent Dirichlet processes via latent multinomial processes

We describe a procedure to introduce general dependence structures on a ...
06/24/2020

### Slice Sampling for General Completely Random Measures

Completely random measures provide a principled approach to creating fle...
06/20/2019

### Regression Analysis of Dependent Binary Data for Estimating Disease Etiology from Case-Control Studies

In large-scale disease etiology studies, epidemiologists often need to u...
09/14/2021

### A Wasserstein index of dependence for random measures

Nonparametric latent structure models provide flexible inference on dist...
05/03/2019

### Bayesian analysis of Turkish Income and Living Conditions data, using clustered longitudinal ordinal modelling with Bridge distributed random-effects

This paper is motivated by the panel surveys, called Statistics on Incom...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Data that are generated from different (though related) studies, populations or experiments are typically characterised by some degree of heterogeneity. A number of Bayesian nonparametric models have been proposed to accommodate such data structures, but analytic complexity has limited understanding of the implied dependence structure across samples. The spectrum of possible dependence ranges from homogeneity, corresponding to full exchangeability, to complete heterogeneity, corresponding to unconditional independence. It is clearly desirable to construct a prior that can cover this full spectrum, leading to a posterior that can appropriately adapt to the true dependence structure in the available data.

This problem has been partly addressed in several papers. In Lijoi et al. (2014) a class of random probability measures is defined in such a way that proximity to full exchangeability or independence is expressed in terms of a

–valued random variable. In the same spirit, a model decomposable into idiosyncratic and common components is devised in

Müller et al. (2004). Alternatively, approaches based on Pólya tree priors are developed in Ma & Wong (2011); Holmes et al. (2015); Filippi & Holmes (2017), while a multi–resolution scanning method is proposed in Soriano & Ma (2017). In Bhattacharya & Dunson (2012) Dirichlet process mixtures are used to test homogeneity across groups of observations on a manifold. A popular class of dependent nonparametric priors that fits this framework is the nested Dirichlet process of Rodríguez et al. (2008)

, which aims at clustering the probability distributions associated to

populations. For this model is

 Xi,1,Xj,2∣(~p1,~p2)\scriptsize{ind}∼~p1×~p2(~p1,~p2)∣~q∼~q2,~q=∑i≥1ωiδGi (1)

where the random elements , for , take values in a space , the sequences and are independent, with almost surely, and the ’s are i.i.d. random probability measures on such that

 Gi=∑t≥1wt,iδθt,i,θt,i\scriptsize{iid}∼P (2)

for some non–atomic probability measure on . In Rodríguez et al. (2008) it is assumed that and the ’s are realizations of Dirichlet processes while in Rodríguez & Dunson (2014) it is assumed they are from a generalised Dirichlet process introduced in Hjort (2000). Due to discreteness of , one has with positive probability allowing for clustering at the level of the populations’ distributions and implying in such cases.

The nested Dirichlet process has been widely used in a rich variety of applications, but it has an unappealing characteristic that provides motivation for this article. In particular, if and share at least one value, then the posterior distribution of degenerates on , forcing homogeneity across the two samples. This occurs also in nested Dirichlet process mixture models in which the are latent, and is not specific to the Dirichlet process but is a consequence of nesting discrete random probabilities.

To overcome this major limitation, we propose a more flexible class of latent nested processes, which preserve heterogeneity a posteriori, even when distinct values are shared by different samples. Latent nested processes define and in (1) as resulting from normalisation of an additive random measure model with common and idiosyncratic components, the latter with nested structure. Latent nested processes are shown to have appealing distributional properties. In particular, nesting corresponds, in terms of the induced partitions, to a convex combination of full exchangeability and unconditional independence, the two extreme cases. This leads naturally to methodology for testing equality of distributions.

## 2 Nested processes

### 2.1 Generalising nested Dirichlet processes via normalised random measures

We first propose a class of nested processes that generalise nested Dirichlet processes by replacing the Dirichlet process components with a more flexible class of random measures. The idea is to define in (1) in terms of normalised completely random measures on the space of probability measures on . Let be an almost surely finite completely random measure without fixed points of discontinuity, i.e. where are i.i.d. random probability measures on with some fixed distribution on . The corresponding Lévy measure on is assumed to factorise as

 ν(ds,dp)=cρ(s)dsQ(dp) (3)

where is some non–negative function such that and . Since such a characterises through its Lévy-Khintchine representation

 E[e−λ~μ(A)]=exp[−cQ(A)∫∞0(1−e−λs)ρ(s)ds]=:e−cQ(A)ψ(λ) (4)

for any measurable , we use the notation . The function in (4) is also referred to as the Laplace exponent of . For a more extensive treatment of completely random measures, see Kingman (1993). If one additionally assumes that , then almost surely and we can define in (1) as

 ~q\scriptsize{d}=~μ~μ(P) (5)

This is known as a normalised random measure with independent increments, introduced in Regazzini et al. (2003), and is denoted as . The baseline measure, , of in (3) is, in turn, the probability distribution of , with and having Lévy measure

 ν0(ds,dx)=c0ρ0(s)dsQ0(dx) (6)

for some non–negative function such that and . Moreover, is a non–atomic probability measure on and is the Laplace exponent of . The resulting general class of nested processes is such that and is indicated by The nested Dirichlet process of Rodríguez et al. (2008) is recovered by specifying and to be gamma processes, namely , so that both and are Dirichlet processes.

### 2.2 Clustering properties of nested processes

A key property of nested processes is their ability to cluster both population distributions and data from each population. In this subsection, we present results on: (i) the prior probability that

and the resulting impact on ties at the observations’ level; (ii) equations for mixed moments as convex combinations of fully exchangeable and unconditionally independent special cases; and (iii) a similar convexity result for partially exchangeable partition probability function. The probability distribution of an exchangeable partition depends only on the numbers of objects in each group; the exchangeable partition probability function is the probability of observing a particular partition as a function of the group counts. Partial exchangeability is exchangeability within samples; the partially exchangeable partition probability function depends only on the number of objects in each group that are idiosyncratic to a group and common. Simple forms for the partially exchangeable partition probability function not only provide key insights into the clustering properties but also greatly facilitate computation.

Before stating result (i), define

 τq(u)=∫∞0sqe−usρ(s)ds,τ(0)q(u)=∫∞0sqe−usρ0(s)ds,

for any , and agree that .

###### Proposition 1.

If , and , then

 π1:=P(~p1=~p2)=c∫∞0ue−cψ(u)τ2(u)du (7)

and the probability that any two observations from the two samples coincide equals

 P(Xj,1=Xk,2)=π1c0∫∞0ue−c0ψ0(u)τ(0)2(u)ds>0. (8)

This result shows that the probability of and coinciding is positive, as desired, but also that this implies a positive probability of ties at the observations’ level. Moreover, (7) only depends on and not , since the latter acts on the space. In contrast, the probability that any two observations and from the two samples coincide given in (8) depends also on . If is a nested Dirichlet process, which corresponds to , one obtains and .

The following proposition [our result (ii)] provides a representation of mixed moments as a convex combination of full exchangeability and unconditional independence between samples.

###### Proposition 2.

If and is as in (7), then

 E[∫P2Xf1(p1)f2(p2)~q(dp1)~q(dp2)]=π1∫PXf1(p)f2(p)Q(dp)+(1−π1)∫PXf1(p)Q(dp)∫PXf2(p)Q(dp) (9)

for all measurable functions .

This convexity property is a key property of nested processes.

The component with weight in (9) accounts for heterogeneity among data from different populations and it is important to retain this component also a posteriori in (1). Proposition 2 is instrumental to obtain our main result (iii) characterizing the partially exchangeable random partition induced by and in (1). To fix ideas consider a partition of the data of sample into specific groups and groups shared with sample () with corresponding frequencies and . For example, 0.5, 2, , 5, 5, 0.5, 0.5 and 5, , 0.5, 0.5 yield a partition of objects into groups of which and are specific to the first and the second sample, respectively, and are shared. Moreover, the frequencies are , , and . Let us start by analyzing the two extreme cases. For the fully exchangeable case (in the sense of exchangeability holding true across both samples), one obtains the exchangeable partition probability function

 Φ(N)k(n1,n2,q1+q2)=ck0Γ(N)∫∞0uN−1e−c0ψ0(u)×k1∏j=1τ(0)nj,1(u)k2∏i=1τ(0)ni,2(u)k0∏r=1τ(0)qr,1+qr,2(u)du (10)

having set , and

for any vector

with . The marginal exchangeable partition probability functions for the individual sample are

 Φ(nℓ)k0+kℓ(nℓ,qℓ)=(c0)k0+kℓΓ(nℓ)∫∞0unℓ−1e−c0ψ0(u)kℓ∏j=1τ(0)nj,ℓ(u)k0∏r=1τ(0)qr,ℓ(u)du (11)

Both (10) and (11) hold true with the constraints and , for each . Finally, the convention implies that whenever an argument of the function is zero, then it reduces to . For example, . Both (10) and (11) solely depend on the Lévy intensity of the completely random measure and can be made explicit for specific choices. We are now ready to state our main result (iii).

###### Theorem 1.

The random partition induced by the samples and drawn from , according to (1), is characterised by the partially exchangeable partition probability function

 Π(N)k(n1,n2,q1,q2)=π1Φ(N)k(n1,n2,q1+q2)+(1−π1)Φ(n1+|q1|)k0+k1(n1,q1)Φ(n2+|q2|)k0+k2(n2,q2)1{0}(k0) (12)

The two independent exchangeable partition probability functions in the second summand on the right–hand side of (12) are crucial for accounting for the heterogeneity across samples. However, the result shows that one shared value, i.e. , forces the random partition to degenerate to the fully exchangeable case in (10). Hence, a single tie forces the two samples to be homogeneous, representing a serious limitation of all nested processes including the nDP special case. This result shows that degeneracy is a consequence of combining simple discrete random probabilities with nesting. In the following section, we develop a generalisation that is able to preserve heterogeneity in presence of ties between the samples.

## 3 Latent nested processes

To address degeneracy of the partially exchangeable partition probability function in (12), we look for a model that, while still able to cluster random probabilities, can also take into account heterogeneity of the data in presence of ties between and . The issue is relevant also in mixture models where and

are used to model partially exchangeable latent variables such as, e.g., vectors of means and variances in normal mixture models. To see this, consider a simple density estimation problem, where two-sample data of sizes

are generated from

 Xi,1∼12N(5,0.6)+12N(10,0.6)Xj,2∼12N(5,0.6)+12N(0,0.6).

This can be modeled by dependent normal mixtures with mean and variance specified in terms of a nested structure as in (1). The results, carried out by employing the algorithms detailed in Section 4, show two possible outcomes: either the model is able to estimate well the two bimodal marginal densities, while not identifying the presence of a common component, or it identifies the shared mixture component but does not yield a sensible estimate of the marginal densities, which both display three modes. The latter situation is displayed in Figure 1: once the shared component is detected, the two marginal distributions are considered identical as the whole dependence structure boils down to exchangeability across the two samples.

This critical issue can be tackled by a novel class of latent nested processes. Specifically, we introduce a model where the nesting structure is placed at the level of the underlying completely random measures, which leads to greater flexibility while preserving tractability. In order to define the new process, let be the space of boundedly finite measures on and the probability measure on induced by , where is as in (6). Hence, for any measurable subset of

###### Definition 1.

Let , with . Random probability measures are a latent nested process if

 ~pℓ=μℓ+μSμℓ(X)+μS(X)ℓ=1,2, (13)

where and is the law of a , where , for some . Henceforth, we will use the notation .

Furthermore, since

 ~pi=wiμiμi(X)+(1−wi)μSμS(X),where wi=μi(X)μS(X)+μi(X), (14)

each is a mixture of two components: an idiosyncratic component and a shared component . Here preserves heterogeneity across samples even when shared values are present. The parameter in the intensity tunes the effect of such a shared CRM. One recovers model (1) as . A generalisation to nested completely random measures of the results given in Propositions 1 and 2 is provided in the following proposition, whose proof is omitted.

###### Proposition 3.

If , where as in Definition 1, then

 π∗1=P(μ1=μ2)=c∫∞0ue−cψ(u)τ2(u)du (15)

and

 E[∫M2f1(m1)f2(m2)~q2(dm1,dm2)]=π∗1∫Mf1(m)f2(m)Q(dm)+(1−π∗1)2∏ℓ=1∫Mfℓ(m)Q(dm) (16)

for all measurable functions .

###### Proposition 4.

If , then .

Proposition 4, combined with , entails namely

 P({~p1=~p2}∩{μ1=μ2})+P({~p1≠~p2}∩{μ1≠μ2})=1

and, then, the random variables and coincide almost surely. As a consequence the posterior distribution of can be readily employed to test equality between the distributions of the two samples. Further details are given in Section 5.

For analytic purposes, it is convenient to introduce an augmented version of the latent nested process, which includes latent indicator variables. In particular, , with if and only if

 (Xi,1,Xj,2)∣(ζi,1,ζj,2,μ1,μ2,μS)\scriptsize{ind% }∼pζ1,i×p2ζ2,j(ζi,1,ζj,2)∣(μ1,μ2,μS)∼Bern(w1)×Bern(w2)(μ1,μ2,μS)∣(~q,~qS)∼~q2×~qS. (17)

The latent variables indicate which random probability measure between and generates each , for .

###### Theorem 2.

The random partition induced by the samples and drawn from , as in (17), is characterised by the partially exchangeable partition probability function

 Π(N)k(n1,n2,q1,q2)=π∗1ck0(1+γ)kΓ(N)×∫∞0sN−1e−(1+γ)c0ψ0(s)2∏ℓ=1kℓ∏j=1τ(0)nj,ℓ(s)k0∏j=1τ(0)qj,1+qj,2(s)ds+(1−π∗1)∑(∗)I2(n1,n2,q1+q2,ζ∗) (18)

where

 I2(n1,n2,q1+q2,ζ∗)=ck0γk−¯kΓ(n1)Γ(n2)∫∞0∫∞0un1−1vn2−1e−γc0ψ0(u+v)−c0(ψ0(u)+ψ0(v)) ×k1∏j=1τ(0)nj,1(u+(1−ζ∗j,1)v)k2∏j=1τ(0)nj,2((1−ζ∗j,2)u+v)k0∏j=1τ(0)qj,1+qj,2(u+v)dudv

and the sum in the second summand on the right hand side of (18) runs over all the possible labels .

The partially exchangeable partition probability function (18) is a convex linear combination of an exchangeable partition probability function corresponding to full exchangeability across samples and one corresponding to unconditional independence. Heterogeneity across samples is preserved even in the presence of shared values. The above result is stated in full generality, and hence may seem somewhat complex. However, as the following examples show, when considering stable or gamma random measures, explicit expressions are obtained. When the expression (18) reduces to (12), which means that the nested process is achieved as a special case.

###### Example 1.

Based on Theorem 2 we can derive an explicit expression of the partition structure of latent nested –stable processes. Suppose and , for some and in . In such a situation it is easy to see that , and . Moreover let , since the total mass of a stable process is redundant under normalization. If we further set

 Jσ0,γ(H1,H2;H):=∫10wH1−1(1−w)H2−1[γ+wσ0+(1−w)σ0]Hdw,

for any positive , and , and

 ξa(n1,n2,q1+q2):=2∏ℓ=1kℓ∏j=1(1−a)nj,ℓ−1k0∏j=1(1−a)qj,1+qj,2−1,

for any , then the partially exchangeable partition probability function in (18) may be rewritten as

 Π(N)k(n1,n2,q1,q2)=σk−10Γ(k)ξσ0(n1,n2,q1+q2){(1−σ)Γ(N)+σΓ(n1)Γ(n2)×∑(∗)γk−¯kJσ0,γ(n1−¯n1+¯k1σ0,n2−¯n2+¯k2σ0;k)⎫⎬⎭.

The sum with respect to can be evaluated and it turns out that

where Beta

stands for the beta distribution with parameters

and , while is the beta function with parameters and . As it is well–known, is the exchangeable partition probability function of a normalised –stable process. Details on the above derivation, as well as for the following example, can be found in the Appendix.

###### Example 2.

Let . Recall that and , furthermore by standard calculations. From Theorem 2 we obtain the partition structure of the latent nested Dirichlet process

 Π(N)k(n1,n2,q1,q2)=ξ0(n1,n2,q1+q2)ck0{11+c(1+γ)k(c0(1+γ))N+c1+c∑(∗)γk−¯k(α)n2(β)n13F2(c0+¯n2,α,n1;α+n2,β+n1;1)⎫⎬⎭

where , and is the generalised hypergeometric function. In the same spirit as in the previous example, the first element in the linear convex combination above is nothing but the Ewens’ sampling formula, i.e. the exchangeable partition probability function associated to the Dirichlet process whose base measure has total mass .

## 4 Markov Chain Monte Carlo algorithm

We develop a class of Markov Chain Monte Carlo algorithms for posterior computation in latent nested process models relying on the partially exchangeable partition probability functions in Theorem 2, as they tended to be more effective. Moreover, the sampler is presented in the context of density estimation, where

 Xj,ℓ∣(θ(n1)1,θ(n2)2) \scriptsize{ind}∼h(⋅;θj,ℓ)ℓ=1,2 (Xi,1,Xj,2)∣(θ(n1)1,θ(n2)2) \scriptsize{ind}∼h(⋅;θi,1)×h(⋅;θj,2)

and the vectors , for and with each taking values in , are partially exchangeable and governed by a pair of as in (17). The discreteness of and entails ties among the latent variables and that give rise to distinct clusters identified by

• the distinct values specific to , i.e. not shared with . These are denoted as , with corresponding frequencies and labels ;

• the distinct values specific to , i.e. not shared with . These are denoted as , with corresponding frequencies and labels ;

• the distinct values shared by and . These are denoted as , with being their frequencies in and shared labels .

As a straightforward consequence of Theorem 2

, one can determine the joint distribution of the data

, the corresponding latent variables and labels as follows

 f(x∣θ)Π(N)k(n1,n2,q1,q2)2∏ℓ=0kℓ∏j=1Q0(dθ∗j,ℓ) (19)

where is as in (18) and, for and ,

 f(x∣θ)=2∏ℓ=1kℓ∏j=1∏i∈Cj,ℓh(xi,ℓ;θ∗j,ℓ)k0∏r=1∏i∈Cr,ℓ,0h(xi,ℓ;θ∗r,0).

We do now specialise (19) to the case of latent nested –stable processes described in Example 1. The Gibbs sampler is described just for sampling , since the structure is replicated for . To simplify the notation, denotes the random variable after the removal of . Moreover, with