 # Is infinity that far? A Bayesian nonparametric perspective of finite mixture models

Mixture models are one of the most widely used statistical tools when dealing with data from heterogeneous populations. This paper considers the long-standing debate over finite mixture and infinite mixtures and brings the two modelling strategies together, by showing that a finite mixture is simply a realization of a point process. Following a Bayesian nonparametric perspective, we introduce a new class of prior: the Normalized Independent Point Processes. We investigate the probabilistic properties of this new class. Moreover, we design a conditional algorithm for finite mixture models with a random number of components overcoming the challenges associated with the Reversible Jump scheme and the recently proposed marginal algorithms. We illustrate our model on real data and discuss an important application in population genetics.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Mixture models are a very powerful and natural statistical tool to model data from heterogeneous populations. In a mixture model, observations are assumed to have arisen from one of (finite or infinite) groups, each group being suitably modelled by a density typically from a parametric family. The density of each group is referred to as a component of the mixture, and is weighted by the relative frequency (weight) of the group in the population. This model offers a conceptually simple way of relaxing distributional assumptions and a convenient and flexible way to approximate distributions that cannot be modelled satisfactorily by a standard parametric family. Moreover, it provides a framework by which observations may be clustered together into groups for discrimination or classification. For a comprehensive review of mixture models and their applications see McLachlan et al. (2000); Frühwirth-Schnatter (2006) and Fruhwirth-Schnatter et al. (2019). More in details, let be the population variable, each observation is assumed to have arisen from one of groups:

 fY(y∣P)=∫Θf(y∣θ)P(dθ)=M∑j=1wjf(y∣τj) (1)

where is a parametric family of densities on , while is an almost sure discrete measure on , and it is referred to as mixing measure. Here is a collection of points in , that defines the support of . For each , the density is the kernel of the mixture, and is weighted by , the relative frequency of the group in the population. Model (1) defines a framework by which observations may be clustered together into groups, so that conditionally, data are independent and identically distributed within the groups and independent between groups. To avoid confusion in terminology, in what follows will denote the number of components in a mixture, i.e. of possible clusters/sub-populations, while by number of clusters,

, we mean the number of allocated components, i.e. components to which at least one observation has been assigned. The latter quantity can only be estimated

a posteriori.

We believe that in the context of mixture modelling the words cluster and component are often misused in terminology, i.e. the distinction between number of components and number of clusters has generally been overlooked in the parametric world, leading to the criticism that if we fix a priori the number , we cannot estimate the number of clusters. What needs to be highlighted (see Rousseau and Mengersen, 2011) is that when in a finite mixture model we fix , we are specifying the number of components (i.e. possible clusters) that corresponds to the data generating process, but still we need to estimate the actual number of clusters in the sample (allocated components). Already Nobile et al. (2004) had pointed out this difference, noticing that the posterior distribution of the number of components

might assigns considerable probabilities to values greater than the number of allocated components. Similar observations have been by

Richardson and Green (1997), who specify a prior on the number of components , highlighting the fact that some of the components might be empty as not all the components might be represented in a finite sample and the data are non-informative on unallocated components. This leads to an identifiability problem for and, as a consequence, fully non-informative priors cannot be elicited in a mixture context. Nevertheless, Richardson and Green (1997) still focus their inference problem on and do not investigate the relationship between and . More recently, Malsiner-Walli et al. (2016) introduce sparse finite mixture models as an alternative to infinite (nonparametric) mixtures, and impose sparsity to estimate the number of non-empty components in a deliberately over-fitting mixture model where is fixed relatively large. On the other hand, in Bayesian nonparametrics is set equal to infinity (i.e., ) and the focus of inference is only . In this work we stress the importance of the distinction between and as it will allow us to collocate nonparametric and parametric mixtures in exactly the same framework.

In Bayesian Nonparametrics (i.e., ), the Dirichlet process mixture model (DPM, Lo, 1984; Neal, 2000) – i.e. the Model in (1) where the mixing measure is indeed the Dirichlet process – plays a pivotal role. DPM popularity is mainly due to its high flexibility and mathematical and computational tractability both in density estimation as well as in clustering problems. However, in some statistical applications, the use of the Dirichlet process as a clustering mechanism may be restrictive (see, for instance, Lau and Green, 2007; Miller and Harrison, 2013): the clustering results often depend on the choice of a particular kernel, partitions will typically be dominated by few large clusters (the rich-get-richer property), the number of clusters increases as the number of observations increases (as ), often leading to the creation of too many non-interpretable singleton clusters. To overcome these drawbacks many alternative mixing measures have been proposed (e.g. Ishwaran and James, 2003; Dey et al., 2012). In particular, Lijoi et al. (2007) replace the Dirichlet process with a large and flexible class of random probability measures obtained by normalization of random (infinite dimensional) measures with independent increments. Once again, all these approaches assume and focus on estimating .

On the other hand, in a Bayesian parametric context (i.e., almost surely) the most popular approaches are (i) fix and then focus mainly on density estimation (ii) treat as a random parameter and make it the focus of inference. Then, conditionally on , the mixture weights are chosen according to a dimensional distribution. We refer to the latter model as finite Dirichlet mixture model (FDMM). Refer, among the others, to Nobile (1994); Richardson and Green (1997); Stephens (2000) and Miller and Harrison (2018) for more details. The literature is rich of proposals on how to estimate the number of components , but there is no consensus on the best method. Likelihood based inference typically relies on model choice criteria, such as BIC or the approximate weight of evidence (see Biernacki et al., 2000, for a review). Although in the Bayesian paradigm there are approaches based on model choice criteria, such as DIC, it is usually preferable to perform full posterior inference on as well, eliciting an appropriate prior. A fully Bayesian approach in FDMM is often based on the reversible jump Markov chain Monte Carlo (Richardson and Green, 1997; Dellaportas and Papageorgiou, 2006) or, alternatively, on the marginal likelihood . Both methods present significant computational challenges.

Although mainly for computational purposes, the connection between finite and infinite mixture models has been present in the literature for at least two decades since the work of Muliere and Tardella (1998). Since then extensive research effort has been devoted to find approximate representation of the Dirichlet process (e.g. Ishwaran and Zarepour, 2002). Moreover, algorithms for posterior inference of infinite mixture models often truncate the infinite measure and approximate it with a finite mixture with components, where is sufficiently large or random (see for instance Ishwaran and James, 2001; Argiento et al., 2016) and, in practice, the inferential problem translates into estimating the number of allocated components (clusters) and the cluster-specific parameters. The focus of this work is to provide a probabilistic treatment of mixture modelling, that reconciles the two approaches: and . This connection has received some attention from a theoretical point of view (see e.g. Miller and Harrison, 2018; Frühwirth-Schnatter and Malsiner-Walli, 2018), but has never been investigated thoroughly and fully resolved.

### 1.1 Contribution of this work

In this work, instead of approximating an infinite mixture with a parametric one, we show that a finite mixture model is simply a realization of a stochastic process whose dimension is random and has an infinite dimensional support. To this end, we introduce a new class of random measures obtained by normalization of a point process and use it as mixing measures in Model (1). We refer to this new class as Normalized Independent Finite Point Processes and we derive the family of prior distributions induced on the data partition by providing a general formula for exchangeable partition probability functions (Pitman, 1996). Finally, we characterize the posterior distribution of the Normalized Independent Finite Point Process. Our construction is exactly in the spirit of Bayesian nonparametrics, as it is based on the normalization of a point process, leading to an almost surely discrete measure. Indeed, there is a fundamental and simple idea behind the construction of almost surely discrete random measures: they can be obtained by normalization of stochastic processes. Already Ferguson (1973) in his seminal work derived the Dirichlet process as normalization of a Gamma process. More recently, Regazzini et al. (2003) propose a new class of nonparametric priors, called Normalized Random Measures with Independent Increments

, obtained through the normalization of a Lévy process. This latter work has opened the door to one of the most active lines of current research in Bayesian statistics as well as in machine learning. On one hand it has led to the development of nonparametric priors beyond the Dirichlet process

(e.g. Lijoi et al., 2010) and on the other the same techniques are widely used for clustering in the machine learning community under the name of Normalised Completely Random Measures (e.g. Jordan, 2010).

The class we propose is rich and includes as a particular case the popular finite Dirichlet mixture model. Several inference methods have been proposed for the finite mixture models, of which the most commonly-used are the Reversible Jump Markov chain Monte Carlo (Green, 1995; Richardson and Green, 1997) and the recently proposed marginal algorithm (Miller and Harrison, 2018). The Reversible jump algorithm is a very general technique and has been successfully applied in many contexts, but it can be difficult to implement since it requires designing problem-specific moves, which is often a nontrivial task particularly in high-dimensions. The algorithm proposed by Miller and Harrison (2018), although more efficient and in some ways more automatic, restrains the class of prior distributions for the weights and does not allow inference on the hyper-parameters of the process, which could constitute a serious limitation in complex set-ups. In Miller and Harrison (2018), by integrating out the mixing measure, inference is limited to the number of allocated components and the mean of linear functionals of the posterior distribution of the mixture model. See Gelfand and Kottas (2002) for a discussion of these issues.

Among the main achievements of this work, there is the construction of a Gibbs sampler scheme to simulate from the posterior distribution of the Normalized Finite Independent Point Process, in particular a conditional MCMC algorithm based on the posterior characterization of such process. This algorithm, in the particular case of a Dirichlet prior on the mixture weights, leads to conjugate updating with a substantial gain in computational efficiency over current algorithms. The key result (associated to the nonparametric construction of the process) is to be able to propose transdimensional moves which are automatic and naturally implied by the prior process. We illustrate the proposed prior process through the benchmark example offered by the the Galaxy data (Roeder, 1990) and an important application in population genetics.

The manuscript is structured as follows. In Section 2 we introduce the finite mixture model framework, highlighting the connection between parametric and nonparametric constructions. Section 3 reviews necessary theory from Finite Point Processes. In Section 4 we introduce the prior process, the Normalised Independent Finite Point Process, and discuss its clustering properties, while in Section 5 we characterise its posterior distribution. In Section 6 we show how the new construction leads to efficient conditional and marginal algorithms. In Section 7 we briefly show how the new prior can be used as a component in more complex hierarchies. In Section 8 we discuss possible choices for the prior on the number of mixture components, while Section 9 presents special cases of the process. We conclude the paper with two examples: (i) the Galaxy data example, which provides an opportunity to benchmark our method in Section 10; (ii) a real data genetic application aimed to identify population structure from microsatellites loci genotyped in a sample of thrushes in Section 11. We conclude the paper in Section 12.

## 2 Finte Mixture Models

In this section we introduce the finite mixture model (FMM) and show how it can be written in three equivalent ways. The first two representations are widely used in the parametric literature, while the last one uses notions typical of random mixing measure (nonparametric) set-ups. Our starting point is the general finite mixture model. Let be a set of observations taking values in an Euclidean space . We consider the following mixture model:

 (2)

where is a parametric density on

, which depends on a vector of parameters

. The vector of parameters assume values in and is assigned a non-atomic prior density corresponding to the probability measure on . The number of components is an important parameter of the mixture model and in a fully Bayesian approach it is given a prior . Conditionally on , the vector of weights

, which represents the probability of belonging to each mixture component, is given a prior probability

on the simplex of dimension . The model in Eq. (2) can be rewritten in terms of latent variables, since this representation allows for simpler computations. To this end, we need to introduce a latent allocation vector , whose element denotes to which component observation is assigned, . Then the model in (2) is equivalent to:

 Yi|θiind∼f(y∣θi),i=1,…,nθi|ciind∼δτci(dθi)τm∣Miid∼P0(dτ),m=1,…,Mw∣M∼PW(w∣M)ci∣M,w∼MultinomialM(1,w1,…,wM)M∼qM (3)

where is the Dirac measure assigning unit mass at location . Usually is assumed to be a Dirichlet distribution, while typical choices for

include a discrete uniform on some finite space, a Negative Binomial or a Poisson distribution. Note that prior information about the relative sizes of the mixing weights

can be introduced through – roughly speaking, small favours lower entropy ’s, while large favours higher entropy

’s. In general, the hyperparameter

is either set equal to a constant (e.g. equal 1 or

) or is assigned a Gamma hyperprior.

Frühwirth-Schnatter and Malsiner-Walli (2018) propose a sparsity prior on , which allows the number of non-empty components to be much smaller than , where is a non-random constant. Alternatively, Miller and Harrison (2018) do not make strong assumptions on the prior on but, by showing the connection between FMM and exchangeable partition probability functions (eppf), manage to apply the well-developed inferential methods for DPMs to FMMs with significant gains in computational efficiency. The strategy proposed by Miller and Harrison (2018) is limited to the Dirichlet prior on and employs a marginal-type of algorithm to perform posterior inference. This approach, often used in DPMs, marginalises over the weights of the mixture and it is most appropriate when the main object of scientific interest is . In this work we propose a richer construction, where the prior on

is obtained normalising a finite point process. Advantages of the proposed approach include: (i) extension of the family of prior distributions for the weights; (ii) full Bayesian inference on all the unknowns (in particular

and ); (iii) possibility of inducing sparsity through appropriate choice of hyper-parameters; (iv) ease of interpretation; (v) possibility of extending the construction to covariate dependent weights and (vi) extension to more general processes.

In a nutshell, we build a general class of finite mixture models by proposing a new prior process for which admits the conventional mixture model described in Eq. (3) as special case. To introduce this new class of prior distributions, which we refer to as Normalized Independent Finite Point Processes (Norm-IFPP), we first need to review some background theory and introduce some notation. Then, in Section 9 we provide examples which do not require the prior for the weights to be a Dirichlet distribution.

The theoretical developments are based on the key observation that a realization from the prior on the mixture model parameters defined in Eq. (2) in terms of hierarchical parametric distribution defines an almost surely (a.s.) finite-dimensional random probability measure on the parameter space :

 P(θ)=M∑m=1wmδτm(dθ) (4)

This implies that the joint probability distribution on

, and induces a distribution on defined in Eq. (4), whose support is the space of the a.s. finite-dimensional random probability measures on . Moreover, it is straightforward to prove (see Argiento et al., 2019) that by letting , as in Eq. (3), the variables can be considered as a sample from , i.e. . From this observation, the link between infinite (nonparametric) and finite mixture models becomes evident as the model in Eq. (2) can be easily rewritten as

 Y1,…,Yn|θ1,…,θnind∼f(y;θi)θ1,…,θn|Piid∼PP∼P (5)

where is defined in Eq. (4) and is the law of defined via . The main theoretical contribution of this work is to give a constructive definition of , introducing a class of FMM for which the weights represent the normalised jumps of a finite point process and the parameters are defined in terms of realisations of the same point process. As in any mixture, s in Model (5) are equal to one of the in Eq. (4), depending on which component the th observation is assigned to. The link between finite mixture models and point processes is not unknown, as pointed out in the introduction. In particular, Stephens (2000) highlights this connection, but defines the point process on the complex space of normalized weights (i.e. the union of infinite simplexes). In this work through normalization not only we are able to work on a simpler space, but also to build a new general class of distribution , i.e. a new class of prior for the weights of the mixing measure .

## 3 Finite Point Processes

In this section we review some concepts from point process theory which are necessary to construct the Norm-IFPP process. We refer to the books of Daley and Vere-Jones (2007) and Møller and Waagepetersen (2003) for a complete treatment of finite point processes.

Let be a complete separable metric space, a finite point process is a random countable subset of . In this paper we restrict our attention to processes whose realizations are a finite subset of . For any realization of the process, , let denote the cardinality of . The realizations of are constrained on Elements of are called finite point configurations. The law of a finite point process is identified by the following quantities:

1. a discrete probability density determining the law of the total number (i.e. ) of points of the process,

2. for each integer , a probability distribution on the Borel sets of

, that determines the joint distribution of the positions of the points of the process, given that their total number is

.

In particular, and provide a constructive definition of the process which is very useful in simulations. First generate a random integer from and then, given , generate a random set which is a sample from . If , the random generation stops and coincides with the empty set.

Note that a point process is a set of unordered points. To this end, the distributions needs to give equal weight to all permutations of the elements in the vector , i.e. must be symmetric. A convenient way to specify the law of is based on the Janossy measure (Daley and Vere-Jones, 2007):

 J(A1×⋯×AM)=M!qMΠM(A1×⋯×AM)

where the s are Borel-sets of , with . The Janossy measure is unnormalised and plays a fundamental role in the study of finite point processes and spatial point patterns. It has a simple interpretation which makes it easy to work with. Let and let denote the density of with respect to the Lebesgue measure with for . Then is the probability that there are exactly points in the process, one in each of the distinct infinitesimal regions . Here, we will use the Janossy measure to characterize the prior and the posterior distribution of the new class of finite discrete random probability measures, i.e. the family of Normalised Independent Finite Point Processes (Norm-IFPP). We now introduce a simplified version of this process which assumes that the points are conditionally independent and identically distributed.

###### Definition 1.

Let and be a density on and a probability mass function respectively. is an independent finite point process, , if its Janossy density can be written as

 j(ξ1,…,ξM)=M!qMM∏m=1ν(ξm) (6)

In what follows, our construction is based on Eq. (6).

## 4 Normalized Independent Finite Point Processes

Let , for some positive integer and let be . We denote with a point of . Let be a density on such that , where is a density on and is a density on . Finally, we consider only such that , i.e. the prior probability of is zero. We consider the independent finite point process with parameters and , i.e. . In what follows, it is easier to introduce a slight change of notation and define to highlight the dependence of the process also on . Let be the set of indexes corresponding to the points of the process. Since we are assuming that

is almost surely larger than so that we can give the following definition:

###### Definition 2.

Let , with . A normalized independent finite point process (Norm-IFPP) with parameters and is a discrete probability measure on defined by

 P(A)=∑m∈Mwmδτm(A)d=∑m∈MSmTδτm(A) (7)

where and denotes a measurable set of . We refer to the process in Eq. (7) as .

The finite dimensional process defined in Eq. (7) belongs to the wide class of species sampling models (see Pitman, 1996) and this will allow us to use all the efficient machinery developed for such models. Let be a sample from a Norm-IFPP. It is well known that sampling from a discrete probability measure induces ties among the s and, therefore, a random partition of the observations. Let indicate a partition of the set in subsets, where for , and let denote the set of distinct s associated to each . The marginal law of has a unique characterization:

 L(dθ1,…,dθn)=L(ρn,dθ⋆1,…,dθ⋆k)=π(n1,…,nk)k∏j=1P0(dθ⋆j)

where , and is the exchangeable partition probability function (eppf) associated to the random probability (Pitman, 1996). For each , the eppf is a probability law on the set of the partitions of , which determines the (random) number of clusters and the numerosity of each cluster . The partition is exchangeable because its law depends only on the number and size of the clusters, and not on the allocation of the individuals to each clusters. The eppf is a key tool in Bayesian analysis as mixture models can be rewritten in terms of random partitions and such equivalence is often exploited to improve computational efficiency, in particular of marginal algorithms (Lijoi et al., 2010). The following proposition provides an expression for the eppf of a Norm-IFPP measure.

###### Theorem 1.

Let be a vector of positive integers such that . Then, the eppf associated with a Norm-IFPP is

 π(n1,…,nk)=∫+∞0un−1Γ(n){∞∑m=0(m+k)!m!ψ(u)mqm+k}k∏j=1κ(nj,u)du (8)

where is the Laplace transform of the density , i.e.

 ψ(u):=∫∞0e−ush(s)ds (9)

and

 κ(nj,u):=∫∞0snje−ush(s)ds=(−1)njddunjψ(u)

Proof: See Appendix

For what follows it is important to highlight the difference between and . The number of components of the finite mixture is given by a realisation of the process in Eq. (7). On the other hand, denotes the number of non-empty (allocated) components, with . This difference has been noted before in the literature (see, for example, Nobile et al. (2004); Miller and Harrison (2018); Frühwirth-Schnatter and Malsiner-Walli (2018)). Suppose that a realization from is a discrete measure with atoms in Eq. (4) and . Furthermore we have a realization from , , with . Then the allocated components are and the total number of mixture components is . Note that the representation in Eq. (7) implies that the jumps of the point process, indexed by the elements of , correspond to the components of the finite mixture, and their relative size defines the weights. More formally, we denote by the set of indexes of allocated jumps of the Process (7), i.e. the indexes corresponding to some jumps such that there exists a location for which , . The remaining values of correspond to the non-allocated jumps and we denote this set with . We use the superscripts and for random variables related to allocated and non-allocated jumps respectively.

One of the main focus of inference when using finite mixture models is to determine the clustering allocation of the observations. The eppf gives the prior distribution on the space of possible partitions. Moreover, marginalising over the cluster sizes, it is also possible to derive the implied prior distribution on the number of clusters, , which corresponds to the number of allocated components.

###### Corollary 1.

Under the assumptions of Theorem 1, the marginal prior probability of sampling a partition with clusters is given by

 p⋆k=Pr{M(a)=k}=∫+∞0un−1Γ(n){∞∑m=0(m+k)!m!ψ(u)mqm+k}Bn,k(κ(⋅,u)) (10)

where , and is the partial Bell polynomial (Pitman, 2006) over the sequence of coefficients .

Proof: See Appendix A.2

Moreover, from de Finetti’s theorem it follows that, converges almost surely to , as .

## 5 Posterior Carachterization of a Norm-IFPP Process

In this section we characterise the posterior distribution of the process . To this end, we introduce the random variable , where , with and independent, where . It is easy to show (see the Appendix A.4) that if then, for any , the marginal density of is given by

 fUn(u;n)=un−1Γ(n)(−1)nddunE(ψ(u)M) (11)

where is the Laplace transform of the density , as defined in Eq. (9). We give the derivation of Eq. (11) in Appendix A.4. The posterior distribution of , given , is crucial to perform posterior inference and allows us to derive the posterior distribution of the unnormalised process . To this end, we need to show that a posteriori, conditionally to , is the superposition (union) of two independent process: a point process and a finite process with fixed locations at . Note that corresponds to the number of allocated jumps and is equal to the sum of and the number of unallocated jumps, assuming values in . The process of unallocated jumps is a latent variable which links the parametric part of the model in to a nonparametric process. This link is essential for computations as it will become clearer in Section 6, where we discuss the algorithm. The results below are conditional on the realizations of the random variable , which is a typical strategy in the theory of normalised random measures, since working on the augmented space allows us to exploit the quasi-conjugacy of the process (see James et al., 2009). We now present the main theoretical contribution of this work.

###### Theorem 2.

If , then the unnormalized process , given , and , is the superposition of two processes:

 ˜Pd=˜P(na)∪˜P(a)

where

1. The process of non-allocated jumps is an independent finite point process with Janossy density given by

 jm((s1,τ1),…,(sm,τm))=m!q⋆mm∏j=1h⋆(sj)p0(τj)

where , , is the Laplace transform of , and is a realization of , the number of unallocated jumps, taking values in .

2. The process of allocated jumps is the unordered set of points , such that, for , and the distribution of is proportional to .

3. Conditionally on and , and are independent.

Moreover, the posterior law of given depends only on the partition and has density on the positive reals given by

 fU(u∣ρn)∝un−1Γ(n){∞∑m=0(m+k)!m!ψ(u)mqm+k}k∏i=1κ(nj,u)

Proof: See Appendix A.3

The result in Theorem 2 is the finite dimensional counterpart of Theorem 1 in James et al. (2009) for normalised completely random measure. This theorem will allow building an efficient block Gibbs sampler for finite mixture models. Since the order in which the points of a point process arise is not important, without loss of generality, given a realization of the posterior process , we assume that, in , , i.e. the first points correspond to the allocated jumps, while the last to the non-allocated ones.

## 6 Posterior inference

To perform posterior inference tailored MCMC algorithms need to be devised. The two most popular strategies in Bayesian nonparametrics are marginal (Neal, 2000) and conditional algorithms (Ishwaran and James, 2001; Kalli et al., 2011; Argiento et al., 2016). Our construction allows for straightforward extension of such startegies to the finite mixture case, offering a convenient alternative to the often inefficient and labour intensive reversible jump. To implement marginal algorithms it is desirable (although not necessary, but at the cost of extra computations) to be able to compute the sum in Eq. (8) to obtain the probability of a random partition. On the other hand, for conditional algorithms we need to sample from the posterior distribution of a Norm-IFPP which requires a closed form expression for the Laplace transform in Theorem 2. More specifically, it is essential to be able to sample from the posterior distribution of the number of the non-allocated jumps, , as well as from the distribution of the allocated and unallocated jumps, i.e. the densities proportional to (Exponential tilted) and (Gamma tilted). Specific solutions for well known processes will be presented in the following sections. Here we give a general outline of both algorithms.

### 6.1 Marginal Algorithm

As mentioned before, a sample from induces a partition of the set of the data indexes, denoted by , such that implies that datum belongs to cluster . Marginal algorithms rely on the fact that, by integrating out the measure , the only parameters left in Eq. (5) are the random partition and the cluster specific parameters . Posterior sampling strategies for are based on the Chinese restaurant process (Aldous, 1985), which describes the (a priori) predictive generative process for , and relies on the evaluation of the eppf associated with . Nevertheless, when corresponds to the Norm-IFPP model, this evaluation can be computationally burdensome due to the integral with respect to in Eq. (8). To design efficient algorithms we adopt a disintegration technique following a strategy similar to the one suggested by James et al. (2009) and Favaro and Teh (2013) for NRMI. In particular, we augment the state space introducing the latent variable (see Theorem 2).

We now explain how, conditional to the latent variable , the Chinese restaurant process can be adapted to this set-up. Recall that the marginal distribution of , defined in Section 7, with , has been derived in Eq. (11): .

The partition (or clustering) can be generated using the eppf derived in Theorem 1. It is straightforward to show that

 L(ρn,Un)=π(n1,…,nk;u)=un−1Γ(n){∞∑m=0(m+k)!m!ψ(u)mqm+k}k∏j=1κ(nj,u)=un−1Γ(n)Ψ(u,k)k∏j=1κ(nj,u) (12)

where . This joint distribution allows us to derive the predictive probability (conditionally on ) that observation belongs to a new cluster is

 P(n+1∈Ck+1|u,ρn) ∝π(n1,…,nk,1;u)π(n1,…,nk;u)=S(u,k+1)S(u,k)κ(1,u) (13)

while the predictive probability of belonging to an existing cluster is

 P(n+1∈Cj|u,ρn) ∝π(n1,…,nj+1,…,nk,1;u)π(n1,…,nj,…,nk;u)=κ(nj+1,u)κ(nj,u),j=1,…,k (14)

As in a standard Chinese restaurant process (Aldous, 1985), a sequence of customers (data ) enter a restaurant with an infinite number of tables (groups ). The first customer sits at the first table and a random variable is drawn. Then each subsequent customer joins a new table with probability proportional to Eq. (13), or an existing table with customers with probability proportional to Eq. (14). For each new customer , a variable is drawn. After customers have entered the restaurant, the seating arrangement of customers around tables corresponds to a partition of with numerosity , . The seating arrangement of the customers is exchangeable, in the sense that any seating that leads to the same number of occupied tables and the same number of customers per table has the same probability. The main difference with the standard Chinese process consists in updating the cluster allocation conditional on ’s. The strategy of conditioning on a sequence of auxiliary variables to generalise the Chinese restaurant process was introduced for infinite dimensional measures by James et al. (2009). Here, we have derived the finite dimensional counterpart.

A general scheme to implement a posterior Gibbs sampler for Norm-IFPP mixture model is the following:

• Draw from . This can be done, for instance, using one of the several algorithms presented in Neal (2000), by simply substituting the predictive distributions of the Dirichlet process with the conditional predictive structure of a Norm-IFPP given in Eq.s (13) and (14).

• Draw from . This update requires a Metropolis step (or any other alternative that ensures that the chain is invariant) with target distribution proportional to in Eq. (12)

• Draw , for each , from . In general, this is straightforward and involves a simple parametric update from

 ∏i∈Cjf(yi∣θ⋆j)p0(θ⋆j)

Special cases in which the full conditional distributions of and have a simple expression will be discussed later.

### 6.2 Conditional Algorithm

Conditional algorithms are usually of wide applicability. The most famous example of this type of strategy is the one proposed by Ishwaran and James (2001), which consists of a blocked Gibbs sampler based on the stick-breaking representation of a discrete random measure. Conditional algorithms allow us to draw from the joint distribution of in Eq. (3), where , which in turns defines a draw of the random probability measure on :

 P(dθ)=M∑m=1wmδτm(dθ)

As the algorithm samples from the posterior distribution of the random measure, we are able to perform full posterior inference, at least numerically, on any functional of such distribution. These issues are discussed in detail in Gelfand and Kottas (2002). Moreover, it is simple to make inference on the hyper-parameters of the distributions of and . An outline of the MCMC algorithm is given in Figure 1. The scheme follows directly from Theorem 2, adapted to the mixture case. Note that in step 2 of the algorithm, the relabelling of the mixture components is essential so that the non-empty components correspond to the first components.

## 7 Norm-IFFP hierarchical mixture models

Most real world applications of discrete random measures involve an additional layer in the model hierarchy and convolve the random measure with a continuous kernel leading to nonparametric mixture models. In this context, data are assumed to be generated from a parametric distribution indexed by some parameter , with . Usually is assigned a nonparametric prior, in our case a Norm-IFFP. This leads to models of the form

 Y1,…,Yn|θ1,…,θnind∼f(y∣θi)θ1,…,θn|Piid∼PP∼Norm−IFFP(h,p0,qm) (15)

where is a parametric density on , for all . We point out that is the density of a non-atomic probability measure on , such that for all . Model (15) will be addressed here as a Norm-IFFP hierarchical mixture model. The model can be extended by specifying appropriate hyperpriors. It is well known that this model is equivalent to assuming that the ’s, conditional on , are independently distributed according to the random density (1). We point out that Model (15) admits as a special case the popular finite Dirichlet mixture model (see Nobile (1994); Richardson and Green (1997); Stephens (2000); Miller and Harrison (2018)) discussed in more details in Section 9.1.1. The posterior characterization given in Theorem 2, as well as the analytical expression for the eppf given in Theorem 1, allow us device conditional or marginal algorithms to perform inference under Model (15) as discussed in Section 6.

## 8 Special Choices of q in Norm-IFFP

The exact evaluation of the eppf in Eq. (8) presents two challenges: an integral and an infinite sum. Numerical solution of the integral is handled within the MCMC via the augmentation trick, while here we discuss more in detail the infinite sum, defined as

 Ψ(u,k):={∞∑m=0(m+k)!m!ψ(u)mqm+k}

for each real and each integer . As it is shown in the proof of Theorem 2, , i.e. the sum always converges.

The analytical solution of the latter depends on the particular choice of prior distribution for . Since is less than 1, is related to a binomial series. This implies that if is a Poisson or a Negative Binomial, we can derive conjugate updates for and we can find a closed form solution for . In particular, if , corresponding to the density of a random variable shifted on , then we obtain

 Ψ(u,k)=Λk−1(Λψ(u)+k)exp{Λ(ψ(u)−1)}

Moreover, the full conditional distribution of , i.e. in item (a) of Theorem 2, is

 q⋆m∝kψ(u)P0(m,Λψ(u))+Λψ(u)P1(m,Λψ(u))

where and are the probability mass function of a Poisson and of a shifted Poisson respectively.

Finally, it is worth to mention that the shifted Poisson choice for implies that in Eq. (11), we have

 fUn(u;n)=un−1Γ(n)(−1)nddunψ(u)eΛ(ψ(u)−1)

Note that it is also possible to use a Truncated Poisson distribution for , with a slight difference in results.

On the other hand if we choose , a Negative Binomial density with parameters and and support on , i.e.

 qm=Γ(r+m−1)(m−1)!Γ(r)pm−1(1−p)rI{1,2,…}(m)

then, it is easy to show that

 Ψ(u,k)=Γ(r+k−1)Γ(r)pk−1(1−p)rpψ(u)(r−1)+k(1−pψ(u))k+ru>0 and k≥1

In this case we obtain that the full conditional for the number of non-allocated components has support in with probability mass function

 q⋆m∝(r+k)pψ(u)NegBin(m;pψ(u),r+k)+k(1−pψ(u))NegBin(m+1;pψ(u),r+k−1)

Moreover,

 fUn(u;n) =un−1Γ(n)(−1)nddunψ(u)(1−p)r(1−pψ(u))r

Finally, in applications, we might want to fix the number of mixture components, i.e. the number of points of the point process, leading to the standard finite mixture setup. In this case, if is set very large, we recover the sparse mixture framework of Frühwirth-Schnatter and Malsiner-Walli (2018). Let with probability 1, we obtain

 Ψ(u,k)=⎧⎨⎩˜M!(˜M−k)!ψ(u)˜M−kif k≤˜M0if k>˜M

This prior specification for implies that the support for is bounded, , assigns probability mass one to and

 fUn(u)=un−1Γ(n)(−1)n˜Mψ(u)˜M−1ψ′(u)

## 9 Important Examples

The depends on three densities. The prior on is

and, in applications, a conjugate prior is usually preferred. The choice of

has been discussed in Section 8. We now focus on the choice of . Once again the particular choice of influences the induced clustering in Eq. (8) as well as efficiency of computations. There are two possible alternatives: either to choose as a parametric density or to select the Laplace transform of , .

### 9.1 Choice of h

#### 9.1.1 Finite Dirichlet process

Let be the density. Under this choice of the Norm-IFPP is a finite Dirichlet measure, that is an almost surely discrete probability measure as in Eq. (7), where, conditionally on , the jump sizes of are a sample from the -dimensional Dirichlet distribution. Therefore, this is equivalent to a conventional finite mixture model as described in Section 2. Recall that the Laplace transform and its derivatives for a density are given by Then, applying Theorem 1, we obtain that the eppf of this model is

 p(n1,…,nk) = ∫∞0un−1Γ(n)(∞∑m=0(m+k)!m!1(u+1)mγqm+k)k∏j=1Γ(γ+nj)Γ(γ)1(u+1)nj+γdu = {1Γ(n)∞∑m=0(m+k)!m!qm+k∫∞0un−1(u+1)mγ+n+kndu}k∏j=1Γ(γ+nj)Γ(γ) = {∞∑m=0(m+k)!m!qm+kΓ((k+m)γ)Γ((k+m)γ+n)}k∏jΓ(γ+nj)Γ(γ) = V(n,k)k∏