# Bayesian estimation and prediction for mixtures

For two vast families of mixture distributions and a given prior, we provide unified representations of posterior and predictive distributions. Model applications presented include bivariate mixtures of Gamma distributions labelled as Kibble-type, non-central Chi-square and F distributions, the distribution of R^2 in multiple regression, variance mixture of normal distributions, and mixtures of location-scale exponential distributions including the multivariate Lomax distribution. An emphasis is also placed on analytical representations and the relationships with a host of existing distributions and several hypergeomtric functions of one or two variables.

## Authors

• 1 publication
• 7 publications
• ### Model-based clustering and classification using mixtures of multivariate skewed power exponential distributions

Families of mixtures of multivariate power exponential (MPE) distributio...
07/03/2019 ∙ by Utkarsh J. Dang, et al. ∙ 0

• ### Inconsistency of Pitman-Yor process mixtures for the number of components

In many applications, a finite mixture is a natural model, but it can be...
08/30/2013 ∙ by Jeffrey W. Miller, et al. ∙ 0

• ### Bayesian Measurement Error Models Using Finite Mixtures of Scale Mixtures of Skew-Normal Distributions

We present a proposal to deal with the non-normality issue in the contex...
07/26/2020 ∙ by C. R. B. Cabral, et al. ∙ 0

• ### Conditional tail risk expectations for location-scale mixture of elliptical distributions

We present general results on the univariate tail conditional expectatio...
07/18/2020 ∙ by Baishuai Zuo, et al. ∙ 0

• ### Power-Expected-Posterior Priors as Mixtures of g-Priors

One of the main approaches used to construct prior distributions for obj...
02/13/2020 ∙ by Dimitris Fouskakis, et al. ∙ 0

• ### The Bayesian Bridge

We propose the Bayesian bridge estimator for regularized regression and ...
09/11/2011 ∙ by Nicholas G. Polson, et al. ∙ 0

• ### Long-tailed distributions of inter-event times as mixtures of exponential distributions

Inter-event times of various human behavior are apparently non-Poissonia...
04/30/2019 ∙ by Makoto Okada, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Mixture models are ubiquitous in probability and statistics. Such models, whether they are finite mixture models, mixtures of Poisson, exponential, gamma or normal distributions, etc., are quite useful and appealing for best representing data and heterogeneous environments. As well, distributional properties of mixture models are often quite elegant and instructive. However, analytical and numerical challenges are present and well documented, namely in terms of likelihood-based and Bayesian inference.

It is also the case that several familiar distributions are representable in terms of mixtures, and that such representation facilitates the derivation of various statistical properties and approaches to statistical inference. Prominent examples include the noncentral chi-square, Beta, and Fisher distributions, which arise typically in relationship to quadratic forms in normal linear models. Other important examples include the distribution of the square of a multiple correlation coefficient () in a standard multiple regression linear model with normally distributed errors, as well as the vast class of univariate or bivariate Gamma mixtures which include the Kibble distribution (see Example 1.1).

We consider mixture models for summary statistics which are of one of the following two types:

 Type I:X|K∼fK with K∼gθ, Y|J∼qJ with J∼hθ; (1.1) Type II:X|K∼fθ,K with K∼g, Y|J∼qθ,J with J∼h; (1.2)

The classification of these types will be adhered to and seems to be a natural way to present the various expressions and examples that make up this paper. In the above models, the mixing variables will typically be either a discrete or continuous univariate distribution, but more generally . The parameter is unknown and a prior density will be assumed. Otherwise the densities will be assumed known (except for the value of ) and absolutely continuous with respect to a finite measure . Similarly, the densities and for , as well as and for , will be assumed to be known and absolutely continuous with respect to a finite measure . Examples will include both discrete and continuous mixing for and , as well as discrete and continuous models for the conditional distributions and .

We provide analytical expressions for Bayesian posterior distributions for based on , as well as Bayesian predictive densities for based on . We are particularly interested in eliciting the general structures driving these Bayesian solutions. We also strive for elegant and concise representations, informative connections, and aim to present various illustrations and applications. Although the findings are quite general, model applications that we present include bivariate mixtures of Gamma distributions that we label as Kibble-type, non-central Chi-square and F distributions, the distribution of in multiple regression, variance mixture of normal distributions, and mixtures of location-scale exponential distributions including the multivariate Lomax distribution. For bivariate gamma mixtures, which we focus on in Section 4, we also consider bivariate prior distributions with dependence structures, in particular as occurring under an order restriction on the parameters. The posterior and predictive distribution decompositions in this work follow familiar paths, but the analytical representations provided are nevertheless unified, useful and insightful, and lead to simplifications in Bayesian posterior analyses. This is also the case where we purposely exploit the mixture representation of a familiar distribution.

Here is a first illustration of Type I and Type II mixtures, which will be addressed more generally in Sections 3 and 4 respectively.

###### Example 1.1.

The Kibble bivariate distribution (Wicksell, 1933; Kibble, 1941) for , admits the following mixture representation:

 X1,X2|K∼ind.G(ν+K,1−ρλ1) and G(ν+K,1−ρλ2), with K∼NB(ν,1−ρ). (1.3)

Here , and

Such a distribution originates for instance in describing the joint distribution of sample variances generated from a bivariate normal distribution with correlation coefficient

. The cases reduce to . Now, observe that we have a Type I mixture if with known , and a Type II mixture for ; or again ; with known . A third type of mixture arises when is unknown, but will not be further addressed in this paper. The mixing representation of the Kibble distribution has been exploited in a critical fashion for Bayesian analysis by Iliopoulos et al. (2005).

## 2 Notations and definitions

Here are some notations and definitions used throughout concerning some special functions and various distributions that appear below, related either to the model, the mixing variables, the prior, the posterior, or the predictive distribution.

In what follows, we denote

as the Poisson distribution with mean

and p.m.f. (or density) . We write , with and

, to denote a negative binomial distribution with p.m.f. (or density)

.

Throughout the paper, we define for positive real numbers , and , the generalized hypergeometric function as

 pFq(a1,...,ap;b1,...,bq;z)=∞∑k=0∏pi=1(ai)k∏qj=1(bj)kzkk!,

with the Pochhammer function defined here for . We write , for

, to denote a generalized hypergeometric distribution (e.g., Johnson et al., 1995) with p.m.f.

 P(N=n)=1pFq(a1,...,ap;b1,...,bq;λ)∏pi=1(ai)n∏qj=1(bn)kλnn!IN(n).

For these distributions, we will have positive ’s and ’s, and will be non negative and in the radius of convergence of the function. With such a notation for instance, we could write for , and for .

We denote , , and , for , as Gamma, Beta, and Beta type II distributions, and with densities , , and , respectively. The latter Beta type II family includes Pareto distributions on for .

The Kummer distribution of type II, denoted for parameters and , is taken with density

 σbΓ(a)ψ(a,1−b,c)ta−1(t+σ)a+be−ct/σI(0,∞)(t),

where is the confluent hypergeometric function of type II defined for and as: . This class of distributions include Gamma distributions with choices . The class can also be extended to include cases which correspond to distributions.

We will denote McKay’s bivariate gamma distribution with parameters as with p.d.f.

 γ3γ1+γ2Γ(γ1)Γ(γ2)zγ1−11(z2−z1)γ2−1e−γ3z2I(0,∞)(z1)I(z1,∞)(z2).

The distribution has a long history (e.g., McKay, 1934) and is a benchmark bivariate distribution to model durations that are ordered, It is easy to verify that the marginals are distributed as and , and that and are independently distributed, with . A generalization will be presented in Section 4.

## 3 Type I mixtures

We begin with Type I mixtures, providing general representations for the posterior distribution , as well as the predictive distribution , and following up with various examples and observations.

###### Theorem 3.1.

Let be conditionally independent distributed as in (1.1) and let have prior density for with respect to finite measure . Let and be the densities given by

 πk(u)=π(u)gu(k)mπ(k) and f′x(k)=fk(x)mπ(k)∫Kfk(x)mπ(k)dμ(k),u∈Θ,x∈Rd,

with the density given by

 mπ(k)=∫Θgθ(k)π(θ)dτ(θ),k∈K.

Then,

1. The posterior distribution of admits the mixture representation:

 U|K′∼πK′,K′∼f′x; (3.4)
2. The Bayes predictive density of admits the representation:

 Y|J′∼qJ′,J′∼h′x, (3.5)

with

 h′x(j′)=∫Kqπ(j′|k′)f′x(k′)dμ(k′), and qπ(j′|k′)=∫Θhθ(j′)πk′(θ)dτ(θ). (3.6)

Proof. (a) We have indeed

 π(θ|x) ∝ ∫Kfk(x)gθ(k)π(θ)dμ(k) ∝ ∫Kfk(x)mπ(k)πk(θ)dμ(k)∝∫Kf′k(x)πk(θ)dμ(k).

(b) The predictive density of , i.e. the conditional density of , is given by:

 qπ(y|x) = ∫Θq(y|θ)π(θ|x)dτ(θ) = ∫Θ{∫Kqj′(y)hθ(j′)dμ(j′)}{∫Kπk′(θ)f′x(k′)dμ(k′)}dτ(θ) = ∫Kqj′(y){∫Kf′x(k′)∫Θhθ(j′)πk′(θ)dτ(θ) dμ(k′)}dμ(j′),

where we have used (3.4). This establishes the result. ∎

###### Remark 3.1.

The posterior and predictive distribution representations of Theorem

3.1 are particularly appealing. Indeed, observe that posterior distribution (3.4) mixes the which correspond to the posterior density of as if one had actually observed . Moreover, the mixing density for is a weighted version of the marginal density for with weight proportional to .

The predictive distribution for mixes the same densities as the model density for , with the prior mixing density replaced by the posterior mixing density . Furthermore, this mixing density is itself a mixture of the predictive (or conditional) densities of as if one had observed .

The following examples concern posterior and predictive distribution illustrations of Theorem 3.1.

###### Example 3.2.

In model (1.1), consider Poisson mixing with a prior for . From this familiar set-up, we obtain as a density and as a NB p.m.f. Following (3.4), the posterior distribution is a mixture of the above ’s with mixing density on given by

 f′x(k)∝(a)kk!ba(1+b)a+kfk(x).

Now, consider the cases: (i) , (ii) and (iii) . In the context of model (1.1), case (i)

corresponds to a non-central chi-square distribution with

degrees of freedom and non-centrality parameter (), case (ii)

to a non-central Beta distribution with shape parameters

, , and non-centrality parameter , and case (iii) to the density of , with distributed as a non-central distribution with degrees of freedom , and non-centrality parameter . The latter two cases are essentially equivalent though and related by the fact that in (ii) is distributed as is in (iii).

For (i), we obtain

 f′x∼Hyp(a;p2;x2(1+b)). (3.7)

Observe that the above generalized hypergeometric distribution reduces to a distribution for . The posterior expectation can be evaluated with the help its mixture representation and standard calculations involving the above p.m.f. One obtains

 E(θ|x) = EK′|x{E(θ|K′,x)} (3.8) = Ef′x(a+K′1+b) = a1+b⎛⎜⎝1+xp(1+b)1F1(a+1;p/2+1;x2(1+b))1F1(a;p/2;x2(1+b))⎞⎟⎠, (3.9)

with the case simplifying to as noted by Saxena and Alam (1982).

For the two other cases, we obtain the mixing power series densities:

 f′x∼Hyp(a,p+q2;p2;x1+b) for (ii), and f′x∼Hyp(a,p+q2;p2;x(1+x)(1+b)) for (iii). (3.10)

Observe that the case reduces to a NB distribution for (ii) and a NB distribution for (iii). The posterior expectation may be computed from (3.8) with simplifications occurring for , yielding for instance in (ii) : . Finally, we point out that the above posterior distribution representation applies as well for the improper prior choice (i.e., ), with not a p.m.f., but given by .

###### Example 3.3.

Turning now to predictive densities with the same set-up as in Example 3.2, we consider distributed identically to (i.e, and ). For (i.e., case (i)), Theorem 3.1 tells us that the Bayes predictive density of admits the mixture representation: with given in (3.6). The latter admits itself the mixture representation

 J′|K′∼NB(a+K′,1+b2+b),K′∼f′x as in (???), (3.11)

with the former being the Bayes predictive density of based on and prior . Alternatively represented, the mixing p.m.f. for may be expressed as

 h′x(j′) = ∑k′(a+k′)j′j′!(1+b2+b)a+k′(12+b)j′(a)k′k′!(p2)k′(x2(1+b))k′1F1(a;p/2;x2(1+b)) = = (a)j′j′!(1+b2+b)a(12+b)j′1F1(a+j′;p/2;x2(2+b))1F1(a;p/2;x2(1+b)),

which also can be viewed directly as a weighted p.m.f.

The non-central Beta and Fisher distributions are similar. For instance, in the former case with and identically distributed with and , predictive densities associated with priors are also distributed as mixtures

 Y|J′∼B(p/2+J′,q/2), with J′|K′∼NB(a+K′,1+b2+b),K′∼f′x as in (???).

For the mixing p.m.f. of , a development as above yields the expression:

 h′x(j′)=(a)j′j′!(1+b2+b)a(12+b)j′2F1(a+j′,(p+q)/2;p/2;x2+b)2F1(a,(p+q)/2;p/2;x1+b).
###### Example 3.4.

A doubly non-central distribution is a type I mixture (e.g., Bulgren, 1971) with and admitting the representation

 X|K∼B2(ν12+K1,ν22+K2,ν2ν1) with K1,K2∼indep.P(θ12) and P(θ22). (3.12)

Such a distribution arises naturally as a multiple of the ratio of two independent noncentral chi-squared distributions, and reduces to a non-central for . Consider now the application of Theorem 3.1 for the prior with , yielding the familiar distributional results:

 πk(θ1,θ2)∼indep.G(ai+ki,1+bi),i=1,2,

and

 mπ(k1,k2)∼indep.NB(ai,bi1+bi),i=1,2.

Representation (3.4) tells us that the posterior distribution of is a mixture of the ’s with mixing density

 f′x(k) ∝ fk(x)mπ(k) ∝ (ν1+ν22)k1+k2(a1)k1(a2)k2(ν12)k1(ν22)k2k1!k2!(β1)k1(β2)k2IN2(k1,k2) ⇒f′x(k) = (ν1+ν22)k1+k2(a1)k1(a2)k2F2(ν1+ν22;a1,a2;ν12,ν22;β1,β2)1(ν12)k1(ν22)k2k1!k2!(β1)k1(β2)k2IN2(k), (3.13)

with , , and where is the Appell function of the second kind given by

 F2(γ1;γ2,γ3;γ4,γ5;w,z)=∞∑m=0∞∑n=0(γ1)m+n(γ2)m(γ3)nm!n!(γ4)m(γ5)nwmzn.

The bivariate p.m.f. in (3.13) and how it is arisen here are of interest. It is a bivariate power series p.m.f. generated by the coefficients of the Appell function and will be bona fide p.m.f. for and . Appell’s function appears in a similar way, again in a Bayesian framework, as a bivariate discrete distribution called Bailey by Laurent (2012) (see also Jones & Marchand, 2019 for another derivation). For the particular case , the p.m.f. in (3.13) simplifies and the corresponding random pair , admits the stochastic representation: and with and .

Turning to the predictive density, consider distributed as and the same prior on as above. Similarly to Example 3.3, which shares the same Poisson mixing and gamma prior, we obtain from Theorem 3.1 and (3.5), that the predictive density for admits the representation

 Y|J′∼B2(ν12+J′1,ν22+J′2,ν2ν1), with J′i|K′∼indep.NB(a+K′i,1+bi2+bi),K′∼f′x as in (???).

We point out that, if the distribution of is non-identical to that of with associated degrees of freedom and , the only change in the previous expression is to replace the ’s by the ’s for the distribution of . Similar observations apply to the other examples of this section.

###### Example 3.5.

Consider univariate Gamma mixtures with in model (1.1), with the mixing distribution and prior , with known . Theorem 3.1 applies and tells us that the posterior distribution is a mixture of making use of a familiar posterior distribution for gamma models with gamma priors. In evaluating the mixing density of given in (3.4), it is easy to verify that and one thus obtains

 f′x(k)∝mπ(k)fk(x)∝ka+α−1(k+b)a+ce−xkI(0,∞)(x),

which corresponds to a Kummer distribution as defined in Section 2.

Now consider the Bayesian predictive density for the Gamma mixture and , which includes the particular case of identically distributed and for and . Observe that the density is that of for the set-up , independently distributed, and with . A calculation (e.g., Aitchison & Dunsmore, 1975) yields . With the above, it follows from Theorem 3.1 that the predictive distribution for admits the representation:

 Y|J′∼G(α′,J′), with J′|K′∼B2(a′,a+c,b+K′) and K′∼K2(a+α,c−α,bx,b).

An alternative representation comes from simply evaluating the marginal density of . A calculation gives:

 h′x(j′)=Γ(a′+a+c)Γ(a′)Γ(a+c)bc−αj′(a′−1)(j′+b)a′+c−αψ(a+α,1+α−a′−c,jx+bx)ψ(a+α,1+α−c,bx)I(0,∞)(j′),

with the confluent hypergeometric function of type II as defined in Section 2.

The final application of Theorem 3.1 concerns the coefficient of determination in a standard multiple regression context.

###### Example 3.6.

Consider a coefficient of determination , or square of a multiple correlation coefficient, that arises for a sample of size from , with , and the regression of based on . For more details on the underlying distributional theory, see for instance Muirhead (1982). It is well known that the distribution of is a Type I mixture (1.1) with

 X|K∼B(m−12+K,n1−m+12), with % K∼NB(n1−12,1−θ),

with is the theoretical squared multiple correlation coefficient. As in Marchand (2001), a convenient prior on is a Beta prior and it leads along with the negative binomial distributed to a conjugate posterior. Specifically, we obtain for the posterior density , as well as the marginal p.m.f.

 mπ(k) = ∫(0,1)Γ(a+b)Γ(a)Γ(b)θa−1(1−θ)b−1Γ(n1−12+k)k!Γ(n1−12)θk(1−θ)(n1−1)/2dθ = (n1−12)k(a)kk!(a+b+n1−12)k(b)(n1−1)/2(a+b)(n1−1)/2, for k∈N.

Theorem 3.1 tells us that the posterior distribution is a mixture of the ’s with mixing

 f′x(k) ∝ fk(x)mπ(k) ∝ (n1−12)k(m−12)kxk(n1−12)k(a)kk!(a+b+n1−12)k ⇒f′x ∼ Hyp(n1−12,n1−12,a;m−12,a+b+n1−12;x). (3.14)

The result, which we have derived from the general context of Theorem 3.1 was obtained by Marchand (2001) in this specific set-up. In doing so, he defined such Beta mixtures as HyperBeta and also provided several graphs of prior-posterior densities for varying prior parameters , sample size , and observed values of .

For the predictive density of a future distributed as but allowing for a possibly different sample size , expression (3.5) tells us that such a predictive density admits the mixture representation:

 Y|J′∼B(m−12+J′,n2−m+12),J′|k′∼qπ(⋅|k′), and K′∼f′x as in (???),with (3.15)

the predictive density for based on and prior . An evaluation of (3.6) yields

 qπ(j′|k′) = ∫10(n2−12)j′j′!θ