On Johnson's "sufficientness" postulates for features-sampling models

In the 1920's, the English philosopher W.E. Johnson introduced a characterization of the symmetric Dirichlet prior distribution in terms of its predictive distribution. This is typically referred to as Johnson's "sufficientness" postulate, and it has been the subject of many contributions in Bayesian statistics, leading to predictive characterization for infinite-dimensional generalizations of the Dirichlet distribution, i.e. species-sampling models. In this paper, we review "sufficientness" postulates for species-sampling models, and then investigate analogous predictive characterizations for the more general features-sampling models. In particular, we present a "sufficientness" postulate for a class of features-sampling models referred to as Scaled Processes (SPs), and then discuss analogous characterizations in the general setup of features-sampling models.

Authors

• 14 publications
• 31 publications
03/15/2018

Hierarchical Species Sampling Models

This paper introduces a general class of hierarchical nonparametric prio...
08/14/2018

The joint distribution of pin-point plant cover data: a reparametrized Dirichlet - multinomial distribution

A reparametrized Dirichlet-multinomial distribution is introduced, and t...
08/20/2019

Stick-breaking Pitman-Yor processes given the species sampling size

Random discrete distributions, say F, known as species sampling models, ...
11/03/2021

Joint Species Distribution Modeling with species competition and non-stationary spatial random effects

Joint species distribution models (JSDM) are among the most important st...
05/04/2022

Nonstationary Bandit Learning via Predictive Sampling

We propose predictive sampling as an approach to selecting actions that ...
08/26/2021

Contaminated Gibbs-type priors

Gibbs-type priors are widely used as key components in several Bayesian ...
02/27/2017

Nearly Maximally Predictive Features and Their Dimensions

Scientific explanation often requires inferring maximally predictive fea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Exchangeability (definetti

) provides a natural modeling assumption in a large variety of statistical problems, and it amounts to assume that the order in which observations are recorded is not relevant. Consider a sequence of random variables

defined on a common probability space

and taking values in an arbitrary space, which is assumed to be Polish. The sequence is exchangeable if and only if

 (Z1,…,Zn)d=(Zσ(1),…,Zσ(n))

for any permutation of the set and any . By virtue of the celebrated de Finetti representation theorem, exchangeability of is tantamount to assert the existence of a random element , defined on a (parameter) space , such that conditionally on the s are independent and identically distributed with common distribution , i.e.,

 Zj|~μ\scriptsize{iid}∼p~μj≥1~μ∼M, (1)

where is the distribution of . In a Bayesian setting, takes on the interpretation of a prior distribution for the parameter object of interest. In this sense, the de Finetti representation theorem is a natural framework for Bayesian statistics. For mathematical convenience, is assumed to be a Polish space, equipped with the Borel -algebra . Hereafter, with the term parameter we refer to both a finite and an infinite dimensional object.

Within the framework of exchangeability (1), a critical role is played by the predictive distributions, namely the conditional distributions of the th observation , given . The problem of characterizing prior distributions in terms of their predictive distributions has a long history in Bayesian statistics, starting from the seminal work of the English philosopher johnson1932 who provided a predictive characterization of the symmetric Dirichlet prior distribution. Such a characterization is typically referred to as Johnson’s “sufficientness” postulate. Species-sampling models (Pitman96species) provide arguably the most popular infinite-dimensional generalization of the Dirichlet distribution. They form a broad class of nonparametric prior models which correspond to assume that in (1) is an almost surely discrete random probability measure

 ~p=∑i≥1~piδ~zi, (2)

where: i) are non-negative random weights summing up to almost surely; ii) are random species’ labels, independent of , and i.i.d. with common (non-atomic) distribution . The term species refers to the fact that the law of is a prior distribution for the unknown species composition of a population of individuals s, with belonging to a species with probability , for . In the context of species-sampling models, Regazzini1978 and Lo1991, provided a “sufficientness” postulate for the Dirichlet process (ferguson1973bayesian). Such a characterization has then been extended by Zabell2005 to the Pitman-Yor process (Perman92; pityor_97), and by marcoB to the more general Gibbs-type prior models (gnepit_06).

In this paper, we introduce and discuss Johnson’s “sufficientness” postulates in the features-sampling setting, which generalizes the species-sampling setting by allowing each individual of the population to belong to multiple species, now called features. We point out that feature-sampling models are extremely important in different applied areas, see, e.g. Gri_11; Ayed2021 and references therein. Under the framework of exchangeability (1), the features-sampling setting assumes that

 Zj|~μ=∑i≥1Aj,iδ~wi∼p~μ, (3)

and

 ~μ=∑i≥1~piδ~wi

where: i) conditionally on , are independent Bernoulli random variables with parameters ; ii) are -valued random weights; iii) are random features’ labels, independent of , and i.i.d. with common (non-atomic) distribution . That is, individual displays feature if and only if , which happens with probability . For example, if, conditionally on , displays only two features, say and , it equals the random measure . The distribution is the law of a Bernoulli process with parameter , which is denoted by , whereas the law of is a nonparametric prior distribution for the unknown feature probabilities , i.e. a features-sampling model. Here, we investigate the problem of characterizing prior distributions for in terms of their predictive distributions, with the goal to provide “sufficientness” postulates for features-sampling models. We discuss such a problem, and present partial results for a class of features-sampling models referred to as Scaled Process (SP) priors for (james2015scaled; camerlenghi2021scaled). With these results, we aim at stimulating future research in this field, to obtain “sufficientness” postulates for general features-sampling models.

The paper is structured as follows. In Section 2, we present a brief review on Johnson’s “sufficientness” postulates for species-sampling models. Section 3 focuses on nonparametric prior models for the Bernoulli process, i.e. features-sampling models: we review their definitions, properties and sampling structures. In Section 4 we present a “sufficientness” postulate for SPs. Section 5 concludes the paper by discussing our results and conjecturing analogous results for more general classes of features-sampling models.

2 Species-sampling models

To introduce species-sampling models, we assume that the observations are -valued random elements, and is supposed to be a Polish space whose Borel -algebra is denoted by . Thus contains all the possible species’ labels of the populations. When we deal with species-sampling models, the hierarchical formulation (1) specializes as

 Zj|~p\scriptsize{iid}∼~pj≥1~p∼M (4)

where is an almost surely discrete random probability measure on , and denotes its law. We also remind that: i) are non-negative random weights summing up to almost surely; ii) are random species’ labels, independent of , and i.i.d. as a common (non-atomic) distribution . Using the terminology of Pitman96species, the discrete random probability measure is a species-sampling model. In Bayesian nonparametrics, popular examples of species-sampling models are: the Dirichlet process (ferguson1973bayesian), the Pitman-Yor process (Perman92; pityor_97), the normalized generalized Gamma process (Brix1999; Lijoi2007). These are examples belonging to a peculiar subclass of species-sampling models, which are referred to as Gibbs-type prior models (gnepit_06; gibbs_15). More general subclasses of species-sampling models are, e.g. the homogeneous normalized random measures (regazzini2003) and the Poisson-Kingman models (pitman2003poisson; Pitman_06). We refer to lijoi2010models and fundamentals for a detailed and stimulating account on species-sampling models and their use in Bayesian nonparametrics.

Because of the almost sure discreteness of in (4), a random sample from features ties, that is if . Thus, induces a random partition of the set into blocks, labelled by , with corresponding frequencies , such that and . From Pitman96species, the predictive distribution of is of the form

 P(Zn+1∈A|Zn)=g(n,k,n)P(A)+k∑i=1fi(n,k,n)δZ∗i(A),A∈Z, (5)

for any , having set , with and being arbitrary non-negative functions that satisfy the constraint . The predictive distribution (5) admits the following interpretation: i) corresponds to the probability that is a new species, that is a species non observed in ; ii) corresponds to the probability that is a species in . The functions and completely determine the distribution of the exchangeable sequence , and in turns the distribution of the random partition of induced by . Predictive distributions of popular species-sampling models, e.g. the Dirichlet process, the Pitman-Yor process and the normalized generalized Gamma process, are of the form (5) for suitable specification of the functions and . We refer to Pitman_06 for a detailed account on random partitions induced by species-sampling models, and generalizations thereof.

Here, we recall the predictive distribution of Gibbs-type prior models (gnepit_06; gibbs_15). Let us first introduce the definition of these processes.

Definition 2.1

Let and let be a (non-atomic) distribution on . A Gibbs-type prior models is a species-sampling models with predictive distribution of form

 P(Zn+1∈A|Zn)=Vn+1,k+1Vn,kP(A)+Vn+1,kVn,kk∑i=1(ni−σ)δZ∗i(A),A∈Z, (6)

for any , where is a collection of non-negative weights that satisfy the recurrence relation for all , , with the proviso .

Note that the Dirichlet process is a Gibbs-type prior model which corresponds to

 Vn,k=θk(θ)n

for , and having denoted by the Pochhammer symbol for the rising factorials. Moreover, the Pitman-Yor process is a Gibbs-type prior model corresponding to

 Vn,k=∏k−1i=0(θ+iσ)(θ)n

for and . We refer to pitman2003poisson for other examples of Gibbs-type prior models, and for a detailed account on the s. See also Pitman_06 and references therein.

Because of de Finetti’s representation theorem, there exists a one-to-one correspondence between the functions and in the predictive distribution (5) and the law of , i.e. the de Finetti measure. This is at the basis of Johnson’s “sufficientness” postulates, characterizing species-sampling models through their predictive distributions. Regazzini1978, and later Lo1991, provided the first “sufficientness” postulate for species-sampling models, showing that: the Dirichlet process is the unique species-sampling model for which the function depends on only through , and the function depends on only through and , for . Such a result has been extended in Zabell1997, providing the following “sufficientness” postulate for the Pitman-Yor process: the Pitman-Yor process is the unique species-sampling model for which the function depends on only through and , and the function depends on only through and , for . marcoB discussed “sufficientness” postulate in the more general setting of Gibbs-type prior models, showing that: Gibbs-type prior models are the sole species-sampling models for which the function depends on only through and , and the function depends on only through , and . Such a result shows a critical difference, at the sampling level, between the Pitman-Yor process and Gibbs-type prior models, which lies in the inclusion of the sampling information on the observed number of distinct species in the probability of observing at the -th draw a species already observed in the sample.

3 Features-sampling models

Features-sampling models generalize species-sampling models by allowing each individual to belong to more than one species, which are now called features. To introduce features-sampling models, we consider a space of features , which is assumed to be a Polish space, and we denote by its Borel -field. Thus contains all the possible features’ labels of the population. Observations are represented through the counting measure (3), whose parameter is an almost surely discrete measure with masses in . When we deal with features-sampling models, the hierarchical formulation (1) specializes as

 Zj|~μ\scriptsize{iid}∼BeP(~μ)~μ∼M (7)

where is an almost surely discrete random measure on , and denotes its law. We also remind that: i) conditionally on , are independent Bernoulli random variables with parameters ; ii) are -valued random weights; iii) are random features’ labels, independent of , and i.i.d. with common (non-atomic) distribution . Completely random measures (CRMs) (daleyII; kingman1967completely) provide a popular class of nonparametric priors , the most common examples being the Beta process prior and the stable Beta process prior (teh2009indian; james2017bayesian). See also broderick2018posteriors and references therein for other examples of CRM priors, and generalizations thereof. Recently camerlenghi2021scaled investigated an alternative class of nonparametric priors , generalizing CRM priors and referred to as Scaled Processes (SPs). SPs priors first appeared in the work of james2017bayesian.

We assume a random sample to be modeled as in (7), and we introduce the predictive distribution of , that is the conditional probability of given . Note that, because of the pure discreteness of , the observations may share a random number of distinct features, say , denoted here as , and each feature is displayed exactly by of the individuals, as . Since the features’ labels are immaterial and i.i.d. form the base measure , the conditional distribution of , given

, may be equivalently characterized through the vector

, where: i) is the number of new features displayed by the th individual, namely hitherto unobserved out of the sample ; ii) is a -valued random variable for any , and if the th individual displays feature , it equals otherwise. Hence, the predictive distribution of is

 P((Yn+1,A∗n+1,1,…,A∗n+1,Kn)=(y,a1,…,aKn)|Zn)=f(y,a1,…,ak;n,k,m) (8)

where we denote by

a probability distribution evaluated at

, and where and is the sampling information. In the rest of this section we specify the function under the assumption of a CRM prior and a SP prior, showing its dependence on and . In particular, we show how SP priors allow to enrich the predictive distribution of CRM priors, by including additional sampling information in terms of the number of distinct features and their corresponding frequencies.

3.1 Priors based on CRMs

Let denote the space of all bounded and finite measures on , that is to say iff for any bounded set . Here we recall the definition of a Completely Random Measure (CRM) (see, e.g., daleyII).

Definition 3.1

A Completely Random Measure (CRM) on is a random element taking values in the space such that the random variables are independent for any choice of bounded and disjoint sets and for any .

We remind that kingman1967completely proved that a CRM may be decomposed as the sum of a deterministic drift and a purely atomic component. In Bayesian nonparametrics, it is common to consider purely atomic CRMs without fixed points of discontinuity, that is to say may be represented as , where is a sequence of random atoms and are the random locations. An appealing property of purely atomic CRMs is the availability of their Laplace functional, indeed for any measurable function one has

 E[e−∫Wf(w)~μ(dw)]=exp{−∫W×R+(1−e−sf(w))ν(dw,ds)} (9)

where is a measure on called the Lévy intensity of the CRM and it is such that

 ν({w}×R+)=0∀w∈W,and ∫A×R+min{s,1}ν(dw,ds)<∞ (10)

for any bounded Borel set . Here, we focus on homogeneous CRMs by assuming that the atoms s and the locations s are independent, in this case the Lévy measure may be written as

 ν(dw,ds)=λ(s)dsP(dw)

for some measurable function and a probability measure on , called the base measure, which is assumed to be diffuse. In this case the distribution of will be denoted as , and the second integrability condition in (10) reduces to the following

 ∫R+min{s,1}λ(s)ds<+∞. (11)

In the feature-sampling framework, may be used as a prior distribution if the sequence of atoms are in between , which happens if the Lévy intensity has support on . A noteworthy example, widely used in this setting, is the stable Beta process prior (teh2009indian). It is defined as a CRM with Lévy intensity

 λ(s)=α⋅Γ(1+c)Γ(1−σ)Γ(c+σ)s−1−σ(1−s)c+σ−11(0,1)(s) (12)

where , and (james2017bayesian; masoero2019more). Now, we describe the predictive distribution an arbitrary CRM . For the sake of clarity, we fix the following notation

 Poiss(y;C):=Cye−Cy!,y∈N and Bern(a;p):=pa(1−p)1−a,a∈{0,1}

to denote the probability mass functions of a Poisson with parameter and a Bernoulli random variable with parameter , respectively. We refer to james2017bayesian for a detailed posterior analysis of CRM priors. See also broderick2018posteriors and references therein.

Theorem 3.1 (james2017bayesian)

Let be exchangeable random variables modeled as in (7), where equals . If is a random sample which displays distinct features , and feature appears exactly times in the samples, as , then

 P((Yn+1,A∗n+1,1,…,A∗n+1,Kn)=(y,a1,…,aKn)|Zn)=Poiss(y;∫10s(1−s)nλ(s)ds)k∏i=1Bern(ai;p∗i) (13)

being

 p∗i:=∫10smi+1(1−s)n−miλ(s)ds∫10smi(1−s)n−miλ(s)ds.
Proof.

We consider james2017bayesian for Bernoulli product models (see also camerlenghi2021scaled), thus the distribution of , given , equals the distribution of

 Z′n+1+Kn∑i=1A∗n+1,iδW∗i, (14)

where such that , and are Bernoulli random variables with parameters , respectively, such that each is a random variable whose distribution with density function of the form

 fJi(s)∝(1−s)n−mismiλ(s).

By exploiting the previous predictive characterization, we can derive the posterior distribution of , given , by means of a direct application of the Laplace functional. Indeed, the distribution of equals . Thus, for any , we have the following

 E[e−tYn+1|Zn] =E[∏i≥1(e−t~η′i+(1−~η′i))],

where we used the representation and the fact that the s are independent Bernoulli random variables conditionally on . We now use the Laplace functional for to get

 E[e−tYn+1|Zn] =E[exp{∑i≥1log(1+~η′i(e−t−1))}] =exp{−(1−e−t)∫10(1−s)nsλ(s)ds}.

As a direct consequence, the posterior distribution of given

is a Poisson distribution with mean

. Again, by exploiting the predictive representation (14), the posterior distribution of , as , is a Bernoulli with the following mean

 E[Ji]=∫10sfJi(s)ds=∫10(1−s)n−mismi+1λ(s)ds∫10(1−s)n−mismiλ(s)ds.

Corollary 3.1

Let be exchangeable random variables modeled as in (7), where is the law of the stable Beta process. If is a random sample which displays distinct features , and feature appears exactly times in the samples, as , then

 P((Yn+1,A∗n+1,1,…,A∗n+1,Kn)=(y,a1,…,aKn)|Zn)=Poiss(y;α(c+σ)n(c+1)n)k∏i=1Bern(ai;mi−σn+c), (15)

where denotes the Pochhammer symbol for .

Proof.

It is sufficient to specialize Theorem 3.1 for the stable Beta process. In particular, from Theorem 3.1 the posterior distribution of given is a Poisson distribution with mean

 ∫10s(1−s)nλ(s)dsαΓ(1+c)Γ(1−σ)Γ(c+σ)∫10s−σ(1−s)n+c+σds=α(c+σ)n(c+1)n.

Moreover the parameters of the Bernoulli random variables are equal to

 p∗i=∫10smi+1(1−s)n−miλ(s)ds∫10smi(1−s)n−miλ(s)dsB(mi+1−σ,c+σ+n−mi)B(mi−σ,c+σ+n−mi)=mi−σn+c

as . ∎

3.2 SP priors

From Theorem 3.1, under CRM priors the distribution of the number of new features is a Poisson distribution which depends on the sampling information only through the sample size . Moreover, the probability of observing a feature already observed in the sample, say , depends only on the sample size and the frequency of feature out of the initial sample. camerlenghi2021scaled showed that SP priors allow to enrich the predictive structure of CRM priors, including additional sampling information in the probability of discovering new features. To introduce SP priors, consider a CRM on , where are positive random atoms and are i.i.d. random atoms, with Lévy intensity satisfying

 ∫∞0min{s,1}λ(s)ds<+∞. (16)

Consider the ordered jumps of the CRM and define the random measure

 ~μΔ1=∑i≥1Δi+1Δ1δ~wi

normalizing by the largest jump. The definition of SPs follows by a suitable change of measure of (james2015scaled; camerlenghi2021scaled). Let us denote by

a regular version of the conditional probability distribution of

, given . Now denote by a positive random variable with density function on , and define

 L(⋅):=∫R+L(⋅,a)fΨ1(a)da

the distribution of obtained by mixing with respect to the density function . Thus, we are ready to define a SP.

Definition 3.2

A Scaled Process (SP) prior on is defined as the almost surely discrete random measure

 ~μΨ1:=∑i≥1~ηiδ~wi, (17)

where has distribution and is a sequence of independent random variables with common distribution , also independent of . We will write .

A thoughtful account with a complete posterior analysis for SPs is given in camerlenghi2021scaled. Here we characterize the predictive distribution (8) of SPs.

Theorem 3.2

[james2017bayesian; camerlenghi2021scaled] Let be exchangeable random variables modeled as in (7), where equals . If is a random sample which displays distinct features , and feature appears exactly times in the samples, as , then the conditional distribution of , given , has posterior density:

 fΨ1|Zn(a)∝e−∑ni=1∫10s(1−s)n−1aλ(as)dsk∏i=1∫10smi(1−s)n−miaλ(as)dsfΨ1(a). (18)

Moreover, conditionally on and ,

 P((Yn+1,A∗n+1,1,…,A∗n+1,Kn)=(y,a1,…,aKn)|Zn,Ψ1)=Poiss(y;∫10sΨ1(1−s)nλ(sΨ1)ds)k∏i=1Bern(ai;p∗i(Ψ1)) (19)

being

 p∗i(Ψ1):=∫10smi+1(1−s)n−miλ(sΨ1)ds∫10smi(1−s)n−miλ(sΨ1)ds.
Proof.

The representation of the predictive distribution (19) follows from camerlenghi2021scaled. Indeed the posterior distribution of the largest jump directly follows from (camerlenghi2021scaled, Equation (4)). In addition (camerlenghi2021scaled, Proposition 2) shows that the conditional distribution of , given and , equals the distribution of the following counting measure

 Z′n+1+Kn∑i=1A∗n+1,iδW∗i, (20)

where and is a CRM with Lévy intensity of the form

 ν′Ψ1(dw,ds)=(1−s)nΨ1λ(Ψ1s)1(0,1)(s)dsP(dw).

Moreover are Bernoulli random variables with parameters , respectively, such that conditionally on each has distribution with density function of the form

 fJi|Ψ1(s)∝(1−s)n−mismiΨ1λ(Ψ1s)on (0,1).

As in the proof of Theorem 3.1, we show that the distribution of equals . Thus, by the evaluation of the Laplace functional, one may easily realize that the last random sum has a Poisson distribution with mean . Moreover by exploiting the posterior representation (20), the variables , as , conditionally on and

, are independent and Bernoulli distributed with mean

 E[Ji|Ψ1]=∫10sfJi|Ψ1(s)ds=∫10(1−s)n−mismi+1Ψ1λ(sΨ1)ds∫10(1−s)n−mismiΨ1λ(sΨ1)ds.

Remark 3.1

According to (18), the conditional distribution of , given may include the whole sampling information, depending on the specification of and , and hence the conditional distribution of , given , may also include such a sampling information. As a corollary of Theorem 3.2, the conditional distribution of , given , is a mixture of Poisson distributions that may include the whole sampling information; in particular, the amount of sampling information in the posterior distribution is uniquely determined by the mixing distribution, namely by the conditional distribution of , given .

Hereafter, we specialize Theorem 3.2 for the stable SP, that is a peculiar SP defined through a CRM with a Lévy intensity such that for a parameter . We refer to camerlenghi2021scaled for a detailed posterior analysis of the stable SP prior.

Corollary 3.2

Let be exchangeable random variables modeled as in (7), where equals , with for some . If is a random sample which displays distinct features , and feature appears exactly times in the samples, as , then the conditional distribution of , given , has posterior density:

 fΨ1|Zn(a)∝a−kσe−σa−σ∑ni=1B(1−σ,i)fΨ1(a) (21)

having denoted by the classical Euler Beta function. Moreover, conditionally on and ,

 P((Yn+1,A∗n+1,1,…,A∗n+1,Kn)=(y,a1,…,aKn)|Zn,Ψ1)=Poiss(y;σΨ−σ1B(1−σ,n+1))k∏i=1Bern(ai;mi−σn−σ+1). (22)
Proof.

The proof is a plain application of Theorem 3.2 under the choice . ∎

4 Predictive characterizations for SPs

In this section, we introduce and discuss Johnson’s “sufficientness” postulates in the context of features-sampling models, under the class of SP priors. According to Theorem 3.1, if the features-sampling model is a CRM prior, then the conditional distribution of , given , is a Poisson distribution that depends on the sampling information only through the sample size . Moreover, the conditional probability of generating an old feature , given , depends on the sampling information only through and . As shown in Theorem 3.2, SP priors enrich the predictive structure of CRM priors through the conditional distribution of the latent variable , given the observable sample . In the next theorem, we characterize the class of SP priors for which the conditional distribution of , given , depends on the sampling information only through .

Theorem 4.1

Let be exchangeable random variables modeled as in (7), where equals , being . Moreover, suppose that is a random sample which displays