    # The combinatorial structure of beta negative binomial processes

We characterize the combinatorial structure of conditionally-i.i.d. sequences of negative binomial processes with a common beta process base measure. In Bayesian nonparametric applications, such processes have served as models for latent multisets of features underlying data. Analogously, random subsets arise from conditionally-i.i.d. sequences of Bernoulli processes with a common beta process base measure, in which case the combinatorial structure is described by the Indian buffet process. Our results give a count analogue of the Indian buffet process, which we call a negative binomial Indian buffet process. As an intermediate step toward this goal, we provide a construction for the beta negative binomial process that avoids a representation of the underlying beta process base measure. We describe the key Markov kernels needed to use a NB-IBP representation in a Markov Chain Monte Carlo algorithm targeting a posterior distribution.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The focus of this article is on exchangeable sequences of multisets, that is, set-like objects in which repetition is allowed. Let be a complete, separable metric space equipped with its Borel -algebra and let denote the non-negative integers. By a point process on , we mean a random measure on such that is a

-valued random variable for every

. Because is Borel, we may write

 X=∑k≤κδγk (1)

for a random element in and some – not necessarily distinct – random elements in . We will take the point process to represent the multiset of its unique atoms with corresponding multiplicities . We say is simple when for all , in which case represents a set.

In statistical applications, latent feature models associate each data point in a dataset with a latent point process from an exchangeable sequence of simple point processes, which we denote by . The unique atoms among the sequence are referred to as features, and a data point is said to possess those features appearing in its associated point process. We can also view these latent feature models as generalizations of mixture models that allow data points to belong to multiple, potentially overlapping clusters [10, 2]. For example, in an object recognition task, a model for a dataset consisting of street camera images could associate each image with a subset of object classes – for example, “trees”, “cars”, and “houses”, etc. – appearing in the images. In a document modeling task, a model for a dataset of news articles could associate each document with a subset of topics – for example, “politics”, “Europe”, and “economics”, etc. – discussed in the documents. Recent work in Bayesian nonparametrics utilizing exchangeable sequences of simple point processes have focused on the Indian buffet process (IBP) [10, 7], which characterizes the marginal distribution of the sequence when they are conditionally-i.i.d. Bernoulli processes, given a common beta process base measure [11, 24].

If the point processes are no longer constrained to be simple, then data points can contain multiple copies of features. For example, in the object recognition task, an image could be associated with two cars, two trees, and one house. In the document modeling task, an article could be associated with 100 words from the politics topic, 200 words from the Europe topic, and 40 words from the economics topic. In this article, we describe a count analogue of the IBP called the negative binomial Indian buffet processes (NB-IBP), which characterizes the marginal distribution of when it is a conditionally i.i.d. sequence of negative binomial processes [3, 28], given a common beta process base measure. This characterization allows us to describe new Markov Chain Monte Carlo algorithms for posterior inference that do not require numerical integrations over representations of the underlying beta process.

### 1.1 Results

Let , let be a non-atomic, finite measure on , and let be a Poisson (point) process on with intensity

 (ds,dp)↦cp−1(1−p)c−1dp˜B0(ds). (2)

As this intensity is non-atomic and merely -finite, will have an infinite number of atoms almost surely (a.s.), and so we may write for some a.s. unique random elements in and in . From , construct the random measure

 B:=∞∑j=1bjδγj, (3)

which is a beta process . The construction of ensures that the random variables are independent for every finite, disjoint collection , and is said to be completely random or equivalently, have independent increments . We review completely random measures in Section 2.

The conjugacy of the family of beta distributions with various other exponential families carries over to beta processes and randomizations by probability kernels lying in these same exponential families. The beta process is therefore a convenient choice for further randomizations, or in the language of Bayesian nonparametrics, as a prior stochastic process. For example, previous work has focused on the (simple) point process that takes each atom

with probability for every , which is, conditioned on , called a Bernoulli process (with base measure ) . In this article, we study the point process

 X:=∞∑j=1ζjδγj, (4)

where the random variables are conditionally independent given and

 ζj|bj∼NB(r,bj),j∈N, (5)

for some parameter . Here,

denotes the negative binomial distribution with parameters

, , whose probability mass function (p.m.f.) is

 NB(z;r,p):=(r)zz!pz(1−p)r,z∈Z+, (6)

where with is the th rising factorial. Note that, conditioned on , the point process is the (fixed component) of a negative binomial process [3, 28]. Unconditionally, is the ordinary component of a beta negative binomial process, which we formally define in Section 2.

Conditioned on , construct a sequence of point processes that are i.i.d. copies of . In this case, is an exchangeable sequence of beta negative binomial processes, and our primary goal is to characterize the (unconditional) distribution of the sequence. This task is non-trivial because the construction of the point process in equation (4) is not finitary in the sense that no finite subset of the atoms of determines with probability one. In the case of conditionally-i.i.d. Bernoulli processes, the unconditional distributions of the measures remain in the class of Bernoulli processes, and so a finitary construction is straightforwardly obtained with Poisson (point) processes. Then the distribution of the sequence, which Thibaux and Jordan  showed is characterized by the IBP, may be derived immediately from the conjugacy between the classes of beta and Bernoulli processes [13, 11, 24]. While conjugacy also holds between the classes of beta and negative binomial processes [3, 28], the unconditional law of the point process is no longer that of a negative binomial process; instead, it is the law of a beta negative binomial process.

Existing constructions for beta negative binomial processes truncate the number of atoms in the underlying beta process and typically use slice sampling to remove the error introduced by this approximation asymptotically [3, 28, 23, 19]. In this work, we instead provide a construction for the beta negative binomial process directly, avoiding a representation of the underlying beta process. To this end, note that while the beta process has a countably infinite number of atoms a.s., it can be shown that is still an a.s. finite measure . It follows as an easy consequence that the point process is a.s. finite as well and, therefore, has an a.s. finite number of atoms, which we represent with a Poisson process. The atomic masses are then characterized by the digamma distribution, introduced by Sibuya , which has p.m.f. (for parameters ) given by

 digamma(z;r,θ):=1ψ(r+θ)−ψ(θ)(r)z(r+θ)zz−1,z≥1, (7)

where denotes the digamma function. In Section 3, we prove the following: Let be a Poisson process on with finite intensity

 ds↦c[ψ(c+r)−ψ(c)]˜B0(ds), (8)

that is, for a Poisson random variable with mean and i.i.d. random variables , independent from , each with distribution . Let be an independent collection of i.i.d. random variables. Then

 Xd=κ∑k=1ζkδγk, (9)

where is the beta negative binomial process defined in equation (4).

With this construction and conjugacy (the relevant results are reproduced in Section 4), characterizing the distribution of is straightforward. However, in applications we are only interested in the combinatorial structure of the sequence , that is, the pattern of sharing amongst the atoms while ignoring the locations of the atoms themselves. More precisely, for every , let be the set of all length- sequences of non-negative integers, excluding the all-zero sequence. Elements in are called histories, and can be thought of as representations of non-empty multisets of . For every , let be the number of elements such that for all . By the combinatorial structure of a finite subsequence , we will mean the collection of counts, which together can be understood as representations of multisets of histories. These counts are combinatorial in the following sense: Let be a Borel automorphism on , that is, a measurable permutation of whose inverse is also measurable, and define the transformed processes , for every , where each atom is repositioned to . The collection is invariant to this transformation, and it is in this sense that they only capture the combinatorial structure. In Section 4, we prove the following.

The probability mass function of is

 P{Mh=mh:h∈Hn} (10)

where , for every , and .

As one would expect, equation (10) is reminiscent of the p.m.f. for the IBP, and indeed the collection is characterized by what we call the negative binomial Indian buffet process, or NB-IBP. Let denote the beta negative binomial distribution (with parameters ), that is, we write if there exists a beta random variable such that . In the NB-IBP, a sequence of customers enters an Indian buffet restaurant:

• The first customer

• selects distinct dishes, taking servings of each dish, independently.

• For , the ()st customer

• takes servings of each previously sampled dish ; where is the total number of servings taken of dish by the first customers;

• selects new dishes to taste, taking servings of each dish, independently.

The interpretation here is that, for every , the count is the number of dishes such that, for every , customer took servings of dish . Then the sum in equation (10) is the total number of servings taken of dish by the first customers. Because the NB-IBP is the combinatorial structure of a conditionally i.i.d. process, its distribution, given in Theorem 1.1, must be invariant to every permutation of the customers. We can state this property formally as follows. [(Exchangeability)] Let be a permutation of , and, for , note that the composition is given by , for every . Then

 (Mh)h∈Hnd=(Mh∘π)h∈Hn. (11)

The exchangeability of the combinatorial structure and its p.m.f. in equation (10) allows us to develop Gibbs sampling techniques analogous to those originally developed for the IBP [7, 17]. In particular, because the NB-IBP avoids a representation of the beta process underlying the exchangeable sequence , these posterior inference algorithms do not require numerical integration over representations of the beta process. We discuss some of these techniques in Section 5.

## 2 Preliminaries

Here, we review completely random measures and formally define the negative binomial and beta negative binomial processes. We provide characterizations via Laplace functionals and conclude the section with a discussion of related work.

### 2.1 Completely random measures

Let denote the space of -finite measures on equipped with the -algebra generated by the projection maps for all . A random measure on is a random element in , and we say that is completely random or has independent increments when, for every finite collection of disjoint, measurable sets , the random variables are independent. Here, we briefly review completely random measures; for a thorough treatment, the reader should consult Kallenberg , Chapter 12, or the classic text by Kingman . Every completely random measure can be written as a sum of three independent parts

 ξ=¯ξ+∑s∈Aϑsδs+∑(s,p)∈ηpδsa.s., (12)

called the diffuse, fixed, and ordinary components, respectively, where:

[3.]

is a non-random, non-atomic measure;

is a non-random countable set whose elements are referred to as the fixed atoms and whose masses are independent random variables in (the non-negative real numbers);

is a Poisson process on whose intensity is -finite and has diffuse projections onto , that is, the measure on is non-atomic. In this article, we will only study purely-atomic completely random measures, which therefore have no diffuse component. It follows that we may characterize the law of by (1) the distributions of the atomic masses in the fixed component, and (2) the intensity of the Poisson process underlying the ordinary component.

### 2.2 Definitions

By a base measure on , we mean a -finite measure on such that for all . For the remainder of the article, fix a base measure . We may write

 B0=˜B0+∑s∈A¯bsδs (13)

for some non-atomic measure ; a countable set ; and constants in .111Note that we have relaxed the condition on (in the Introduction) to be merely -finite. As discussed in the Introduction, a convenient model for random base measures are beta processes, a class of completely random measures introduced by Hjort . For the remainder of the article, let be a measurable function, which we call a concentration function (or parameter when it is constant).

[(Beta process)] A random measure on is a beta process with concentration function and base measure , written , when it is purely atomic and completely random, with a fixed component

 ∑s∈Aϑsδs,ϑsind∼beta(c(s)¯bs,c(s)(1−¯bs)), (14)

and an ordinary component with intensity measure

 (ds,dp)↦c(s)p−1(1−p)c(s)−1dp˜B0(ds). (15)

It is straightforward to show that a beta process is itself a base measure with probability one. This definition of the beta process generalizes the version given in the introduction to a non-homogeneous process with a fixed component. Likewise, we generalize our earlier definition of a negative binomial process to include an ordinary component.

[(Negative binomial process)] A point process on is a negative binomial process with parameter and base measure , written , when it is purely atomic and completely random, with a fixed component

 ∑s∈Aϑsδs,ϑsind∼NB(r,¯bs), (16)

and an ordinary component with intensity measure

 (ds,dp)↦rδ1(dp)˜B0(ds). (17)

The fixed component in this definition was given by Broderick et al.  and Zhou et al.  (and by Thibaux  for the case ). Here, we have additionally defined an ordinary component, following intuitions from Roy .

The law of a random measure is completely characterized by its Laplace functional, and this representation is often simpler to manipulate: From Campbell’s theorem, or a version of the Lévy–Khinchin formula for Borel spaces, one can show that the Laplace functional of is

 f↦E[e−X(f)]=exp[−∫(1−e−f(s))r˜B0(ds)]∏s∈A[1−¯bs1−¯bse−f(s)]r, (18)

where ranges over non-negative measurable functions and .

Finally, we define beta negative binomial processes via their conditional law.

[(Beta negative binomial process)] A random measure on is a beta negative binomial process with parameter , concentration function , and base measure , written

 X∼BNBP(r,c,B0),

if there exists a beta process such that

 X|B∼NBP(r,B). (19)

This characterization was given by Broderick et al.  and can be seen to match a special case of the model in Zhou et al.  (see the discussion of related work in Section 2.3). It is straightforward to show that a beta negative binomial process is also completely random, and that its Laplace functional is given by

 E[e−X(f)] = exp[−∫[1−(1−p1−pe−f(s))r]c(s)p−1(1−p)c(s)−1dp˜B0(ds)] ×∏s∈A∫(1−p1−pe−f(s))rbeta(p;c(s)¯bs,c(s)(1−¯bs))dp,

for measurable, where we note that the factors in the product term take the form of the Laplace transform of the beta negative binomial distribution.

### 2.3 Related work

The term “negative binomial process” has historically been reserved for processes with negative binomial increments – a class into which the process we study here does not fal – and these processes have been long-studied in probability and statistics. We direct the reader to Kozubowski and Podgórski  for references.

One way to construct a process with negative binomial increments is to rely upon the fact that a negative binomial distribution is a gamma mixture of Poisson distributions. In particular, similarly to the construction by Lo

, consider a Cox process directed by a gamma process with finite non-atomic intensity. So constructed, has independent increments with negative binomial distributions. Like the beta process (with a finite intensity underlying its ordinary component), the gamma process has, with probability one, a countably infinite number of atoms but a finite total mass, and so the Cox process is a.s. finite as well. Despite similarities, a comparison of Laplace functionals shows that the law of is not that of a beta negative binomial process. Using an approach directly analogous to the derivation of the IBP in , Titsias  characterizes the combinatorial structure of a sequence of point processes that, conditioned on , are independent and identically distributed to the Cox process . See Section 4 for comments. This was the first count analogue of the IBP; the possibility of a count analogue arising from beta negative binomial processes was first raised by Zhou et al. , who described the distribution of the number of new dishes sampled by each customer. Recent work by Zhou, Madrid and Scott , independent of our own and proceeding along different lines, describes a combinatorial process related to the NB-IBP (following a re-scaling of the beta process intensity).

Finally, we note that another negative binomial process without negative binomial increments was defined on Euclidean space by Barndorff-Nielsen and Yeo  and extended to general spaces by Grégoire  and Wolpert and Ickstadt . These measures are generally Cox processes on directed by random measures of the form

 ds↦∫R+ν(t,ds)G(dt),

where is again a gamma process, this time on , and is a probability kernel from to , for example, the Gaussian kernel.

## 3 Constructing beta negative binomial processes

Before providing a finitary construction for the beta negative binomial process, we make a few remarks on the digamma distribution. For the remainder of the article, define for some . Following a representation by Sibuya , we may relate the digamma and beta negative binomial distributions as follows: Let and define , the latter of which has p.m.f.

 P{W=w}=(θλr,θ)−1w+rw+1beta-NB(w;r,1,θ),w∈Z+. (21)

Deriving the Laplace transform of the law of is straightforward, and because , one may verify that the Laplace transform of the digamma distribution is given by

 (22)

The form of equation (21) suggests the following rejection sampler, which was first proposed by Devroye , Proposition 2, Remark 1: Let and let

be an i.i.d. sequence of uniformly distributed random numbers. Let

 (Yn)n∈Ni.i.d.∼beta-NB(r,1,θ),

and define . Then

 Yη+1∼digamma(r,θ),

and

 Eη=max{r,1}θ[ψ(r+θ)−ψ(θ)];Eη

With digamma random variables, we provide a finitary construction for the beta negative binomial process. The following result generalizes the statement given by Theorem 1.1 (in the Introduction) to a non-homogeneous process, which also has a fixed component.

Let , and let be a collection of independent random variables with

 ϑs∼beta-NB(r,c(s)¯bs,c(s)(1−¯bs)),s∈A. (23)

Let be a Poisson process on , independent from , with (finite) intensity

 ds↦c(s)[ψ(c(s)+r)−ψ(c(s))]˜B0(ds). (24)

Write for some random element in and a.s. unique random elements in , and put . Let be a collection of random variables that are independent from and are conditionally independent given , and let

 ζj|F∼digamma(r,c(γj)),j∈N. (25)

Then

 X=∑s∈Aϑsδs+κ∑j=1ζjδγj∼BNBP(r,c,B0). (26)

We have

 EF[e−X(f)]=∏s∈AE[e−ϑsf(s)]×κ∏j=1EF[e−ζjf(γj)], (27)

for every measurable. For , write for the Laplace transform of the digamma distribution evaluated at , where is given by equation (22). We may then write

 κ∏j=1EF[e−ζjf(γj)]=κ∏j=1g(γj). (28)

Then by the chain rule of conditional expectation, complete randomness, and Campbell’s theorem,

 E[e−X(f)] = ∏s∈AE[e−ϑsf(s)]×exp[−∫Ω(1−g(s))c(s)λr,c(s)˜B0(ds)] (29) = ∏s∈A[∫(1−p1−pe−f(s))rbeta(p;c(s)¯bs,c(s)(1−¯bs))dp]

which is the desired form of the Laplace functional.

A finitary construction for conditionally-i.i.d. sequences of negative binomial processes with a common beta process base measure now follows from known conjugacy results. In particular, for every , let . The following theorem characterizes the conjugacy between the (classes of) beta and negative binomial processes and follows from repeated application of the results by Kim , Theorem 3.3 or Hjort , Corollary 4.1. This result, which is tailored to our needs, is similar to those already given by Broderick et el.  and Zhou et al. , and generalizes the result given by Thibaux  for the case . [(Hjort , Zhou et al. )] Let and, conditioned on , let be a sequence of i.i.d. negative binomial processes with parameter and base measure . Then for every ,

 B|X[n]∼BPL(cn,ccnB0+1cnSn), (31)

where and , for .

It follows immediately that, for every , the law of conditioned on is given by

 Xn+1|X[n]∼BNBP(r,cn,ccnB0+1cnSn). (32)

We may therefore construct this exchangeable sequence of beta negative binomial processes with Theorem 3.

## 4 Combinatorial structure

We now characterize the combinatorial structure of the exchangeable sequence in the case when is constant and is non-atomic. In order to make this precise, we introduce a quotient of the space of sequences of integer-valued measures. Let and for any pair and of (finite) sequences of integer-valued measures, write when there exists a Borel automorphism on satisfying for every . It is easy to verify that is an equivalence relation. Let denote the equivalence class containing . The quotient space induced by is itself a Borel space, and can be related to the Borel space of sequences of -valued measures by coarsening the -algebra to that generated by the functionals

 Mh(U1,…,Un):=#{s∈Ω:∀j≤n,Uj{s}=h(j)},h∈Hn,j≤n, (33)

where denotes the cardinality of , and is the space of histories defined in the Introduction. The collection of multiplicities (of histories) corresponding to , also defined in the Introduction, then satisfies for every . The collection thus identifies a point in the quotient space induced by . Our aim is to characterize the distribution of , for every .

Let , and define to be the collection of histories in that agree with on the first entries. Then note that

 Mℏ=∑h∈H(ℏ)n+1Mh,ℏ∈Hn, (34)

that is, the multiplicities at stage completely determine the multiplicities at all earlier stages. It follows that

 P{Mh=mh:h∈Hn+1} = P{Mℏ=mℏ:ℏ∈Hn} ×P{Mh=mh:h∈Hn+1|Mℏ=mℏ:ℏ∈Hn},

where for . The structure of equation (4) suggests an inductive proof for Theorem 1.1.

### 4.1 The law of Mh for h∈H1

Note that is isomorphic to and that the collection counts the number of atoms of each positive integer mass. It follows from Theorem 1.1 and a transfer argument , Propositions 6.10, 6.11 and 6.13, that there exists: [3.]

a Poisson random variable with mean , where ;

an i.i.d. collection of a.s. unique random elements in ;

an i.i.d. collection of random variables; all mutually independent, such that

 X1 = κ∑j=1ζjδγja.s.

It follows that

 Mh = #{j≤κ:ζj=h(1)}a.% s., for h∈H1, (36)

and a.s. Therefore,

 P{Mh=mh:h∈H1} (37) =P{κ=∑h∈H1mh}P{Mh=mh:h∈H1∣∣κ=∑h∈H1mh}.

Because are i.i.d., the collection has a multinomial distribution conditioned on its sum . Namely, counts the number of times, in independent trials, that the multiplicity arises from a distribution. In particular,

 P{Mh=mh:h∈H1∣∣κ=∑h∈H1mh} (38) =(∑h∈H1mh)!∏h∈H1(mh!)∏h∈H1[digamma(h(1);r,c)mh].

It follows that

 P{Mh=mh:h∈H1} (39) =(cTλr,c)∑h∈H1mh∏h∈H1(mh!)exp(−cTλr,c)∏h∈H1[digamma(h(1);r,c)mh].

### 4.2 The conditional law of Mh for h∈Hn+1

Let . Recall that for . We may write

 Sn = ∑ℏ∈HnMℏ∑j=1s(ℏ)δωℏ,j, (40)

for some collection of a.s. distinct random elements in . It follows from Remark 3, Theorem 1.1, and a transfer argument that there exists: [4.]

a Poisson random variable with mean ;

an i.i.d. collection of a.s. unique random elements in , a.s. distinct also from ;

an i.i.d. collection of random variables;

for each , an i.i.d. collection of random variables satisfying

 ϑℏ,j∼beta-NB(r,s(ℏ),c+nr) for j∈N;

all mutually independent and independent of , such that

 Xn+1 = ∑ℏ∈HnMℏ∑j=1ϑℏ,jδωℏ,j+κ∑j=1ζjδγja.s. (41)

Conditioned on , the first and second terms on the right-hand side correspond to the fixed and ordinary components of , respectively. Let

 H(0)n+1:={h∈Hn+1:h(j)=0,j≤n} (42)

be the set of histories for which is the first non-zero element. Then, with probability one,

 Mh=#{j≤κ:ζj=h(n+1)}for h∈H(0)n+1, (43)

and

 Mh = #{j≤Mℏ:ϑℏ,j=h(n+1)}for ℏ∈Hn and h∈H(ℏ)n+1. (44)

By the stated independence of the variables above, we have

 P{Mh=mh:h∈Hn+1|Mℏ=mℏ:ℏ∈Hn} (45) =P{Mh=mh:h∈H(0)n+1}∏ℏ∈HnP{Mh=mh:h∈H(ℏ)n+1|Mℏ=mℏ}.

Let . For every , the random variables are i.i.d., and therefore, conditioned on , the collection has a multinomial distribution. In particular, the product term in equation (45) is given by

 ∏ℏ∈HnP{Mh=mh:h∈H(ℏ)n+1|Mℏ=mℏ} =∏ℏ∈Hn(mℏ!)∏h∈H+n+1(mh!)∏h∈H+n+1[beta-NB(h(n+1);r,S(h)−h(n+1),c+nr)mh].

The p.m.f. of the beta negative binomial distribution is given by

 beta-NB(z;r,α,β)=(r)zzB(z+α,r+β)B(α,β),z∈Z+, (46)

for positive parameters , and , where denotes the beta function. We have that a.s., and therefore

 P{Mh=mh:h∈H(0)n+1} =P{κ=∑h∈H(0)n+1mh} (47) ×P{Mh=mh:h∈H(0)n+1∣∣κ=∑h∈H(0)n+1mh}.

Because are i.i.d., conditioned on the sum , the collection has a multinomial distribution, and so

 P{Mh=mh:h∈H(0)n+1∣∣κ=∑h∈H(0)n+1mh} (48)

It follows that

 P{Mh=mh:h∈Hn+1|Mℏ=mℏ:ℏ∈Hn} =(cTλr,c+nr)∑h∈H(0)n+1mh(∑h∈H(0)n+1mh)!exp(−cTλr,c+nr) (49) ×∏ℏ∈Hn(mℏ!)∏h∈H+n+1(mh!)∏h∈H+n+1[beta-NB(h(n+1);r,S(h)−h(n+1),c+nr)mh] ×(∑h∈H(