# The convergent Indian buffet process

We propose a new Bayesian nonparametric prior for latent feature models, which we call the convergent Indian buffet process (CIBP). We show that under the CIBP, the number of latent features is distributed as a Poisson distribution with the mean monotonically increasing but converging to a certain value as the number of objects goes to infinity. That is, the expected number of features is bounded above even when the number of objects goes to infinity, unlike the standard Indian buffet process under which the expected number of features increases with the number of objects. We provide two alternative representations of the CIBP based on a hierarchical distribution and a completely random measure, respectively, which are of independent interest. The proposed CIBP is assessed on a high-dimensional sparse factor model.

06/08/2020

### Independent Sets of Random Trees and of Sparse Random Graphs

An independent set of size k in a finite undirected graph G is a set of ...
07/02/2015

### Correlated Random Measures

We develop correlated random measures, random measures where the atom we...
05/06/2022

### Nonparametric estimation of a multivariate density under Kullback-Leibler loss with ISDE

In this paper, we propose a theoretical analysis of the algorithm ISDE, ...
05/09/2021

### Super Solutions of the Model RB

The concept of super solution is a special type of generalized solutions...
04/04/2018

### A Fixed Point Theorem for Iterative Random Contraction Operators over Banach Spaces

Consider a contraction operator T over a Banach space X with a fixed po...
05/21/2019

### Lonely Points in Simplices

Given a lattice L in Z^m and a subset A of R^m, we say that a point in A...
11/17/2014

### A Nonparametric Bayesian Approach Toward Stacked Convolutional Independent Component Analysis

Unsupervised feature learning algorithms based on convolutional formulat...

## 1 Introduction

In this paper, we introduce a new three-parameter generalization of the Indian buffet process (IBP). The IBP, which is firstly introduced by [2], is an exchangeable distribution over binary matrices with a finite number of rows but an infinite number of columns. In the context of latent feature models, a binary matrix for describes feature allocation for objects by letting if the -th object possesses the -th feature and otherwise. The IBP and its two- and three-parameter generalizations [9, 8] has been widely used in various applications [e.g., 4, 6, 5, 1].

It is well known that the expected number of features increases in a certain rate (logarithmic or polynomial) as the number of objects increases under both the one-, two-, and three-parameter IBPs [2, 9, 8]. Therefore, these IBPs, which can produce many unnecessary features, may not be suitable for modelling data sets that are believed to have a finite number of features. For example, in macroeconomic applications, fluctuations in data such as stock return can boil down to several important sources, so it is natural to assume that the number of features is fixed even if the data dimension increases [7].

In this paper, we propose a new stochastic process for latent feature models, under which the distribution of the number of features converges to a certain fixed distribution. Under the proposed process, fewer unnecessary features are generated than under the standard IBPs, and thus both interpretability and prediction ability of the model can be improved.

### 1.1 Convergent Indian buffet process

Our proposed variant of the IBP, which we call the convergent Indian buffet process (CIBP) can be described by the following restaurant analogy.

###### Definition 1 (The restaurant analogy of the CIBP).

Let , and . We call the stochastic process given below the restaurant analogy of :

1. The first customer tries dishes, where denotes the beta function with parameters and .

2. For every , the -th customer

• tries each previously tasted dish independently according to

 {Bernoulli}\delmj,k+αj+κ+α (1.1)

where is the number of previous customers (before -th customer) who have tried the -th dish;

• and tries

 {Poisson}\delγB(α+1,κ+j)B(α,κ+1) (1.2)

new dishes.

The restaurant analogy leads to the binary matrix with the number of rows being the number of costumers and the number of columns being unbounded, where the -th element of the binary matrix is equal to 1 if the -th customer tried the -th dish and 0 otherwise. We denote by the distribution of the binary matrix induced by the above restaurant analogy. In this note, we discuss properties and construction of the CIBP.

### 1.2 Organization

The rest of the paper is organized as follows. In Section 2, we show that the number of features under the CIBP follows the Poisson distribution, with mean monotonically increasing but converging to a certain value as the number of objects goes to infinity. The name convergent IBP is named after this property. We also describe connection between CIBP and the two-parameter IBP. In Section 3

, we provides two alternative representations of the CIBP, where the first one is based on a hierarchical distribution of Poisson, Beta and Bernoulli distributions and the second one is based on random measures. In

Section 4

, as an application, we use the CIBP as the prior distribution on the factor loading matrix for Bayesian estimation of a sparse factor model. We provide a straightforward posterior computation algorithm and some numerical examples. In

Section 5, we give the proofs for the results of Section 3. Section 6 concludes the paper.

### 1.3 Notation

We denote by the indicator function. Let be the set of real numbers and be the set of positive numbers. Let be the set of natural numbers. For , we let . For noational convenience, we let be the ratio of two beta functions defined as

 ¯Ba2,b2a1,b1:=B(a1+a2,b1+b2)B(a1,b1).

## 2 Properties

### 2.1 Distribution of the number of features

In this section, we show that the number of features (i.e., dishes) under the CIBP follows a Poisson distribution with mean being fixed as the number of objects increases. Let be the number of nonzero columns of , which represents the number of features. Formally, we can define

 K+:=K+(Ξ):=∞∑k=11(ξ∙k≠0),

where denotes the -th column of . The following proposition describes the distribution of .

###### Proposition 2.1.

If , then

 K+∼{Poisson}\delγ(1−¯B0,pα,κ+1), (2.1)

where Moreover, the Poisson mean monotonically increases and converges to as , which, in particular, implies that

converges to the random variable

in distribution.

###### Proof.

From the restaurant analogy of the CIBP, we have that

 K+d=P∑j=1Knewj,% where Knewjind∼{Poisson}\delγB(α+1,κ+j)B(α,κ+1)

Therefore, by the additive property of independent Poisson random variables,

 K+∼{Poisson}\delγB(α,κ+1)p∑j=1B(α+1,κ+j)

From the identity , we have

 p∑j=1B(α+1,κ+j) =p∑j=1\cbrB(α,κ+j)−B(α,κ+j+1) =B(α,κ+1)−B(α,κ+p+1),

which implies 2.1. The fact that follows from that .

For the second assertion, note that

 ¯B0,pα,κ+1=Γ(α)Γ(κ+p+1)Γ(α+κ+p+1)Γ(α+κ+1)Γ(α)Γ(κ+1)=p∏j=1κ+jα+κ+j,

where denotes the gamma function. Since , it follows that as . ∎

### 2.2 Exchangeability

Exchangeability of the IBP makes corresponding posterior computation algorithms tractable. The CIBP is an exchangeable distribution also, as shown in the following corollary. This is a direct consequence of Proposition 3.1 and Proposition 3.2 which are presented in the next section.

###### Corollary 2.2.

Assume that a -dimensional binary matrix follows

. Then the random vectors

are exchangeable, where denotes the -th row of the matrix .

### 2.3 Connection to the two-parameter IBP

The restaurant analogy of the two-parameter IBP with parameters and is as follows: The first customer tries dishes. The -th customer for tries each previously tasted dish independently according to and tries new dishes. We denote by the distribution induced by the above restaurant analogy.

By comparing the restaurant analogies of and , we then have the following proposition that connects these two stochastic processes.

###### Proposition 2.3.

For two -dimensional binary matrices and , converges to in distribution as and .

###### Proof.

It suffices to show that the means of the Bernoulli distribution in 1.1 and the Poisson distribution in 1.2 converge to the corresponding quantities for , which can be derived as:

 mj,k+αj+κ+α→mj,kj+κ

as and

 γB(α+1,κ+j)B(α,κ+1) =γΓ(α+1)Γ(κ+j)Γ(α+κ+j+1)Γ(α+κ+1)Γ(α)Γ(κ+1) =γαα+κ+jj∏h=1κ+hα+κ+h →ωκj+κ,

as and . ∎

We visualize the result of the above propostion. Figure 1 shows four binary matrices generated by and with , but with , and . We can see that the IBP tends to generate more features than the CIBP.

## 3 Alternative representations

In this section, we provides two alternative representations of the CIBP. The first one is based on a hierarchical distribution of Poisson, Beta and Bernoulli distributions and the second one is based on random measures. The proofs of all the results in this section are deferred to Section 5.

### 3.1 Hierarchical representation

In this section we show that the CIBP is equivalent to the following hierarchical distribution.

###### Definition 2 (Hierarchical representation of the CIBP).

Let , and

. We call the probability distribution given below the hierarchical representation of

:

 K ∼{Poisson}(γ), (3.1) θk iid∼{Beta}\delα,κ+1,k∈[K] ξjk|θk ind∼{Bernoulli}(θk)j∈[p],k∈[K].

To state the result rigorously, we need a concept of lof-equivalence classes. Under the latent feature model, the ordering of the features does not affect the likelihood of the data. Hence, we say that two dimensional binary matrices are equivalent if they are identical up to a permutation of columns. It is convenient to choose a representative of every equivalence class by the left-ordering procedure. The left-ordering procedure maps each dimensional binary matrix to its left-ordered version whose columns are ordered by the score , which is defined by

 sk:=p∑j=1ξjk2p−j

i.e., the colums are ordered so that We call the equivalence class defined by the left-ordering procedure lof-equivalence class and we denote the lof-equivalence class of a binary matrix by .

We introduce useful notations. Let which is a set of -dimensional binary vectors and where is the vector or zero. For each , we define

 Ku :=∞∑k=11(ξ∙k=u), (3.2)

where denotes the -th column of . In words, is the number of columns equal to the binary vector . Note that . Moreover, let

 mk:=p∑j=1ξjk,

be the number of rows that have the -th feature.

In the next proposition, we provide the explicit form of the probability mass function of the lof-equivalence class

.

###### Proposition 3.1.

If a -dimensional random binary matrix follows the distribution in 3.1, then

 P([Ξ])=γK+∏u∈Δ1Ku!e−γ∑pj=1¯B1,j−1α,κ+1\sbrK+∏k=1¯Bmk,p−mkα,κ+1. (3.3)

From Proposition 3.1, we can show that the restaurant analogy and the hierarchical representation of the CIBP are equivalent.

###### Proposition 3.2.

Suppose that a -dimensional binary matrix follow the hierarchical distribution presented in 3.1. Then the lof-equivalence class follows .

### 3.2 Random measure representation

In this section, we provide another representation of the CIBP, which is based on random measures.

We first briefly review completely random measures. Let a Polish space with its Borel -field and let be a set of all measures on with its Borel -field. A completely random measure (CRM) on is a random measure such that for all disjoint measurable sets are mutually independent. Every CRM can be decomposed into three independent parts:

 μ=μ0+K∑k=1qkδωk+∑(q,ω)∈Φqδω

where is a non-random measure, are fixed atoms in , are independent random variables on and is a Poisson process on . Here we only consider purely-atomic CRMs such that . We write

 μ∼{CRM}\delΛ,(ωk,Pk)k∈[K]

if is the purely-atomic CRM represented by with for and for some probability measures on and on . In particular, we write if with .

It is well known that the two-parameter IBP, with and , has the following random measure representation:

 ξj∙|μ iid∼{BeP}(μ),j∈[p] μ ∼{BP}(κ+1,αΛ0)

for some smooth probability measure , i.e., . Here, denotes the Bernoulli process with mean , which is equivalent to on with

 Λ{BeP}(μ)(dq,dω)=δ1(dq)μ(dω),

where denotes a point mass at 1, and denotes the Beta process which is equivalent to on with

 Λ{BP}(κ+1,αΛ0)(dq,dω)=α(κ+1)q−1(1−q)κdqΛ0(dω).

We introduce another stochastic process represented by a random measure, which will be shown to be related to the CIBP.

###### Definition 3 (Random measure representation of the CIBP).

Let , and . We call the stochastic process given below the random measure representation of :

 ξj∙|μ iid∼{BeP}(μ),j∈[p] (3.4) μ ∼{CRM}(Λ{CIBP}(γ,α,κ))

with

 Λ{CIBP}(γ,α,κ)(dq,dω)=γB(α,κ+1)qα−1(1−q)κdqΛ0(dω) (3.5)

for some smooth probability measure .

The next theorem shows that the hierarchical representation in Definition 2 and random measure representation in Definition 3 of the CIBP are equivalent.

###### Proposition 3.3.

Let be random measures following the distribution given in 3.4

. Then the joint distribution of

is given by

 (3.6)

where there are atoms such that for , and denotes the density of .

The function is integrable on , which means that there would be a finite number of features.

## 4 Application to Bayesian sparse factor models

In this section, we consider an application of the CIBP prior distribution to Bayesian estimation of the factor model.

### 4.1 Model and prior

We consider the following factor model where a -dimensional random vector is distributed as

 Y|Z=z∼{N}p(Bz,σ2I),Z∼{N}K(0,I), (4.1)

a noise variance.

We consider the following prior on the loading matrix . Let be the -th entry of the -dimensional loading matrix . We impose the prior distribution based on the CIBP distribution such that

 βjk|ξjkind∼ (1−ξjk)δ0+ξjk{N}(0,τ),j∈[p],k∈[K] ξjk|θkind∼ {Bernoulli}\delθk,j∈[p],k∈[K] θkiid∼ {Beta}(α,κ+1),k∈[K] K∼ {Poisson}(γ)

where and . That is, we impose on the binary matrix . We refer to the above distribution on as , which is an abbreviation of spike and slab CIBP.

### 4.2 Posterior computation

We provide an Markov chain Monte Carlo (MCMC) algorithm for sampling from the posterior distribution under the

prior on and inverse Gamma prior on . Let be the number of nonzero columns of the loading matrix . The MCMC algorithm is as follows:

#### Sample βjk for j∈[p] and k∈[K+].

 βjk|−∼{{N}(ˆβjk,ˆτk)if ξjk=1δ0if ξjk=0,

where

 ˆτk :=\delσ−2n∑i=1Z2ik+τ−1−1 ˆβjk :=ˆτk\cbrσ−2n∑i=1Zik\delYij−∑h∈[K+]:h≠kZihβjh.

#### Sample ξjk for j∈[p] and k∈N.

When we sample , we use the fact that the CIBP is exchangeable to assume that the -th customer is the last customer to enter the restaurant. Therefore, for each , is sampled with probability

 Π(ξjk=1|−)Π(ξjk=0|−)=mj,k+ακ+p−mj,k√ˆτkτexp\del12ˆτkˆβ2jk,

where . We then sample for each of the infinitely many all-zero columns. To do this, we use the Metropolis–Hastings (MH) steps as follows. We propose and from the proposal distribution

 J(K∗j)J(β∗j|K∗j)={Poisson}(1){N}(0,τ)K∗j.

Then we accept the proposal with probability

 min\cbr1,\abs[0]Mj−n/2exp\del12(β∗j)⊤M−1jβ∗jn∑i=1E2ij\delγ¯B1,p−1α,κ+1K∗j,

where

 Mj :=σ−2β∗j(β∗j)⊤+I Eij :=σ−2\delYij−K+∑k=1Zikβjk.

If the proposal is accepted, we update

 B ←(B,(β∗j,k1(l=j))l∈[p],k∈[K∗j]) K+ ←K++K∗j.

#### Sample Zi for i∈[n].

The latent variable is sampled from

where .

#### Sample σ2.

The noise variance is sampled from

 σ2|−∼{IG}\dela+np2,b+12n∑i=1p∑j=1\delYij−K+∑k=1Zikβjk2.

### 4.3 Simulation

We conduct simulation to compare the CIBP and the two-parameter IBP when they are used as prior distributions for the sparse factor model.

We generate simulated data sets as follows. For each value , we generate a -dimensional loading matrix with the number of nonzero rows

. The loadings in the sampled nonzero rows are generated from the uniform distribution on

. Then we sample

random vectors from the multivariate normal distribution with mean

and variance independently. We repeat this generating procedure 100 times.

For each synthetic data set, we compute the posterior distribution under the and prior, respectively. For the CIBP prior, we set , and . For the IBP prior, we set which is equal to . In Figure 2, we present the posterior mean of the number of factors under the and prior, respectively, for over 100 replications. As the dimension increases, the prior tends to more largely overestimate the number of factors. But the prior provides accurate estimates of the number of factors for all the values of

## 5 Proofs for Section 3

### 5.1 Proof of Proposition 3.1

###### Proof.

Recall that . If , we have that

 P(Ξ|K) =K∏k=1B(mk+α,p−mk+κ+1)B(α,κ+1) =\delB(α,p+κ+1)B(α,κ+1)K−K+K+∏k=1B(mk+α,p−mk+κ+1)B(α,κ+1) =\del¯B0,pα,κ+1K−K+K+∏k=1¯Bmk,p−mkα,κ+1,

where the second equality follows from reordering the columns such that if and otherwise. Recall that . Therefore, since the cardinality of the lof-equivalence class is , the probability of a lof equivalence class of given is given by

 P([Ξ]|K)=K!∏u∈ΔKu!\del¯B0,pα,κ+1K−K+K+∏k=1¯Bmk,p−mkα,κ+1

If , it is clear that .

Let the probability mass function of , i.e., for . Marginalizing over , we have that

 P([Ξ])=1∏u∈Δ1Ku!\sbrK+∏k=1¯Bmk,p−mkα,κ+1∞∑K=K+K!K0!\del¯B0,pα,κ+1K−K+pK(K).

The summation term of the preceding display can be written as

 ∞∑K=K+K!K0!\del¯B0,pα,κ+1K−K+pK(k) =e−γγK+∞∑K=K+1(K−K+)!\delγ¯B0,pα,κ+1K−K+ (5.1) =γK+e−γ\del1−¯B0,pα,κ+1,

where we use the identity for the second inequality. Lastly, from the identity , it follows that

 1−¯B0,pα,κ+1 =1−B(α,p+κ+1)B(α,κ+1) (5.2) =1B(α,κ+1)\cbrB(α,κ+1)−B(α,p+κ+1) =1B(α,κ+1)p∑j=1\cbrB(α,κ+j)−B(α,κ+j+1) =1B(α,κ+1)p∑j=1B(α+1,κ+j) =p∑j=1¯B1,j−1α,κ+1.

Combining 5.1 and 5.2, we get the desired result. ∎

### 5.2 Proof of Proposition 3.2

###### Proof.

The proof is by induction. Let be the -th row of . For , from a Poisson likelihood, we have

 P(ξ1∙)=1K+1!\delγ¯B1,0α,κ+1K+1e−γ¯B1,0α,κ+1

where is a number of nonzero elements of . It is same as 3.3 with and .

For , consider the conditional distribution of given , which is given by

 P(ξp∙|ξ1∙,…,ξp−1∙) =e−γ¯B1,p−1α,κ+1(γ¯B1,p−1α,κ+1)KnewpKnewp! (5.3) ×∏k∈Jpmp,k+αp+κ+α∏k∉Jpp−mp,k+κp+κ+α,

where , is the number of new features sampled by the -th customer and is the set of dishes taken by the -th customer, i.e., . Let and . By the inductive hypothesis, we have

 P(ξ1∙,…,ξp∙) =P(ξp∙|ξ1∙,…,ξp−1∙)P(ξ1∙,…,ξp−1∙) =e−γ∑pj=1¯B0,j−1α,κ+1γK+p∏pj=1Knewj!∏k∈Jpmp,k+αp+κ+α¯Bmp,k,p−1−mp,kα,κ+1 ×∏k∉Jpp−mp,kp+κ+α¯Bmp,k,p−1−mp,kα,κ+1×\del¯B1,p−1α,κ+1Knewp

Since for and otherwise, we have

 mp,k+αp+κ+α¯Bmp,k,p−1−mp,kα,κ+1 =mp,k+αp+κ+αB(mp,k+α,p−mp,k+κ)B(α,κ+1) =B(mp,k+1+α,p−mp,k+κ)B(α,κ+1) =¯Bmp,k+1,p−1−mp,kα,κ+1 =¯Bmk,p−mkα,κ+1

and similarly,

 p−mp,kp+κ+α¯Bmp,k,p−1−mp,kα,κ+1=¯Bmp,k,p−mp,kα,κ+1=¯Bmk,p−mkα,κ+1.

Therefore,

 P(ξ1∙,…,ξp∙) =e−γ∑pj=1¯B1,j−1α,κ+1γK+p∏pj=1Knewj!∏k∈Jp¯Bmk,p−mkα,κ+1 (5.4) ×∏k∉Jp¯Bmk,p−mkα,κ+1×\del¯B1,p−1α,κ+1Knewp =e−γ∑pj=1¯B1,j−1α,κ+1γK+p