    # Posterior Contraction Rates of the Phylogenetic Indian Buffet Processes

By expressing prior distributions as general stochastic processes, nonparametric Bayesian methods provide a flexible way to incorporate prior knowledge and constrain the latent structure in statistical inference. The Indian buffet process (IBP) is such an example that can be used to define a prior distribution on infinite binary features, where the exchangeability among subjects is assumed. The phylogenetic Indian buffet process (pIBP), a derivative of IBP, enables the modeling of non-exchangeability among subjects through a stochastic process on a rooted tree, which is similar to that used in phylogenetics, to describe relationships among the subjects. In this paper, we study the theoretical properties of IBP and pIBP under a binary factor model. We establish the posterior contraction rates for both IBP and pIBP and substantiate the theoretical results through simulation studies. This is the first work addressing the frequentist property of the posterior behaviors of IBP and pIBP. We also demonstrated its practical usefulness by applying pIBP prior to a real data example arising in the field of cancer genomics where the exchangeability among subjects is violated.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recently nonparametric Bayesian approaches have become popular methods in machine learning and other fields to learn structural information from data. By expressing prior distributions as general stochastic processes, nonparametric Bayesian methods provide flexible ways to incorporate prior knowledge and constrain the latent structure. The Indian buffet process (IBP) is such a stochastic process that can be used to define a prior distribution where the latent structure is presented in the form of a binary matrix with a finite number of rows and an infinite number of columns

[18, 22]

. The exchangeability among subjects is assumed in IBP, i.e., the joint probability of the subjects being modeled by the prior is invariant to permutation. In certain applications, exogenous information may suggest certain groupings of the subjects, such as studies involving cancer patients with different subtypes. In these cases, treating all subjects exchangeable using IBP is not appropriate. As an alternative, the phylogenetic Indian buffet process (pIBP)

 provides a flexible framework to incorporate prior structural information among subjects for more accurate statistical inference. In pIBP, the dependency structure among subjects is captured by a stochastic process on a rooted tree similar to that used in phylogenetics. As a derivative of IBP, pIBP inherits many of the nice features of IBP including inducing sparsity and allowance of a potentially infinite number of latent factors. In addition, pIBP provides an effective approach to incorporate useful information on the relationship among subjects without losing computational tractability.

Despite many successful applications of IBP and its variants in many areas 

, as far as we know, there has not been any theoretical investigation of their posterior behaviors. Suppose there is a true data-generating process, do the posterior distributions of IBP and pIBP concentrate on the truth? In the parametric setting where the number of parameters is fixed, the posterior distribution is well behaved according to the classical Bernstein-von Mises theorem

. However, when the prior charges a diverging or an infinite number of parameters, whether the posterior distribution still possesses such convergence properties is no longer guaranteed. IBP prior and pIBP prior belong to the second situation because they are stochastic processes on infinite binary matrices. Besides the issue of posterior convergence, we are also interested in the question whether the extra information in pIBP prior would lead to better posterior behavior than that of IBP prior.

In this paper, we study the theoretical properties of IBP and pIBP under a binary factor model. Posterior contraction rates are derived for both priors under various settings. By imposing a group structure on the true binary factor matrix, pIBP is proved to have faster convergence rates than IBP whenever the group structure is well-specified by the phylogenetic tree. Even when the group structure is mis-specified by pIBP, it still has the same convergence rate as that of IBP. To the best of our knowledge, this is the first work addressing the frequentist property of the posterior behaviors of both IBP and pIBP.

We further substantiated the theoretical results through simulation studies. Our simulations show that pIBP is an attractive alternative to IBP when subjects can be related through a tree structure based on some prior information. Moreover, even when the tree structure is mis-specified in pIBP prior, the posterior behavior is still comparable to that of IBP prior, suggesting a robust property of pIBP. We further apply pIBP to analyze cancer genomics data to demonstrate its practical usefulness.

We organize the rest of the paper as follows. Section 2 introduces a binary factor model, which is the probabilistic setting of the paper. The definitions of IBP and pIBP are reviewed in Section 3. Section 4 presents our theoretical studies of the posterior contraction rates of IBP and pIBP. Simulation studies are carried out in Section 5. Sections 6 presents the analysis of a TCGA data set using pIBP. Section 7 discusses related work on factor models and an extension of our theoretical results. Proofs for theoretical results are collected in the supplementary materials.

## 2 Problem Setting

### 2.1 Notation

We denote by and by . For two positive sequences and , means there exists a , such that for all . For a matrix , denote its matrix Frobenius norm by . For a set , denote its cardinality by . The symbol

stands for the prior probability distribution associated with the mixture of IBP or pIBP defined in Section

3.4, and is the corresponding posterior distribution.

### 2.2 Binary Factor model

Let denote the observed data matrix, where each of the rows represents one individual and each of the columns represents one measurement. We hypothesize that the measurement profiles can be characterized by latent factors. We model the effects of these latent factors on through the following model:

 X=ZA+E,

where is a binary factor matrix, and is a loading matrix. The status of , which takes a value of or , indicates the presence or the absence of the th factor in the th individual. The value of weighs the contribution to the th measurement from the th factor. We assume that each entry of follows independently. Let each entry of follow independently, and is independent of . Conditioning on ,

follows a matrix normal distribution with mean

. Integrating out with respect to its distribution, each column of follows

 (x1j,…,xnj)T∼N(0,σ2AZZT+σ2XI), (1)

independently for . Formula (1) shows the covariance structure across individuals imposed by the binary factor model. From this representation, it is easy to see that the matrix

and the variance components

and uniquely determine the data generating process.

### 2.3 Feature Similarity Matrix ZZT

We name the feature similarity matrix because of its important statistical meaning as reflected in (1). An identifiability issue is that the distribution of (1) will not change if one reorder the columns of the factor matrix . Thus, is not identifiable in the model. However, the feature similarity matrix , according to (1), is identifiable. We denote each element of this matrix by . Each row/column of this matrix describes the feature similarity between a particular individual and the other individuals. Note that

 ξij=K∑k=1zikzjk=|{k:zik=zjk=1}|.

Thus, the diagonal element denotes the number of factors possessed by the th individual, and the off-diagonal entry is the number of the factors shared between the th and th individuals. In short, the feature similarity matrix characterizes the latent feature sharing structure among samples. For the th individual, we define as its degree. When we have a group structure among the samples, the individual with the highest degree has the most shared factors among a group. That particular individual is a representative prototype for that group.

## 3 Tree Structured Indian Buffet Process Prior

### 3.1 A Bayesian Framework

To pursue a full Bayesian approach, we put a prior distribution on the triple . The choice of the prior on is not essential, because for asymptotic purpose (when and are large), the prior effect on the parametric part is negligible. In contrast, the prior on the binary matrix is important. Since we do not specify the number of columns in advance, the potential number of parameters in is infinite. It is well-known that when the number of parameters diverges, Bayesian method is no longer guaranteed to be consistent . Thus, the choice of the prior on is important. According to the model representation (1), the order of the columns of is not identifiable. In other words, we cannot tell the first factor from the second. Instead of specifying a prior on , we specify a prior on the equivalent class , where denotes the collection of matrices which are equivalent by reordering the columns.

We describe two priors on in this section, the Indian buffet process proposed by , and its tree-structured generalization, the phylogenetic Indian buffet process proposed by . Both are priors on sparse infinite binary matrices.

### 3.2 Indian Buffet Process

We describe the Indian buffet process (IBP) on by its stick-breaking representation derived in . Given some , first draw independently and identically distributed. Then, is

 pk=k∏i=1vi(k=1,2,…). (2)

Given ,

is drawn independently from a Bernoulli distribution with parameter

for and . The final matrix drawn in this way has dimension , where is the number of nonzero columns. According to ,

follows a Poisson distribution with mean

. Thus, it is finite with probability . The IBP prior on is the image measure induced by the equivalence map . A larger indicates a larger in the prior modeling.

### 3.3 Phylogenetic Indian Buffet Process

The phylogenetic Indian buffet process (pIBP) also starts with drawing as in (2). Different from IBP, given , the entries of the th column of are not independent in pIBP. Their dependency structure is captured by a stochastic process on a rooted tree similar to the models used in phylogenetics . The individuals are modeled as leaves of the tree. The total edge length from the root to any leaf is . Conditioning on , we describe the generating process of the th column of . First, assign to the root of the tree. Along any path from the root to a leaf, let the value of any node change to along any edge of length with probability , where . Once the value has changed to along any path from the root, all leaves below that point are assigned value . pIBP prior is defined to be the image measure on .

### 3.4 A Hyperprior on α

Both IBP and pIBP are determined by the hyper-parameter , which can be tuned in practice. In this paper, we pursue a full Bayesian approach, and put a Gamma prior on for both IBP and pIBP. Thus, the final prior on the equivalent class is a mixture of IBP or pIBP after is integrated out.

## 4 Posterior Contraction Rates of IBP and pIBP

### 4.1 Convergence of the Feature Similarity Matrix

In this section, we establish the posterior convergence of both mixture of IBP and mixture of pIBP and characterize their difference by different convergence rates. Such theoretical comparisons are interesting because IBP can be viewed as a special case of pIBP with a default tree. These results will illustrate the impacts of tree structure imposed by the prior.

We define the triple to be the true parameter generating the data matrix , where is an binary matrix and is the number of factors. For the sake of clearer presentation, we assume , so that the only unknown parameter is . Denote the data generating process of (1) by , and let be the associated expectation (and similarly define and ). The generalization to the case where is unknown is covered in the supplementary materials. Let be the mixture of IBP or pIBP prior on . Note that the matrix does not depend on the order of columns of , and thus we have . We consider the posterior convergence in the sense of

 (3)

for some sequences and constant . When , this is called posterior contraction of feature similarity matrix with rate under the squared Frobenius loss. We choose to study the posterior contraction in terms of the feature similarity matrix because of both the identifiability issue and statistical interpretation described in Section 2.3.

### 4.2 A General Method for Discrete Priors

The theory of Bayesian posterior consistency was first studied by . She proposed a Kullback-Leibler property of the prior and a testing argument to prove weak consistency in the parametric case. The first nonparametric posterior consistency result was obtained by , where the idea of testing on the essential support of the prior is used. Later, the same argument was modified to achieve rate of contraction by . In the current setting of binary factor model, we propose the following general method to prove posterior rate of contraction for priors supported on a discrete set.

###### Theorem 4.1.

For any measurable set , and any testing function , we have

 EZ0[Π(U∣X)]≤EZ0(ϕ)+1Π(||ZZT−Z0ZT0||2F=0)supZ∈UEZ(1−ϕ). (4)

The theorem can be viewed as a discrete version of the Schwartz theorem . We take advantage of the discrete nature of the problem, thus avoiding calculating the prior mass of the Kullback-Leibler neighborhood of . We specify to be

 U={||ZZT−Z0ZT0||2F>Mϵ2n,p}.

Thus, in order to obtain (3), it is sufficient to upper bound the right hand side of (4). This can be done by lower bounding and constructing a testing function for and with appropriate type 1 and type 2 error bounds. The existence of such testing function is guaranteed by the following lemma.

###### Lemma 4.1.

For any , there is a testing function such that the testing error is upper bounded by

 exp{−Cpmin(Mϵ2n,pn2K20,√Mϵn,pnK0)+2logn}+exp(−Cp+2logn),

for some universal constant and introduced in (3).

Therefore, it is sufficient to lower bound the prior mass to obtain (3).

### 4.3 Two-Group Tree and Factor Decomposition

Before studying the prior mass lower bound of IBP and pIBP, we need to specify a non-exchangeable structure among the subjects. To demonstrate the power of pIBP to model non-exchangeability, we study a special but representative tree structure, the two-group tree. Let individuals be labeled by . Without loss of generality, we assume is even. Let , where and . The tree induced by the two-group structure has one root, two group nodes and leaves. The two group nodes are connected with the root by two edges of length . Then, the th group node is connected with each member of by an edge of length , where . The parameter is the strength of the group structure imposed by the prior . When , pIBP reduces to IBP. Figure 1: An illustration of the two group tree and the factor decomposition.

Our theory covers three cases. The first case is IBP prior, with no group structure specified in the prior. The second case is the two-group pIBP prior with group structure correctly specified. The third case is the the two-group pIBP prior with group structure mis-specified. Let have columns, representing factors. Given the two-group structure by the prior , we have the following factor decomposition

 K0=K01+K02+K∗0, (5)

where is the number of factors unique to , is the number of factors unique to , and is the number of factors shared across and . Decomposition (5) is determined by both the structure of and the prior . It characterizes how well the group structure is specified compared with the true (see Figure 6). Generally speaking, the smaller is, the better the group structure is specified by .

### 4.4 Prior Mass

Under the two-group structure defined above, we obtain the following prior mass lower bound.

###### Theorem 4.2.

For any constant , there exists some constant such that the prior mass can be lower bounded by

 exp(−Cn((K∗0+κ)2+1)−CnK0−K∗0(4/3)K∗0+κ−C(K0+κ)(K0−K∗0+1))

for any .

Theorem 4.2 provides an explicit characterization of the prior mass lower bound as a function of . For a larger , the prior mass will be at a smaller order due to an increased level of misspecification. The prior mass lower bound directly determines the posterior contraction rate according to Theorem 4.1 and Lemma 4.1. In the following, we consider and , separately.

When , pIBP and IBP are equivalent. The prior does not impose any group structure. Thus, in the decomposition (5), we have . By letting , Theorem 4.2 can be written as

 (6)

The prior mass lower bound for IBP in (6) is the benchmark for us to compare IBP with pIBP in various situations.

When , the tree structure plays a role in the prior. In practice, is often used to characterize moderate group structure belief in the prior . We say the group structure is effectively specified if for some . In this case, the result of Theorem 4.2 can be optimized for for any . That is, for sufficiently large , we have

 Π(||ZZT−Z0ZT0||2F=0)≥exp(−C2nmink≥K∗0(k2∨K0(4/3)k)), (7)

which is lower bounded by

 exp(−C′2nK2(1−β)0).

This rate is superior to (6). Thus, pIBP is advantageous over IBP as long as the tree structure captures any group-specific features in the sense that .

On the other hand, the group structure is mis-specified if . In this case, we reduce to (6), so that

Thus, a mis-specified tree structure does not compromise the results, compared to a default tree structure of IBP. One may wonder whether this is due to a possibly loose bound in Theorem 4.2. By scrutinizing the proof, we found that the slack is at most at a constant level independent of . Thus, the prior mass lower bounds of pIBP with a mis-specified tree and of IBP are essentially the same.

### 4.5 Posterior Contraction Rates

Combining Theorem 4.1, Lemma 4.1 and Theorem 4.2, we can derive the posterior contraction rates in the sense of (3) for both IBP and pIBP.

###### Theorem 4.3.

For the mixture of IBP prior or pIBP prior on , let be the true factor matrix. Then, for the binary factor model, there exist and , such that

 EZ0[Π(||ZZT−Z0ZT0||2F≤MK40n3p∣∣X)]≥1−exp(−C′nK20),

as long as .

###### Theorem 4.4.

For the mixture of pIBP prior on with , let be the true factor matrix. When and for , for the binary factor model, there exist and , such that

 EZ0⎡⎣Π(||ZZT−Z0ZT0||2F≤MK4−2β0n3p∣∣X)⎤⎦≥1−exp(−C′nK2(1−β)0),

as long as .

The above two theorems establish rates of contraction for the posterior distributions of IBP and pIBP. The posterior probabilities on the neighborhood of the truth can be arbitrarily close to

in expectation under the true model for sufficiently large , and . The contraction rate is faster for larger and smaller , because more variables are helpful to identify the feature similarity of a group of individuals.

Compared with the rate of IBP in Theorem 4.3, when the tree structure is effectively specified, the upper bound of the rate of pIBP in Theorem 4.4 is faster by a factor of . Such difference is significant if the number of features is large. Moreover, Theorem 4.3 also suggests that even when the tree structure of pIBP is mis-specified, the rate of contraction is the same as that of IBP, implying the robust property of pIBP. Although our theoretical study is carried out in the simple two-group structure model, similar conclusions can also be obtained under a more complicated structural assumption using the same method.

## 5 Simulation Studies

In this section, we perform simulations to evaluate the performances of IBP and pIBP. We implemented the Markov chain Monte Carlo algorithm proposed in

 to perform posterior inference of the feature similarity matrix . In the algorithm, the sampling process on the tree structure is expressed as a graphical model, where the prior probabilities can be calculated efficiently by a sum-product algorithm. All the parameters , , and (marginal probabilities of a latent feature equaling ) are sampled as part of the overall Markov chain Monte Carlo procedure.

In the first simulation, we evaluated the performance of IBP, pIBP with a correctly specified tree structure, and pIBP with a mis-specified tree structure (mispIBP). We constructed a set of samples with a clear subgroup structure on . Specifically we simulated data with eight subgroups characterized by six latent factors as illustrated by Figure 2. Twelve models presented in Table 1 are considered. For each model, we generated an matrix with 05. For IBP, we let so that pIBP is equivalent to IBP. For pIBP, we let and a proper tree structure is given. For the mispIBP, we let

and the prior is a mis-specified tree with samples within a subtree assigned to different groups. Estimation error on

is evaluated in terms of the normalized Frobenius norm of the feature similarity matrix . We further evaluated the latent structure recovery by the number of estimated latent factors. We observed that both IBP and pIBP overestimates the number of latent factors because of the presence of many factors with only a few samples. This is similar to what was proved for Dirichlet and Pitman-Yor processes where the posterior is inconsistent for estimating the number of clusters . Therefore, we reported a truncated estimator of the number of latent factors counting only those factors shared by at least samples.

The algorithm of  is implemented for 1000 MCMC steps. We observe that it guarantees convergence in the problem sizes that are considered in this simulation.

Generally, the reported twelve models represent two scenarios: the small scenario and the large scenario. Remember in our setting, the larger the value of is, the more accurately we can recover the latent features. In the models with a small ( and ), the information from data is limited and the inference relies more heavily on the prior information. We found pIBP performs better than the other two methods in both cases. Besides, mispIBP has comparable performance with IBP, implying that pIBP is robust to mis-specified tree structure. The simulation results substantiate the conclusions we have from Theorem 4.3 and Theorem 4.4. In the models with large ( and ), there is adequate information from the data and the priors play a less important role. Inferences using different priors lead to similar results. Figure 2: The illustration of IBP, pIBP with an appropriate tree structure and pIBP with a mis-specified tree structure and the latent factor matrix Z0 used in the first simulation.

In the second simulation, we used the similarity data to construct pIBP prior. Nine models presented in Table 2 are considered. For each model, we generated an binary matrix with 4 columns sampled from a

and 5 columns with fixed structure. For IBP, no prior of the group structure is given. For pIBP, we first apply a hierarchical cluster analysis with complete linkage on the rows of

and then use its output dendrogram as the tree in the pIBP prior (see Figure 3). In our analysis, we constructed our prior based on the true knowledge of in order to investigate whether the correct structural information will improve the performance through pIBP priors. In practice, such trees need to be constructed from external sources. For mispIBP, the tree prior was constructed in the same way as pIBP but using a random permutation of on rows in the clustering. In this setting, mispIBP represents totally incorrect information. Similar as the previous simulation, we evaluated the performance by and the truncated number of estimated latent features (Table 2). When is small, pIBP outperforms IBP in all cases. When is adequately large ( in this setting), the inference is less influenced by the prior information. Figure 3: The illustration of the latent factor matrix Z0 and tree prior constructed from the hierarchical clustering analysis of Z0 in the second simulation.

## 6 Applications of pIBP in the Integrative Cancer Genomics Analysis

Cancer research has been revolutionized by recent advances in high through-put technologies. Diverse types of genomics data, e.g., DNA, RNA, and epigenetic, have been profiled for different tumor types [28, 27, 3, 33]. These data have revealed that substantial heterogeneities exist across tumor types, across individuals within the same tumor types and even within an individual tumor. However, the tumor heterogeneity at somatic level has not been explicitly explored in the integrative analysis.

Here we propose to use binary factor model to integrate somatic mutation and gene expression data based on pIBP prior. Our working hypothesis is that gene expression profiles of a cancer patient may be predicted by a set of latent factors that represent distinct molecular drivers. With this hypothesis, the more similar the somatic mutation profiles are between two cancer patients, the more similar their gene expression profiles are. Therefore, we build a pIBP prior based on somatic mutation data then specify it on the latent factors of gene expression data. Using this approach, we can investigate the gene expression data by taking into account the heterogeneities across cancer patients at somatic level.

We consider studies on a specific cancer type/subtype, which collects somatic mutations from whole exome sequencing and gene expressions either from sequencing or microarrays for each sample. Somatic mutations can either be more narrowly defined as single nucleotide changes and small insertions/deletions, or more broadly defined to include changes at the copy number level. We denote the detected somatic mutations for a group of samples by a binary matrix , with indicating the mutation status of the th gene on the th individual, as an external resource to construct the tree prior. When subclonality information is available, may be expressed as a continuous measure between 0 and 1, representing the percentage of the cells containing mutations at the th gene.

As for using a tree structure to express the relationships of individuals using the somatic mutation data, we propose to construct either logic tree or dendrogram tree. The logic tree prior is constructed as a logic tree based on the presence/absence of a set of somatic mutations. In this case, each node represents the status of a specific mutation. The dendrogram tree prior is adapted from the dendrogram tree of a hierarchical clustering on the somatic profiles . In such a tree, the non-leaf nodes have no explicit meaning but represent a local cluster of individuals. When the order of mutation acquisitions and the effects of specific mutations are unknown, the dendrogram tree provides a measure of the overall similarities between individuals.

We analyzed the TCGA BRCA Level 3 dataset generated by  (downloaded from cBio ) using the dendrogram tree construction strategy. We focused on 134 samples categorized as HER2 or Basal-like subtypes. Among these two subtypes, HER2 subtype is relatively well characterized and has effective clinical treatments. The basal-like subtype, which is also known as triple-negative breast cancers (TNBCs, lacking expression of ER, progesterone receptor (PR) and HER2), is poorly understood, with only chemotherapy as the main therapeutic option . Characterization of the basal-like subtype at the molecular level has important clinical implications. We built a tree prior from the dendrogram of a hierarchical clustering analysis with the frequent mutations in breast cancer including AKT1, CDH1, GATA3, MAP3K1, MLL3, PIK3CA, PIK3R1, PTEN, RUNX1 and TP53. For expression data, genes having top 300 MAD across samples were kept and centered. We ran 10 Markov chains. No substantial difference was observed across runs and we chose the one with largest posterior probability as the final result. Figure 4 shows the input tree prior, subtype information and the inferred latent feature matrix .

In our samples, the basal-like and HER2 samples display different and almost complementary patterns in their possession of the first two features. 74 of 81 Basal-like samples exhibit the first feature and 79 of 81 are deplete with the second feature. In contrast, 43 of 53 HER2 samples are deplete with the first feature and 31 of 53 exhibit the second feature. For the first feature, the top 10 genes with the largest loadings include MRPL9, PUF60, SCNM1, EIF2C2, BOP1, MTBP, DEDD, PHF20L1, HSF1 and HEATR1. Among these, BOP1 is involved in ribosome biogenesis and contributes to genomic stability, deregulation of which leads to altered chromosome segregation ; MTBP inhibits cancer metastasis by interacting with MDM2 ; DEDD interacts with PI3KC3 to activate autophagy and attenuate epithelial mesenchymal transition in cancer ; and HSF1 has been proposed as a predictor of survival in breast cancer . EIF2C2, PUF60 and PHF20L1 have been reported as prognostic markers in ovarian cancer [30, 38], which is consistent with the recent discovery that basal-like breast tumours with high-grade serous ovarian tumours share many molecular commonalities . These basal-like specific genes may potentially become novel therapeutic targets or prognostic markers. For the second feature, the top 10 genes with the largest loadings include STARD3, MED1, PSMD3, GRB7, ORMDL3, WIPF2, CASC3, RPL19, SNF8 and AMZ2. Among these, overexpressions of STARD3, PSMD3, GRB7, CASC3 and RPL19 have been reported in HER2-amplified breast cancer cell lines ; MED1 is required for estrogen receptor-mediated gene transcription and breast cancer cell growth 

. As revealed by principal component analysis based on gene expression (Figure

4), these genes weighing high on first two latent features have discriminating power on Basal-like and HER2 samples. Figure 4: A graph showing the dendrogram tree prior (left), the inferred latent factor matrix [Z] (middle, only first 20 columns shown) and PCA analysis of Basal-like (Red) and HER2 (Green) based on genes with top loading on latent factors (topright, with a set of 10 genes from first factor; bottomright, with a set of 20 genes from first two factors) for TCGA BRCA dataset.

Furthermore, we found that the status of the fifth and sixth features was strongly associated with disease recurrence in our samples as revealed by survival analysis (Figure 5 shows the Kaplan–Meier plot). Samples with the fifth feature have a higher probability of recurrence than those without it, with a p-value of 00068, whereas samples without the sixth feature have a higher probability of recurrence than those with it, with a p-value of 000084. Examinations of the loadings on these two features identified RMDN1, ARMC1, TMEM70, VCPIP1, TCEB1, MTDH, EBAG9, MRPL13, UBE2V2, FAM91A1 and RRS1 on the fifth feature and TRIM11, COMMD5, PYCRL, TIGD5, MRPL55, LSM1, SETDB1, CNOT7, PROSC, DEDD and HSF1 on the sixth feature. Among these, the prognosis significance of some has been discussed before, for example, MTDH activation by 8q22 genomic gain promotes chemoresistance and cetastasis of poor-prognosis breast cancer ; EBAG9 (RCAS1) is associated with ductal breast cancer progression . The other genes may serve as candidate tumor progression markers.

In comparison, we analyzed the same 134 breast cancer samples with the expression profiles of 300 genes and the mutation status of 11 genes with IBP prior. The resulting latent factor matrix is less sparse than that of pIBP, which offers compromised interpretability (See Supplementary Figure 1). Moreover, the above reported features were not recovered by IBP prior, suggesting the integration of somatic mutations might lead to better understanding of gene expression. Figure 5: A Kaplan–Meier plot for groups with different status of the fifth and sixth feature inferred from TCGA BRCA dataset.

## 7 Discussion

### 7.1 Related Work on Factor Models

This paper attempts to provide a theoretical foundation for the widely used IBP and pIBP priors. We illustrate the performance of the priors through a simple binary factor model. To the best of our knowledge, there are only a few literatures on posterior rates of contraction for factor models and its alternative form principal component analysis (PCA).  is the first work to consider posterior contraction rates for sparse factor models.  derives rate-optimal posterior contraction for sparse PCA. Both results achieve the frequentist minimax rates (up to a logarithmic factor for the first work). Frequentist estimation in factor models include ,  and .

Minimax rates for factor models usually appear in the literature in the form of principal component analysis. For example, minimax rates for sparse PCA are derived by , ,  and  under various settings.

For binary factor models, minimax rates are not available in the literature, and it cannot be easily derived from the existing results. In the current binary factor model setting, there are two main points that deviate from the settings considered in the literature. First, the largest eigenvalue of the matrix

may diverge as in the extreme case, while most minimax rates in the literature for covariance estimation assume bounded spectrum. Second, the binary factor model only takes value in , which distinguishes itself from ordinary factor models. The results in this paper suggest at least two open problems. First, what is the minimax rate of the binary factor model? Second, is IBP or pIBP rate-optimal? If not, what is the best rate of contraction that can be achieved by the posterior distribution?

### 7.2 Approximate Group Structure

Theorem 4.4 states the posterior contraction rates of pIBP under the model of a two-group structure through the factor decomposition (5). Such characterization of group structure is exact in the sense that even only one person in possesses a factor that is mostly possessed by people in

, that factor is classified as a common factor, contributing to the total

. Therefore, in many real cases the exact two-group structure is violated and we can easily get , thus losing the advantage of using pIBP.

In this section, we present a result to demonstrate that pIBP still gains advantage over IBP even when but the two-group structure approximately holds. We say has an approximate two-group structure if there exists a binary matrix of the same size such that the number associated with is bounded by and is small. In other words, may have a large , but it is close to a binary factor matrix whose is small. The following theorem is an oracle inequality for pIBP under the posterior distribution.

###### Theorem 7.1.

Let be an arbitrary binary factor matrix, and let be a binary factor matrix with a well specified group structure such that its for . Under the assumption of Theorem 4.4,

 EZ0⎡⎣Π(∥∥ZZT−Z0(Z0)T∥∥2F≤M⎛⎝n3K4−2β0p+n2K20||Z0ZT0−Z∗(Z∗)T||4⎞⎠∣∣X)⎤⎦
 ≥1−exp(−C′nK2(1−β)0)−2p,

for some constants .

In the case when has an exact two-group structure, we may choose so that . Then it reduces to the result in Theorem 4.4. Otherwise, we may choose a with an exact two-group structure to approximate . In this case, the posterior distribution contracts to the truth with a rate consisting of two parts. The first part can be viewed as the estimation error of a binary factor matrix with an exact two-group structure. The second part is the approximation error for the true binary factor matrix by . Note that the rate of convergence for IBP in Theorem 4.3 is . Therefore, as long as

 n2K20∥∥Z0ZT0−Z∗(Z∗)T∥∥4F=o(K40n3p),

pIBP still converges faster than IBP if the true binary factor matrix has an approximate two-group structure.

Let us consider the following example to illustrate Theorem 7.1. Let be a binary factor matrix which generates the data. Among the factors that possess approximate group structures, there are factors belonging to and factors belonging to . In addition, for some small , people in can possess a constant number of factors belonging to , and people in can possess a constant number of factors belonging to . We call this situation a -approximate two-group structure. By zeroing out these entries, we obtain a binary factor matrix with an exact two-group structure, whose factor decomposition is . In other words, for , there are factors exclusively belonging to and factors exclusively belonging to . The approximation error is bounded by , where

denotes the spectral norm of a matrix, which is its largest singular value. We summarize this example in the following corollary.

###### Corollary 7.1.

Under the setting of Theorem 7.1, let have a factor decomposition satisfying , then as long as , we have

 EZ0[Π(∥∥ZZT−Z0(Z0)T∥∥2F≤ϵ2n,p∣∣X)]≥1−exp(−C′nK2(1−β)0)−2p,

for some positive sequence and some constant .

The corollary provides an example that pIBP converges at a faster rate than that of IBP when satisfies the -approximate two-group structure. The quantity quantifies the sparsity of the binary factor matrix . In many applied situations, the true binary factor matrix has a sparse structure [19, 22, 7]. This leads to a small .

## References

• Arriola et al.  Edurne Arriola, Caterina Marchio, David SP Tan, Suzanne C Drury, Maryou B Lambros, Rachael Natrajan, Socorro Maria Rodriguez-Pinilla, Alan Mackay, Narinder Tamber, Kerry Fenwick, et al. Genomic analysis of the her2/top2a amplicon in breast cancer and breast cancer cell lines. Laboratory investigation, 88(5):491–503, 2008.
• Barron et al.  Andrew Barron, Mark J Schervish, and Larry Wasserman. The consistency of posterior distributions in nonparametric problems. The Annals of Statistics, 27(2):536–561, 1999.
• Bell et al.  D Bell, A Berchuck, M Birrer, J Chien, DW Cramer, F Dao, R Dhir, P DiSaia, H Gabra, P Glenn, et al. Integrated genomic analyses of ovarian carcinoma. Nature, 474:609–615, 2011.
• Birnbaum et al.  Aharon Birnbaum, Iain M Johnstone, Boaz Nadler, and Debashis Paul.

Minimax bounds for sparse pca with noisy high-dimensional data.

Annals of statistics, 41(3):1055, 2013.
• Cai et al. [2013a] T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074–3110, 2013a.
• Cai et al. [2013b] Tony Cai, Zongming Ma, and Yihong Wu. Optimal estimation and rank detection for sparse spiked covariance matrices. Probability Theory and Related Fields, pages 1–35, 2013b.
• Carvalho et al.  Carlos M Carvalho, Jeffrey Chang, Joseph E Lucas, Joseph R Nevins, Quanli Wang, and Mike West. High-dimensional sparse factor modeling: applications in gene expression genomics. Journal of the American Statistical Association, 103(484):1438–1456, 2008.
• Cerami et al.  Ethan Cerami, Jianjiong Gao, Ugur Dogrusoz, Benjamin E Gross, Selcuk Onur Sumer, Bülent Arman Aksoy, Anders Jacobsen, Caitlin J Byrne, Michael L Heuer, Erik Larsson, et al. The cbio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer discovery, 2(5):401–404, 2012.
• Chen et al.  Mengjie Chen, Chao Gao, and Hongyu Zhao. Posterior contraction rates of the phylogenetic indian buffet processes. 2014.
• Chène  Patrick Chène. Inhibiting the p53–mdm2 interaction: an important target for cancer therapy. Nature reviews cancer, 3(2):102–109, 2003.
• Diaconis and Freedman  Persi Diaconis and David Freedman. On the consistency of bayes estimates. The Annals of Statistics, pages 1–26, 1986.
• Fan et al.  Jianqing Fan, Yingying Fan, and Jinchi Lv. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics, 147(1):186–197, 2008.
• Fan et al.  Jianqing Fan, Yuan Liao, and Martina Mincheva. High dimensional covariance matrix estimation in approximate factor models. Annals of statistics, 39(6):3320, 2011.
• Fan et al.  Jianqing Fan, Yuan Liao, and Martina Mincheva. Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(4):603–680, 2013.
• Gao and Zhou  Chao Gao and Harrison H Zhou. Rate-optimal posterior contraction for sparse pca. Annals of Statistics, to appear, 2015.
• Ghosal and Van Der Vaart  Subhashis Ghosal and Aad Van Der Vaart. Convergence rates of posterior distributions for noniid observations. The Annals of Statistics, 35(1):192–223, 2007.
• Ghosal et al.  Subhashis Ghosal, Jayanta K Ghosh, and Aad W van der Vaart. Convergence rates of posterior distributions. Annals of Statistics, 28(2):500–531, 2000.
• Griffiths and Ghahramani  Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the indian buffet process. In In NIPS, pages 475–482. MIT Press, 2005.
• Griffiths and Ghahramani  Thomas L Griffiths and Zoubin Ghahramani. The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12:1185–1224, 2011.
• Hu et al.  Guohong Hu, Robert A Chong, Qifeng Yang, Yong Wei, Mario A Blanco, Feng Li, Michael Reiss, Jessie L-S Au, Bruce G Haffty, and Yibin Kang. Mtdh activation by 8q22 genomic gain promotes chemoresistance and metastasis of poor-prognosis breast cancer. Cancer cell, 15(1):9–20, 2009.
• Killian et al.  Audrey Killian, Nasrin Sarafan-Vasseur, Richard Sesboüé, Florence Le Pessot, France Blanchard, Aude Lamy, Michelle Laurent, Jean-Michel Flaman, and Thierry Frébourg. Contribution of the bop1 gene, located on 8q24, to colorectal tumorigenesis. Genes, Chromosomes and Cancer, 45(9):874–881, 2006.
• Knowles and Ghahramani  David Knowles and Zoubin Ghahramani. Nonparametric bayesian sparse factor models with application to gene expression modeling. The Annals of Applied Statistics, 5(2B):1534–1552, 2011.
• Le Cam and Yang  Lucien Le Cam and Grace Lo Yang. Asymptotics in statistics: some basic concepts. Springer, 2000.
• Lv et al.  Qi Lv, Wei Wang, Jianfei Xue, Fang Hua, Rong Mu, Heng Lin, Jun Yan, Xiaoxi Lv, Xiaoguang Chen, and Zhuo-Wei Hu. Dedd interacts with pi3kc3 to activate autophagy and attenuate epithelial–mesenchymal transition in human breast cancer. Cancer research, 72(13):3238–3250, 2012.
• Miller and Harrison  Jeffrey W Miller and Matthew T Harrison. Inconsistency of pitman-yor process mixtures for the number of components. arXiv preprint arXiv:1309.0024, 2013.
• Miller et al.  Kurt T Miller, Thomas Griffiths, and Michael I Jordan. The phylogenetic indian buffet process: A non-exchangeable nonparametric prior for latent features. arXiv preprint arXiv:1206.3279, 2012.
• Muzny et al.  Donna M Muzny, Matthew N Bainbridge, Kyle Chang, Huyen H Dinh, Jennifer A Drummond, Gerald Fowler, Christie L Kovar, Lora R Lewis, Margaret B Morgan, Irene F Newsham, et al. Comprehensive molecular characterization of human colon and rectal cancer. Nature, 487:330–337, 2012.
• Nik-Zainal et al.  Serena Nik-Zainal, Peter Van Loo, David C Wedge, Ludmil B Alexandrov, Christopher D Greenman, King Wai Lau, Keiran Raine, David Jones, John Marshall, Manasa Ramakrishna, et al. The life history of 21 breast cancers. Cell, 149(5):994–1997, 2012.
• Pati et al.  Debdeep Pati, Anirban Bhattacharya, Natesh S Pillai, and David Dunson. Posterior contraction in sparse bayesian factor models for massive covariance matrices. The Annals of Statistics, 42(3):1102–1130, 2014.
• Ramakrishna et al.  Manasa Ramakrishna, Louise H Williams, Samantha E Boyle, Jennifer L Bearfoot, Anita Sridhar, Terence P Speed, Kylie L Gorringe, and Ian G Campbell. Identification of candidate growth promoting genes in ovarian cancer through integrated copy number and expression analysis. PloS one, 5(4):e9983, 2010.
• Rousseau et al.  Joel Rousseau, Bernard Têtu, Danielle Caron, Patrick Malenfant, Paola Cattaruzzi, Marie Audette, Charles Doillon, Jacques P Tremblay, and Benoit Guérette. Rcas1 is associated with ductal breast cancer progression. Biochemical and biophysical research communications, 293(5):1544–1549, 2002.
• Schwartz  Lorraine Schwartz. On bayes procedures. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 4(1):10–26, 1965.
• TCGA  TCGA. Comprehensive molecular portraits of human breast tumours. Nature, 490:61–70, 2012.
• Teh et al.  Yee Whye Teh, Dilan Görür, and Zoubin Ghahramani. Stick-breaking construction for the indian buffet process. In

Proceedings of the International Conference on Artificial Intelligence and Statistics

, volume 11, 2007.
• Van De Vijver et al.  Marc J Van De Vijver, Yudong D He, Laura J van’t Veer, Hongyue Dai, Augustinus AM Hart, Dorien W Voskuil, George J Schreiber, Johannes L Peterse, Chris Roberts, Matthew J Marton, et al. A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25):1999–2009, 2002.
• Vershynin  Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
• Vu and Lei  Vincent Q Vu and Jing Lei. Minimax sparse principal subspace estimation in high dimensions. The Annals of Statistics, 41(6):2905–2947, 2013.
• Wrzeszczynski et al.  Kazimierz O Wrzeszczynski, Vinay Varadan, James Byrnes, Elena Lum, Sitharthan Kamalakaran, Douglas A Levine, Nevenka Dimitrova, Michael Q Zhang, and Robert Lucito. Identification of tumor suppressors and oncogenes from genomic and epigenetic features in ovarian cancer. PLoS One, 6(12):e28503, 2011.
• Zhang et al.  Dingxiao Zhang, Pingping Jiang, Qinqin Xu, Xiaoting Zhang, Dingxiao Zhang, Pingping Jiang, Qinqin Xu, and Xiaoting Zhang. Arginine and glutamate-rich 1 (arglu1) interacts with mediator subunit 1 (med1) and is required for estrogen receptor-mediated gene transcription and breast cancer cell growth. Journal of Biological Chemistry, 286(20):17746–17754, 2011.

## Appendix A Proofs

### a.1 Preparatory lemmas

###### Lemma A.1.

There is some constant such that for any ,

 PZ{∥∥∥1pXXT−(ZZT+I)∥∥∥F>t}≤exp{−Cpmin(t2n2||ZZT+I)||2∞,tn||ZZT+I)||∞)+2logn},

and

 PZ{∥∥∥1pXXT−(ZZT+I)∥∥∥∞>t}≤exp{−Cpmin(t2||ZZT+I)||2∞,t||ZZT+I)||∞)+2logn}.

### a.2 Proofs of Theorem 4.1 and Lemma 4.1

For notational simplicity, we write for , with the dependency on and being implicit.

###### Proof of Theorem 4.1.

The posterior distribution, according to Bayes formula, is

 Π(U|X)=∫Up(X|Z)p(X|Z0)dΠ([Z])∫p(X|Z)p(X|Z0)dΠ([Z]).

The denominator has lower bound

 ∫p(X|Z)p(X|Z0)dΠ([Z])≥∫{||ZZT−Z0ZT0||2F=0}p(X|Z)p(X|Z0)dΠ([Z])=Π(||ZZT−Z0ZT0||2F=0).

The above equality is because when . Thus, we have

 EZ0Π(U|X) ≤ EZ0ϕ+EZ0Π(U|X)(1−ϕ) ≤ EZ0ϕ+EZ0(∫Up(X|Z)p(X|Z0)dΠ([Z])(1−ϕ))Π(||ZZT−Z0ZT0||2F=0) = EZ0ϕ+∫UEZ(1−ϕ)dΠ([Z])Π(||ZZT−Z0ZT0||2F=0) ≤ EZ0ϕ+1Π(||ZZT−Z0ZT0||2F=0)supZ∈UEZ(1−ϕ),

where the equality above is due to Fubini’s Theorem. Therefore, the proof is complete. ∎

###### Proof of Lemma 4.1.

We consider the following test.

 H0:Z=Z0,H1:||ZZT−Z0ZT0||F>√Mϵ.

The alternative region has decomposition

 H1 ⊂ {||ZZT−Z0ZT0||F>√Mϵ,||ZZT+I||∞≤4(K0+1)} ∪⋃l≥1{4l(K0+1)<||ZZT+I||∞≤4(l+1)(K0+1)} = ∞⋃l=0H1l

Define the testing functions

 ϕ0=I{∥∥∥1pXXT−(Z0ZT0+I)∥∥∥F>12√Mϵ},
 ϕl=I{∥∥∥1pXXT∥∥∥∞>2l(K0+1)},\rm for each l.

Then, by Lemma A.1 and the fact that , we have

 EZ0ϕ0≤exp{−Cpmin(Mϵ2n2K20,√MϵnK0)+2logn},

and

 EZ0ϕl = PZ0{∥∥∥1pXXT∥∥∥∞>2l(K0+1)} ≤ PZ0{∥∥∥1pXXT−(Z0ZT0+I)∥∥∥∞>2l(K0+1)−∥∥Z0ZT0+I∥∥∞} ≤ PZ0{∥∥∥1pXXT−(Z0ZT0+I)∥∥∥∞>l(K0+1)} ≤ exp(−Clp+2logn),

where the second inequality above is by , and the last inequality above is by Lemma A.1. We also have for any ,

 EZ(1−ϕ0) = PZ{∥∥∥1pXXT−(Z0ZT0+I)∥∥∥F≤12√Mϵ} ≤ PZ{||ZZT−Z0ZT0||F−∥∥∥1pXXT−(ZZT+I)∥∥∥F≤12√Mϵ} ≤ PZ{∥∥∥1pXXT−(ZZT+I)∥∥∥F>12√Mϵ} ≤ exp{−Cpmin(Mϵ2n2K20,√MϵnK0)+2logn},

where the last inequality is by Lemma A.1 and the fact that for any . Taking supreme over , we get

 supZ∈H10EZ(1−ϕ0)≤exp{−Cpmin(Mϵ2n2K20,√MϵnK0)+2logn}.

For any , we have

 EZ(1−ϕl) = PZ{∥∥∥1pXXT∥∥∥∞≤2l(K0+1)} ≤ P