# Bayesian Nonparametric Models for Biomedical Data Analysis

In this dissertation, we develop nonparametric Bayesian models for biomedical data analysis. In particular, we focus on inference for tumor heterogeneity and inference for missing data. First, we present a Bayesian feature allocation model for tumor subclone reconstruction using mutation pairs. The key innovation lies in the use of short reads mapped to pairs of proximal single nucleotide variants (SNVs). In contrast, most existing methods use only marginal reads for unpaired SNVs. In the same context of using mutation pairs, in order to recover the phylogenetic relationship of subclones, we then develop a Bayesian treed feature allocation model. In contrast to commonly used feature allocation models, we allow the latent features to be dependent, using a tree structure to introduce dependence. Finally, we propose a nonparametric Bayesian approach to monotone missing data in longitudinal studies with non-ignorable missingness. In contrast to most existing methods, our method allows for incorporating information from auxiliary covariates and is able to capture complex structures among the response, missingness and auxiliary covariates. Our models are validated through simulation studies and are applied to real-world biomedical datasets.

There are no comments yet.

## Authors

• 10 publications
11/03/2017

### Bayesian Nonparametric Mixed Effects Models in Microbiome Data Analysis

Detecting associations between microbial composition and sample characte...
03/16/2018

### Phylogeny-based tumor subclone identification using a Bayesian feature allocation model

Tumor cells acquire different genetic alterations during the course of e...
08/04/2019

### Full-semiparametric-likelihood-based inference for non-ignorable missing data

During the past few decades, missing-data problems have been studied ext...
04/23/2014

### Bayesian Reconstruction of Missing Observations

We focus on an interpolation method referred to Bayesian reconstruction ...
09/01/2021

### Bayesian data combination model with Gaussian process latent variable model for mixed observed variables under NMAR missingness

In the analysis of observational data in social sciences and businesses,...
02/20/2020

### A Bayesian Feature Allocation Model for Identification of Cell Subpopulations Using Cytometry Data

A Bayesian feature allocation model (FAM) is presented for identifying c...
02/16/2019

### Sequentially additive nonignorable missing data modeling using auxiliary marginal information

We study a class of missingness mechanisms, called sequentially additive...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1.1 Overview

This dissertation develops nonparametric Bayesian models, corresponding Markov chain Monte Carlo (MCMC) algorithms, and applications for biomedical data analysis. Chapters

2 and 3 are about applications to genomic data analysis, and Chapter 4 discusses applications to longitudinal missing data analysis. Nonparametric Bayesian methods provide flexible and highly adaptable approaches for statistical inference. In applications with biostatistics data such methods can often better address biological research problems than more restrictive parametric methods.

In Chapter 2

, we talk about inference on tumor heterogeneity. During tumor growth, tumor cells acquire somatic mutations that allow them to gain advantages compared to normal cells. As a result, tumor cell populations are typically heterogeneous consisting of multiple subpopulations with unique genomes, characterized by different subsets of mutations. This is known as tumor heterogeneity. The homogeneous subpopulations are known as subclones and are an important target in precision medicine. We propose a Bayesian feature allocation model to reconstruct tumor subclones using next-generation sequencing (NGS) data. The key innovation is the use of (phased) pairs of proximal single nucleotide variants (SNVs) for the subclone reconstruction. We utilize parallel tempering to achieve a better mixing Markov chain with highly multi-modal posterior distributions. We also develop trans-dimensional MCMC algorithms with transition probabilities that are based on splitting the data into training and test data sets to efficiently implement trans-dimensional MCMC sampling. Through simulation studies we show that inference under our model outperforms models using only marginal SNVs by recovering the number of subclones as well as their structures more accurately. This is the case despite significantly smaller number of phased pairs than the number of marginal SNVs. Estimating our model for four lung cancer tissue samples, we successfully infer their subclone structures. For this work, I collaborate with Peter Mueller (The University of Texas at Austin), Subhajit Sengupta (NorthShore University HealthSystem) and Yuan Ji (NorthShore University HealthSystem and The University of Chicago).

In Chapter 3, we address another important aspect of statistical inference for tumor heterogeneity, aiming to recover the phylogenetic relationship of subclones. Such inference can significantly enrich our understanding of subclone evolution and cancer development. We develop a tree-based feature allocation model which explicitly models dependence structure among subclones. That is, in contrast to commonly used feature allocation models, we allow the latent features to be dependent, using a tree structure to introduce dependence. In the application to inference for tumor heterogeneity this tree structure is interpreted as a phylogenetic tree of tumor cell subpopulations. We adapt our MCMC sampling techniques to efficiently search the tree space. We analyze a lung cancer data set and infer the underlying evolutionary process. For this work, I collaborate with Subhajit Sengupta, Peter Mueller and Yuan Ji.

In Chapter 4

, we model missing data in longitudinal studies. In longitudinal clinical studies, the research objective is often to make inference on a subject’s full data response conditional on covariates that are of primary interest; for example, to calculate the treatment effect of a test drug at the end of a study. The vector of responses for a research subject is often incomplete due to dropout. Dropout is typically non-ignorable and in such cases the joint distribution of the full data response and missingness needs to be modeled. In addition to the covariates that are of primary interest, we would often have access to some auxiliary covariates (often collected at baseline) that are not desired in the model for the primary research question. Such variables can often provide information about the missing responses and missing data mechanism. In this setting, auxiliary covariates should be incorporated in the joint model as well, and we should proceed with inference unconditional on these auxiliary covariates. As a result, we consider a joint model for the full data response, missingness and auxiliary covariates. In particular, we specify a nonparametric Bayesian model for the observed data via Gaussian process priors and Bayesian additive regression trees. These model specifications allow us to capture non-linear and non-additive effects, in contrast to existing parametric methods. We then separately specify the conditional distribution of the missing data response given the observed data response, missingness and auxiliary covariates (i.e. the extrapolation distribution) using identifying restrictions. We introduce meaningful sensitivity parameters that allow for a simple sensitivity analysis. Informative priors on those sensitivity parameters can be elicited from subject-matter experts. We use Monte Carlo integration to compute the full data estimands. Our methodology is motivated by, and applied to, data from a clinical trial on treatments for schizophrenia. For this work, I collaborate with Michael Daniels (The University of Texas at Austin).

The remainder of this chapter is organized as follows, Section 1.2

contains basics of Bayesian inference and Bayesian nonparametrics. Sections

1.3, 1.4 and 1.5 present three classes of statistical models, discuss how nonparametric Bayesian methods can be used, and demonstrate applications related to succeeding chapters. These models include latent class models, latent feature models and regression.

### 1.2 Bayesian Nonparametrics

##### Bayesian Inference

By way of introducing notation, we briefly review the setup of Bayesian inference. Bayesian inference, named for Thomas Bayes, is a particular approach to statistical inference. Let denote the observed data, denote the unobserved parameters of interest, and denote unknown but potentially observable quantities (such as a data point that is not yet observed) of interest. In the Bayesian framework, we update our belief on the unobserved parameters according to evidences in the observed data based on Bayes’ rule:

 p(θ∣y)=p(y∣θ)p(θ)p(y). (1.1)

In equation (1.1), is called the prior distribution, is called the sampling distribution (when regarded as a function of with fixed ) or the likelihood (when regarded as a function of with fixed ). The denominator is the marginal distribution of , which is calculated by . Inference on is given by the posterior distribution . We can then make inference on based on the posterior predictive distribution

 p(~y∣y)=∫p(~y∣θ)p(θ∣y)dθ.

Bayesian statistical inference is stated in terms of probability statements conditional on the observed values of . For a review of Bayesian statistics, see, for example, Gelman et al. (2014) or Hoff (2009).

##### Exchangeability

Exchangeability plays an important role in statistics. Suppose we have random variables (which can be data points or parameters) with a joint distribution . The random variables are called exchangeable if their joint distribution is invariant to permutation. Let and denote by a permutation of . (Finite) exchangeability states that

 y1,…,yNd=yσ(1),…,yσ(N)

for any , where means equal in distribution. Furthermore, an infinite sequence of random variables is called infinitely exchangeable if

 y1,y2,…d=yσ(1),yσ(2),…,

where is a finite permutation. That is, for some finite value , for all .

The importance of exchangeability is due to de Finetti’s theorem (De Finetti, 1931, Hewitt and Savage, 1955, De Finetti, 1974), which states that 111This is a rephrased simpler version from Gelman et al. (2014). The original version is a statement about probability measure. De Finetti’s original paper De Finetti (1931) is for the case of binary random variables, and Hewitt and Savage (1955) extended it to any real valued random variables. if are infinitely exchangeable random variables, their joint distribution can be expressed as a mixture of independent and identical distributions

 p(y1,…,yN)=∫(N∏i=1p(yi∣θ))p(θ)dθ. (1.2)

The theorem can be rephrased from a more general perspective 222Another simpler version from Teh (2011).. If are infinitely exchangeable, there exists a random distribution such that the sequence is composed of i.i.d. draws from it,

 p(y1,…,yN)=∫N∏i=1F(yi)dp(F). (1.3)

That is, in Equation (1.2) can be interpreted as indexing a probability measure , or can even be the probability measure itself.

##### Bayesian Nonparametrics

A model is called parametric if it only has a finite (and usually small) number of parameters, i.e. lives in a finite dimensional space. In contrast, a nonparametric model has a potentially infinite number of parameters, i.e. or

are in an infinite dimensional space. Thus, nonparametric Bayesian inference requires constructing probability distributions on an infinite dimensional parameter space. Such probability distributions are called

stochastic processes

with sample paths in the parameter space. Nonparametric Bayesian methods avoid the often restrictive assumptions of parametric models and provide flexible and highly adaptable approaches for statistical modeling. Reviews of Bayesian nonparametrics include

Hjort et al. (2010), Ghosh and Ramamoorthi (2003), Müller et al. (2015), Walker et al. (1999), Müller and Quintana (2004), Orbanz and Teh (2011), Gershman and Blei (2012) and Orbanz (2014) 333Hjort et al. (2010) includes a set of introductory and overview papers, Ghosh and Ramamoorthi (2003) focuses on posterior convergence, and Müller et al. (2015) has more discussion on data analysis problems..

Nonparametric Bayesian approaches have been widely used in many statistical inference problems, including density estimation, clustering, feature allocation, regression, classification, and graphical models. In the next sections, we give examples to show how Bayesian nonparametric approaches can be applied to address those important problems.

### 1.3 Latent Class Models

Suppose we have objects . In a latent class model (for a review, see Griffiths and Ghahramani, 2011), each object belongs to a latent class , , where is the number of possible classes, and is allowed. When , the model has an infinite number of classes and thus has an infinite number of parameters. In this case, the model is nonparametric. We are interested in how the classes are related to the objects, , and the distribution over class assignments, . For , we assume conditional independence,

 p(y∣c) =N∏n=1p(yn∣cn), p(yn∣cn=k,μ∗k) =G(⋅∣μ∗k). (1.4)

That is, for an object belonging to class , we assume it has a distribution with parameter . We then put some prior distribution on the ’s,

 μ∗1,…,μ∗Kiid∼F0. (1.5)

Next, we specify . Specifying is equivalent to defining a distribution on a random partition of the index set . The formal definition of random partition follows in Section 1.3.1. The equivalence can be seen by noticing that the unique values of correspond to a partition of , , with if , and vice versa.

#### 1.3.1 Random Partition

We briefly summarize the definition of random partitions as given in Broderick, Jordan, and Pitman (2013). See there for more details and discussion. Let denote the index set of objects. A partition of is a collection of mutually exclusive, exhaustive, nonempty subsets of called blocks. Let , where is the number of blocks. Here is allowed, in which case the index set becomes , and is also allowed.

Let be the space of all partitions of . A random partition of is a random element of . The probability is called the partition probability function of .

Exchangeability of a random partition can be defined as follows. Let be a finite permutation. That is, for some finite value , for all . Furthermore, for any block , denote the permutation applied to the block as . For any partition , denote the permutation applied to the partition as . A random partition is called exchangeable if for every permutation of . The importance of exchangeability has been stated in Section 1.2.

#### 1.3.2 Chinese Restaurant Process

The Chinese restaurant process (CRP) defines an exchangeable random partition. The CRP can be derived in multiple ways, such as from the Dirichlet process (Blackwell and MacQueen, 1973), or by taking limit of a finite mixture model (Green and Richardson, 2001, Neal, 1992, 2000). We take the latter approach, following Griffiths and Ghahramani (2011).

##### A Finite Mixture Model

We assume object belongs to class with probability ,

 p(cn=k∣πk)=πk.

Note that the distribution of always induces a distribution on a random partition of . In the above case, the assignment of an object to a class is independent of the assignments of the other objects conditional on , and the latent class model is called a mixture model. To complete the model, we put a symmetric Dirichlet distribution prior on ,

 (π1,…,πK)∼Dir(α/K,…,α/K),

and .

Integrating out the ’s, the marginal distribution of is

 p(c)=∏Kk=1Γ(mk+αK)Γ(αK)KΓ(α)Γ(N+α), (1.6)

where is the number of objects assigned to class . This distribution is exchangeable, since it only depends on the counts and does not depend on the ordering of objects.

##### Equivalence Classes

A partition includes an ordering of the blocks by increasing labels . In many applications, we are only interested in the division of objects, and the ordering of the blocks does not matter. For example, and correspond to the same division of objects, where the only difference is the choice of labels of the blocks. If the order of the blocks is not identifiable, it is helpful to define an equivalence class of assignment vectors, denoted by , with two assignment vectors and belonging to the same equivalence class if they imply the same division of objects.

We therefore focus on the equivalence classes . Let be the number of classes for which , and be the number of classes for which , so . The cardinality of is . Taking the summation over all assignment vectors that belong to the same equivalence class, and expanding (1.6), we obtain

 p([c]) =∑c∈[c]p(c) =K!K0!(αK)K+(K+∏k=1mk−1∏j=1(j+αK))Γ(α)Γ(N+α). (1.7)
##### Taking the Infinite Limit

When the number of classes , taking the limit in Equation (1.7), we get

 limK→∞p([c])=αK+(K+∏k=1(mk−1)!)Γ(α)Γ(N+α). (1.8)

See details in Griffiths and Ghahramani (2011). Note that this distribution is still exchangeable, just as in the finite case.

##### Chinese Restaurant Analogy

The Chinese restaurant analogy (Aldous, 1985) 444Aldous credits this analogy to Jim Pitman and Lester Dubins. comes from the fact that the distribution in Equation (1.8) can be described by a Chinese restaurant metaphor. Let the objects be customers in a restaurant, and the classes be tables at which they sit, . The customers enter the restaurant one by one, and each chooses a table at random. At time 1, the first customer comes in and chooses the first table to sit. At time , the -th customer comes in, chooses an occupied table with probability proportional to the number of customers sitting at that table, or the first unoccupied table with probability proportional to , . The customers and tables form a partition of , if we treat the tables as partition blocks. Denote by the event that customer sits at table , the number of customers sitting at table after time , and . Mathematically, the CRP can be written as

 cn∣c1,…,cn−1={k,w. pr. mkα+n−1, for k≤K+;K++1,w. pr. αα+n−1, (1.9)

for . The probability of a partition of given by Equation (1.9) is identical to what given in Equation (1.8).

##### Dirichlet Process

The CRP defines an exchangeable random partition of . We can further extend he model by assigning each table a value , with generated from some fixed distribution . We then assign customer a value if the customer sits at table . The predictive distribution of given the values of the first customers is

 μn∣μ1,…,μn−1={μ∗k,w. pr. mkα+n−1, for k≤K+;μ∗K++1∼F0,w. pr. αα+n−1. (1.10)

It can be shown that the random sequence is infinitely exchangeable. By Equation (1.3), there exists a random distribution such that and . Here is a prior over the random distribution , which is known as the Dirichlet process (DP) (Ferguson, 1973). We denote by a DP with concentration parameter and base distribution . Equation (1.10) is also called the Pólya urn representation (Blackwell and MacQueen, 1973) of the DP. Using the notion of DP, we can re-parameterize Equations (1.4), (1.5) and (1.9) with a hierarchical model

 yn∣μn∼G(⋅∣μn)μn∣F∼FF∣α,F0∼DP(α,F0) (1.11)

The model (1.11) is called a Dirichlet process mixture (DPM) model.

The DP is probably the most popular nonparametric Bayesian model. Discussions and extensions of the DP include Blackwell (1973), Blackwell and MacQueen (1973), Antoniak (1974), Lo et al. (1984), Sethuraman (1994), Pitman and Yor (1997), MacEachern (2000), Müller et al. (2004), Teh et al. (2006), Rodriguez et al. (2008), Lijoi and Prünster (2010), Adams et al. (2010), Teh (2011), De Blasi et al. (2015), where Sethuraman (1994) proposes the stick-breaking construction of the DP, Pitman and Yor (1997) extend the DP to the Pitman-Yor process, MacEachern (2000) proposes the dependent DP, Teh et al. (2006) develop the hierarchical DP, and Rodriguez et al. (2008) develop the nested DP. Literature about posterior inference methods for the DP includes West et al. (1994), Escobar and West (1995), MacEachern and Müller (1998), Neal (2000), Rasmussen (2000), Ishwaran and James (2001), Jain and Neal (2004), and Blei and Jordan (2006).

#### 1.3.3 Related Applications

Latent class models, in particular, the DPM model and its variations, have been extensively used in many data analysis problems. We highlight their applications to inference for tumor heterogeneity and inference for missing data because of the relevance to the following chapters.

##### Inference for Tumor Heterogeneity

Tumor cell populations are typically heterogeneous consisting of multiple homogeneous subpopulations with unique genomes. Such subpopulations are known as subclones. Our goal is reconstructing such subclones from next-generation sequencing (NGS) data (Mardis, 2008). See more details in Chapters 2 and 3. One approach to this problem is to model the observed read count data using a latent class model. This approach is taken by PyClone (Roth et al., 2014) and PhyloWGS (Jiao et al., 2014, Deshwar et al., 2015). Tumor evolution is a complex process involving many biological details, such as tumor purity, copy number variations and tumor phylogeny. Also, NGS data are subject to sequencing error and are often overdispersed. For simplicity of illustration we consider only one pure tumor tissue sample, ignore all the complexities mentioned above and also ignore the zygosity of the mutation sites. For detailed discussions see, for example, Roth et al. (2014), Jiao et al. (2014), Deshwar et al. (2015) and Chapters 2 and 3.

Consider single nucleotide variants (SNVs). Here SNVs refer to the loci of the nucleotides (base pairs) for which we record variants. Variants are defined relative to some reference genome. The SNVs are the objects in the latent class model. In an NGS experiment, DNA fragments are first produced by extracting the DNA molecules from the cells in a tumor sample. The fragments are then sequenced using short reads. The short reads are mapped to the reference genome, and counts are recorded for each locus (i.e. base pair). In the end, for each SNV locus (), denote by and the total number of reads and number of variant reads covering the locus, respectively. The total number of reads is usually treated as a fixed number. PyClone uses a DPM model for ,

 ns∣ps ∼Binom(Ns;ps), ps∣F ∼F, F∣α,F0 ∼DP(α,F0),

where the base distribution is chosen to be . Here is known as the cellular prevalence of mutation , i.e. the fraction of cancer cells harbouring a mutation. The DP prior for allows multiple mutations to share the same cellular prevalence. The critical step towards subclone reconstruction is the following. Mutations having the same cellular prevalence are thought of having occurred at the same point in the clonal phylogeny. Thus, latent classes of mutations can be used as markers of subclone populations (Roth et al., 2014). We note that this is essentially an application of latent class models to clustering. PhyloWGS, on the other hand, uses the tree-structured stick breaking process (TSSB) (Adams et al., 2010) as the prior for ,

 F∣α,γ,F0 ∼TSSB(α,γ,F0),

which allows it to infer tumor phylogeny.

One restriction of the latent class model is that each object can only belong to one class. In the tumor heterogeneity application, this restriction implies that each mutation can only occur once in the clonal phylogeny. Therefore, subclone reconstruction methods based on latent class models usually rely on the infinite site assumption (ISA) (Kimura, 1969), which can be summarized as (Roth et al., 2014)

1. Subclone populations follow a perfect phylogeny. That is, no SNV site mutates more than once in its evolutionary history;

2. Subclone populations follow a persistent phylogeny. That is, mutations do not disappear or revert.

However, ISA is not necessarily valid, in which case we should model the observed read count data using latent feature models. See Section 1.4.5.

##### Inference for Missing Data

Missing data are very common in real studies. Missingness is typically non-ignorable (Rubin, 1976, Little and Rubin, 2014), and in such cases the joint distribution of the full data response and missingness needs to be modeled. We focus on the missing outcome case. See Linero and Daniels (2017) for a general review of Bayesian nonparametric approach to missing outcome data. Let denote the outcome that was planned to be collected for subject at time , and be the missingness indicator with or accordingly as is observed or not, , . Let denote the covariates that are of primary interest to the study. In longitudinal clinical trial setting, is usually an indicator of treatment. We often treat as fixed and do not proceed with inference on it. The full data for subject are . The observed data for subject are , where . We assume . We stratify the model by and suppress the conditional on hereafter to simplify notation. The extrapolation factorization (Daniels and Hogan, 2008) factorizes

 p(y,r)=p(ymis∣yobs,r)p(yobs,r).

The observed data distribution is identified by the observed data, while the extrapolation distribution is not. Identifying the extrapolation distribution relies on untestable assumptions such as parametric models for the full data distribution or identifying restrictions. See Chapter 4 or Linero and Daniels (2017).

For current discussion, we focus on specifying . One way of specifying is specifying , and set

 p(yobs,r)=∫p(y,r)dymis.

This is known as the working model idea (Linero and Daniels, 2015, Linero, 2017, Linero and Daniels, 2017). Linero and Daniels (2017) discuss a Bayesian nonparametric framework for modeling a complex joint distribution of outcome and missingness, which sets

 p(y,r)=K∑k=1πkG(⋅∣μk). (1.12)

Equation (1.12) is another way of writing a latent class model

 Yi,Ri∣ci=k,μk ∼G(⋅∣μk), p(ci=k∣πk) =πk.

When , mixture models of the form (1.12) can approximate any joint distribution for (subject to technical constraints). We note that this is essentially an application of latent class models to density estimation.

Linero and Daniels (2017) use a model of the form (1.12) to analyze data from the Breast Cancer Prevention Trial. In this trial, or represent subject is depressed or not at time . They model

 p(y,r)=∞∑k=1πk{J∏j=1γrjkj(1−γkj)1−rj}{J∏j=1βyjkj(1−βkj)1−rj}.

Linero and Daniels (2015) discussion is another example of applying latent class models to inference for missing data. They analyze data from an acute schizophrenia clinical trial, where the outcome is a continuous variable called the positive and negative syndrome scale (PANSS) score. Missingness is monotone in this application. That means, if was unobserved then was also unobserved. Let denote the dropout time, i.e. if then was observed but was not. For monotone missingness, captures all the information about missingness. Denote by the history of outcomes through the first times. They model

 p(y,s)=∞∑k=1πkf(y∣μk,Σk)g(s∣y,ζk,γk),

with

 f(y∣μ,Σ)=N(y;μ,Σ), logit[g(S=j∣S≥j,y,ζ,γ)]=ζj+γTj¯yj.

### 1.4 Latent Feature Models

Suppose we have objects, . In a latent feature model (for a review, see Griffiths and Ghahramani, 2011), each object is represented by a vector of latent feature values , where is the number of features. Similar to the latent class model case, is allowed, in which case the model is nonparametric. Examples of latent feature models include probabilistic principle component analysis (Tipping and Bishop, 1999) and factor analysis (Roweis and Ghahramani, 1999). We focus on the case that there is no upper bound on the number of features.

We can break the vector into two components: a binary vector with if object has feature and otherwise, and a second vector indicating the value of each feature. The vector can be expressed as the elementwise product of and , i.e.

 dnk=znkvk,k=1,…,K. (1.13)

Let and denote matrices with columns and , respectively. We are interested in how the feature values are related to the data, , and the distribution over feature values, . For , we generally assume conditional independence

 p(y∣D) =N∏n=1p(yn∣dn), p(yn∣dn) ∼G(⋅∣dn). (1.14)

The distribution can be broken into two components, with being determined by and . We assume an independent prior on ,

 v1,…,vKiid∼F0.

We then focus on defining a prior on . Specifying is equivalent to defining a distribution on a random feature allocation of the index set . The formal definition of random feature allocation follows in Section 1.4.1. The equivalence can be seen by noticing that the values of correspond to a feature allocation , with if and if , and vice versa.

#### 1.4.1 Random Feature Allocation

We briefly summarize the definition of random feature allocations from Broderick, Jordan, and Pitman (2013). See there for more details and discussion. Feature allocations could be seen as a generalization of partitions which relaxes the restriction to mutually exclusive and exhaustive subsets. Consider an index set . A feature allocation of is a multiset of non-empty subsets of called features, such that no index belongs to infinitely many features. Denote by , where is the number of features. Here and are allowed. For example, a feature allocation of is .

Let be the space of all feature allocations of . A random feature allocation of is a random element of .

Let be a finite permutation. That is, for some finite value , for all . Furthermore, for any feature , denote the permutation applied to the feature as . For any feature allocation , denote the permutation applied to the feature allocation as . Let be a random feature allocation of . A random feature allocation is called exchangeable if for every permutation of .

##### Matrix Representation

Suppose objects and features are present. A feature allocation can be represented by a binary matrix, denoted by . Rows of correspond to the index set , and columns of correspond to features. Each element , , , is a binary indicator, where or indicates index belongs or does not belong to feature , i.e. or , respectively. For example, Figure 1.1(a) shows a feature allocation of with features, , , , , , , , , , , .

#### 1.4.2 Indian Buffet Process

The Indian buffet process (IBP) (Griffiths and Ghahramani, 2006, 2011)) is a popular example of an exchangeable random feature allocation. Using the matrix representation of feature allocation, the IBP defines a distribution on binary matrices (with an unbounded random number of columns).

##### A Finite Feature Model

The IBP can be defined as the limit of a finite feature model. Suppose we have objects and

features. We use a binary variable

to indicate object has feature , thus form a binary matrix . We assume that each object possesses feature with probability

, and the features are independent. Furthermore, beta distribution priors are put on

’s. That is,

 znk∣πk ∼Bernulli(πk); πk ∼Beta(α/K,1).

Integrating out the ’s, the marginal distribution of is

 p(Z)=K∏k=1αKΓ(mk+αK)Γ(N−mk+1)Γ(N+1+αK),

where is the number of objects possessing feature . This distribution is exchangeable, since it only depends on the counts and does not depend on the ordering of the objects.

##### Left-ordered Constraint and Equivalence Classes

A feature allocation indicates an ordering of the features. In many applications, the ordering of the features is not identifiable. When the labels of the features are arbitrary, it is helpful to define an equivalence class of binary matrices, denoted by . We first introduce an order constraint on binary matrices called the left-ordered constraint. For a binary matrix , its corresponding left-ordered binary matrix, denoted by , is obtained by ordering the columns of from left to right by the magnitude of the binary number expressed by that column, taking the first row as the most significant bit. For example, Figure 1.1(b) shows the corresponding left-ordered binary matrix of Figure 1.1(a). In the first row of the left-ordered matrix, the columns for which are grouped at the left. In the second row, the columns for which are grouped at the left of the sets for which . This grouping structure persists throughout the matrix.

We can then define equivalence classes with respect to the function . This function maps binary matrices to left-ordered binary matrices, as described before. The function is many-to-one: many binary matrices reduce to the same left-ordered form, and there is a unique left-ordered form for every binary matrix. Any two binary matrices and are equivalent if . In models where feature order is not identifiable, performing inference at the level of -equivalence classes is appropriate. The probability of a particular -equivalence class of binary matrices is .

The matrix left-ordered form motivates the following definition. The history of feature at object is defined to be . When is not specified, history refers to the full history of feature , . The histories of features are individuated using the decimal equivalent of the binary numbers corresponding to the column entries. For example, at object 3, features can have one of four histories: 0, corresponding to a feature with no previous assignments, 1, being a feature for which but , 2, being a feature for which but , and 3, being a feature possessed by both previous objects were assigned. The number of features possessing the history is denoted by , with being the number of features for which and being the number of features for which , so . The function thus places the columns of a matrix in ascending order of their histories.

Using the notion above, the cardinality of is . Thus,

 p([Z])=K!∏2N−1h=0Kh!⋅K∏k=1αKΓ(mk+αK)Γ(N−mk+1)Γ(N+1+αK). (1.15)
##### Taking the Infinite Limit

Taking the limit in Equation (1.15),

 limK→∞p([Z])=αK+∏2N−1h=1Kh!⋅exp{−αHN}⋅K+∏k=1(N−mk)!(mk−1)!N!, (1.16)

where is the -th harmonic number, . See Griffiths and Ghahramani (2011) for details. This distribution is still exchangeable. In practice, we usually drop all columns with all zeros, since they corresponds to the features that no object possesses, and it should not be included in the feature allocation as features are non-empty sets. It can be proved that we can obtain a matrix with finite columns with probability 1 by deleting the columns with all zeros.

##### Indian Buffet Analogy

The probability distribution defined in Equation (1.16) can be derived from a simple stochastic process, which is referred to as the IBP. Think about an Indian buffet where customers (objects) choose dishes (features). In the buffet, customers enter one after another, and each customer encounters infinitely many dishes arranged in a line. The first customer starts at the left of the buffet and takes a serving from each dish, stopping after a number of dishes. The -th customer moves along the buffet, sampling dishes in proportion to their popularity, serving himself with probability , where is the number of previous customers who have sampled a dish. Having reached the end of all previously sampled dishes, the -th customer then tries a number of new dishes. We use a binary matrix with rows and infinitely many columns to indicate which customers chose which dishes, where if the -th customer sampled the -th dish. The matrices produced by this process are generally not in left-ordered form, and customers are not exchangeable under this distribution. However, if we only record the -equivalence classes of the matrices generated by this process, one obtains the exchangeable distribution given by Equation (1.16).

##### Beta Process

Similar to the relationship between the DP and the CRP, the de Finetti’s measure underlying the exchangeable distribution produced by the IBP is the beta process (BP) (Hjort, 1990). See full details in Thibaux and Jordan (2007).

Other discussions of the IBP and the BP include Teh et al. (2007), Teh and Gorur (2009), Doshi et al. (2009), Paisley et al. (2010), Williamson et al. (2010), Knowles and Ghahramani (2011), and Miller et al. (2012).

#### 1.4.3 Latent Categorical Feature Models

In a latent feature model of the form (1.13) and (1.14), a feature can either have the same effect on the objects possessing it, or have no effect on the objects not possessing it. This is sometimes too restrictive. One relaxation is to assume different effects of each feature on different objects, as seen in Griffiths and Ghahramani (2011). That is, in Equation (1.13), can depend on , and . We can then calibrate prior on according to the specific application.

In some applications, for example, the tumor heterogeneity application in Chapters 2 and 3, it is natural to categorize the features. To elaborate, let indicating object does not possess feature , if , or possesses category of feature , if , . A feature has the same effect on the objects possessing the same category of it, while the feature has different effects on the objects possessing different categories of it. That is, in Equation (1.13),

 dnk=g(znk)⋅vk,k=1,…,K, (1.17)

where represents those types of effects. We do not restrict . That is, a feature can have effect on the objects that do not possess it. We will see this is the case in the application in Chapters 2 and 3. We hereafter refer to this type of latent feature models (Equations (1.14) and (1.17)) as latent categorical feature models. To complete the model, we need to define a distribution on categorical valued matrices.

#### 1.4.4 Categorical Indian Buffet Process

The categorical Indian buffet process (cIBP) (Sengupta, 2013) is a categorical extension of the IBP, which defines a distribution on categorical valued matrices (with a random and unbounded number of columns). Each entry of the matrix can take values from a set of integers , where is fixed. Here indicates object does not possess feature , and , represents object possesses category of feature .

##### A Finite Feature Model

Similar to the construction of the IBP, the cIBP can be derived as a limit of finite feature allocation models. Assume that each object possesses category of feature with probability , i.e. , and the features are independent. Furthermore, beta-Dirichlet distribution (Kim et al., 2012) priors are put on ’s. That is, follows a beta distribution with parameters 1 and , i.e. . Let , . Then follows a Dirichlet distribution with parameters . In summary,

 znk∣πk ∼Categorical(πk), πk ∼Beta-Dirichlet(α/K,1,β1,…,βQ).

For simplicity we only consider a symmetric Dirichlet distribution , which is sufficient in most cases. Integrating out the ’s, the marginal distribution of is

 p(Z)=K∏k=1{Γ(Qβ)Γ(αK+1)Γ(αK)(Γ(β))Q⋅[∏Qq=1Γ(β+mkq)]Γ(N−mk⋅+1)Γ(αK+mk⋅)Γ(αK+N+1)Γ(Qβ+mk⋅)}, (1.18)

where denotes the number of objects from total objects possessing category of feature , and denotes the number of objects possessing feature . The distribution is exchangeable since it depends on the counts only.

##### Left-ordered Constraint and Equivalent Classes

Similar to what was defined for binary matrices, we can define left-ordered form and history on -nary matrices. A left-ordered -nary matrix or is obtained by ordering the columns of from left to right by the magnitude of the -nary number (i.e represented in base ) expressed by that column taking first row as the most significant bit. Figure 1.2(a) shows an example of -nary matrix, where , and Figure 1.2(b) shows the corresponding left-ordered matrix. The -equivalence class of matrix is still denoted by . The history of feature at object is defined to be the decimal equivalent of the -nary number represented by the vector , and the full history of feature refers to the decimal equivalent of the -nary number of . The number of features having history is denoted by , with being the number of features for which and being the number of features for which , and .

Using the notions above, the cardinality of is , and

 p([Z])=∑Z∈[Z]p(Z)=K!∏(Q+1)N−1h=0Kh!⋅p(Z). (1.19)
##### Taking the Infinite Limit

Taking the limit in Equations (1.19) and (1.18), we obtain

 limK→∞p([Z])=(α/Q)K+∏(Q+1)N−1h=1Kh!⋅exp{−αHN}⋅K+∏k=1{(N−mk⋅)!(mk⋅−1)!N!⋅1∏mk⋅−1j=1(j+Qβ)1βQ∏q=1Γ(β+mkq)Γ(β)}. (1.20)

Details in Sengupta (2013). This distribution is exchangeable as in the finite case. The last step is dropping the columns with all zeros.

##### Indian Buffet Analogy

The probability distribution defined in Equation (1.20) can also be derived from a stochastic process similar to the IBP, which is referred to as the cIBP. The customers are the objects and the dishes are the features. The categories of a feature can be seen as different spice levels of a dish. Start with customers and an infinite number of dishes. As the -th customer walks in, he/she chooses the -th dish with a particular spice level with probability , where denotes the number of customers who have tasted the dish with spice level , denotes the number of customers who have tasted the dish in total, and . Then the -th customer tastes new dishes with spice level determined by a draw from . Using a -nary matrix with rows and infinitely many columns to indicate which customers chose which dishes, where if the -th customer did not choose the -th dish, and if the -th customer chose the -th dish with spice level . If one only pays attention to the -equivalence classes of the matrices generated by this process, one obtains the exchangeable distribution given by Equation (1.20).

#### 1.4.5 Inference for Tumor Heterogeneity Using Feature Allocation Models

We continue the discussion of inference for tumor heterogeneity in Section 1.3.3. Recall that one approach to this problem is to model the read counts using a latent class model, where the SNVs are the objects, and the subclones are the classes. We have discussed in Section 1.3.3 that subclone reconstruction methods based on latent class models usually rely on the ISA. However, ISA is not necessarily valid because multiple tumor subclones might acquire the same mutation in convergent evolution. See more discussions in Marass et al. (2017). Such concerns inspire another approach to this problem: modeling the read counts using a latent feature model. Clomial (Zare et al., 2014), Lee et al. (2015), BayClone (Sengupta et al., 2015), Cloe (Marass et al., 2017) and PairClone (Chapters 2, 3) take this approach. We briefly summarize BayClone and Cloe here.

Still, ignoring many biological details, consider SNVs. Let and denote the total and variant read counts covering locus , respectively, . BayClone models the variant read counts using a latent (categorical) feature model

 ns∣ps∼Binom(Ns;ps), ps∣w,Z=C∑c=1wcg(zsc), p(zsc=q∣πcq)=πcq, (πc0,…,πcQ)∼Beta-Dirichlet(α/C,1,β,…,β)

where represent the subclones (i.e. features), is the cellular proportion of subclone (i.e. value of feature ), and is the genotype, with , or corresponding to a homozygous wild-type, heterozygous variant or homozygous variant at site of subclone . The mapping maps , and , which indicates the contributions of three different genotypes to the expected variant allele fraction .

Cloe, on the other hand, uses a column-dependent phylogenetic prior for , which allows it to infer tumor phylogeny.

### 1.5 Regression

Given two or more observables, regression analysis is concerned with the relationship between a dependent variable (or response), denote by , and one or more independent variables (or predictors), denote by . We usually treat as fixed quantities and treat as random variables. We focus on the case that

is continuous and is normally distributed. For more general cases (e.g. generalized linear models) see, for example,

Dobson and Barnett (2008) or Dey et al. (2000). Suppose we have observations . For observation , we assume

 yn=f(xn)+εn, (1.21)

where

 εniid∼N(0,σ2)

is normally distributed random error.

In the linear regression setting, the function

in Equation (1.21) is specified as a linear function

 f(x)=xTβ,

where is a parameter vector, and the elements of are called regression coefficients. The model is completed with a prior on

, which is usually a normal conjugate prior. Linear regression model is a traditional statistical model and has been studied extensively. For a review, see, for example,

Gelman et al. (2014) or Christensen (2011).

In many applications, the linearity assumption is too restrictive. One possible generalization using nonparametric Bayesian models is to replace the linear model with a less restrictive flexible prior on . A popular prior specification for is the Gaussian process, which we will describe in detail in Section 1.5.1. Another popular specification for is based on a function basis and a representation of as . A prior model for induces a prior model for . See Müller and Quintana (2004) for a review.

#### 1.5.1 Gaussian Process

The Gaussian process (GP) (O’Hagan, 1978, Neal, 1998, Rasmussen and Williams, 2006) defines a distribution over random functions (stochastic processes). A GP is a collection of random variables