# Separate Exchangeability as Modeling Principle in Bayesian Nonparametrics

We argue for the use of separate exchangeability as a modeling principle in Bayesian inference, especially for nonparametric Bayesian models. While in some areas, such as random graphs, separate and (closely related) joint exchangeability are widely used, and it naturally arises for example in simple mixed models, it is curiously underused for other applications. We briefly review the definition of separate exchangeability. We then discuss two specific models that implement separate exchangeability. One example is about nested random partitions for a data matrix, defining a partition of columns and nested partitions of rows, nested within column clusters. Many recently proposed models for nested partitions implement partially exchangeable models. We argue that inference under such models in some cases ignores important features of the experimental setup. The second example is about setting up separately exchangeable priors for a nonparametric regression model when multiple sets of experimental units are involved.

## Authors

• 4 publications
• 6 publications
• 9 publications
06/27/2019

### On nested and 2-nested graphs: two subclasses of graphs between threshold and split graphs

A (0,1)-matrix has the Consecutive Ones Property (C1P) for the rows if t...
08/17/2020

### A Common Atom Model for the Bayesian Nonparametric Analysis of Nested Data

The use of high-dimensional data for targeted therapeutic interventions ...
10/28/2014

### Beta-Negative Binomial Process and Exchangeable Random Partitions for Mixed-Membership Modeling

The beta-negative binomial process (BNBP), an integer-valued stochastic ...
02/26/2020

### Bayesian Nonparametric Space Partitions: A Survey

Bayesian nonparametric space partition (BNSP) models provide a variety o...
12/16/2014

### Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data

A common approach for Bayesian computation with big data is to partition...
08/05/2021

### Shape Modeling with Spline Partitions

Shape modelling (with methods that output shapes) is a new and important...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We argue for the use of separate exchangeability as a modeling principle and unifying framework for data that involves multiple sets of experimental units. While exchangeability and partial exchangeability have proven to be powerful principles for statistical modeling (Bernardo and Smith, 2009), separate exchangeability has been curiously under-used as a modeling principle in some of the nonparametric Bayesian literature. In the context of two typical example, we introduce two widely applicable modeling frameworks that implement inference under separate exchangeability.

Contrary to partial exchangeability and exchangeability, separate exchangeability facilitates inference that maintains the identity of experimental units in situations like the two motivating applications with shared sets of one type of experimental units over subsets of the other type of experimental units or blocks. That is, if the data (or a design matrix) is a rectangular array, the nature of rows and columns as different experimental units is preserved. See the two motivating examples for illustrations.

Under parametric inference separate exchangeability is often naturally preserved by just introducing additive row- or column-specific effects. In contrast, this is not true in nonparametric Bayesian models. In the present article we discuss two general strategies to define Bayesian nonparametric (BNP) models to perform flexible inference and prediction under separate exchangeability. The first example concerns inference for microbiome data, with patients and OTUs (essentially different types of microbiomes) being two different experimental units (Denti et al., 2021). We discuss how to define separately exchangeable partition structures via nested partitions similar to Lee et al. (2013). The second example is about inference for protein activation for proteins in patients, under two different conditions. We build a separately exchangeable model for random effects, using simple additive structure.

The rest of this article is organized as follows. Section 2 reviews the notion of exchangeability and how it relates to modeling in Bayesian inference. Section 2.3 discuss the borrowing of information under separate exchangeability, focusing in particular on dependent random partitions as they arise under mixture models. Section 3 describes two motivating data sets. In the context of these two applications we then proceed to introduce two specific models to implement separate exchangeability in related problems. Section 4 presents one general strategy based on a construction of nested partitions that preserves the identity of two types of experimental units. The strategy is illustrated with inference for the microbiome data set. Section 5 presents another general strategy based on a BNP regression model with an additive structure of random effects related to the two types of experimental units. Section 6

contains concluding remarks and suggested future work. Additional details, including Markov chain Monte Carlo (MCMC) based posterior inference algorithms, are presented in the supplementary materials.

## 2 Exchangeability as a Modeling Principle

To perform inference and prediction we rely on some notion of homogeneity across observations that allows us to leverage information from the sample to deduce inference about a set of future observations . De Finetti refers to this as “analogy” (Finettide, 1937)

. In Bayesian statistics, the assumptions are stated in the language of probability and learning is naturally performed via conditional probability.

### 2.1 Exchangeability and partial exchangeability.

#### Exchangeability.

A fundamental assumption that allows such generalization in Bayesian learning is exchangeability, that is, invariance of the joint law of the data with respect to permutations of the observation indices. This entails that the order of the observations is irrelevant in the learning process and one can deduce inference for from observations . More precisely, a sequence is judged exchangeable if

 x1:nd=xπ(1:n) (1)

for any permutation of . Here denotes equality in distribution. If the observable are considered a sample from an infinite exchangeable sequence , that is finite exchangeability holds for any sample size , de Finetti’s theorem (Finettide, 1930) states that an extendable sequence is exchangeable if and only if it can be expressed as conditionally independently identically distributed (i.i.d.) from a random probability

 xi∣P iid∼ P,i=1,2,… P ∼ L. (2)

The characterization of infinite exchangeability as a mixture of i.i.d. sequences highlights the fact that the homogeneity assumption of exchangeability in Bayesian learning is equivalent to assuming an i.i.d. sequence in the frequentist paradigm. The de Finetti measure can be interpreted as the prior in the Bayes-Laplace paradigm. If is restricted to a parametric family, we can write as , where denotes a finite dimensional random parameter and

reduces to a prior probability model on

. However, if is unrestricted we have the characterization in (2) and can take the form of a BNP prior on (Müller and Quintana, 2004).

Note that the unknown probability

arises from an assumption on observable quantities, justifying inference on the parameters/latent quantities in terms of measurable quantities. Eliciting assumptions in terms of observable, and thus testable, events is fundamental in science also if the main inference goal were inference on latent quantities. In general, a predictive approach of statistics is becoming increasingly popular in the statistics and machine learning community. See

Fortini et al. (2000, 2012) or Fortini and Petrone (2016) for interesting discussions of the predictive approach and characterization results for the prior probability measure in terms of predictive sequences under exchangeability and partial exchangeability.

#### Partial exchangeability.

In real world applications, the assumption of exchangeability is often too restrictive. To quote Finettide (1937): “But the case of exchangeability can only be considered as a limiting case: the case in which this ‘analogy’ is, in a certain sense, absolute for all events under consideration. [..] To get from the case of exchangeability to other cases which are more general but still tractable, we must take up the case where we still encounter ‘analogies’ among the events under consideration, but without attaining the limiting case of exchangeability.” Indeed, depending on the design of the experiment it is often meaningful to generalize exchangeability to less restrictive invariance assumptions that allow us to introduce more structure into the prior, and thus into the learning mechanism.

If the data are collected in different, related populations a simple generalization of exchangeability is partial exchangeability. Let denote a data array, where is the label of the population from which is collected. We say that is partially exchangeable if the joint law is invariant under different permutations of the observations within each population

 (xπj(i),j:i=1,…,Ij;j=1,…,J)d=(xπj(i),j:i=1,…,Ij;j=1,…,J), (3)

for any family of permutations . Partial exchangeability entails that the order of the observations is irrelevant in the learning mechanism up to preserving the information of the population memberships. If partial exchangeability holds for any sample sizes the “analogy” assumption of partial exchangeability can be characterized in terms of latent quantities (e.g. parameters) via de Finetti’s theorem (Finettide, 1937),

 xij∣(P1,…,PJ) ind∼ Pjj=1,2,…;i=1,2,… (P1,…,PJ) ∼ L. (4)

Therefore, partially exchangeable extendable arrays can be thought of as decomposable into different conditionally independent exchangeable populations. As in (2) the characterization in (4) does not restrict the distributions associated with the different populations to any parametric family. The would usually be assumed to be dependent, allowing borrowing of strength across blocks.

Note that exchangeability is a degenerate special case of partial exchangeability which corresponds to ignoring the information on the specific populations from which the data are collected, i.e., ignoring the known heterogeneity. The opposite, also degenerate, extreme case corresponds to modeling data from each population independently, i.e., ignoring similarities between populations. See Aldous (1985) or Kallenberg (2006) for detailed probabilistic accounts on different exchangeability assumptions and Foti and Williamson (2013) and Orbanz and Roy (2014) for insightful discussions on the topic in the context of non-paramatric Bayesian models (BNP).

The way how partial exchangeability preserves heterogeneity is perhaps easiest seen in considering marginal correlations between pairs of observations. As desired, partial exchangeability allows increased dependence between two observation arising from the same experimental condition, compared to correlation between observations arising from different populations. For instance, under partial exchangeability (3) it is possible that

 Corr(xij,xi′j)>Corr(xij,xi′j′),   j≠j′,i≠i′ (5)

while by definition of exchangeability, under (1)

 Corr(xij,xi′j)=Corr(xij,xi′j′), (6)

entailing that in the learning process we can borrow information across groups, but to learn about a specific population the observations recorded in the same population are more informative than observations from another population . Here and in the following we assume that

are real valued, square integrable random variables such that the stated correlations are well defined. See Appendix A for a (simple) proof of (

5).
Going beyond correlations between individual observations, one can see a similar pattern for groups of observations under populations and . See Section 2.3 for more discussion.

#### Statistical inference.

From a statistical perspective partial exchangeability is a framework that allows to borrow information across related populations while still preserving heterogeneity of populations. Partially exchangeable models are widely used in many applications, including meta-analysis, topic modeling and survival analysis when the design matrix records different statistical units (e.g. patients) under related experimental conditions (e.g. hospitals).

Flexible learning can be achieved assuming dependent non-parametric priors for the vector of random probabilities

in (4). An early proposal appeared in Cifarelli and Regazzini (1978), but the concept was widely taken up in the literature only after the seminal paper of MacEachern (1999) introduced the dependent DP (DDP) as a prior over families , an instance of which could be used for in (4). See Quintana et al. (2020) for a recent review of DDP models.

Finally, in anticipation of the later application examples we note that while in a stylized setup it is convenient to think of in (4) as the observable data, in many applications the discussed symmetry assumptions are used to construct prior probability models for (latent) parameters. The might be, for example, cluster membership indicators for bi-clustering models, or any other latent quantities in a larger hierarchical model. The same comment applies for the upcoming discussion of separate exchangeability.

### 2.2 Separate exchangeability.

We discussed earlier that exchangeability can be too restrictive to model the case when observations are arranged in different populations. Similarly, partial exchangeability can prove to be too restrictive when the same experimental unit is recorded across different blocks of observations (e.g., the same patient is recorded in different hospitals). A similar case arises when two types of experimental units are involved, and observations are recorded for combinations of the two types of units, as is common in many experimental designs (e.g., in the first motivating study in Section 3 the same type of microbiome is observed across different subjects).

In such a case a simple, but effective, homogeneity assumption that preserves the information on the experimental design is separate exchangeability, that is, invariance of the joint law under different permutations of indices related to the two types of units (or blocks). If data is arranged in a data matrix with rows and columns corresponding to two different types of units (or blocks), this reduces to invariance with respect to arbitrary permutations of rows and column indices. More precisely, a data matrix is separately exchangeable if

 x1:n,1:Jd=xπ1(1:n),π2(1:J) (7)

for separate permutations and of rows and columns, respectively.

The notion of separate exchangeability formally reflects the known design in the learning mechanism. That is, it introduces more dependence between two values recorded from the same statistical unit than between values recorded in different statistical units. Note also that partial exchangeability of grouped w.r.t. columns plus exchangeability of the columns is a stronger homogeneity assumption than separate exchangeability in a similar way as exchangeability is a degenerate case of partial exchangeability (i.e. we loose some structure). Similar to (5) and (6), under separate exchangeability it is possible that

 Corr(xij,xij′)>Corr(xij,xi′j′),   j≠j′,i≠i′ (8)

while by the definition of partial exchangeability, under (3) we always have

 Corr(xij,xij′)=Corr(xij,xi′j′). (9)

In fact, inequality (8) is always true with (for extendable separately exchangeable array), as can be shown similar to a corresponding argument for partial exchangeability in Appendix A. See Section 2.3 for a more detailed discussion on the borrowing of information under partial and separate exchangeability.

Finally, we conclude this section reviewing a version of de Finetti’s theorem for separately exchangeable arrays. If is extendable, that is, it can be seen as a projection of , a representation theorem in terms of latent quantities was proven independently by Aldous (1981) and Hoover (1979). See also Kallenberg (1989) and reference therein. More precisely, an extendable matrix is separately exchangeable if and only if

 xij=f(θ,ξi,ηj,ζij), (10)

for some measurable function and i.i.d.  random variables and , . The representation theorem in (10

) implies a less strict representation theorem in which the uniform distributions are replaced by any distributions

, and , as long as independence is maintained (together with a corresponding change of the domain for ). This representation in turn can alternatively be written as

 p(x1:I,1:J)=∫pθ(θ)n∏i=1pξ(ξi)J∏j=1pη(ηj)∏ijp(xij∣θ,ξi,ηj)dθdξ1:ndη1:J. (11)

Similar to (2) or (4) model (11) can be stated as a hierarchical model

 xij∣θ,ηj,ξiind∼Pθ,ξi,ηj,i=1,…;j=1,2,… θ∼pθ⊥ξiiid∼pξ⊥ηjiid∼pη,i=1,2,…;j=1,2,… (12)

where is the law of in (11). Finally, in both, (11) and (2.2

), the probability models can still be indexed with unknown hyperparameters.

### 2.3 Separately exchangeable random partitions

We discuss in more detail the nature of borrowing of information under partial and separate exchangeability. For easier exposition we focus on a matrix , with two columns, i.e. . Note that the assumptions of partial and separate exchangeability imply marginal exchangeability within columns. Moreover, we exclude the degenerate cases of (marginal) independence between and , when no borrowing of strength occurs under Bayesian learning, as well as the fully exchangeable case when we loose heterogeneity across columns.

In the previous section we discussed how separate exchangeability, contrary to partial exchangeability, allows to respect the identity of an experimental unit by allowing for increased dependence between and compared to dependence between and . 111Note that sometimes it might be desirable to introduce negative dependence (repulsion), e.g. as .

We now extend the discussion to borrowing of information also on other functionals of interest of and , potentially even while assuming and independent for any , but and dependent. In the following we focus on a fundamental example related to borrowing of information about a partition of and . For this discussion, we assume discrete . In that case, ties of define a partition of for each column (and similarly for rows – but for simplicity we only focus on columns). Let denote this partition in column . A common context when such situations arise is when are latent indicators in a mixture model for observable data , with a top level sampling models , where are additional cluster-specific parameters.

Similarly, a partially exchangeable model (4) with discrete probability measures defines a random partition in column . If we then induce dependence between and (by way of dependent ), it is possible to borrow information about the law of the random partitions, for example, the distribution of the number of clusters. See Franzolini et al. (2021) for definition and probabilistic characterizations of partially exchangeable partitions. However, by definition of partial exchangeability (3), with only partial exchangeability it is not possible to borrow information about the actual realizations of the random partitions . For example, under a partially exchangeable partition

 p({x1,j′=x2,j′}∣{x1,j=x2,j})=p({x1,j′=x3,j′}∣{x1,j=x2,j}). (13)

In contrast, under separate exchangeability it is possible to have

 p({x1,j′=x2,j′}∣{x1,j=x2,j})>p({x1,j′=x3,j′}∣{x1,j=x2,j}). (14)

From a statistical perspective this means that under separate exchangeability the fact that two observations are clustered together (e.g.  and ) in one column can increase the probability that the same two observations are clustered together in another column. This difference in the probabilistic structures, and thus in the borrowing of information under Bayesian learning, is particularly relevant when observations indexed by have a meaningful identity in a particular application. For example, in one of the motivating examples, units refer to three different types of microbiomes (OTU). In words, (14) implies that seeing OTUs 1 and 2 clustered together in subject increases the probability of seeing the same OTUs co-cluster in subject . In the actual application will refer to parameters in a hierarchical model.

In Section 4 we discuss an effective way to define flexible, but analytically and computationally tractable, non-degenerate separately exchangeable partitions. We first cluster columns, and then set up nested partitions of the rows, nested within column clusters. That is, all columns in the same column-cluster share the same nested partition of rows.

## 3 Two Examples

We use two motivating examples to illustrate the notion of separate exchangeability, and to introduce two modeling strategies that provide specific implementations respecting separate exchangeability.

In the first example we assume separate exchangeability for data . In short, we construct a separately exchangeable model for a data matrix by setting up a partition of columns and, nested within column clusters, partitions of row. Such nested partitions are created by assuming separate exchangeability for parameters in a hierarchical prior model. In the second example we assume separate exchangeability for the prior on model parameters which index semi-parametric regression models. The separately exchangeable model is set up in a straightforward way by defining additive structure with exchangeable priors on terms indexed by and . In both cases the model is separately exchangeable without reducing to the special cases of partial exchangeability or marginal exchangeability.

#### Microbiome data - random partitions.

The first example is inference for microbiome data for subjects and OTUs (operational taxonomic units). The data report the frequencies of OTU for subject (after suitable normalization). In building a model for these data we are guided by separate exchangeabilty with respect to subject indices and OTU indices . We use a publicly available microbiome data set from a diet swap study (O’Keefe et al., 2015). The same data was analyzed by Denti et al. (2021). The dataset includes subjects from the US and rural Africa. The interest is to investigate the different patterns of microbial diversity (OTU counts) across subjects and how the distributions of OTU abundancy vary across subgroups of subjects. Focusing on inference for microbial diversity they restricted attention to inference about identifying clusters of subjects with similar distributions of OTU frequencies, considering OTU counts as partially exchangeable. While this focus is in keeping with the tradition in the literature, it is ignoring the shared identity of the OTUs across subjects . In Section 4 we will set up an alternative model that respects OTU identity by modeling the data as separately exchangeable.

We show some summary plots of the data, trying to motivate the proposed inference. First we sort OTUs by overall abundancy (across all subjects). Figure 1 (top) shows the cumulative relative frequencies of OTUs in rural Africans (RU), African Americans (AA) and all subjects, highlighting a difference in distribution of OTU frequencies between RU and AA. Throughout, The OTU frequencies in this data have been scaled by average library size, i.e, is the absolute count of OTU normalized by the totals . The two barplots at the bottom of the same figure show OTU abundancies in the two groups, suggesting that subjects might meaningfully group by distribution OTU frequencies.

In Figure 2

, we show hierarchical clustering for subjects based on the 10 OTUs with highest empirical variance. Note how the clusters correlate well with the two groups, RA and AA, suggesting that in grouping subjects by distribution of OTUs frequencies we should proceed in a way that maintains and respects OTU identities (as is the case in the hierarchical clustering). Importantly, any inference on such groupings has to account for substantial uncertainty. Observing these features in the figures motivates us to formalize inference on grouping subjects by OTU abundancies using model-based inference. We will set up a separately exchangeable model, exchangeable with respect to subject and OTU indices.

#### Protein expression - nonparametric regression.

In a second example we analyze protein measurement data in a study of ataxia, a neurodegenerative disease. The same data is studied in Lee et al. (2021). The data measures the abundancy of potentially disease-related proteins among two groups of subjects, one control group and one patient group. Each group includes subjects with ages between 5 and 50 years. The subjects’ ages define time for our observations. There are unique times , , with one control and one patient for each unique time . That is, refers to time in calendar years, and states time as an index in the list of unique time points. The data are protein activation data , , , for the subjects and proteins. For each subject, is an indicator for being a patient () or control (), and denotes the age. The inference goal is to identify proteins that are related with the disease, defined as proteins with a large difference between patients and controls in change of protein expression over age. We set up a nonparametric regression for versus age . The regression mean function is constructed using a cubic B-splines with 2 interior knots, an offset for protein , and an interaction of treatment and B-spline, to allow for a difference in spline coefficients for patients versus controls. See the later discussion for more model details. In this case, the separate exchangeability assumption is made for protein-and-age specific effects that appears in the regression mean function.

Figure 3 shows a random subset of the data. The figure shows protein expression (on a logarithmic scale) over time for randomly selected proteins, separately for patient and control groups. The figure highlights the high level of noise in this data.

Recall the primary inference goal of identifying proteins with the largest difference in slopes between patients versus controls. Table 1 shows the selected top 10 proteins using two simple data summaries. Let and denote the indices of four subjects, with indicating the subject with and . The first set of 10 proteins are the proteins with the largest empirical difference

 ^γi=yij1T−yij11T−1−yij0T−yij01T−1, (15)

that is, the 10 proteins with largest observed difference between patients and controls in change over time. One problem with is that it is based on only the 4 subjects with minimum and maximum age, and does not borrow any strength from data for subjects with ages in between, or other proteins. And both summaries ignore uncertainty of . The second set of 10 proteins are the proteins with largest fitted difference, fitting for each protein two separate smoothing splines, one to all patients and a second one to all controls, and evaluating replacing the data by the fitted values under these smoothing splines. The second set therefore includes borrowing of strength across all subjects, but still no borrowing of strength across proteins. Note that the top 10 proteins based on these two data summaries include no overlap, highlighting again the high level of noise in these data, and the need for more principled inference and characterization of uncertainties. These observations motivate us to develop a hierarchical model to borrow strength across proteins and time, and to allow for a full description of uncertainties. The hierarchical model across subjects and proteins is set up using separate exchangeability on parameters . Note that in this case separate exchangeability is not on the data, but on parameters (including slope etc.) of the fitted curves. See later for details. The model includes a random partition of proteins to group proteins with similar patterns into common clusters, allowing us to identify a group of interesting proteins as the cluster with the highest cluster-specific change in protein expression over time.

## 4 Separate Exchangeability through Nested Partitions

### 4.1 Separate versus partial exchangeability

The microbiome data records frequencies of OTUs , for subjects, . The measurement reports the scaled abundancy of OTU in subject . The main inference goal is identify subgroups , , of subjects with different patterns of OTU frequencies. The subsets define a partition as with for . We refer to the as subject clusters. Alternatively we represent using cluster membership indicators if and only if .

#### Partially exchangeable model.

We assume mixture of normal distributions. Let

denote a stick-breaking prior for a sequence of weights (Sethuraman, 1994). Marginally for each subject (i.e., marginalizing over other subjects, ), we assume

 yij∣Sj=k iid∼Gk,i=1,2,… Gk=∞∑ℓ=1wkℓN(μℓ,σ2ℓ) and θℓ=(μℓ,σ2ℓ)iid∼G0, wk∼GEM(α) p(Sj=k∣π) =πk and π=(π1,…)∼GEM(β) (16)

with independent sampling across . Marginally, for one subject , the first two lines of (16) imply a Dirichlet process (DP) mixture of normal model for . That is, the marginal law for each is a DP mixture

 yij∣Giid∼G, i=1,…,I, G=∫N(μ,σ2)dF(μ,σ2) and F=∑ℓwkℓδμℓ,σℓ∼DP(α,G0). (17)

We keep using notation to highlight the link with (16). Here are the atoms of a discrete random probability measure and defines a DP prior with total mass and base measure . See, for example, (Müller et al., 2015, Chapter 2) for a review of such DP mixture models. However, the fact that in (16) multiple subjects can share the same

introduces dependence across subjects. Additionally, note that the normal moments

are indexed by only, implying common atoms of the across . The model construction is completed by assuming that in (16) are sampled independently also across given the vector of random probabilities. Below we will introduce a variation of this final assumption, motivated by the following observation.

The use of common atoms across allows us to define clusters of observations across OTUs and subjects. This is easiest seen by replacing the mixture of normal model in the first line of (16) by a hierarchical model with latent indicators as

 p(yij∣~Mij=ℓ)=N(μℓ,σ2ℓ) and p(~Mij=ℓ∣Sj=k)=wkℓ.

Interpreting as cluster (of OTUs) membership indicators, the model defines a random partition of OTUs for each subject with clusters defined by , i.e., . In this construction subjects with shared share the same prior on the partition of OTUs, implied by with , defining partial exchangeability as in (4), with the random partition defining a dependent prior on . Conditional of the latter is characterized by only distinct .

#### A separately exchangeable prior.

Recognizing the described construction with as reducing to partial exchangeability when non-degenerate separate exchangeability is implied by the nature of the experiment, we introduce a modification. While the change is minor in terms of notation, it has major consequences for interpretation and inference as we will show later. We replace the indicators by (note the subindex ) specific to each OTU and cluster of subjects, with otherwise unchanged marginal prior

 p(Mik=ℓ)=wkℓ (18)

and . The assumption completes the marginal model (16) by introducing dependence of the across , which is parsimoniously introduced with the indicators. The marginal distribution (17) remains unchanged. But now the implied random partitions are shared among all .

In practice we use an implementation using a finite Dirichlet process (Ishwaran and James, 2001) for , i.e., we truncate with atoms. Similarly, we truncate the stick-breaking prior for at a fixed number of atoms. For later reference we state the joint probability model conditional on and is

 p(y1:I,1:J,S,M,π,w,∣K,L)=J∏j=1I∏i=1p(yij∣μMi,Sj,σ2Mi,Sj)×J∏j=1πSj{K∏k=1I∏i=1wMik,k}p(π)K∏k=1p(wk)L∏ℓ=1p(μℓ,σ2ℓ) (19)

with and being finite stick breaking priors, and chosen to be conditionally conjugate for the sampling model .

In summary, we have introduced separate exchangeability for by defining (i) a random partition of columns, corresponding to the cluster membership indicators , and (ii) nested within column clusters , a nested partition of rows, represented by cluster membership indicators . In contrast, a model under which only share the prior on the nested partition reduces to the special case of partial exchangeability. The model remains invariant under arbitrary permutation of the row (OTU) labels in any column (subject). Without the reference to separate exchangeability a similar construction with nested partitions was also used in Lee et al. (2013). The construction of the nested partition is identical, but there is no notion of common atoms to allow for clusters of observations across column clusters.

Finally, to highlight the nature of the model as being separately exchangeable with respect to OTUs and subjects we exhibit the explicit Aldous-Hoover representation (11), still conditional on hyperparameters. We show separate exchangeability conditional on by matching variables with the arguments in (11) as follows: , , , and with . Here we used that (11) allowed conditioning on additional hyperparameters, in this case .

### 4.2 Results

Posterior simulation is implemented as a straightforward Gibbs sampling algorithm. All required complete conditional distributions follow easily from the joint probability model (19). Posterior Monte Carlo simulation is followed by a posterior summary of the random partition using the approach of Dahl (2006) to minimize Binder loss, using the algorithm in Dahl et al. (2021)

. We estimate three clusters of subjects,

. We show the three subject clusters in Figure 4 by plotting cumulative frequencies of OTUs. We sort all OTUs by overall abundancy across subjects. For each cluster of subjects, we collect all subjects and plot cumulative (observed) frequencies.

Figure 5 summarizes the nested partition of OTUs, nested within the three subject clusters. The three panels correspond to subject clusters and . For each subject cluster, the figure shows the estimated co-clustering probabilities of OTUs, i.e., for each pair of OTUs. As usual for heatmaps, the OTUs are sorted for a better display, to highlight the clusters.

Figure 6 shows the same nested partitions, but now by showing the data arranged by subject clusters. In each panel OTUs are sorted by observational clusters. That is, each plot shows the data corresponding to , for and . The subjects are on the x-axis. The OTUs are on the y-axis, arranged by the estimated observational clusters. The patterns in echo the clusters shown in the previous plot.

Inference as in Figure 5 or 6 is not meaningfully possible under partial exchangeability, since the nested partitions are not shared across . In other words, consider for example two subjects and . Assume for subject 1 we record two OTUs and co-cluster in a cluster with high frequency, whereas for subject 2 OTUs and cluster together. Under partial exchangeability and might be placed in the same cluster although the different OTUs might be linked to entirely different diets.

## 5 Nonparametric Regression

### 5.1 Separate exchangeability in an ANOVA DDP model

#### Anova Ddp.

In this example we set up a separately exchangeable model as an implementation of the popular dependent Dirichlet process (DDP), by means of introducing the symmetric structure in the prior for the atoms in a Dirichlet process (DP) random measure over subjects and time. In this example, the separate exchangeability assumption is not on the observed data (protein expression over time and two different conditions). Instead we set up a separately exchangeable prior for the linear model parameters in a statement of the DDP as a DP mixture of linear models. The actual construction is very simple. We achieve separate exchangeability by setting up additive structure with terms specific to proteins and time .

The DDP is a predictor-dependent extension of DP mixtures first proposed by MacEachern (1999, 2000). It defines a family of random probability measures , where the random distributions are indexed by covariates and each marginally follows a DP prior as in (17). The desired dependence of across is achieved by writing the atoms or the weights of the DP random measure as functions of covariates (Quintana et al., 2020). The simplest version of the DDP arises when only the atoms of vary over (common weights DDP) and are specified as a linear function of . This defines the linear dependent DP (LDDP) or ANOVA DDP (De Iorio et al., 2004). The model can be written as a DP mixture of linear models, that is, as a mixture with respect to some (or all) linear model parameters, and a DP prior on the mixing measure. See below for a specific example.

The DDP naturally implements partially exchangeable structure if it is assumed as de Finetti measure in (4). Consider data , assuming , typically using the additional convolution with a density kernel to construct a continuous sampling model, if desired. A similar situation arises when data is observed with a categorical covariate and . In either case, a DDP prior on implements a partially exchangeable model for the observable data , or grouped by , respectively. However, if the experimental context calls for separately exchangeable structure with respect to and , appropriate model variations are called for. We introduce one next, where the separately exchangeable structure is on the parameters in (a variation of) the DP mixture representation.

#### A separately exchangeable ANOVA DDP.

To state the specific model we need some more notation. Recall the notation set up in Section 3, with denoting the abundancy of protein in subject , and and denoting subject-specific age and condition. We use cubic B-splines with two internal nodes to represent protein expression over a grid of time points, , . The linear model parameters in the ANOVA DDP model include coefficients for the B-spline basis functions to represent protein expression over time for controls, plus an additional equal number of coefficients for the same basis functions to represent an offset for protein expression for patients. More specifically, let denote a design vector for subject , with being 6 spline basis functions evaluated at time , and representing an offset for patients (). That is, linear model coefficients for model protein expression over time for controls (), while the coefficients for represent an additional offset for patients () in protein expression over time.

Let denote all data for protein . We assume an ANOVA DDP model, written as DP mixture

 f(yi∣G,α,δ)=∫{J∏j=1N(yij;αi+δtj+x′jβi,σ2i)}dG(βi,σ2i)), (20)

with a DP prior for , Here are protein-specific offsets and are offsets for each of the unique time points, . We use a finite DP, , truncated at (Ishwaran and James, 2001), with total mass parameter and defined by and . We complete the prior specification with and . Here, are fixed hyperparameters. See Appendix B for specific values.

Note that the linear model is over-parametrized. For example, we do not restrict the cubic splines to zero average over all (two) subjects observed at the same time , implying confounding of the intercept with . However, recall that the inference target is the difference in slope between patient and control for a given protein, which is not affected by this over-parametrization.

The regression model is set up to allow straightforward inference on the protein-specific difference in slope for patients versus controls. As in (15), let again denote indices of subjects with and , respectively. Then keeping in mind that and differ only in the last 6 elements (and ignoring scaling by a constant ),

 γi=[(xj1T−xj11)−(xj0T−xj01)]βi=(xj1T,7:12−xj11,7:12)βi,7:12 (21)

represents the desired difference in slope between patients and controls. Posterior inference on implements the desired model-based inference on the difference of slopes with borrowing of strength across proteins and subjects.

Let denote the parameters that index the regression model for protein at time . By construction the prior probability model on is separately exchangeable, as it is invariant with respect to separate permutations of protein indices and of age:

 p(θ1:I,1:T)=p(θπ(1:I),π′(1:T)).

Note that represents the mean function for patient as coefficients for the spline basis (valid for any ). Only in (20) this mean function is evaluated for and . In particular, separate exchangeability is assumed for the mean function parameters, not for the fitted values or the data. The model is separately exchangeable by construction. Alternatively one can trivially match the variables with the arguments in (11), and .

### 5.2 Posterior Inference

#### MCMC posterior simulation.

Posterior simulation under the proposed ANOVA DDP (or DDP of splines) model for protein expression is straightforward. We used the R function bs from the package splines to evaluate the spline basis functions for . The transition probabilities are detailed in Algorithm 1 in the appendix.

For the statement of the detail transition probabilities in Algorithm 1 it is useful to replace the mixture model (20) by an equivalent hierarchical model

 yij∣si=h∼N(αi+δtj+x′j~βh,~σ2h) (22)

with . Recall that are the atoms of the random mixing measure in (20), and that we use a finite DP truncated at atoms. Interpreting as cluster membership indicators defines inference on a random partition of proteins. Let then denote cluster defined by , and let . Note that in this notation we allow for empty clusters that arise when an atom is not linked with any observation, i.e., and . Using this notation, see Algorithm 1 in Appendix B for a description of MCMC posterior simulation.

We are mainly interested in two posterior summaries, a partition of proteins by different patterns of protein expression over time, and identification of the proteins with the highest rank in (the difference in slope between patients and control). The latter proteins are the ones that are most likely linked with ataxia.

We start by identifying the MAP estimate for the number of clusters in (22), and then follow the approach of Dahl (2006) to report a posterior summary of the random partition of proteins. Let denote the reported partition. Conditional on , we then generate a posterior Monte Carlo sample for for each cluster to obtain (conditional) posterior mean and variance for the cluster-specific . We did not encounter problems related to label-switching in the actual implementation - in general one might need to account for possible label-switching.

#### Ranking proteins.

The main inference target is to identify proteins with the most significant difference between time profiles for patients versus controls, suggesting such proteins are the most likely to be linked with ataxia.

We therefore focus on the difference (across conditions) in differences (over age) of mean protein expression, defined as in (21). We evaluate the posterior mean using Rao-Blackwellization, that is, as Monte Carlo average of conditional expectations , where are all parameters in the -th posterior Monte Carlo sample, excluding . Note that is a deterministic function of the cluster-specific when . Let denote the cluster-specific estimate. Conditioning on we can then identify the set of proteins with the largest average change.

Alternatively, we can cast the problem of identifying interesting proteins as a problem of ranking, and more specifically, one of estimating a certain quantile, to report the most promising

proteins. Here is chosen by the investigator. Choice of should reflect the effort and capacity to further investigate selected proteins. We then formalize the problem of reporting promising proteins as the problem of identifying the proteins with in the top percentile of values. The problem of ranking experimental units in a hierarchical model and reporting the top percentile is discussed in Lin et al. (2004) who cast it as a Bayesian decision problem. Let

 Ri=rank(γi)=I∑i′=1I(|γi|≥|γi′|),

denote the true ranks, with for the smallest and for the largest . Alternatively we use to report the quantile, or the percentile

. One of the loss functions that

Lin et al. (2004) consider is a 0-1 loss aimed at identifying the top units, i.e., the quantile. Let denote the estimated rank for unit , and . The following loss function penalizes the number of misclassifications in the top quantile, including the number of proteins that are falsely reported (false positives) plus those that are failed to be reported (false negatives).

 L0/1(c)=1I{\# misclassifications}=1I{I∑i=1AB(c,Pi,ˆPi)+BA(c,Pi,ˆPi)}

where AB and BA are penalties for the two types of misclassifications,

 AB(c,P,ˆP)=1(P>c,ˆPc(I+1),ˆRc)=1(Rc(I+1),

Lin et al. (2004) show that is optimized by with

 R⋆i(c)=rank{p(Pi>c∣y)}, (23)

or .

### 5.3 Simulations

We set up a simulation with subjects and proteins. The simulation does not include a split into patients and control (think of the data as already reporting the difference of patient and control). We generate (equivalent to age in the actual study), a hypothetical partition of proteins with cluster membership indicators using , . We set up protein-specific offsets using shared common values for all proteins in a cluster, i.e., with , and patient-specific offsets . To mimic the actual data, we manually modify some to create similar patterns as in the data (see Figure 8b). Using these parameters we set up a simulation truth

 yij=⎧⎪ ⎪⎨⎪ ⎪⎩αi+2tj+3t3j+δj+ϵij,si=1αi−2tj+t3j+δj+ϵij,si=2αi+tj−3t3j+δj+ϵij,si=3

with and for , respectively. See Figure 8b (black lines) for the mean functions for proteins in the three clusters under the simulation truth. Under this simulation truth, the proteins in cluster are those with largest overall slope. Figure 7 shows randomly selected versus age , in the simulation (left panel) and in the real data (right panel).

MCMC posterior simulation converges after iterations. As before, let , and let denote the number of non-empty clusters. Figure 8a shows a histogram of the posterior distribution . The posterior mode recovers the simulation truth, with little posterior uncertainty. We then evaluate the earlier described point estimate for the posterior random partition. Conditioning on we continue to simulate MCMC transitions to evaluate conditional posterior means for , and in the analysis model (20).

Let , and denote the conditional posterior means. All posterior means are evaluated numerically as Monte Carlo sample averages. Also, let . For a data point , let and . Then