# A Bayesian Framework for Non-Collapsible Models

In this paper, we discuss the non-collapsibility concept and propose a new approach based on Dirichlet process mixtures to estimate the conditional effect of covariates in non-collapsible models. Using synthetic data, we evaluate the performance of our proposed method and examine its sensitivity under different settings. We also apply our method to real data on access failure among hemodialysis patients.

## Authors

• 7 publications
• 21 publications
• 6 publications
06/08/2015

### Learning Mixtures of Ising Models using Pseudolikelihood

Maximum pseudolikelihood method has been among the most important method...
11/16/2020

### Foundations of Bayesian Learning from Synthetic Data

There is significant growth and interest in the use of synthetic data as...
10/21/2021

### Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition

With recent advances in speech synthesis, synthetic data is becoming a v...
08/14/2019

### Uplift Modeling for Multiple Treatments with Cost Optimization

Uplift modeling is an emerging machine learning approach for estimating ...
07/17/2021

### A new Method for the in vivo identification of material properties of the human eye. Feasibility analysis based on synthetic data

This paper proposes a new method for in vivo and almost real-time identi...
01/20/2021

### Accounting for not-at-random missingness through imputation stacking

Not-at-random missingness presents a challenge in addressing missing dat...
07/02/2020

### Epidemiology of exposure to mixtures: we cant be casual about causality when using or testing methods

Background: There is increasing interest in approaches for analyzing the...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Statistically, non-collapsibility represents the setting where the marginal measure of association between two random variables

and , differs from the conditional measure of association between these two random variables, after conditioning upon the levels of a third random variable , where is not a confounder, i.e., is associated with one random variable but not the other (Greenland et al., 1999). In this situation, a careful attention is required to properly interpret a conditional association as opposed to a marginal association. Further, one should note that in the absence of confounding, both the marginal association and the conditional association, despite being different, are unbiased. Hence, a clear distinction between confounding and non-collapsibility is required.

Similarly, non-collapsibility exists in a regression setting when the marginal association between a predictor variable,

, and a response variable,

, differs from the conditional association in a separate regression model where a third variable is adjusted in the model. As before, we assume that is not a confounder so it is only associated with the response variable.

In general, one needs to consider the relative importance of estimating the marginal association between the two random variables and , in contrast to the conditional association given a third random variable . When

is observed, it is possible to heuristically compare the difference between the marginal and the conditional associations by simply comparing the adjusted and unadjusted estimated associations. However, when

is latent, analysts generally default to estimating a marginal association without giving any thoughts to the relative merits of the two estimands.

In longitudinal studies, non-collapsibility has especially garnered some attention when comparing the estimates from the generalized linear mixed model with the estimates from the generalized estimating equation model, where the former provides conditional estimates that are conditioned upon the subject-specific random effects, and the latter provides estimates that are marginalized over all subjects. Longitudinal data can be considered as a special case of the repeated measure data with measurements indexed by time. We shall use the words “longitudinal data” and “repeated measure data” interchangeably.

As a simple case, one may consider subjects, each with within subject measurements with and as the outcome and the covariate for the measurement on the subject, respectively. One can write a generalized linear mixed effect model with random intercepts of the form

 E[Yij|tij,β0i] =μij,

where the mean and the covariate and the subject-specific random intercept are linked using a link function , where

 g(μij) =β0i+β0+β1tij. (1)

In this model, and are intercept and slope that are shared across all subjects. In a typical mixed effects model, , where

, are assumed to be independent and Normally distributed. Under this model setting, conditioned upon the subject-specific random intercepts,

, represents the conditional association between the random variable and the outcome, .

Alternatively, one may consider a model of the form

 E[Yij|tij] =ηij,

where the mean is related to the covariate through a link function , where

 g(ηij) =γ0+γ1tij. (2)

In this model, is the intercept and is the slope where both are shared across all subjects. Under this model setting, represents the marginal association between the covariate and the outcome, .

Generally, even with random intercepts with no confounding effect, the conditional covariate effect (equation (1)) and the marginal covariate effect (equation (2)) need not be equal. Several authors including Gail et al. (1984), Gail (1986) showed that with non-confounding subject-specific random intercept, , is guaranteed to be collapsible, if is either the identity link or the log link. That means with the identity or the log link and in the absence of confounding, equality of the conditional covariate effect and the marginal covariate effect is guaranteed. Hence, we are primarily interested in studying non-collapsibility in logistic and proportional hazards models.

To show the non-collapsibility effect in the logistic regression model, we generated synthetic data, where we considered three different groups with different intercepts of

, , . Independently of the intercepts, we generated covariate , where is simulated from the standard Normal . Using the a logistic link and with a true coefficient values of , we generated binary outcomes. We then fit a conditional model of the form

 logit[E(Yij|Xij,β0i)]=β0i+β0+β1Xij,

where is a binary outcome for the measurement on the cluster, is the covariate value corresponding to the outcome , and is the true value of the cluster-specific intercept that is directly adjusted in the model. We also fit a marginal model of the form

 logit[E(Yij|Xij)]=γ0+γ1Xij.

After fitting the conditional and the marginal models above, we plot the results, where the x-axis is the covariate values and the y-axis is the predicted probability of

. In this plot, the red curve shows the predicted values from the marginal model and the three black curves show the predicted values from the conditional each corresponding to a sub-group. As Figure 1 shows, the marginal slope that is averaged across sub-groups () is smaller than the stratum-specific slope (

). This plots clearly shows non-collapsibility in logit link.

As Figure 1 shows, the marginal coefficient estimand

is shrunk towards the null hypothesis of no covariate effect compared to the conditional coefficient estimand

. When random intercepts are latent, even under the conditional generalized linear mixed effects model (equation (1)), the coefficient estimate may shrink towards 0 compared to the true conditional estimand and that is when the distribution of the random intercepts are mis-specified. One such example is a random intercept model with true random intercepts distributed according to a bi-modal distribution. In this situation, coefficient estimates under a model that assumes random intercepts are distributed Gaussian, may still attenuate towards compared to the true conditional coefficients.

Similar to the logistic regression models, proportional hazards models are also non-collapsible. Let denote the survival time for the cluster. One example of such repeated measure survival data is the survival data on access failure among hemodialysis patients where each patient may have multiple access failures. Let be the covariate corresponding to the survival outcome. One can write a multiplicative hazard function of the form

 h(Tij|Xij,β0i)=h0(Tij)exp{β0i+β1Xij},

where is the hazard at time , is the baseline hazard at time , is the frailty term including latent cluster-specific baseline hazard multipliers, and is the log relative risk of the effect of the covariate on the risk of “death”. Under this model setting, is the conditional relative risk of the covariate that is conditioned on the frailty them .

Alternatively, a marginal proportional hazards model is of the form

 h(Tij|Xij)=h0(Tij)exp{γ1Xij},

where and are hazard and baseline hazard at time , respectively. represents the marginal log relative risk of the effect of every one unit changes in the covariate on the risk of “death”. As we shall show with synthetic data, proportional hazards models are non-collapsible where the marginal parameter shrinks toward compared to the conditional parameter .

In this paper, we explore non-collapsibility in longitudinal data when there exists latent subject-specific random intercepts. For non-collapsible logistic regression and proportional hazards models, we propose Dirichlet process mixture models (Antoniak, 1974; Sethuraman, 1994) that are capable of detecting underlying structure of data by clustering units of analysis into sub-clusters based on the distributional similarities of those units. We believe that our approach can provide insights into conditional associations between the response variable and a set of covariates given population subgroups. Using simulation studies, we compare our proposed models with the common statistical models to analyze longitudinal data. Finally, we use our proposed models to analyze data on hemodialysis patients in order to find risk factors associated with access failure among these patients.

## 2 Methods

With the focus on logistic regression and the proportional hazards models, and in the context of modeling correlated longitudinal data where repeated measures on sampling units are collected over time, we propose Dirichlet process mixture models capable of estimating conditional covariate effects when there exists latent sub-population effects. In Section 2.1, we introduce our proposed Bayesian logistic model, and in Section 2.2 we introduce our proposed Bayesian proportional hazards model.

### 2.1 A Bayesian Hierarchical Logistic Regression with Dirichlet Process Mixture Priors

The logistic link is non-collapsible. This means, when there exists latent population subgroup effects in the form of random intercepts, failure to adjust for these subgroup effects leads to coefficient estimates that are shrunk toward

compared to the true conditional estimands from a separate model with those latent random intercepts taken into account. Generalized linear mixed effects models are capable of modeling random intercepts where they typically assume random intercepts to be distributed according to a Gaussian distribution, however, distributional mis-specification of the random intercepts may still cause coefficient estimates to shrink. A model capable of detecting subgroup random intercepts, that is also robust to distributional mis-specification of random intercepts, can provide the merits of estimating the conditional coefficient estimates.

We propose a hierarchical Bayesian model that is capable of detecting latent subgroup effects that are in the form of latent random intercepts. The models is capable of estimating conditional parameters. Using a Dirichlet process mixture prior, our proposed model is robust to distributional mis-specification of the random intercepts. In our proposed model, we consider the binary data to be distributed according to

 Yij|β0i,β0,β1,Xij ∼Bernoulli(pi=β0i+β0+β1Xij),

where and with as the number of subjects and as the number of measurements on the subject, is the corresponding covariate to the outcome , is the subject-specific intercept for subject , and and are the intercept and the slope that are shared across all subjects, respectively. We consider Gaussian priors on the shared intercept and the shared slope of the form

 β0 ∼N(0,σ2β0), β1 ∼N(0,σ2β1).

We propose using the Dirichlet process mixture prior on the random intercepts , where with as the number of subjects in the data. Using the Dirichlet process mixture prior, as opposed to an explicit distributional assumption, will make the model robust to distributional mis-specification. Further, DPM prior will allow subjects to cluster based on the distributional similarities of their latent random intercepts, hence, provides higher precision in estimating those latent subject effects. We specify a Dirichlet process mixture prior on as

 β0i ∼N(μi,σ2β0i), μi|G ∼G, (3) G ∼DP(α,G0=N(0,σ20)).

The Dirichlet process mixture prior above induces a prior on

that is essentially an infinite mixture of Normal distributions that are mixed over the mean parameter. We shall refer to this model as a Mean-DPM model. Alternatively, one may set a Dirichlet process prior that induces an infinite Gaussian mixture prior that are mixed over the standard deviation parameter. Such a prior can be specified as

 β0i ∼N(0,σ2(i)β(i)0), σ2(i)β(i)0|G ∼G, (4) G ∼DP(α,G0=log−Normal(μG0,σ2G0)).

We shall refer to this model as Sigma-DPM model.

### 2.2 A Bayesian Hierarchical Proportional Hazards Model with Dirichlet Process Mixture Priors

Similar to the logistic regression models, proportional hazards models are also non-collapsible. When there exists differential subject-specific baseline hazard risk, even in the absence of confounding in the baseline hazard risks, failure to adjust for these subject-specific baseline risks in a proportional hazards model leads to coefficient estimates that are shrunk toward compared to the true conditional estimands from a separate model with those latent baseline risks taken into account. In this situation, a proportional hazards model that is capable of detecting subject-specific baseline hazards, can provide the merits of estimating the conditional coefficient estimates.

We propose a hierarchical Bayesian proportional hazards model that is capable of detecting the differential subject-specific baseline hazard risk across subjects. Our proposed model uses a Dirichlet process mixture prior on the latent subject-specific baseline hazards. The Dirichlet process mixture prior allows clustering subjects based on the distributional similarities of their baseline hazards. Further, by using the Dirichlet process mixture prior, we avoid any explicit distributional assumption on the latent subject-specific baseline hazards. In our proposed model, we consider survival times , where and with as the number of subjects and as the number of measurements on the subject, to be distributed according to a Weibull distribution of the form

 Tij|τ,β0i,β0,β1,Xij ∼Weibull(τ,θi), log(θi) =β0i+β0+β1Xij,

where is the covariate value corresponding to the survival time, is a subject-specific baseline hazard, is a shared intercept across all subjects, is a shared slope across all subjects that represents the log relative risk of every one unit increase in the covariate , is the shape parameter, and is a subject-specific scale parameter. In the model specification above, we introduced covariates into the model through the scale parameter and using the equation . For our proposed model, we consider Gaussian priors on and parameters as

 β0 ∼N(0,σ2β0), β1 ∼N(0,σ2β1),

where and are fixed numbers. We also assume a log-Normal prior on the shape parameter, , as

 τ ∼log−Normal(μτ,σ2τ),

with and as fixed numbers.

We use a Dirichlet process mixture prior for the subject-specific parameters as

 β0i ∼N(μi,σ2β0i) μi|G ∼G, (5) G ∼DP(α,G0=N(0,σ20)).

The Dirichlet process mixture prior above is essentially an infinite mixture of Normal distributions that are mixed over the mean parameter. We shall refer to this model with the Mean-DPM proportional hazards model. Alternatively, we propose a Dirichlet process mixture model that induces an infinite Normal distributions mixed overt the standard deviation parameter. This Dirichlet process prior can be written as

 β0i ∼N(0,σ2(i)β0i), σ2(i)β0i|G ∼G, (6) G ∼DP(α,G0=log−Normal(μG0,σ2G0)).

We shall refer to this new model with the above Dirichlet process mixture prior as Sigma-DPM proportional hazards model.

## 3 Simulation Studies

Using simulation studies, we investigate non-collapsibility in logistic regression and proportional hazards models. We consider three simulation scenarios: one when subject-specific intercepts are sampled independently from the standard Normal , another when subject-specific intercepts are sampled from a mixture distribution of the form

 β0iiid∼θiN(μ=−1.5,σ2=1)+(1−θi)N(μ=1.5,σ2=1),

where with where is the number of subjects. Finally, in the third scenario subject-specific intercepts are sampled from a mixture distribution of the form

 β0iiid∼θiN(μ=0,σ2=1)+(1−θi)N(μ=0,σ2=5),

where for .

We compare parameter estimation between our proposed models and some common statistical models used to analyze repeated measure binary data and survival data. For every simulation scenario, we run 1,000 simulations each with 300 subjects and 12 within-subject measurements per subject.

### 3.1 Logistic Regression Models

Unlike linear and log links, logistic link is not collapsible. In this section, using synthetic data we compare parameter estimation under our proposed Mean-DPM and Sigma-DPM Bayesian hierarchical logistic regressions we use the following common statistical models to analyze repeated measure binary data:

• Generalized linear model with a logit link (GLM): We fit a frequentist GLM model with the logit link. This technique ignores the correlation between within-subject measurements. Further, this model does not account for any subject-specific effect. Due to the ignorance of within subject correlations in this model, standard error for the estimated coefficients tend to underestimate the true standard error once the within subject correlation is taken into account.

• Generalized estimating equation (GEE): Instead of a simple generalized linear model with the logit link where all within-subject measurements are treated as independent measures, one can use the generalized estimating equation framework to account for the correlation between within-subject measurements. Despite accounting for the correlation between measurements taken on the same subject, GEE does not consider any subject-specific random effect.

• Generalized linear mixed effects model (GLMM): We also fit the frequentist generalized linear mixed effects model with subject-specific random intercepts to model binary data. GLMM is capable of taking the correlation in with-subject measurements into account. Further, GLMM is also capable of estimating subject-specific random intercepts with the assumption that the random intercepts are Normally distributed.

• Bayesian logistic regression: We also consider a Bayesian logistic regression model with a likelihood of the form

 Yij|β0,β1,Xij ∼Bernoulli(pi=β0+β1Xij),

where is the outcome of the measurement on the subject, is the measured covariate corresponding to outcome, and and are intercept and slope. We assume priors of the form

 β0 ∼N(0,σ2β0), β1 ∼N(0,σ2β1),

where and are fixed numbers.

• Hierarchical Bayesian logistic regression model: Analogous to the the GLMM model to analyze binary data, one can setup a Bayesian hierarchical model with a likelihood of the form

 Yij|β0i,β0,β1,Xij ∼Bernoulli(pi=β0+β1Xij),

where is the outcome of the measurement on the subject, is the measured covariate corresponding to outcome, is the subject-specific random intercepts where with as the number of subjects in the data, and and are intercept and slope. We assume Gaussian priors on subject-specific random intercepts of the form

 β0i ∼N(0,σ2β0i),

where . Also, Gaussian priors are assumed on coefficients , and of the form

 β0 ∼N(0,σ2β0), β1 ∼N(0,σ2β1),

where and are fixed numbers.

Figure 2 shows the histogram of the posterior median of , where from the proposed Mean-DPM hierarchical Bayesian logistic model, where is the subject-specific prior mean on the random intercept of subject (equation (3)). Under each simulation scenario, we simulated a single dataset with 300 subjects each with 12 within-subject measurements and applied our proposed Mean-DPM model. The plot to the left shows a histogram of the posterior median of when data are simulated with random intercept sampled from the standard Normal . As the histogram shows, most of the posterior medians are close to zero. The histogram in the model shows the distribution of the posterior median when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form , where is distributed Bernoulli with parameter . As the histogram in the middle shows, posterior medians are bi-modal where modes are around the true values of -1.5 and 1.5. Finally, the histogram to the right shows the posterior median of when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form . Due to the differences in the standard deviations, one may expect the histogram to be spread more widely compare to the first scenario, nonetheless, posterior medians are still centered around the true mean of .

Figure 3 shows the histogram of the posterior median of , where from the proposed Sigma-DPM hierarchical Bayesian logistic model, where is the subject-specific prior standard deviation on the random intercept of subject (equation (4)). Under each simulation scenario, we simulated a single dataset with 300 subjects each with 12 within-subject measurements and applied our proposed Sigma-DPM model. The plot to the left shows a histogram of the posterior median of when data are simulated with random intercept sampled from the standard Normal . As the histogram shows, most the posterior medians are close to 1. The histogram in the model shows the distribution of the posterior median when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form , where is distributed Bernoulli with parameter

. As the histogram in the middle shows, posterior medians are uniformly distributed from 3.18 to 3.28. This results make sense as now the data is widely spread with two distinct mean with a distance of

. Our Sigma-DPM model with prior mean on random intercepts has to have a larger standard deviation to provide a prior to cover all plausible subject-specific random intercepts . Finally, the histogram to the right shows the posterior median of when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form . It seems that in this case, the model converged to a standard deviation that is close . This makes sense since a when a random intercept is plausible under the prior , it’s also plausible under a prior with larger standard deviation. Hence, posterior medians converged to a large standard deviation that is plausible according to the random intercepts sampled from .

While Figure 2 and Figure 3 show the performance of our proposed models in estimating prior mean and prior standard deviation of the random intercepts, , however, the main interest is on evaluating the performance of the model on estimating the actual random intercepts. Figure 4 provides a grid of scatter plots each shows the relation between the true random intercept value and the posterior median or the estimate of random intercepts. As one can see in the plot, when random intercepts are Normally distributed according to the standard Normal distribution, in terms of estimating the latent random intercepts, our proposed Mean-DPM and Sigma-DPM models work equally well as the GLMM model and the hierarchical Bayesian logistic model with explicit Normal assumption on the random intercepts. When the reference distribution of the sampled random intercepts is not Normal, our proposed Mean-DPM and Sigma-DPM models that are robust to distributional mis-specification of the random intercepts, outperform the GLMM and the hierarchical Bayesian logistic regression in terms of estimating the latent random intercepts.

As tables (1), (2), and (3) show, coefficient estimates under marginal Bayesian model and marginal frequentist GLM and GEE shrank toward the compared to the true conditional value. The fact that in table (1) coefficient estimates under both GLM and GEE are the same is not surprising as we are using balanced data with the canonical link. By taking sub-group intercepts into account, coefficient estimates from the generalized linear mixed effect model and the hierarchical Bayesian model with Normal prior on the random intercepts are closer to the true conditional estimand compared to the marginal models. However, the coefficient estimate under these models still shrink toward no 0. The amount of shrinkage is larger under the second and the third scenarios when the distribution of random intercepts is mis-specified. Our proposed Dirichlet process mixture models, however, are capable of detecting sub-group intercepts and are robust to distributional mis-specification of the random intercepts. Coefficient estimates from our proposed models lead to the minimum mean squared error (MSE) in estimating the true conditional coefficient value.

### 3.2 Proportional Hazards Survival Models

To explore non-collapsibility in proportional hazards models and to compare coefficient estimation under our proposed Mean-DPM and Sigma-DPM models with common proportional hazards models, we consider the following proportional hazards models:

• The frequentist Cox model: We fit the frequentist Cox proportional hazards model. This model assumes an overall baseline hazards for all subjects. Using the partial likelihood techniques, Cox model does not need any baseline hazard specification as that measure gets canceled out during the estimation process. The Cox frequentist model does not take the differential baseline hazards across subjects into account. In fitting the Cox model, we take the within subject correlation between multiple within-subject measurements into account using the approach proposed by Lee et al. (1992) where we first estimate model coefficients using the independent covariance matrix and then we use a robust sandwich covariance matrix to account for within subject correlation between measurements.

• Weibull accelerated failure time model (AFT): AFT models describe survival times as a function of predictor variables. Generally, Weibull AFT models are of the form

 log(Tij)=β0+β1Xij+ϵ,

where is the survival time for the measurement on the subject, is the corresponding covariate to the outcome , and a random error such that is distributed according to a Weibull distribution with shape parameter and scale parameter . When there exists multiple measurements per subject, failure to account for the correlation between within subject measurements leads to incorrect estimated standard error of coefficients. In order to account for this intra class correlations, we take the the approach proposed by Lee et al. (1992) where first coefficients in the model are estimated using an independent covariance structure between within subject measurements and then a robust sandwich covariance matrix is used to account for the within cluster correlations.

• Bayesian marginal proportional hazards model: We consider a Bayesian proportional hazard model with a likelihood of the form

 Tij|τ,β0,β1,Xij ∼Weibull(τ,λi=β0+β1Xij),

where and are the survival times and the measured covariate on the measurement on the subject, is the shape parameter, and are the intercepts and the slope with as the log relative risk of death per every one unit change in . Similar to the previously introduced Weibull distribution for survival times, is the log of the subject-specific scale parameter. We specify a log-Normal prior on the shape parameter that is of the form

 τ ∼log−Normal(μτ,στ),

where and are fixed numbers. Also, and are assumed to have Gaussian priors of the form

 β0 ∼N(0,σ2β0), β1 ∼N(0,σ2β1).
• Hierarchical Bayesian proportional hazards model: In order to account for the differential baseline hazard across subjects, one can consider a likelihood of the form

 Tij|τ,β0i,β0,β1,Xij ∼Weibull(τ,λi=β0i+β0+β1Xij),

where can be considered as the subject-specific log baseline hazard. For this model, we assume similar to priors as the one specified for the “Bayesian marginal proportional hazards model”. Additionally, we assume , where , to have a Gaussian prior of the form:

 β0i∼N(0,σ2β0i),

where is a fixed number.

Figure 5 shows the histogram of the posterior median of , where from the proposed Mean-DPM hierarchical Bayesian proportional hazard model, where is the subject-specific prior mean on the subject-specific log baseline hazard of subject , which we represent it with and for the sake consistency, we shall refer to it as the subject-specific random intercept (equation (5)). Under each simulation scenario, we simulated a single dataset with 300 subjects each with 12 within-subject measurements and applied our proposed Mean-DPM model. The plot to the left shows a histogram of the posterior median of when data are simulated with random intercept sampled from the standard Normal . As the histogram shows, most the posterior medians are close to zero. The histogram in the model shows the distribution of the posterior median when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form , where is distributed Bernoulli with parameter . As the histogram in the middle shows, posterior medians are bi-modal where modes are around the true values of -1.5 and 1.5. Finally, the histogram to the right shows the posterior median of when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form . Due the the differences in the standard deviations, one may expect the histogram to be spread more widely compare to the first scenario, nonetheless, posterior medians are still centered around the true mean of .

Figure 6 shows the histogram of the posterior median of , where from the proposed Sigma-DPM hierarchical Bayesian proportional hazard model, where is the subject-specific prior standard deviation on the random intercept of subject (equation (6)). Under each simulation scenario, we simulated a single dataset with 300 subjects each with 12 within-subject measurements and applied our proposed Sigma-DPM model. The plot to the left shows a histogram of the posterior median of when data are simulated with random intercept sampled from the standard Normal . As the histogram shows, most the posterior medians are close to 1. The histogram in the model shows the distribution of the posterior median when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form , where is distributed Bernoulli with parameter . As the histogram in the middle shows, posterior medians are uniformly distributed from 3.18 to 3.28. This results make sense as now the data is widely spread with two distinct mean with a distance of . Our Sigma-DPM model with prior mean on random intercepts has to have a larger standard deviation to provide a prior to cover all plausible subject-specific random intercepts . Finally, the histogram to the right shows the posterior median of when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form . It seems that in this case, the model converged to a standard deviation that is close . This makes sense since a when a random intercept is plausible under the prior , it’s also plausible under a prior with larger standard deviation. Hence, posterior medians converged to a large standard deviation that is plausible according to the random intercepts sampled from .

Based on Figure 5 and Figure 6, our proposed models show good performance when estimating the prior mean and prior standard deviation of the random intercepts, , however, the main interest is on evaluating the performance of the proposed model on estimating the actual random intercepts. In Figure 7, we provide a grid of scatter plots each shows the relation between the true random intercept value and the posterior median estimates of those random intercepts. As Figure 7 shows, when random intercepts are distributed according to the standard Normal distribution, in terms of estimating the latent random intercepts, our proposed Mean-DPM and Sigma-DPM models work equally well as the the hierarchical Bayesian proportional hazard model with explicit Normal assumption on the random intercepts. When the reference distribution of the sampled random intercepts is not Normal, our proposed Mean-DPM and Sigma-DPM models that are robust to distributional mis-specification of the random intercepts, outperform the hierarchical Bayesian proportional hazard model in terms of estimating the latent random intercepts .

Tables 4, 5, and 6 show the results for the proportional hazards models. Coefficient estimates under the Cox model, the Bayesian marginal model, and the Weibull AFT model, all examples of marginal models, are smaller compared to the true conditional estimand and the marginal coefficient estimate under these models shrink toward .

By taking the differential subject-specific baseline hazard into account, the hierarchical Bayes model with the Normal prior on random intercepts is capable of estimating the true conditional estimand when the random intercepts are truly Normally distributed (Table 4). However, the model is not robust to distributional mis-specification as under the second and the third scenarios, the coefficient estimate of shrank toward 0 (Table 5 and Table 6).

Finally, our proposed Mean-DPM and Sigma-DPM proportional hazards models assume no explicit distributional assumption on the random intercepts, are capable of detecting subject-specific random intercepts, and are robust to distributional mis-specification of the random intercepts. Hence, our proposed DPM proportional hazard models can estimate the true conditional estimand.

## 4 Sensitivity Analysis

Using synthetic data, we showed that our proposed Mean-DPM and Sigma-DPM are capable of estimating latent cluster-specific intercepts and are robust to distributional mis-specification. Based on the simulation results presented in Section 3, in terms of MSE of estimating conditional coefficients, our proposed models outperform common frequentist and Bayesian models to analyze repeated measure binary data and survival data. In this section, we are interested in testing the sensitivity of our proposed Mean-DPM and Sigma-DPM proportional hazards models with respect to the three main parameters of the number of within unit measurements (), the difference in mean parameter and when random intercepts are simulated from the mixture of two Normal distributions of the form and , and the ratio between the two parameters and when random intercepts are simulated from the mixture of two Normal distributions of the form and .

### 4.1 Sensitivity to li

In this section, we test the sensitivity of our proposed Mean-DPM proportional hazards and Sigma-DPM proportional hazards models with respect to the number of within subject measurements and under the case where the distribution of the random random intercepts is mis-specified. We generate synthetic repeated measure binary and survival data under two scenarios - one when subject-specific intercepts are sampled from a mixture distribution of the form , and another when subject-specific intercepts are sampled from a mixture distribution of the form , where with and as the number of subjects. By changing the number of within subject measurements , we test the sensitivity of our proposed models.

Figure 8 provides a histogram of posterior medians of the prior mean on the random intercepts . The results are from our proposed Mean-DPM hierarchical Bayesian proportional hazard model that is run on a single dataset that is generated under the simulation scenario where random intercepts ’s are sampled from an equally weighted mixture of two Normal distributions with means or and with the standard deviation of . As one can see, as the number of within subject measurements increases, our proposed Mean-DPM can better estimate the prior mean ’s with the true values that are either -1.5 or 1.5.

Similarly, Figure 9 provides a histogram of posterior medians of the prior standard deviation on the random intercepts . The results are from our proposed Sigma-DPM hierarchical Bayesian proportional hazard model that is run on a single dataset generated under the simulation scenario where random intercepts ’s are sampled from an equally weighted mixture of two Normal distributions both with mean and with the standard deviation of and . As one can see, as the number of within subject measurements increases, our proposed Sigma-DPM can better estimate the prior standard deviations ’s with the true values that are either or .

As Figure8 and Figure9 show, using our proposed Mean-DPM and Sigma-DPM hierarchical Bayesian proportional hazard model, the larger within subject number of measurements, , are, the more accurate the posterior medians of prior means and prior standard deviations will be. and are the hyper-parameters that are parameters of prior distributions on the random intercepts .

Figure 10 includes scatterplots that show the relation between the true values and the posterior medians from our proposed Mean-DPM and Sigma-DPM proportional hazard models on simulated data with the true subject-specific random intercepts sampled from a mixture of two Normal distributions of the form , where is distributed Bernoulli with the parameter . As one can infer from the plots in this figure, as the number of within-subject measurements increase, posterior medians of the random intercepts provide a more accurate estimate of the true .

Similarly, Figure 11 includes similar scatterplots that show the relation between the true values and the posterior medians from our proposed Mean-DPM and Sigma-DPM proportional hazard models on data simulated with the true subject-specific random intercepts sampled from a mixture of two Normal distributions of the form , where is distributed Bernoulli with the parameter . From the plots in the figure, one can clearly realize that as the number of within-subject measurements increase, posterior medians of the random intercepts provide a more accurate estimate of the true .

Table 7 provides results on the sensitivity of our models under the first simulation scenario and table 8 provides the result on the sensitivity of our models under the second simulation scenario. As the results in Table 7 and Table 8 show, with larger number of within subject measurements , our proposed models can better estimate the latent random intercepts, and hence, lead to a smaller error in estimating the true conditional coefficient estimate.

### 4.2 Sensitivity to |μ2−μ1|

In this section, we test the sensitivity of our proposed Mean-DPM and Sigma-DPM proportional hazards models with respect to the distance between the mean parameters and , where and are the mean parameters of two Normal distributions that are used to simulate subject-specific random intercepts. Subject-specific random intercepts are sampled from a mixture of two Normal distributions of the form , where is distributed Bernoulli with the parameter . In this section, we evaluate the sensitivity of our proposed Mean-DPM and Sigma-DPM proportional hazards models with respect to the distance between the means and . In particular, we consider five cases where the distance is half of the standard deviation shared between both components, , or is equal to the , or is two times bigger than the , or three times bigger, or four times bigger (Table 9).

Figure 12 provides a histogram of posterior medians of the prior mean on the random intercepts . The results are from our proposed Mean-DPM hierarchical Bayesian proportional hazard model that is run on a single dataset that is generated under the simulation scenario where random intercepts ’s are sampled from an equally weighted mixture of two Normal distributions with means or and with the standard deviation of . In order to test the sensitivity of our models with respect to the distance between and , we consider 5 cases based on the distance between and . Those cases are when the distance between the means is half of the standard deviation , equal to , twice of the , three times of the , or four times of the .

Figure 13 includes scatterplots that show the relation between the true values and the posterior medians from our proposed Mean-DPM and Sigma-DPM proportional hazard models on simulated data with the true subject-specific random intercepts sampled from a mixture of two Normal distributions of the form , where is distributed Bernoulli with the parameter . To test the sensitivity of our proposed models with respect to the distance between and , we consider 5 different cases. Those cases are when the distance between the means are , , , , and .