1 Introduction
Statistically, noncollapsibility represents the setting where the marginal measure of association between two random variables
and , differs from the conditional measure of association between these two random variables, after conditioning upon the levels of a third random variable , where is not a confounder, i.e., is associated with one random variable but not the other (Greenland et al., 1999). In this situation, a careful attention is required to properly interpret a conditional association as opposed to a marginal association. Further, one should note that in the absence of confounding, both the marginal association and the conditional association, despite being different, are unbiased. Hence, a clear distinction between confounding and noncollapsibility is required.Similarly, noncollapsibility exists in a regression setting when the marginal association between a predictor variable,
, and a response variable,
, differs from the conditional association in a separate regression model where a third variable is adjusted in the model. As before, we assume that is not a confounder so it is only associated with the response variable.In general, one needs to consider the relative importance of estimating the marginal association between the two random variables and , in contrast to the conditional association given a third random variable . When
is observed, it is possible to heuristically compare the difference between the marginal and the conditional associations by simply comparing the adjusted and unadjusted estimated associations. However, when
is latent, analysts generally default to estimating a marginal association without giving any thoughts to the relative merits of the two estimands.In longitudinal studies, noncollapsibility has especially garnered some attention when comparing the estimates from the generalized linear mixed model with the estimates from the generalized estimating equation model, where the former provides conditional estimates that are conditioned upon the subjectspecific random effects, and the latter provides estimates that are marginalized over all subjects. Longitudinal data can be considered as a special case of the repeated measure data with measurements indexed by time. We shall use the words “longitudinal data” and “repeated measure data” interchangeably.
As a simple case, one may consider subjects, each with within subject measurements with and as the outcome and the covariate for the measurement on the subject, respectively. One can write a generalized linear mixed effect model with random intercepts of the form
where the mean and the covariate and the subjectspecific random intercept are linked using a link function , where
(1) 
In this model, and are intercept and slope that are shared across all subjects. In a typical mixed effects model, , where
, are assumed to be independent and Normally distributed. Under this model setting, conditioned upon the subjectspecific random intercepts,
, represents the conditional association between the random variable and the outcome, .Alternatively, one may consider a model of the form
where the mean is related to the covariate through a link function , where
(2) 
In this model, is the intercept and is the slope where both are shared across all subjects. Under this model setting, represents the marginal association between the covariate and the outcome, .
Generally, even with random intercepts with no confounding effect, the conditional covariate effect (equation (1)) and the marginal covariate effect (equation (2)) need not be equal. Several authors including Gail et al. (1984), Gail (1986) showed that with nonconfounding subjectspecific random intercept, , is guaranteed to be collapsible, if is either the identity link or the log link. That means with the identity or the log link and in the absence of confounding, equality of the conditional covariate effect and the marginal covariate effect is guaranteed. Hence, we are primarily interested in studying noncollapsibility in logistic and proportional hazards models.
To show the noncollapsibility effect in the logistic regression model, we generated synthetic data, where we considered three different groups with different intercepts of
, , . Independently of the intercepts, we generated covariate , where is simulated from the standard Normal . Using the a logistic link and with a true coefficient values of , we generated binary outcomes. We then fit a conditional model of the formwhere is a binary outcome for the measurement on the cluster, is the covariate value corresponding to the outcome , and is the true value of the clusterspecific intercept that is directly adjusted in the model. We also fit a marginal model of the form
After fitting the conditional and the marginal models above, we plot the results, where the xaxis is the covariate values and the yaxis is the predicted probability of
. In this plot, the red curve shows the predicted values from the marginal model and the three black curves show the predicted values from the conditional each corresponding to a subgroup. As Figure 1 shows, the marginal slope that is averaged across subgroups () is smaller than the stratumspecific slope (). This plots clearly shows noncollapsibility in logit link.
As Figure 1 shows, the marginal coefficient estimand
is shrunk towards the null hypothesis of no covariate effect compared to the conditional coefficient estimand
. When random intercepts are latent, even under the conditional generalized linear mixed effects model (equation (1)), the coefficient estimate may shrink towards 0 compared to the true conditional estimand and that is when the distribution of the random intercepts are misspecified. One such example is a random intercept model with true random intercepts distributed according to a bimodal distribution. In this situation, coefficient estimates under a model that assumes random intercepts are distributed Gaussian, may still attenuate towards compared to the true conditional coefficients.Similar to the logistic regression models, proportional hazards models are also noncollapsible. Let denote the survival time for the cluster. One example of such repeated measure survival data is the survival data on access failure among hemodialysis patients where each patient may have multiple access failures. Let be the covariate corresponding to the survival outcome. One can write a multiplicative hazard function of the form
where is the hazard at time , is the baseline hazard at time , is the frailty term including latent clusterspecific baseline hazard multipliers, and is the log relative risk of the effect of the covariate on the risk of “death”. Under this model setting, is the conditional relative risk of the covariate that is conditioned on the frailty them .
Alternatively, a marginal proportional hazards model is of the form
where and are hazard and baseline hazard at time , respectively. represents the marginal log relative risk of the effect of every one unit changes in the covariate on the risk of “death”. As we shall show with synthetic data, proportional hazards models are noncollapsible where the marginal parameter shrinks toward compared to the conditional parameter .
In this paper, we explore noncollapsibility in longitudinal data when there exists latent subjectspecific random intercepts. For noncollapsible logistic regression and proportional hazards models, we propose Dirichlet process mixture models (Antoniak, 1974; Sethuraman, 1994) that are capable of detecting underlying structure of data by clustering units of analysis into subclusters based on the distributional similarities of those units. We believe that our approach can provide insights into conditional associations between the response variable and a set of covariates given population subgroups. Using simulation studies, we compare our proposed models with the common statistical models to analyze longitudinal data. Finally, we use our proposed models to analyze data on hemodialysis patients in order to find risk factors associated with access failure among these patients.
2 Methods
With the focus on logistic regression and the proportional hazards models, and in the context of modeling correlated longitudinal data where repeated measures on sampling units are collected over time, we propose Dirichlet process mixture models capable of estimating conditional covariate effects when there exists latent subpopulation effects. In Section 2.1, we introduce our proposed Bayesian logistic model, and in Section 2.2 we introduce our proposed Bayesian proportional hazards model.
2.1 A Bayesian Hierarchical Logistic Regression with Dirichlet Process Mixture Priors
The logistic link is noncollapsible. This means, when there exists latent population subgroup effects in the form of random intercepts, failure to adjust for these subgroup effects leads to coefficient estimates that are shrunk toward
compared to the true conditional estimands from a separate model with those latent random intercepts taken into account. Generalized linear mixed effects models are capable of modeling random intercepts where they typically assume random intercepts to be distributed according to a Gaussian distribution, however, distributional misspecification of the random intercepts may still cause coefficient estimates to shrink. A model capable of detecting subgroup random intercepts, that is also robust to distributional misspecification of random intercepts, can provide the merits of estimating the conditional coefficient estimates.
We propose a hierarchical Bayesian model that is capable of detecting latent subgroup effects that are in the form of latent random intercepts. The models is capable of estimating conditional parameters. Using a Dirichlet process mixture prior, our proposed model is robust to distributional misspecification of the random intercepts. In our proposed model, we consider the binary data to be distributed according to
where and with as the number of subjects and as the number of measurements on the subject, is the corresponding covariate to the outcome , is the subjectspecific intercept for subject , and and are the intercept and the slope that are shared across all subjects, respectively. We consider Gaussian priors on the shared intercept and the shared slope of the form
We propose using the Dirichlet process mixture prior on the random intercepts , where with as the number of subjects in the data. Using the Dirichlet process mixture prior, as opposed to an explicit distributional assumption, will make the model robust to distributional misspecification. Further, DPM prior will allow subjects to cluster based on the distributional similarities of their latent random intercepts, hence, provides higher precision in estimating those latent subject effects. We specify a Dirichlet process mixture prior on as
(3)  
The Dirichlet process mixture prior above induces a prior on
that is essentially an infinite mixture of Normal distributions that are mixed over the mean parameter. We shall refer to this model as a MeanDPM model. Alternatively, one may set a Dirichlet process prior that induces an infinite Gaussian mixture prior that are mixed over the standard deviation parameter. Such a prior can be specified as
(4)  
We shall refer to this model as SigmaDPM model.
2.2 A Bayesian Hierarchical Proportional Hazards Model with Dirichlet Process Mixture Priors
Similar to the logistic regression models, proportional hazards models are also noncollapsible. When there exists differential subjectspecific baseline hazard risk, even in the absence of confounding in the baseline hazard risks, failure to adjust for these subjectspecific baseline risks in a proportional hazards model leads to coefficient estimates that are shrunk toward compared to the true conditional estimands from a separate model with those latent baseline risks taken into account. In this situation, a proportional hazards model that is capable of detecting subjectspecific baseline hazards, can provide the merits of estimating the conditional coefficient estimates.
We propose a hierarchical Bayesian proportional hazards model that is capable of detecting the differential subjectspecific baseline hazard risk across subjects. Our proposed model uses a Dirichlet process mixture prior on the latent subjectspecific baseline hazards. The Dirichlet process mixture prior allows clustering subjects based on the distributional similarities of their baseline hazards. Further, by using the Dirichlet process mixture prior, we avoid any explicit distributional assumption on the latent subjectspecific baseline hazards. In our proposed model, we consider survival times , where and with as the number of subjects and as the number of measurements on the subject, to be distributed according to a Weibull distribution of the form
where is the covariate value corresponding to the survival time, is a subjectspecific baseline hazard, is a shared intercept across all subjects, is a shared slope across all subjects that represents the log relative risk of every one unit increase in the covariate , is the shape parameter, and is a subjectspecific scale parameter. In the model specification above, we introduced covariates into the model through the scale parameter and using the equation . For our proposed model, we consider Gaussian priors on and parameters as
where and are fixed numbers. We also assume a logNormal prior on the shape parameter, , as
with and as fixed numbers.
We use a Dirichlet process mixture prior for the subjectspecific parameters as
(5)  
The Dirichlet process mixture prior above is essentially an infinite mixture of Normal distributions that are mixed over the mean parameter. We shall refer to this model with the MeanDPM proportional hazards model. Alternatively, we propose a Dirichlet process mixture model that induces an infinite Normal distributions mixed overt the standard deviation parameter. This Dirichlet process prior can be written as
(6)  
We shall refer to this new model with the above Dirichlet process mixture prior as SigmaDPM proportional hazards model.
3 Simulation Studies
Using simulation studies, we investigate noncollapsibility in logistic regression and proportional hazards models. We consider three simulation scenarios: one when subjectspecific intercepts are sampled independently from the standard Normal , another when subjectspecific intercepts are sampled from a mixture distribution of the form
where with where is the number of subjects. Finally, in the third scenario subjectspecific intercepts are sampled from a mixture distribution of the form
where for .
We compare parameter estimation between our proposed models and some common statistical models used to analyze repeated measure binary data and survival data. For every simulation scenario, we run 1,000 simulations each with 300 subjects and 12 withinsubject measurements per subject.
3.1 Logistic Regression Models
Unlike linear and log links, logistic link is not collapsible. In this section, using synthetic data we compare parameter estimation under our proposed MeanDPM and SigmaDPM Bayesian hierarchical logistic regressions we use the following common statistical models to analyze repeated measure binary data:

Generalized linear model with a logit link (GLM): We fit a frequentist GLM model with the logit link. This technique ignores the correlation between withinsubject measurements. Further, this model does not account for any subjectspecific effect. Due to the ignorance of within subject correlations in this model, standard error for the estimated coefficients tend to underestimate the true standard error once the within subject correlation is taken into account.

Generalized estimating equation (GEE): Instead of a simple generalized linear model with the logit link where all withinsubject measurements are treated as independent measures, one can use the generalized estimating equation framework to account for the correlation between withinsubject measurements. Despite accounting for the correlation between measurements taken on the same subject, GEE does not consider any subjectspecific random effect.

Generalized linear mixed effects model (GLMM): We also fit the frequentist generalized linear mixed effects model with subjectspecific random intercepts to model binary data. GLMM is capable of taking the correlation in withsubject measurements into account. Further, GLMM is also capable of estimating subjectspecific random intercepts with the assumption that the random intercepts are Normally distributed.

Bayesian logistic regression: We also consider a Bayesian logistic regression model with a likelihood of the form
where is the outcome of the measurement on the subject, is the measured covariate corresponding to outcome, and and are intercept and slope. We assume priors of the form
where and are fixed numbers.

Hierarchical Bayesian logistic regression model: Analogous to the the GLMM model to analyze binary data, one can setup a Bayesian hierarchical model with a likelihood of the form
where is the outcome of the measurement on the subject, is the measured covariate corresponding to outcome, is the subjectspecific random intercepts where with as the number of subjects in the data, and and are intercept and slope. We assume Gaussian priors on subjectspecific random intercepts of the form
where . Also, Gaussian priors are assumed on coefficients , and of the form
where and are fixed numbers.
Figure 2 shows the histogram of the posterior median of , where from the proposed MeanDPM hierarchical Bayesian logistic model, where is the subjectspecific prior mean on the random intercept of subject (equation (3)). Under each simulation scenario, we simulated a single dataset with 300 subjects each with 12 withinsubject measurements and applied our proposed MeanDPM model. The plot to the left shows a histogram of the posterior median of when data are simulated with random intercept sampled from the standard Normal . As the histogram shows, most of the posterior medians are close to zero. The histogram in the model shows the distribution of the posterior median when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form , where is distributed Bernoulli with parameter . As the histogram in the middle shows, posterior medians are bimodal where modes are around the true values of 1.5 and 1.5. Finally, the histogram to the right shows the posterior median of when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form . Due to the differences in the standard deviations, one may expect the histogram to be spread more widely compare to the first scenario, nonetheless, posterior medians are still centered around the true mean of .
Figure 3 shows the histogram of the posterior median of , where from the proposed SigmaDPM hierarchical Bayesian logistic model, where is the subjectspecific prior standard deviation on the random intercept of subject (equation (4)). Under each simulation scenario, we simulated a single dataset with 300 subjects each with 12 withinsubject measurements and applied our proposed SigmaDPM model. The plot to the left shows a histogram of the posterior median of when data are simulated with random intercept sampled from the standard Normal . As the histogram shows, most the posterior medians are close to 1. The histogram in the model shows the distribution of the posterior median when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form , where is distributed Bernoulli with parameter
. As the histogram in the middle shows, posterior medians are uniformly distributed from 3.18 to 3.28. This results make sense as now the data is widely spread with two distinct mean with a distance of
. Our SigmaDPM model with prior mean on random intercepts has to have a larger standard deviation to provide a prior to cover all plausible subjectspecific random intercepts . Finally, the histogram to the right shows the posterior median of when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form . It seems that in this case, the model converged to a standard deviation that is close . This makes sense since a when a random intercept is plausible under the prior , it’s also plausible under a prior with larger standard deviation. Hence, posterior medians converged to a large standard deviation that is plausible according to the random intercepts sampled from .While Figure 2 and Figure 3 show the performance of our proposed models in estimating prior mean and prior standard deviation of the random intercepts, , however, the main interest is on evaluating the performance of the model on estimating the actual random intercepts. Figure 4 provides a grid of scatter plots each shows the relation between the true random intercept value and the posterior median or the estimate of random intercepts. As one can see in the plot, when random intercepts are Normally distributed according to the standard Normal distribution, in terms of estimating the latent random intercepts, our proposed MeanDPM and SigmaDPM models work equally well as the GLMM model and the hierarchical Bayesian logistic model with explicit Normal assumption on the random intercepts. When the reference distribution of the sampled random intercepts is not Normal, our proposed MeanDPM and SigmaDPM models that are robust to distributional misspecification of the random intercepts, outperform the GLMM and the hierarchical Bayesian logistic regression in terms of estimating the latent random intercepts.
As tables (1), (2), and (3) show, coefficient estimates under marginal Bayesian model and marginal frequentist GLM and GEE shrank toward the compared to the true conditional value. The fact that in table (1) coefficient estimates under both GLM and GEE are the same is not surprising as we are using balanced data with the canonical link. By taking subgroup intercepts into account, coefficient estimates from the generalized linear mixed effect model and the hierarchical Bayesian model with Normal prior on the random intercepts are closer to the true conditional estimand compared to the marginal models. However, the coefficient estimate under these models still shrink toward no 0. The amount of shrinkage is larger under the second and the third scenarios when the distribution of random intercepts is misspecified. Our proposed Dirichlet process mixture models, however, are capable of detecting subgroup intercepts and are robust to distributional misspecification of the random intercepts. Coefficient estimates from our proposed models lead to the minimum mean squared error (MSE) in estimating the true conditional coefficient value.
SD  MSE  

GLM  0.845  0.031  0.025 
GEE  0.845  0.031  0.025 
Bayesian Logistic Reg.  0.847  0.031  0.025 
GLMM  0.951  0.032  0.004 
Hierarchical Bayes Logistic Reg.  0.947  0.035  0.005 
MeanDPM Hierarchical Logistic Reg.  1.001  0.036  0.001 
SigmaDPM Hierarchical Logistic Reg.  1.003  0.037  0.001 
SD  MSE  

GLM  0.626  0.027  0.141 
GEE  0.627  0.275  0.140 
Bayesian Logistic Reg.  0.626  0.027  0.141 
GLMM  0.938  0.034  0.005 
Hierarchical Bayes Logistic Reg.  0.931  0.033  0.006 
MeanDPM Hierarchical Logistic Reg.  1.006  0.042  0.002 
SigmaDPM Hierarchical Logistic Reg.  0.978  0.042  0.002 
SD  MSE  

GLM  0.702  0.028  0.090 
GEE  0.701  0.030  0.090 
Bayesian Logistic Reg.  0.700  0.028  0.091 
GLMM  0.946  0.033  0.004 
Hierarchical Bayes Logistic Reg.  0.935  0.034  0.005 
MeanDPM Hierarchical Logistic Reg.  0.998  0.041  0.001 
SigmaDPM Hierarchical Logistic Reg.  0.994  0.040  0.001 
3.2 Proportional Hazards Survival Models
To explore noncollapsibility in proportional hazards models and to compare coefficient estimation under our proposed MeanDPM and SigmaDPM models with common proportional hazards models, we consider the following proportional hazards models:

The frequentist Cox model: We fit the frequentist Cox proportional hazards model. This model assumes an overall baseline hazards for all subjects. Using the partial likelihood techniques, Cox model does not need any baseline hazard specification as that measure gets canceled out during the estimation process. The Cox frequentist model does not take the differential baseline hazards across subjects into account. In fitting the Cox model, we take the within subject correlation between multiple withinsubject measurements into account using the approach proposed by Lee et al. (1992) where we first estimate model coefficients using the independent covariance matrix and then we use a robust sandwich covariance matrix to account for within subject correlation between measurements.

Weibull accelerated failure time model (AFT): AFT models describe survival times as a function of predictor variables. Generally, Weibull AFT models are of the form
where is the survival time for the measurement on the subject, is the corresponding covariate to the outcome , and a random error such that is distributed according to a Weibull distribution with shape parameter and scale parameter . When there exists multiple measurements per subject, failure to account for the correlation between within subject measurements leads to incorrect estimated standard error of coefficients. In order to account for this intra class correlations, we take the the approach proposed by Lee et al. (1992) where first coefficients in the model are estimated using an independent covariance structure between within subject measurements and then a robust sandwich covariance matrix is used to account for the within cluster correlations.

Bayesian marginal proportional hazards model: We consider a Bayesian proportional hazard model with a likelihood of the form
where and are the survival times and the measured covariate on the measurement on the subject, is the shape parameter, and are the intercepts and the slope with as the log relative risk of death per every one unit change in . Similar to the previously introduced Weibull distribution for survival times, is the log of the subjectspecific scale parameter. We specify a logNormal prior on the shape parameter that is of the form
where and are fixed numbers. Also, and are assumed to have Gaussian priors of the form

Hierarchical Bayesian proportional hazards model: In order to account for the differential baseline hazard across subjects, one can consider a likelihood of the form
where can be considered as the subjectspecific log baseline hazard. For this model, we assume similar to priors as the one specified for the “Bayesian marginal proportional hazards model”. Additionally, we assume , where , to have a Gaussian prior of the form:
where is a fixed number.
Figure 5 shows the histogram of the posterior median of , where from the proposed MeanDPM hierarchical Bayesian proportional hazard model, where is the subjectspecific prior mean on the subjectspecific log baseline hazard of subject , which we represent it with and for the sake consistency, we shall refer to it as the subjectspecific random intercept (equation (5)). Under each simulation scenario, we simulated a single dataset with 300 subjects each with 12 withinsubject measurements and applied our proposed MeanDPM model. The plot to the left shows a histogram of the posterior median of when data are simulated with random intercept sampled from the standard Normal . As the histogram shows, most the posterior medians are close to zero. The histogram in the model shows the distribution of the posterior median when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form , where is distributed Bernoulli with parameter . As the histogram in the middle shows, posterior medians are bimodal where modes are around the true values of 1.5 and 1.5. Finally, the histogram to the right shows the posterior median of when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form . Due the the differences in the standard deviations, one may expect the histogram to be spread more widely compare to the first scenario, nonetheless, posterior medians are still centered around the true mean of .
Figure 6 shows the histogram of the posterior median of , where from the proposed SigmaDPM hierarchical Bayesian proportional hazard model, where is the subjectspecific prior standard deviation on the random intercept of subject (equation (6)). Under each simulation scenario, we simulated a single dataset with 300 subjects each with 12 withinsubject measurements and applied our proposed SigmaDPM model. The plot to the left shows a histogram of the posterior median of when data are simulated with random intercept sampled from the standard Normal . As the histogram shows, most the posterior medians are close to 1. The histogram in the model shows the distribution of the posterior median when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form , where is distributed Bernoulli with parameter . As the histogram in the middle shows, posterior medians are uniformly distributed from 3.18 to 3.28. This results make sense as now the data is widely spread with two distinct mean with a distance of . Our SigmaDPM model with prior mean on random intercepts has to have a larger standard deviation to provide a prior to cover all plausible subjectspecific random intercepts . Finally, the histogram to the right shows the posterior median of when data are simulated with random intercepts sampled from mixture of two Normal distributions of the form . It seems that in this case, the model converged to a standard deviation that is close . This makes sense since a when a random intercept is plausible under the prior , it’s also plausible under a prior with larger standard deviation. Hence, posterior medians converged to a large standard deviation that is plausible according to the random intercepts sampled from .
Based on Figure 5 and Figure 6, our proposed models show good performance when estimating the prior mean and prior standard deviation of the random intercepts, , however, the main interest is on evaluating the performance of the proposed model on estimating the actual random intercepts. In Figure 7, we provide a grid of scatter plots each shows the relation between the true random intercept value and the posterior median estimates of those random intercepts. As Figure 7 shows, when random intercepts are distributed according to the standard Normal distribution, in terms of estimating the latent random intercepts, our proposed MeanDPM and SigmaDPM models work equally well as the the hierarchical Bayesian proportional hazard model with explicit Normal assumption on the random intercepts. When the reference distribution of the sampled random intercepts is not Normal, our proposed MeanDPM and SigmaDPM models that are robust to distributional misspecification of the random intercepts, outperform the hierarchical Bayesian proportional hazard model in terms of estimating the latent random intercepts .
Tables 4, 5, and 6 show the results for the proportional hazards models. Coefficient estimates under the Cox model, the Bayesian marginal model, and the Weibull AFT model, all examples of marginal models, are smaller compared to the true conditional estimand and the marginal coefficient estimate under these models shrink toward .
By taking the differential subjectspecific baseline hazard into account, the hierarchical Bayes model with the Normal prior on random intercepts is capable of estimating the true conditional estimand when the random intercepts are truly Normally distributed (Table 4). However, the model is not robust to distributional misspecification as under the second and the third scenarios, the coefficient estimate of shrank toward 0 (Table 5 and Table 6).
Finally, our proposed MeanDPM and SigmaDPM proportional hazards models assume no explicit distributional assumption on the random intercepts, are capable of detecting subjectspecific random intercepts, and are robust to distributional misspecification of the random intercepts. Hence, our proposed DPM proportional hazard models can estimate the true conditional estimand.
SD  MSE  

Frequentist Cox Model  0.661  0.089  0.123 
Weibull_AFT  0.709  0.096  0.095 
Bayesian Marginal Proportional Hazard Model  0.700  0.038  0.100 
Hierarchical Bayesian Proportional Hazard Model  1.015  0.122  0.014 
MeanDPM Proportional Hazard Model  0.995  0.124  0.015 
SigmaDPM Proportional Hazard Model  0.999  0.122  0.016 
SD  MSE  

Frequentist Cox Model  0.471  0.101  0.290 
Weibull_AFT  0.507  0.107  0.255 
Bayesian Marginal Proportional Hazard Model  0.506  0.038  0.257 
Hierarchical Bayesian Proportional Hazard Model  0.898  0.122  0.047 
MeanDPM Proportional Hazard Model  1.002  0.170  0.029 
SigmaDPM Proportional Hazard Model  1.000  0.209  0.033 
SD  MSE  

Frequentist Cox Model  0.460  0.107  0.303 
Weibull_AFT  0.481  0.109  0.292 
Bayesian Marginal Proportional Hazard Model  0.483  0.038  0.290 
Hierarchical Bayesian Proportional Hazard Model  0.924  0.121  0.037 
MeanDPM Proportional Hazard Model  1.014  0.184  0.029 
SigmaDPM Proportional Hazard Model  0.997  0.206  0.046 
4 Sensitivity Analysis
Using synthetic data, we showed that our proposed MeanDPM and SigmaDPM are capable of estimating latent clusterspecific intercepts and are robust to distributional misspecification. Based on the simulation results presented in Section 3, in terms of MSE of estimating conditional coefficients, our proposed models outperform common frequentist and Bayesian models to analyze repeated measure binary data and survival data. In this section, we are interested in testing the sensitivity of our proposed MeanDPM and SigmaDPM proportional hazards models with respect to the three main parameters of the number of within unit measurements (), the difference in mean parameter and when random intercepts are simulated from the mixture of two Normal distributions of the form and , and the ratio between the two parameters and when random intercepts are simulated from the mixture of two Normal distributions of the form and .
4.1 Sensitivity to
In this section, we test the sensitivity of our proposed MeanDPM proportional hazards and SigmaDPM proportional hazards models with respect to the number of within subject measurements and under the case where the distribution of the random random intercepts is misspecified. We generate synthetic repeated measure binary and survival data under two scenarios  one when subjectspecific intercepts are sampled from a mixture distribution of the form , and another when subjectspecific intercepts are sampled from a mixture distribution of the form , where with and as the number of subjects. By changing the number of within subject measurements , we test the sensitivity of our proposed models.
Figure 8 provides a histogram of posterior medians of the prior mean on the random intercepts . The results are from our proposed MeanDPM hierarchical Bayesian proportional hazard model that is run on a single dataset that is generated under the simulation scenario where random intercepts ’s are sampled from an equally weighted mixture of two Normal distributions with means or and with the standard deviation of . As one can see, as the number of within subject measurements increases, our proposed MeanDPM can better estimate the prior mean ’s with the true values that are either 1.5 or 1.5.
Similarly, Figure 9 provides a histogram of posterior medians of the prior standard deviation on the random intercepts . The results are from our proposed SigmaDPM hierarchical Bayesian proportional hazard model that is run on a single dataset generated under the simulation scenario where random intercepts ’s are sampled from an equally weighted mixture of two Normal distributions both with mean and with the standard deviation of and . As one can see, as the number of within subject measurements increases, our proposed SigmaDPM can better estimate the prior standard deviations ’s with the true values that are either or .
As Figure8 and Figure9 show, using our proposed MeanDPM and SigmaDPM hierarchical Bayesian proportional hazard model, the larger within subject number of measurements, , are, the more accurate the posterior medians of prior means and prior standard deviations will be. and are the hyperparameters that are parameters of prior distributions on the random intercepts .
Figure 10 includes scatterplots that show the relation between the true values and the posterior medians from our proposed MeanDPM and SigmaDPM proportional hazard models on simulated data with the true subjectspecific random intercepts sampled from a mixture of two Normal distributions of the form , where is distributed Bernoulli with the parameter . As one can infer from the plots in this figure, as the number of withinsubject measurements increase, posterior medians of the random intercepts provide a more accurate estimate of the true .
Similarly, Figure 11 includes similar scatterplots that show the relation between the true values and the posterior medians from our proposed MeanDPM and SigmaDPM proportional hazard models on data simulated with the true subjectspecific random intercepts sampled from a mixture of two Normal distributions of the form , where is distributed Bernoulli with the parameter . From the plots in the figure, one can clearly realize that as the number of withinsubject measurements increase, posterior medians of the random intercepts provide a more accurate estimate of the true .
Table 7 provides results on the sensitivity of our models under the first simulation scenario and table 8 provides the result on the sensitivity of our models under the second simulation scenario. As the results in Table 7 and Table 8 show, with larger number of within subject measurements , our proposed models can better estimate the latent random intercepts, and hence, lead to a smaller error in estimating the true conditional coefficient estimate.
MeanDPM  SigmaDPM  

SD  MSE  SD  MSE  
1  0.998  0.223  0.0291  0.754  0.228  0.104 
3  1.014  0.200  0.0283  0.939  0.218  0.044 
6  1.010  0.182  0.033  0.958  0.211  0.039 
12  1.002  0.170  0.029  1.000  0.209  0.033 
MeanDPM  SigmaDPM  

SD  MSE  SD  MSE  
1  0.939  0.321  0.077  0.803  0.237  0.080 
3  0.984  0.201  0.031  0.947  0.201  0.039 
6  0.987  0.190  0.039  0.995  0.210  0.047 
12  1.014  0.184  0.046  0.997  0.206  0.046 
4.2 Sensitivity to
In this section, we test the sensitivity of our proposed MeanDPM and SigmaDPM proportional hazards models with respect to the distance between the mean parameters and , where and are the mean parameters of two Normal distributions that are used to simulate subjectspecific random intercepts. Subjectspecific random intercepts are sampled from a mixture of two Normal distributions of the form , where is distributed Bernoulli with the parameter . In this section, we evaluate the sensitivity of our proposed MeanDPM and SigmaDPM proportional hazards models with respect to the distance between the means and . In particular, we consider five cases where the distance is half of the standard deviation shared between both components, , or is equal to the , or is two times bigger than the , or three times bigger, or four times bigger (Table 9).
Figure 12 provides a histogram of posterior medians of the prior mean on the random intercepts . The results are from our proposed MeanDPM hierarchical Bayesian proportional hazard model that is run on a single dataset that is generated under the simulation scenario where random intercepts ’s are sampled from an equally weighted mixture of two Normal distributions with means or and with the standard deviation of . In order to test the sensitivity of our models with respect to the distance between and , we consider 5 cases based on the distance between and . Those cases are when the distance between the means is half of the standard deviation , equal to , twice of the , three times of the , or four times of the .
Figure 13 includes scatterplots that show the relation between the true values and the posterior medians from our proposed MeanDPM and SigmaDPM proportional hazard models on simulated data with the true subjectspecific random intercepts sampled from a mixture of two Normal distributions of the form , where is distributed Bernoulli with the parameter . To test the sensitivity of our proposed models with respect to the distance between and , we consider 5 different cases. Those cases are when the distance between the means are , , , , and .
Comments
There are no comments yet.