In clinical trials, continuous outcomes are often measured at multiple time points besides baseline to form a picture of treatment effects or the progression of disease. However, binary outcomes, defined by the proportion of subjects who fall below or above a clinically relevant threshold, are frequently of clinical interest. For example, in diabetes clinical trials, clinicians not only focus on the mean of hemoglobin A1C (HbA1c) but are also interested in the proportion of subjects reaching a HbA1c target, defined by HbA1c < 7% davies2018management or . garber2013american To analyze such data, we typically dichotomize the continuous outcome measures and then apply statistical methods based on binary outcomes.
Incomplete data are quite common in longitudinal measurements, for example, when some subjects drop out from the study early or some measurements are not obtained due to technical errors or subject’s unavailability. Rubin rubin1976inferenceclassified missing mechanisms into three categories. When missingness does not depend on any values in the data set, observed or unobserved, the missing mechanism is considered missing completely at random (MCAR). A mechanism where missingness is related to observed outcomes, but not related to unobserved outcomes given the observed outcomes, is called missing at random (MAR). When missingness depends on unobserved outcomes, the mechanism is said to be missing not at random (MNAR). Both MCAR and MAR are considered as “ignorable” when mild regularity conditions are satisfied, since likelihood-based inferences can be made by analyzing observed outcomes without explicitly modeling the missing mechanism. In contrast, MNAR mechanism is “non-ignorable” under this situation.
MAR is usually a reasonable assumption in clinical trials when drawing inference from sensitivity analyses of incomplete longitudinal outcomes. molenberghs2004analyzing The traditional approach to analyze longitudinal binary outcomes that assume MAR uses the generalized linear mixed model (GLMM), which can be easily implemented by PROC GLIMMIX in SAS. However, GLMMs utilize the dichotomized variable to model the missingness, which may not capture the full mechanism of missingness, leading to a potential information loss. In addition, GLMMs are sometimes difficult to converge or result in unstable estimates. Alternatively, assuming MAR, one can impute the continuous outcome by the multiple imputation (MI) method first and then categorize into the binary outcome. This method may perform well intuitively because it is reasonable to believe that continuous outcomes provide more information than binary outcomes, royston2006dichotomizing potentially resulting in more accurate imputation than indirectly imputing the dichotomized binary outcome in GLMMs. Another advantage of the MI method is that imputation strategies can be customized to create an estimator consistent with the potential outcome of interest and the corresponding estimand, ich2020e9, lipkovich2020causal unlike the GLMM approach that provides the estimator always assume all missing potential outcomes follows the same pattern as the observed values. For example, when we assume the missing potential outcomes after certain intercurrent events will be similar to the outcomes for the retrieved dropouts, those patients who had similar intercurrent events but without missing values, the retrieved dropout imputation may be used to impute these missing values. CHMP2010
Several researchers compared statistical methods to analyze the longitudinal binary outcome created from the underlying continuous outcomes assuming MAR. Lipkovich et al lipkovich2005multiple compared the performance of multiple imputation with generalized estimating equations and restricted pseudo-likelihood in estimating overall treatment effects and treatment differences at the last scheduled visit. Based on their simulations, with moderate to high dropout rates (40% to 60%), multiple imputation led to less biased and more precise estimates of treatment differences for binary outcomes based on underlying continuous scores. In addition, Floden and Bell floden2019imputation
compared multiple imputation of the continuous and dichotomous forms of the outcome and concluded that, with a higher rate of missingness, imputing the continuous outcome was less biased and had well-controlled coverage probabilities of the 95% confidence interval compared to imputing the binary outcome. Grobler and Leegrobler2020multiple further compared the performance of five different imputation methods and recommended imputation using substantive-model compatible fully conditional specification method according to their simulation studies.
All the work mentioned above related to MI are based on the assumption that the continuous outcome follows a (multivariate) normal distribution. This assumption, however, is often violated in clinical trials. In this article, we compared the performance of GLMM with MI based on the continuous outcome where the continuous outcome follows a multivariate normal distribution, a multivariate t-distribution with degree of freedom three, and a multivariate log-normal distribution, respectively, under the MCAR or the MAR missing mechanism. Additional simulations were conducted with data resampled from a real clinical trial. Furthermore, we applied the two methods to 29 clinical trials from three diabetes compounds. Both methods were evaluated by the estimation of risk difference (RD) and logarithm of odds ratio (OR) at the last visit between treatment arms.
ICH E9 (R1) Addendum ich2020e9 provides a framework to define estimands, especially strategies to handle intercurrent events. Missing values can be either truly unobserved data or as a result of data censoring by handling intercurrent events with a hypothetical strategy. In the setting of this article, we assume the outcome of interest is the potential outcome if patients would continue the randomized treatment without intercurrent events (ie, using the controlled direct hypothetical [CDH] strategy qu2021implementation).
Let denote the continous outcome measurement for subject ( at time point (. Without loss of generality, assume that lower values of indicate better outcomes. Let
denote the corresponding dicotomized binary variable such thator . Using the potential outcome introduced by Lipkovich et al., lipkovich2020causal let denote the potential outcome if a patient is initially assigned to treatment and follows a treatment after the intercurrent event. Under the CDH strategy, is always equal to and we simplify the notation as . Let denote the true mean for treatment group ( for the control treatment and for the experimental treatment) under the CDH strategy such that
If are identically distributed, the true mean can be written as
For a given subject, some of the longitudinal outcome and may be missing after a certain time point. The missing data in can be handled through either a GLMM or by imputing the continuous outcome using multiple imputation.
2.1 Generalized Linear Mixed Model (GLMM)
A GLMM for binary data takes the binomial exponential family, with canonical link being logistic. Specifically, the mean response is modeled as
where is the probability of for the th subject at the timepoint ,
is the covariate vectors andis the fixed effect. The between-subject correlation for the binary outcome can be modeled with a between-subject random effect or through the residuals in the marginal model. In this article, we will only consider the marginal model because it allows more flexible covariance structures for the within-subject correlation. The GLMM can be fitted using multiple statistical softwares, such as R, SAS, and SPSS.
2.2 Multiple Imputation (MI) Method
Rubin first introduced MI as a tool to handle nonresponse in large sample public use surveys. rubin1987multiple There are three main steps in MI procedure: imputation, analysis, and pooling. In the imputation step, complete versions of the data are created by replacing the missing values with plausible data values via some chosen imputation model. Next, the analysis of interest is performed for each of the complete data sets separately. Finally, the results are pooled into a a single MI result by “Rubin’s rules.”rubin1987multiple The variance formula based on the within- and between-imputation variances (Rubin’s rules) may over estimate the variance when the imputation model and the analysis model are uncongenial. bartlett2020bootstrap, meng1994multiple, xie2017dissecting Therefore, we also evaluated the inference based on bootstrap methods.
To analyze a derived binary outcome, we can first impute its underlying continuous outcome in the imputation step, then dichotomize it into the binary outcome. Analysis and pooling steps can be further applied to the imputed data sets. By imputing the continuous outcome rather than working on the binary outcome directly, more information is expected to be utilized, which potentially improves the imputation accuracy and overall efficiency for the estimators.
2.3 Group Means and Variance Estimation
For GLMMs with adjustment for baseline covariates, the model-based mean for a treatment group (e.g., the least squared mean) produced by most software packages (including SAS and R) estimates the response at the mean covariate, which has little clinical meaning. Therefore, we consider the average mean probability for each treatment group and the unconditional treatment effect (RD or OR). guidancedraft Therefore, the mean response at each treatment group should be estimated by taking the average of the predicted response for this group across all subjects from all groups.freedman2008randomization The marginal treatment contrasts (RD or OR for the binary outcome) can then be constructed using the group means. Qu and Luo qu2015estimation further studied this problem and provided the formula for variance estimation.
Specifically, the group means can be written as
where is the treatment arm indicator and
is the logit link function withand . It is straightforwad to obtain the estimated group mean difference (RD) as
Using delta method, variance of risk difference can be estimated
where is the estimated variance-covariance matrix for (sandwich variance estimation is recommended to estimate ) and
Similarly, the estimated logarithm of OR [log(OR)] has the following form
with variance estimated by
In this section, we compared the performance of GLMM to MI, analyzing the binary outcome derived from an underlying continuous outcome in a longitudinal setting. We considered a range of distribution options for the longitudinal continuous data. To achieve this comprehensive evaluation, we adopted two data generation strategies to obtain the complete data set with continuous values. In the first strategy, data were generated from some pre-specified distributions besides a multivariate normal distribution, including a multivariate t-distribution (heavier tails than normal) and a multivariate log-normal distribution (skewed). While in the other, data were resampled from a completed real clinical trial. We focused on the monotone missingness in our simulation. Note that a missing data pattern is said to be monotone when an outcome missing for a particular individual implies that all subsequent outcomes are missing for that individual. Missing mechanisms of MCAR and MAR were considered. The objective was to estimate the RD and log(OR) at the final visit assuming the missing potential outcome follows the same pattern as the observed data. In GLMM, the within-subject errors were modeled through the residuals using an unstructured variance-covariance matrix. When the analysis failed to converge, an independent variance-covariance matrix would be used. For the MI method, missing values were imputed using fully conditional specification (FCS) implemented by the MICE algorithm as described in Van Buuren and Groothuis-Oudshoorn.van2011mice completed data sets were created in the imputation step. Estimated variances were obtained by (1) and (2), respectively, for RD and log(OR) in each data set and pooled by “Rubin’s rules.” Ten thousand replicates were simulated for each simulation scenario.
All of our simulation settings were based on the IMAGINE-2 study, which was a phase 3, double-blinded, randomized diabetes clinical trial for insulin-naïve adults with type 2 diabetes. davies2016basal The primary objective was to test for non-inferiority of peglispro to glargine for HbA1c change from baseline to week 52. As secondary objectives, the comparisons of the percentages of subjects achieving HbA1c targets of and were also of interest. The HbA1c was measured at baseline (week 0) and at five subsequent timepoints (week 4, week 12, week 26, week 39, and week 52). The mean HbA1c (%) were and
for the glargine (control) and the peglispro (treatment) arm, respectively. The standard deviations were very close for the two arms at each time point around. A total of subjects were randomized 2:1 to the peglispro and glargine arm, respectively. Among them, had complete HbA1c data.
3.1 Simulation Based on Complete Continuous Data Generated From a Pre-specifed Distribution
We considered two arms with 200 subjects in each arm and a total of six measurements on the continuous outcome of interest required for each subject. Complete data sets for continuous outcomes, , were generated from a multivariate normal distribution, a multivariate t-distribution with degree of freedom three, and a multivariate log-normal distribution, respectively, using R packages “mvtnorm” and “MethylCapSig.” Based on empirical evidence from the IMAGINE-2 study, davies2016basal we assumed the mean HbA1c (%) at six measurement timepoints, including baseline, were and for the control and the treatment arm, respectively. We further assumed the standard deviation vector over time from measurement 1 to measurement 6 to be and the correlation between measurement and to be . Given , the binary outcomes were constructed by
. To be consistent with the clinical data, we set the missing rates to be 3%, 6%, 10%, 13%, and 15% across the timepoints (excluding the baseline), respectively. Under MCAR, the missingness is completely at random, ie, generated from Bernoulli distributions independent of all the other elements in subjects data vector. With MAR, the missingness depends on the longitudinal history of observed data including HbA1c and the treatment arms where the rate of dropout is higher in the control arm compared to the treatment arm and the subjects with higher HbA1c in the previous visits are more likely to dropout. The programming codes of missing data generation for MCAR and MAR are provided in AppendixA.
Table 1 summarizes the empirical bias (BIAS), empirical variance (VAR), average estimated variance (EVAR), relative mean squared error (RMSE) of MI compared GLMM, and 95% empirical coverage probability (CP) for RD and log(OR), respectively. The RMSE was calculated by the ratio of MSEs for GLMM versus MI. Note that the true values of RD were 0.125, 0.197, and 0.128 for normal, t, and log-normal distribution settings, respectively. The corresponding true values of log(OR) were 0.505, 0.801, and 0.515. Under all distribution assumptions, both GLMM and MI approaches performed well with small biases. The estimated variances were in good agreement with the empirical variances and the coverage probabilities were close to the nominal value of 95%. Given RMSE > 1 for all scenarios, we concluded that MI provides a more efficient estimator than GLMM given those considered scenarios. By adopting “Rubin’s rules” in the MI method, variances were overestimated resulting in higher CPs, as expected.
Abbreviations: DIST, distribution; RD, risk difference; OR, odds ratio; MCAR, missing completely at random; MAR, missing at random; BIAS, empirical bias; VAR, empirical variance; EVAR, average estimated variance; CP, 95% empirical coverage probability; RMSE, relative mean squared error.
3.2 Simulation Based on Resampling of Complete Continuous Data From a Diabetes Clinical Trial
Furthermore, we conducted a simulation study with resampling from IMAGINE-2 study. davies2016basal We considered a beneficial treatment scenario and a null effect scenario. For the beneficial treatment scenario, the mean HbA1c was lower for the treatment arm compared to the control arm. In this case, we let the mean HbA1c (%) be and for the control and the treatment arm, respectively. Data for null effect scenario were generated by letting . For binary outcomes of interest, we considered two thresholds and . Corresponding binary outcomes were constructed by and , respectively. Incomplete data sets were generated similarly as above.
The simulation process was as follows: (1) subset the IMAGINE-2 data with complete history of HbA1c measurements (ie, data from the 1269 subjects); (2) adjusted the mean response at each time point by applying a constant to each subject’s outcome at that visit corresponding to and , respectively. These served as the potential outcomes for each subject under each arm at each visit; (3) resampled the subjects such that ; (4) drew treatment assignments using a Bernoulli distribution with 2:1 randomization ratio; (5) assigned the potential outcomes accordingly; (6) induced dropouts according to MCAR or MAR; (7) conducted analyses; (8) repeated previous step (3)-(7) 10,000 times; and (9) summarized the operating characteristics.
As in the previous section, similar conclusions could be made given the simulation results in Table 2. Besides BIAS, VAR, EVAR, RMSE, and CP, the 95% coverage probability based on non-parametric bootstrap with 10,000 resamples (CP-B) was reported. For the bootstrap, we calculated bootstrap percentile intervals.bartlett2020bootstrap The true values of RD and log(OR) were 0 for null effect scenarios. For the beneficial treatment scenarios, the true values of RD were 0.128, and 0.136 for and , respectively. The corresponding true values of log(RD) were 0.649, and 0.546. CP-Bs were closer to 95% compared to CP, indicating that variance estimation was more accurate by bootstrap. These results suggested that MI may be recommended as a primary method of analysis, and corresponding variances were better estimated by bootstrap.
Abbreviations: DIST, distribution; RD, risk difference; OR, odds ratio; BEN, beneficial; MCAR, missing completely at random; MAR, missing at random; BIAS, empirical bias; VAR, empirical variance; EVAR, average estimated variance; CP, 95% empirical coverage probability; CP-B, 95% coverage probability based on bootstrap; RMSE, relative mean squared error.
4 Application in clinical trials
In this sectioin, we applied both GLMM and MI to 29 phase 3 clinical trials for studying diabetes treatments to analyze the RD or the log(OR) in reaching HbA1c target of < 7% or 6.5% at the end of the study. The durations of these study ranged from 12 to 104 weeks, and the sample sizes were between 182 and 1518. For trials with multiple treatment arms, we compared each treatment arm of interest with the reference arm, resulting in a total of 38 comparisons. Since simulation studies in Section 3 showed both methods are little biased in the estimation of the RD or the log(OR), and we were not able to estimate the bias for the real data (not knowing the truth), we focused on the relative variance (variance ratio of GLMM to MI) for these comparisons, where variances are estimated by non-parametric bootstrap. efron1994introduction Figure 1 shows the relative variances for the RD and the log(OR) for the two treatment targets aforementioned. For the HbA1c target of < 7%, MI had smaller variance than GLMM in all comparisons for both RD and log(OR). The mean ratios of variances between the two methods were 1.28 for both RD and log(OR), respectively. For the HbA1c target of 6.5%, the MI-based approach had smaller variances than GLMM in 97.4% (37 out of 38) of comparisons for both RD and log(OR). The mean ratios of variances between the two methods were 1.42 and 4.83 for RD and log(OR), respectively. Some ratios were high (> 2) since the GLMM was suspected to diverge with unusually high values of variance estimates for four comparisons. If we exclude those results for the four comparisons, with the HbA1c target of < 7 (%), the mean ratio of variances were 1.20 and 1.25 for RD and log(OR), respectively; with the HbA1c target of
(%), the mean ratio of variances were 1.27 and 1.22 for RD and log(OR), respectively. In summary, the MI approach generally led to smaller variance estimates compared to GLMM. We also performed linear regressions for the variance ratio versus the rate of missing values at the final visit (Figure2). Results imply that MI was more likely to obtain lower variance estimates than GLMM when the rate of missing values were higher.
5 Summary and Discussion
Binary endpoints derived from continuous variables are often used as the primary endpoint or important secondary endpoints in clinical trials. Missing data occur either because subjects discontinue from the study early or the outcomes collected after intercurrent events, which do not reflect the potential outcomes of interest, are censored. Therefore, it is important to understand the appropriate methods to handle the missing values for these binary endpoints. Literature comparing the two approaches (GLMM and MI) through simulations by generating the underlying continuous variable from a normal distribution showed MI outperforms GLMM. However, the continuous variables are often either skewed or have heavy tails and the normality assumption could be violated. Therefore, those simulation results in existing literature provided limited assurance to use the MI-based approach in practice.
We compared GLMM with MI for analysis of the binary outcome with monotone missing data through two types of simulations. First, we compared the performance of the two methods when the underlying continuous variable was generated from normal, lognormal, and t-distributions. Secondly, we generated data by resampling data from a clinical trial in diabetes. In both simulations, MI outperformed GLMM in all scenarios. In addition, we applied both methods to 38 comparisons from 29 historical clinical trials and showed MI almost always resulted in lower variance in the estimates compared to GLMM. Finally, through the simulation studies and real data analyses, we found GLMM was more likely to fail to converge. The ratio of estimated variances for GLMM versus MI as the missing rate increases, indicating that MI would have more advantageous over GLMM as the missing rate increases.
In this article, we assumed the underlying continuous outcomes for all subjects were independently identically distributed and all missing values followed the same pattern. In real clinical trial analysis, there may be different missingness patterns according to the reasons of missingness. qu2021implementation Compared to GLMM, MI is an easier approach to implementing such a pattern mixture model. little1993pattern Although we did not simulate data based on a pattern mixture model, the convincing results for the MI-based approach from the uni-pattern missing model naturally suggests the missing values will be imputed well for each pattern, and a good estimation of the overall treatment effect can be obtained when there exist mixed missingness patterns.
In conclusion, to handle the missing data for binary outcomes in clinical trials, we should impute the missing values for the underlying continuous variable and then analyze the derived binary outcome.
Appendix A Programming Codes for Missing Data Generation for MCAR and MAR
Input data include:
y1–y6: complete data vector at timepoints 1–6;
data: complete dataset;
trt: treatment indicator.
R1 <- rbinom(n, 1, 0.97)
R2 <- rep(0, n)
R2[R1 == 1] <- rbinom(sum(R1), 1, (1 - 0.031))
R3 <- rep(0, n)
R3[R2 == 1] <- rbinom(sum(R2), 1, (1 - 0.042))
R4 <- rep(0, n)
R4[R3 == 1] <- rbinom(sum(R3), 1, (1 - 0.033))
R5 <- rep(0, n)
R5[R4 == 1] <- rbinom(sum(R4), 1, (1 - 0.023))
data <- data %>% mutate(y2 = ifelse(R1 == 1, y2, NA), y3 = ifelse(R2 == 1, y3, NA), y4 = ifelse(R3 == 1, y4, NA), y5 = ifelse(R4 == 1, y5, NA), y6 = ifelse(R5 == 1, y6, NA))
p1 <- with(data, expit(logit(0.99) - 0.12 * y1 * trt - 0.14 * y1 * (1 - trt)))
R1 <- rbinom(n, 1, p1)
R2 <- rep(0, n)
p2 <- with(data[R1 == 1, ], expit(logit(1 - 0.031) - 0.5 * residuals(lm(y2 y1)) * trt - 0.7 * residuals(lm(y2 y1)) * (1 - trt)))
R2[R1 == 1] <- rbinom(sum(R1), 1, p2)
R3 <- rep(0, n)
p3 <- with(data[R2 == 1, ], expit(logit(1 - 0.042) - 0.51 * residuals(lm(y3 y1)) * trt - 0.72 * residuals(lm(y3 y1)) * (1 - trt)))
R3[R2 == 1] <- rbinom(sum(R2), 1, p3)
R4 <- rep(0, n)
p4 = with(data[R3 == 1, ], expit(logit(1 - 0.033) - 0.52 * residuals(lm(y4 y1)) * trt - 0.74 * residuals(lm(y4 y1)) * (1 - trt)))
R4[R3 == 1] <- rbinom(sum(R3), 1, p4)
R5 <- rep(0, n)
p5 <- with(data[R4 == 1, ], expit(logit(1 - 0.023) - 0.53 * residuals(lm(y5 y1)) * trt - 0.76 * residuals(lm(y5 y1)) * (1 - trt)))
R5[R4 == 1] <- rbinom(sum(R4), 1, p5)
data <- data %>% mutate(y2 = ifelse(R1 == 1, y2, NA), y3 = ifelse(R2 == 1, y3, NA), y4 = ifelse(R3 == 1, y4, NA), y5 = ifelse(R4 == 1, y5, NA), y6 = ifelse(R5 == 1, y6, NA))