DeepAI
Log In Sign Up

Bridged treatment comparisons: an illustrative application in HIV treatment

06/09/2022
by   Paul N Zivich, et al.
0

Comparisons of treatments or exposures are of central interest in epidemiology, but direct comparisons are not always possible due to practical or ethical reasons. Here, we detail a data fusion approach that allows bridged treatment comparisons across studies. The motivating example entails comparing the risk of the composite outcome of death, AIDS, or greater than a 50 cell count decline in people with HIV when assigned triple versus mono antiretroviral therapy, using data from the AIDS Clinical Trial Group (ACTG) 175 (mono versus dual therapy) and ACTG 320 (dual versus triple therapy). We review a set of identification assumptions and estimate the risk difference using an inverse probability weighting estimator that leverages the shared trial arms (dual therapy). A fusion diagnostic based on comparing the shared arms is proposed that may indicate violation of the identification assumptions. Application of the data fusion estimator and diagnostic to the ACTG trials indicates triple therapy results in a reduction in risk compared to mono therapy in individuals with baseline CD4 counts between 50 and 300 cells/mm^3. Bridged treatment comparisons address questions that none of the constituent data sources could address alone, but valid fusion-based inference requires careful consideration.

READ FULL TEXT VIEW PDF
02/14/2019

Adding new experimental arms to randomised clinical trials: impact on error rates

Background: Experimental treatments pass through various stages of devel...
09/25/2020

A novel estimand to adjust for rescue treatment in clinical trials

The interpretation of randomised clinical trial results is often complic...

Introduction

Comparisons between new and existing treatments are of central interest in epidemiology and medical science. However, direct comparisons between two treatments is often infeasible due to practical or ethical reasons. For example, once an efficacious treatment is developed, it is typically no longer ethical to conduct placebo-controlled trials, thus precluding direct comparisons between new treatments and placebo. Instead, informal transitivity arguments are often made to compare a set of studies through shared intermediary treatments.[Catala2014, Lumley2002, Rouse2017] For example, if treatment A is better than B, and B is better than C, then A is better than C. However, such naive comparisons can be of limited utility or even misleading for a variety of reasons, e.g., studies may sample from different populations, define endpoints differently, or have different rates of loss to follow-up. The exact conditions for valid comparisons of treatments across studies are often left implicit and not assessed.

As a motivating example for bridged treatment comparisons, consider different combinations of antiretroviral therapies for treatment of HIV infection. The AIDS Clinical Trials Group (ACTG) 175 randomized trial compared mono therapy versus dual therapy,[Hammer1996actg175] while the ACTG 320 randomized trial compared dual versus triple therapy.[Hammer1997actg320] Researchers, agencies, patients, and other stakeholders might be interested in comparing the efficacy of triple therapy versus mono therapy. However, using estimates from the ACTG trials to compare mono and triple therapy requires careful considerations.

Here, we detail a recent approach to combine or ”fuse” data from multiple sources to make bridged treatment comparisons,[Breskin2021, Cole2022] with the ACTG 175 and ACTG 320 trials serving as the motivating example. Previous work has defined a sufficient set of assumptions under which the bridged treatment comparison is identified and proposed an estimator of the risk difference when the trial populations differ in certain ways that can be accounted for analytically.[Breskin2021] We provide a step-by-step application of this approach and propose a novel diagnostic that tests the underlying assumptions of the fusion method for the data being combined. To facilitate wider application, the ACTG trial data is provided along with corresponding SAS, R, and Python code which implement the data fusion estimator and diagnostic.

Trials

ACTG 175 was a double-masked randomized trial comparing four antiretroviral therapy regimens among HIV-1 infected participants with CD4 counts between 200 and 500 cells/mm3 and followed for over three years.[Hammer1996actg175] Trial arms consisted of zidovudine-only, zidovudine-didanosine, zidovudine-zalcitabine, and didanosine-only. A total of 2493 participants were recruited from sites in the United States. Eligible participants were at least 12 years of age, had no history of previous AIDS-defining illness (expect minimal Kaposi’s sarcoma), and had a Karnofsky score of at least 70. Here, we restricted to three of the trial arms: zidovudine-didanosine or zidovudine-zalcitabine (combined as dual therapy), and zidovudine (mono therapy). The composite outcome was time from randomization to AIDS, death, or a 50%+ decline in CD4 cell count from baseline.

ACTG 320 was a double-masked randomized trial comparing dual nucleoside antiretroviral therapy with a protease inhibitor (zidovudine-lamivudine-indinavir) to dual nucleoside antiretroviral therapy (zidovudine-lamivudine) among HIV-1 infected participants with CD4 cell counts below 200 cells/mm3 and followed for one year.[Hammer1997actg320] A total of 1156 participants were recruited from sites in the United States. Eligible participants were at least 16 years old, had a Karnofsky score of at least 70, had at least 3 months of prior treatment with zidovudine, no more than one week of prior treatment with lamivudine, and no prior treatment with protease inhibitors. The outcome was time from randomization to AIDS or death.

To harmonize data from these trials, we restricted the ACTG 175 to participants 16 years or older with at least three months of prior zidovudine treatment and no prior non-zidovudine antiretroviral therapy (813 of 2493 participants), administratively censored all participants at one year, and used the composite outcome of AIDS, death, or a 50%+ decline in CD4 cell count. Both trials collected a variety of baseline characteristics (Table 1). Note the non-overlapping baseline CD4 cell counts due to differing eligibility criteria between trials (Figure A1), a key point returned to below.

Bridged Comparison Design and Estimation

The fusion study design presented here consists of five steps: (1) definition of the target population, (2) definition of the parameter of interest, (3) identification assumptions and selection of an appropriate estimator, (4) diagnostic of the fusion, and (5) estimation.

The following notation will be used. Let indicate the potential event time for individual when assigned treatment , where is mono therapy, is dual therapy, and is triple therapy. Let indicate the assigned antiretroviral therapy, and indicate the event time corresponding to the assigned treatment, where if is true and otherwise. The observed event times may be right censored due to drop-out or the end of follow-up. Let indicate the censoring time such that the observed event times are , and . Finally, let

indicate a vector of observed baseline covariates measured in both trials,

indicate a separate vector of observed baseline covariates (the distinction between and is made below), and if individual participated in ACTG 320 and indicate participation in ACTG 175. Note is fundamentally different from other covariates (i.e., , ) in that is not an inherent characteristic of an individual. The observed data consists of independent observations of .

Step 1: Definition of the target population

Due to possible treatment effect heterogeneity across trials, the target population must be clearly defined. While trial participants are not typically selected via a formal sampling design, a trial-specific population can still be defined as the set of individuals who meet the trial’s inclusion/exclusion criteria and would self-select to participate in the trial. Therefore, the target parameter can be defined for the ACTG 175 population, the ACTG 320 population, the combination of both trials, or some external population (which would require additional data from or assumptions about that external population). Here, we chose to estimate the parameter in the ACTG 320 population.

Step 2: Definition of the parameter of interest

The motivating quantity of interest is the risk difference at one-year of follow-up comparing the risk of the composite outcome had everyone been assigned triple therapy with the risk of the composite outcome had everyone been assigned mono therapy in the ACTG 320 population. This parameter, , can be expressed as

(1)

Since no individuals in ACTG 320 were randomized to mono therapy, is not identified from the ACTG 320 data. Therefore, we re-express in terms of quantities that may be possible to identify using data from both trials. With the risk of the outcome at time under dual therapy acting as a ’bridge’, the parameter can instead be written as

(2)

Step 3: Identification assumptions and target parameter estimator

In this section, a set of sufficient identification assumptions are provided with corresponding parameter estimators.

Step 3a: Identification assumptions and estimation of triple versus dual therapy

The ACTG 320 trial was designed specifically to estimate , so it is natural to use data from this trial to estimate this risk difference. As triple and dual antiretroviral therapy regimens were randomly assigned in the ACTG 320 trial, the identification assumptions of exchangeability[Hernan2006] (i.e., assigned antiretroviral therapy and potential outcomes were independent), and positivity[Westreich2010] (i.e., all individuals had non-zero probability of assignment to each antiretroviral therapy regimen) are given by design. The exchangeability and positivity assumptions are expressed, respectively, as

Under these identification assumptions and causal consistency,[Cole2009] can be expressed in terms of the observed data distribution in the absence of censoring.[Breskin2021, Robins2008] However, some participants were right censored prior to one-year and hence the event time was not observed for all participants. Additionally, censoring due to loss to follow-up may be driven by both baseline characteristics of participants and assigned treatment. Therefore, exchangeability between censored and uncensored participants conditional on and , and positivity is assumed at each time point:

To estimate , we use the following inverse probability weighted estimator:

where

(3)

the subscript 320 emphasizes that the ACTG 320 trial data is being used for estimation, and is the number of participants in the ACTG 320 trial. The numerator of equation 3 corresponds to whether an event was observed by time (i.e., ) among participants in the ACTG 320 (i.e., ) in the assigned treatment group (i.e., for ). The denominator of equation 3 consists of the randomization probability of the corresponding treatment (i.e., for ) and a model-estimated conditional probability of remaining uncensored (i.e., ) described further below. The denominator effectively re-weights the observed events to account for those who were censored and were in the other treatment arm.

In this analysis, consisted of gender, race, injection drug use, age, and Karnofsky score categories. To estimate the nuisance parameter, , we combined a Cox model for censoring with the Breslow estimator for the cumulative hazard.[Lin2007] Specifically, the parameters of a Cox model were estimated with the event of interest as censoring prior to one-year. Using the estimated coefficients, the baseline hazard function was estimated via the Breslow estimator. The estimated baseline hazard and model coefficients were then used to estimate the probability of remaining uncensored at time . For flexibility in the Cox model, we stratified by antiretroviral therapy and age was modeled using using a restricted quadratic spline (knots placed at 25, 34, 40, and 54). It was not necessary to estimate the probability of treatment because the treatment probabilities are known. However, we estimated

using an intercept-only logistic model since estimating the treatment weights leads to a smaller asymptotic variance of the estimated risk difference.

[vdL2003]

Step 3b: Identification assumptions and estimation of dual versus monotherapy

Conditional on , exchangeability and positivity for dual and mono therapy are given by design in the ACTG 175 trial:

Conditional exchangeability and positivity for censoring are also assumed:

While the covariates included in are the same for both trials, this need not be the case.

The ACTG 175 trial was designed to estimate for , which need not equal (i.e., we are unwilling to assume both trials are random samples of the same population). Therefore, we need to account for differences between populations. To do so, the estimator in equation 3 is extended with developments from the generalizability and transportability literature.[Cole2010, Lesko2017, Westreich2017] This extension is done under the assumption that selection into study populations is exchangeable conditional on a set of baseline covariates , common to the two trials. In particular, conditional exchangeability and positivity of sampling are assumed:

Here, includes demographics, risk factors, and other contextual factors that both differ between trial populations and are related to the composite outcome. For the intent-to-treat parameter, also includes contextual factors related to adherence to assigned treatment. While intent-to-treat analyses of a single trial require no assumptions regarding adherence for internal validity, the same is not true for bridged treatment comparisons. Specifically, if adherence differed by an unmeasured covariate, , and differed between trial populations, then the ACTG 175 would no longer adequately ’stand-in’ for the ACTG 320 population under mono therapy as conditioning on is needed for and to be independent. This problem arises in general when transporting intent-to-treat parameters across populations.[Westreich2015]

The estimator in equation 3

is extended to account for differences between trial populations via inverse odds of sampling weights.

[Westreich2017] The inverse odds weights are defined as

where is the probability of being a member of the ACTG 175 sample conditional on the baseline covariates

. The conditional probability of being in ACTG 175 is estimated using a model (e.g., logistic regression), where

represents a model-based estimate of . Here, the baseline covariates included in the censoring and sampling nuisance models are the same (i.e., ) but this need not be the case.

The inverse probability weights estimator for , which includes the inverse odds of sampling weights, is

where

(4)

and is the inverse odds of sampling weighted sample size. If the weight model is correctly specified, should be approximately equal to . To estimate , data from both trials were stacked together. The conditional probability of remaining uncensored was estimated separately for each trial using the Breslow estimator and antiretroviral therapy stratified Cox models as detailed in Step 3a. Lastly, was estimated via an intercept-only logistic model.

Step 4: Diagnostic for the fusion

As and are estimators for the same quantity (i.e., risk under dual therapy for the ACTG 320 population), the difference in theses estimators should be close to zero if all assumptions above hold including correct model specification. Therefore, we propose comparing the estimated risk between the shared arms across trials as a diagnostic for the data fusion. Difference between and can be assessed graphically or statistically. Twister plots of the differences in estimated risks between the shared arms can be used for graphical comparisons.[Zivich2021] The extent to which the shared trials arms differ may be quantified by the area between the estimated risk curves over time.[Cole2009a] A permutation-type test (Appendix 1) can be used to assess whether this area is larger than expected by chance. Briefly, a distribution of the areas between the risk functions is obtained by repeatedly permuting the trial indicator randomly and calculating the area for each permutation. The observed area between the risk functions is then compared to areas from the set of permutations, where the proportion of permuted data sets with an area greater than the observed area corresponds to the P value for the test. Small P values indicate that even after adjustment by inverse probability of censoring and inverse odds of sampling weights, the two shared arms remain incomparable, bringing into question the veracity of the assumptions underlying the fusion estimator. A simulation study was conducted to assess the performance of the proposed test (Appendix 2). As expected for permutation-based tests, the type 1 error was controlled at the nominal level (Table A1).

Applying this diagnostic test with 10,000 permutations before the ACTG 175 dual arm was re-weighted to the ACTG 320 population, we observed a sizable difference in the estimated risk functions for the composite outcome (Figure 1A, ). After the ACTG 175 was re-weighted using the inverse odds of sampling weights, the difference between the risks of the dual therapy arms was reduced but not ameliorated (Figure 1B, ). As CD4 count is predictive of the composite outcome and was not included in the sampling model as in Breskin et al. (2021),[Breskin2021] differences in baseline CD4 likely explain why the ACTG 175 dual arm (where baseline CD4 counts are higher) appears beneficial. Addition of CD4 count to the sampling model did not correct for this discrepancy (Figure 1C, ) and was nearly double , likely due to deterministic nonpositivity of for lower values of CD4. Restricting analyses to where there was overlap of CD4 counts between trials (50-300 cells/mm3) reduced the differences between the dual therapy arms (Figure 1D, ). We proceeded with the CD4 restricted population for estimation (Table 2) and note that the restriction by CD4 alters the target population form the population in Step 1.

Step 5: Estimation

The triple versus mono therapy risk difference at time was estimated for the CD4-restricted target population via

A closed-form variance estimator is provided in Appendix 3.[Breskin2021] In Appendix 2, performance of the estimator is shown in a brief simulation study. The estimator for the risk difference had little bias and the variance estimator performed well (Tables A2-A3).

Differences in the estimated risk of the composite outcome under triple versus mono therapy are depicted in Figure 2. These results indicate that assignment to triple therapy is preferred over mono therapy for the one-year risk of AIDS, death, or 50%+ decline in CD4. At one-year of follow-up the risk of the composite outcome was 21 percentage points lower (95% Wald confidence interval: -34%, -7%) when assigned triple therapy compared to mono therapy in the ACTG 320 population with CD4 counts between 50-300 cells/mm

3 at baseline.

Discussion

Here, we considered bridged treatment comparisons in the context of antiretroviral therapy for the prevention of AIDS, death, or decline in CD4 cell counts. Through a shared intermediate arm, the discordant arms of two trials were compared. The described fusion estimator accounts for observed differences between the two trial samples and possible informative censoring through inverse probability weighting. To evaluate the proposed fusion, we described a novel diagnostic that compares the estimated risks of the shared arms. As expected, we found that assignment to triple therapy is preferred to mono therapy in terms of preventing AIDS, death, or a 50%+ decline in CD4 for a one-year period. However, the target population had to be restricted due to deterministic non-positivity between the two trials.

Bridged comparisons between treatment may be necessary in some settings, where ethical or feasibility considerations rule out the use of a direct comparison. Another approach to make bridged treatment comparisons is network meta-analysis.[Lumley2002] Network meta-analyses compare study-specific estimates through shared trial arms between studies, akin to the use of the shared arm in the fusion design. However, network meta-analysis implicitly assumes marginal exchangeability of sampling (i.e., ). To check the validity of a comparison in network meta-analysis, multiple comparisons can be made through several different studies, possibly through different shared arms. Comparisons through different studies or intermediates that lead to substantively different conclusions regarding the contrast of interest, referred to as incoherence,[Lumley2002] may indicate that the marginal exchangeability assumption is incorrect. The fusion design detailed here allows for a weaker exchangeability assumption between trials but necessitates the use of individual-level data.

When there is non-adherence, there are two possible parameters of interest: the intent-to-treat and per-protocol parameters. The intent-to-treat parameter compares outcomes following treatment assignment, regardless of treatment received, whereas the per-protocol parameter reflects both assigned treatment and adherence.[Rudolph2020] Here, focus was on the intent-to-treat parameter. A per-protocol extension of the bridging estimator is discussed in Breskin et al..[Breskin2021] Here, we were unable to estimate the per-protocol parameter due to the unavailability of adherence data for ACTG 175.

In the motivating example, the proposed diagnostic indicated that naively fusing data from the two ACTG trials may not be valid, which was expected due to known differences in CD4 cell counts. While the differences in eligibility by CD4 count are apparent in the inclusion criteria, data sources from different places or times may differ in ways not readily apparent in inclusion or exclusion criteria. The proposed diagnostics may be helpful in other bridged treatment comparisons since differences between populations unaccounted for in estimation can result in biased estimates of the parameter. The proposed graphical diagnostic and permutation test comparing the shared arms can be used to assess whether at least one of the data fusion assumptions (i.e., identification assumptions or models for nuisance parameters are correctly specified) does not hold. The proposed diagnostic is analogous to testable implications of assumptions proposed for generalizability and meta-analyses.[Stuart2011, Dahabreh2020] It should be noted, however, that the lack of a difference between the shared arms is not sufficient to conclude that all assumptions hold true. As a counterexample, consider when the censoring models for the shared arms are correctly specified but at least one of the censoring models for one of the other treatment arms is incorrectly specified. In this case, the proposed diagnostics will fail to detect the model misspecification. Alternatively, if multiple variables requires for exchangeability of sampling were omitted from the sampling model, the impact of those missing variables could cancel or mask each other such that no difference between the shared arms is detected. These counterexamples are akin to limitations of other diagnostics, such as comparison of the g-formula under the natural course to the observed data,[Keil2014] or checking for imbalances in the pseudo-population when using inverse probability weighting.[Austin2015]

There are several possible areas of future methodological developments related to bridged treatment comparisons. First, bridged comparisons could further incorporate time-varying covariates predictive of drop-out and the outcome. In the context of the HIV example, the presence of side-effects of protease inhibitors may drive differential drop-out.[Carr1998] Second, a more efficient estimator relative to the described inverse probability weighting estimator may be possible by incorporation of a model for the outcome (i.e., an augmented inverse probability weighting estimator).[Daniel2014, Rotnitsky2006]

Lastly, the estimation approach relied on parametric or semiparametric models to estimate the corresponding nuisance parameters. While restricted quadratic splines were used to model age and censoring models were estimated separately for trials, such modeling choices may not be sufficiently flexible to meet the correct model specification assumption generally. Instead, more flexible data-adaptive or machine learning algorithms could be considered to estimate the nuisance parameters.

[Zivich2021a, Naimi2017]

Acknowledgments

This work was supported in part by R01-AI157758 (PNZ, SRC, BES, JKE), P30-AI050410 (SRC, MGH, BES), K01AI125087 (JKE), and T32-AI007001 (PNZ). The authors would like to thank the HIV/STI trainees at the University of North Carolina at Chapel Hill for their feedback and suggestions. Corresponding code is available at https://github.com/pzivich/publications-code

References

Appendix

ACTG: AIDS Clinical Trial Group.

Plots were created using a Gaussian kernel density estimator stratified by trial. The gray shaded region indicates baseline CD4 between 50-300 cells/mm

3.

Figure A1: Distribution of baseline CD4 cell counts by AIDS Clinical Trial Group

Appendix 1: Proposed fusion diagnostics

As a diagnostic, we propose comparing the difference between the estimated risk functions of the shared intermediate arms between the two data sources. As implied by the identification assumptions, the difference between the shared arms is expected to be approximately zero. Potential differences can be assessed visually using a twister plot[Zivich2021] or statistically using the following permutation test.

Permutation test

For the permutation test, the (positive) area between the two estimated risk functions is proposed as the test statistic. To determine whether the observed area between the curves is larger than would be expected by chance, the following permutation procedure can be used.

The permutation procedure consists of randomly re-assigning the trial indicator and holding everything else fixed. With the re-assigned trial indicator, the estimators can be reapplied to the data and the area between the estimated risk functions calculated. the indicator re-assignment and area calculation are repeated many times (at least 1000 is recommended, with 10,000 being preferred). The observed area can then be compared to this set of permuted data set areas to determine whether the observed area is larger than expected by chance. The value for the permutation test is given by the proportion of permutations with an area greater than the area for the observed data.

The proposed permutation test can be implemented using the following algorithm:

calculate the estimated risk functions, and
calculate the area between the estimated risk functions
for  do
     randomly shuffle the indicator and reassign updated values, indicates as
     estimate the risk functions over time using the permuted data:
where
is the estimated sampling weights for each participant.
     calculate the area between the risk functions
     save the area for the permutation
end for
calculate the P value as the proportion of permutations with an area larger than the observed area
procedure Area between risk functions
     (1) Align the risk functions by the union of the unique events times
     (2) Subtract the next event time from the current event time
     (3) Subtract estimated risks of shared arms at each unique event time
     (4) Multiply the time differences (2) by the absolute value of the risk differences in (3)
     (5) Sum the multiplied differences from (4)
end procedure
Algorithm 1 Permutation test

An illustration of the calculation of the area between the risk functions. Note that the values of are the union of unique event times from each of the data sources

k (1) (2) (3) (4)
1 0 0.00 0.00 0.2-0=0.2 0
2 0.2 0.00 0.07 0.4-0.2=0.2 -0.07
3 0.4 0.10 0.07 1.2-0.4=0.8 0.03
4 1.2 0.20 0.07 1.7-1.2=0.5 0.13
5 1.7 0.35 0.30 2.4-1.7=0.7 0.05
6 2.4 0.35 0.45 2.5-2.5=0.1 -0.10
7 2.5 0.45 0.45 3.0-2.5=0.5 0.00
8 3.0 0.45 0.55 0 -0.10

(5)

Appendix 2: Simulation study

To assess the type 1 error and power of the proposed diagnostic test, a simulation study was conducted mimicking the ACTG example with mono therapy (), dual therapy (), and triple therapy (). The parameter of interest compared triple therapy to mono therapy. Let and .

First, observations of the target population () were generated as follows:

where is history of injection drug use and is baseline CD4. Next, observations were generated via the following: observations were generated from the preceding and distributions, was assigned via , and observations were randomly selected of those observations with

. Therefore, this second population is a biased sample of the target population. Treatment was randomly assigned within each trial following a Bernoulli distribution with the following probabilities:

Events times were simulated from a Weibull model

where

is a random draw from a uniform distribution over

, and

and . The event time under the observed treatment, was

Censoring times were generated from the following Weibull model

where is a random draw from a uniform distribution over . All observations were administratively censored at . Therefore, the observed time and event indicator were

Two variations of the fusion estimator were compared: incorrectly specified sampling model and correctly specified sampling model. The incorrectly specified fusion estimator used the following sampling model

and was estimated using logistic regression. Since history of injection drug use () is related to both time to the event and sampling, this sampling model is incorrectly specified. The probability of remaining uncensored was estimated via the Nelson-Aalen estimator since the censoring mechanism does not dependent on any covariates.

The correctly specified fusion estimator used the following sampling model specification

This fusion estimator assumed that the study samples are exchangeable conditional on and , which corresponds to the true data generating mechanism. Therefore, this sampling model is correctly specified.

We simulated two different sample sizes: () and (). For each of the sample sizes, 1000 simulated data sets were generated. For the diagnostic test, 1000 permutations were used. Type 1 error and power were estimated. Type 1 error was estimated by the proportion of P values below a selected significance level under the correctly specified sampling model. Power estimated by the proportion of P

values below a selected significance level under the incorrectly specified model. Significance levels of 0.05, 0.10, and 0.20 were assessed. In addition to assessment of the proposed diagnostic, we also assessed bias, empirical standard error, standard error ratio, and 95% Wald confidence interval coverage at

for the risk difference. Bias was defined as the mean of the difference between the estimated risk difference each simulated data set and the truth. The true risk difference was estimated from the potential outcomes and

for 10,000,000 generated observations. Empirical standard error was estimated by the standard deviation of the point estimates of the simulation. The standard error ratio was defined as the empirical standard error divided by the mean of the estimated standard error across the data sets. Confidence interval coverage was defined as the proportion of confidence intervals that contained the true risk difference. All simulations were conducted using Python 3.6.8, with the following libraries: NumPy,

[Harris2020] SciPy,[Virtanen2020] pandas,[Mckinney2010] statsmodels,[Seabold2010] and matplotlib.[Hunter2007]

Type 1 error was near the nominal rate for all significance levels for but was below the nominal rate for (Table A1). Power was 1, meaning that incorrectly specified sampling models were always detected.

When the sampling model was correctly specified, there was little bias for both the risk difference (Table A2-A3) . The standard error estimators closely approximated the empirical standard error, indicated by standard error ratio values close to 1 and confidence interval coverage was close to 95% across all time points. When the sampling model was incorrectly specified, bias was present for the risk difference. However, confidence interval coverage was close to nominal levels at most time points.

Type 1 Power Type 1 Power Type 1 Power
0.03 1.00 0.06 1.00 0.13 1.00
0.05 1.00 0.11 1.00 0.20 1.00

Type 1 error was estimated by the proportion of P values below the corresponding for the correctly specified sampling model. Power was estimated by the proportion of P values below the corresponding for the incorrectly specified sampling model

Table A1: Estimated type 1 error and power of the proposed diagnostic permutation test
Bias ESE SER 95% CI Coverage
Day 91
Incorrect sampling model -0.022 0.053 1.06 94.2%
Correct sampling model 0.001 0.031 0.99 95.9%
Day 183
Incorrect sampling model -0.024 0.067 1.08 94.8%
Correct sampling model 0.001 0.040 1.02 96.2%
Day 274
Incorrect sampling model -0.023 0.077 1.14 96.1%
Correct sampling model -0.002 0.048 1.03 96.7%
Day 365
Incorrect sampling model -0.020 0.087 1.13 96.9%
Correct sampling model -0.002 0.055 1.03 96.4%

ESE: empirical standard error of the estimator, SER: standard error ratio, CI: confidence interval. Bias was defined as the mean of the estimated risk difference at the corresponding time minus the true risk difference. ESE: was defied as the standard deviation of the simulation estimates at the corresponding time. SER was the ESE divided by the mean of the estimated standard errors across all simulations. 95% CI coverage was defined as the proportion of 95% CIs containing the true risk difference.
The incorrectly specified sampling model included only CD4. The correctly specified sampling model included CD4 and injection drug use.

Table A2: Simulation study results for the fusion estimators with
Bias ESE SER 95% CI Coverage
Day 91
Incorrect sampling model -0.021 0.038 1.03 92.3%
Correct sampling model 0.000 0.023 1.01 95.5%
Day 183
Incorrect sampling model -0.024 0.048 1.06 93.6%
Correct sampling model 0.000 0.029 1.02 95.5%
Day 274
Incorrect sampling model -0.023 0.055 1.06 95.3%
Correct sampling model -0.001 0.034 1.03 96.1%
Day 365
Incorrect sampling model -0.018 0.062 1.03 95.9%
Correct sampling model 0.000 0.039 1.02 95.4%

ESE: empirical standard error of the estimator, SER: standard error ratio, CI: confidence interval. Bias was defined as the mean of the estimated risk difference at the corresponding time minus the true risk difference. ESE: was defied as the standard deviation of the simulation estimates at the corresponding time. SER was the ESE divided by the mean of the estimated standard errors across all simulations. 95% CI coverage was defined as the proportion of 95% CIs containing the true risk difference.
The incorrectly specified sampling model included only CD4. The correctly specified sampling model included CD4 and injection drug use.

Table A3: Simulation study results for the fusion estimators with

Appendix 3: Closed-form variance estimator

Breskin et al. (2021) proposed the following closed-form variance estimator:

where