Log In Sign Up

Unbiased and Efficient Estimation of Causal Treatment Effects in Cross-over Trials

We introduce causal inference reasoning to cross-over trials, with a focus on Thorough QT (TQT) studies. For such trials, we propose different sets of assumptions and consider their impact on the modelling strategy and estimation procedure. We show that unbiased estimates of a causal treatment effect are obtained by a G-computation approach in combination with weighted least squares predictions from a working regression model. Only a few natural requirements on the working regression and weighting matrix are needed for the result to hold. It follows that a large class of Gaussian linear mixed working models lead to unbiased estimates of a causal treatment effect, even if they do not capture the true data generating mechanism. We compare a range of working regression models in a simulation study where data are simulated from a complex data generating mechanism with input parameters estimated on a real TQT data set. In this setting, we find that for all practical purposes working models adjusting for baseline QTc measurements have comparable performance. Specifically, this is observed for working models that are by default too simplistic to capture the true data generating mechanism. Cross-over trials and particularly TQT studies can be analysed efficiently using simple working regression models without biasing the estimates for the causal parameters of interest.


page 1

page 2

page 3

page 4


A tutorial on individualized treatment effect prediction from randomized trials with a binary endpoint

Randomized trials typically estimate average relative treatment effects,...

High-dimensional regression adjustments in randomized experiments

We study the problem of treatment effect estimation in randomized experi...

Recycled Least Squares Estimation in Nonlinear Regression

We consider a resampling scheme for parameters estimates in nonlinear re...

Analysis of regression discontinuity designs using censored data

In medical settings, treatment assignment may be determined by a clinica...

Optimal Estimation of Generalized Average Treatment Effects using Kernel Optimal Matching

In causal inference, a variety of causal effect estimands have been stud...

Causal Rule Sets for Identifying Subgroups with Enhanced Treatment Effect

We introduce a novel generative model for interpretable subgroup analysi...

Teaching deep learning causal effects improves predictive performance

Causal inference is a powerful statistical methodology for explanatory a...

1 Introduction

A TQT study is an essential component of drug development, ensuring drug safety for patients. Therefore, it is a regulatory requirement to conduct such trials [6]. These trials have complicated designs in an attempt to minimize the sample size. This complexity has in turn led to a long-standing debate and several suggestions on how to best model the resulting data, and in particular how to use baseline measurements [15, 13, 20, 30, 19]. In parallel with these developments, the causal inference literature, and recently also official regulatory recommendations, have increased the focus on clearly defining what we are actually trying to estimate. This has led to the endorsement of the so called estimand framework within regulatory guidance documents and the causal roadmap concept within the causal inference literature [11, 33].

One of the central points made here is that we should enable our research question to be defined in terms of the trial data and not just in terms of a specific model. That said, we would still want to use models in order to gain efficiency or eliminate bias. This is in line with recent regulatory guidance documents that encourage the active use of baseline variables in randomized trials for gaining precision of estimated treatment effects [4, 5].

The developments in this paper aim to support this push towards clearer and more focused statistical procedures. In particular, the causal inference approach we present facilitates a clear and transparent definition of our target of estimation in cross-over trials and specifically in TQT studies.

Similar developments are already available in the literature for one-period trials. Here, standard estimators have been shown to be unbiased for causal parameters [29, 34]. Moreover, appreciable gains in efficiency compared to unadjusted estimators has been demonstrated [28, 9, 1]. We provide similar results for standard estimators in cross-over trials, and in particular in TQT studies.

For a standard TQT study, healthy volunteers [11] are enrolled with the purpose of obtaining electrocardiograms (ECGs) from each subject under different treatment conditions. An ECG measures the electrical output from the heart, resulting in replicates of graphical outputs as illustrated in Figure 1. The output is characterized by different waves, complexes, and intervals. In Figure 1 we see the P, Q, R, S, and T waves. The interval from the beginning of the Q wave to the end of the T wave is called the QT interval, and is measured in milliseconds (ms). The QT interval measures the time it takes the heart to repolarize and prepare for the next beat. The longer the QT interval, the longer the time between heart beats, and the less oxygen is transported to cells in the body. Specifically, prolongation of the QT interval has been shown to be related to an increased risk of Torsades de Pointes, a malignant ventricular arrhythmia. Thus, it is undesirable for the QT interval to be prolonged due to drug exposure.

Figure 1: ECG.

The length of the QT interval is positively associated with the length of the RR interval, i.e. the interval from the R wave on an ECG until the next R wave on the ECG. Therefore, QT intervals are standardized in order to get a QTc (QT corrected) measurement corresponding to a particular length of the RR interval (typically 1 second). An example of a commonly used correction is the Fridericia correction, which is given by

The purpose of the statistical analysis in a TQT study is to formally assess if clinically relevant prolongation is present based on the QTc measurements [21].

TQT studies are predominantly conducted as cross-over trials in an attempt to lower sample size and eliminate between subject variation by paired comparisons of different treatments. In a cross-over trial, each subject is randomized to one of several treatment sequences that uniquely determines the treatment they receive at any given treatment period throughout the trial. Within each period, a baseline QTc measurement is obtained just prior to treatment, followed by a number of post treatment QTc measurements obtained at pre-defined time points following treatment (see Figure 2 for the two-period case). Each pair of consecutive periods will be separated by a washout period to minimize the risk of carry-over effects, i.e. any effects of treatment from the previous period on the QTc measurements in the current period. From a practical point of view, the main challenge with cross-over designs is that the washout period needs to be tailored to the half-life of drug concentration to reflect proper washout of the drug. Specifically, if the half life is long, an even longer (typically 5 half lives) wash-out period is required in order to avoid carry-over effects. Ultimately, this may impose a very long study period for the subjects enrolled in the study. This is clearly not optimal and may also prove to be a challenge with regard to case retention. In such situations, a parallel arm design may be more feasible [6].

Figure 2: cross-over trial design.

The outline of the paper is as follows. In the next section we introduce the basic notation used in the paper and in that context define the fundamental assumption of no carry-over in a cross-over trial. We briefly introduce the concept of counterfactual outcomes and define the causal quantities of interest alongside the data assumptions needed to identify these quantities directly from the data. Section 3 is dedicated to deriving and assessing the theoretical performance of a number of causally motivated estimators. The section also contains the main result of this paper, which shows that a large class of working models still result in unbiased estimates of the causal target parameters even if they do not capture the true data generating mechanism. In Section 4, we analyse data from a real TQT study using a range of working models that in theory result in unbiased estimates of the causal target parameters according to the main result of this paper. Section 5 is dedicated to comparing the same working models in a simulation study cast around the data example in order to evaluate performance in a realistic scenario. We conclude the paper with a discussion in Section 6.

2 Notation and assumptions

In the following, we introduce the notation and causal assumptions needed in order to identify the target of estimation. Let denote the QTc measurement for subject in period at time , . Denote baseline measurement for subject in period by , and treatment by . TQT studies tend to have as many treatments as periods. Thus, we will denote treatments by , where corresponds to subject receiving placebo in period . Often we will suppress the since we assume the subjects are independent draws from the same distribution. Furthermore, let

denote the vector of post-baseline measurements in period


In line with the informed choice of washout period in TQT studies, we assume the wash-out period has been sufficient to ensure no carry-over effects. Under this assumption, our data can be described by the Directed Acyclic Graph (DAG) in Figure 3 in the two-period case. Note that the DAG has no arrows from baseline measurements to post-baseline measurements, since we do not expect a causal effect of the baseline measurements. Instead, we expect any association between baseline measurements and post-baseline measurements to arise from the latent variables, , , and from Figure 3. The latent variable reflects the dependence owing to measurements being from the same subject, whereas the latent variables, and reflect the dependence between measurements from the same period, i.e. temporary traits. Despite the lack of arrows between baseline and post-baseline measurements, it still makes sense to adjust for baseline measurements, for example in a regression model, because we do not observe the latent variables, in which case the baseline measurements act as proxies for the latent variables.

Figure 3: , , = subject specific latent variables, R = Randomization.

The DAG in Figure 3 implies the following about the data distribution:

Assumption 1 (No carry-over).

Let , and likewise for and , and let be the density for variable and likewise for all other variables. The distribution of our data satisfy the Markov factorization property with respect to the DAG in Figure 3 [22], i.e., the joint density of our data can be written as

In accordance with Figure 3, Assumption 1 states that the conditional distribution of in period only depends on treatment in period and the latent variables and .

In the following, we take a causal approach to TQT studies. By doing so, we provide a clearly stated research question that is completely disentangled from the modelling of the data. This exercise provides complete clarity on what assumptions about the data generating mechanism are necessary to answer the research question and sets them apart from purely technical assumptions made during the modelling stage of the estimation.

Let denote the post-baseline QTc measurements we would have made in period if, possibly counter to fact, the subject had received treatments . Note that under Assumption 1, the QTc measurements in period only depend on the treatment in period . I.e., for all treatment regimes and periods. Thus, from here on, will denote the potential QTc measurements in period if the subject, possibly counter to fact, had received treatment in period .

If we were particularly interested in treatment and had ample resources in terms of money, time, and subjects, we would have made a two-arm trial and used

as the causal contrasts of interest. Note that the first period in a cross-over trial corresponds to such a trial and as a consequence the targeted causal contrasts can be identified as:

We can then easily estimate the contrasts based on data from the first period only, for example by

Clearly this is not an efficient use of all the data collected in the cross-over trial. However, to enable full use of the data, stricter assumptions than Assumption 1 are needed. Specifically, we need to make assumptions about the distributions of the post-baseline measurements. To this end, it would be natural to assume the same treatment effect in all periods:

Assumption 2 (Same treatment effect).

for all and . I.e. the treatment effect is the same in all periods.

Assumption 2 enables estimation of one overall treatment effect across all periods. A special case where Assumption 2 holds is when the distribution of period specific data does not depend on period:

Assumption 3 (Same distribution).

for all and .

Assumption 3 is rather restrictive compared to Assumption 2 in that it a priori excludes any systematic effect due to period. This contradicts current modelling and design practice in cross-over trials, where potential period effects are modelled and subjects are randomized according to a latin square in order to balance any period effect [31].

As an alternative, one can assume that the conditional distribution of the post-baseline measurements, given baseline measurements and treatment, is the same in all periods:

Assumption 4 (Same relationship).

for all and .

This facilitates a model fit based on data from all periods, which may in turn be used to infer the period specific causal contrast .

3 Estimation

The causal framework from the last section enables us to clearly define our target of estimation. In this section, we provide inference procedures tailored to assess the causal target parameter in the context of a TQT study. For brevity, we only consider the case, where we have assumed the same average causal treatment effect in all periods. I.e., we specifically develop estimators for under Assumption 2. An outline of how to estimate period specific average causal effects without Assumption 2 is provided in Section 6.

Under Assumption 2 the fact that subjects receive both the placebo and active treatment initially motivates the following simple non-parametric estimator:

where the indicator function, , maps elements of to one and is zero otherwise. Note that this, and all other estimators throughout this paper, has in the subscript to indicate that it is the estimator for the post-baseline measurement time point . Naturally, an effect will be estimated for each . Note that simply takes the outcomes in the periods, where the subjects receive the treatment of interest, and subtracts the outcomes in the placebo periods, and averages across subjects. It is an unbiased estimator for the treatment effects of interest due to the randomization. However, it only uses post-baseline measurements, and may therefore lack precision. Alternatively, one can pursue fitting a working regression model to the data and thereby bring baseline measurements into play. We specifically assume that the possible outcome predictions in this working model are given by , where is a vector of regression parameters. Under Assumption 3 it is possible to ignore periods and use the simpler working regression model . Such a working model can be used to estimate the causal effects of interest simply by plugging into the G-computation formula [27]:

This estimator uses covariate information to gain efficiency. Moreover, in situations with missing endpoint data, the estimator is still unbiased under the missing at random assumption given that the regression model is correctly specified. In comparison, the endpoint data has to be missing completely at random for the simple estimator to be unbiased.

Up front, the above developments depend on the fact that the working regression model is specified so that it captures the true mean value structure. Since this by no means is warranted, it is important to mitigate the impact in terms of bias if the working model is misspecified. Such a mitigation can be successfully achieved with the following semi-parametric estimator:

The estimator is derived in Appendix C. It uses covariates to gain precision, but is unbiased due to the independence between and . This independence is ensured by the randomization and implies that the last term has a mean of zero. Thus, the estimator has the same mean as the non-parametric estimator, namely the true causal effect. The following theorem shows that for certain types of working models.

Theorem 1 (Unbiasedness of ).

Assume the data structure of this paper, and assume we use a working model with post baseline measurements as outcome, and with main effects of treatment specific to each post baseline time point. Let be the vector of all post baseline measurements for subject , and let be the vector of all predictions for subject from the working model. Assume the working model parameters, , are estimated from the following weighted least squares estimating equation:


where is the design matrix for subject , and is a weight matrix on the form


where A and B are matrices. If and are non-singular, then


The proof is provided in Appendix A. ∎

We know that is unbiased by construction. Hence, Theorem 1 implies unbiased estimation when using with a working model satisfying the conditions in Theorem 1, regardless of the ability of the working model to capture the true data generating mechanism.

Previously, the optimal choice of model for TQT studies has been debated extensively, based on the premise that such a model should provide a valid fit to the data. Theorem 1 implies the existence of a whole range of working models that may be used to provide unbiased estimates of causal treatment effects without such considerations.

In this context, it is also worthwhile to note that many of the models that are traditionally applied in TQT studies satisfy the conditions in Theorem 1. We summarize this in the following corollary.

Corollary 1 (Gaussian linear mixed models).

Assume the working model is a Gaussian linear mixed model with main effects of treatment specific to each post baseline time point, and a correlation structure satisfying (

2). Then, if model parameters are estimated by maximum likelihood or restricted maximum likelihood estimation.


The estimating equation for Gaussian linear mixed models is given on page 10 of [12] and can be rewritten to (1). ∎

When the working model is a Gaussian linear mixed model, the requirement (2) corresponds to modelling the dependence within periods by some matrix , and the dependence between periods by a matrix . For example, might be a matrix of constants corresponding to a random subject effect, and can be modelled more flexibly, for example with an unstructured covariance structure, or an AR(1) covariance structure.

In the special case of a Gaussian linear mixed model, consider and

a matrix of zeros. In this case, the working model is a standard linear regression model that ignores any dependence between observations. Corollary

1 then ensures that we are able to produce sound inference even with this simplistic model. We do, however, expect this working model to be less efficient than if we model the dependence structure in a Gaussian linear mixed model.

A particular example of a much used working regression model is given by [21]:


Clearly the conditions of Theorem 1 are satisfied for the systematic part of this model, as it includes a main effect of treatment specific to each post baseline time point. Additionally, the covariance structure proposed in [21] is AR(1) within periods, and assuming constant covariance between observations on the same subject in different periods. Accordingly, the proposed covariance structure complies with (2). The estimates of the time specific effects of treatment equal the estimates obtained if we were to plug model (3) into . Thus, the estimates of are unbiased for the treatment effects of interest under arbitrary model misspecification.

In order to enable inference fully, we further need to characterize the large sample behaviour. This is well established if the targeted treatment effects appear as parameters in the model, and assuming that the model is correctly specified. However, in more complex models, the target treatment effect is not readily identified as a parameter specified in the model. Moreover, model based standard errors may not be appropriate, unless the model is correctly specified. The influence function of

is derived in Appendix B under the assumption that the parameters are estimated using an M-estimator. For models not covered by Theorem 1, is still unbiased whereas may be biased. Therefore, it would be preferable to use in such cases. Accordingly, we also derive the influence function for in Appendix C.

One particular use of the above asymptotic results is when assessing QT prolongation in a TQT trial. In this context, the test for QTc prolongation for the drug of interest is carried out by use of an intersection-union test. I.e. the null-hypothesis is


and is some reasonable amount of QT prolongation, such as 10 ms [21]. Commonly speaking, the null-hypothesis dictates that there exists a time point where QT prolongation exceeds a prespecified clinically negligible threshold. Tests are carried out for each based on the asymptotic behaviour of the standardized estimates of , and the null is rejected if all of these tests are rejected.

In addition, it is common practice to assess if prolongation can be detected for the positive control. The corresponding null-hypothesis is

The test of this hypothesis needs to be adjusted for multiple testing. This adjustment should be made in the most efficient way possible, which is possible because we can estimate the joint (asymptotic) distribution of our estimators [23, 10]. We can derive the joint asymptotic distribution of our estimators from the influence functions as


are the influence functions of the individual estimators. It follows that the asymptotic variance matrix of the joint distribution of our estimators equals


The estimation of targeted treatment effects as outlined above is based on well known regression models and as we have seen the targeted treatment effects may sometimes even be identified directly as regression parameters in those models. Theorem 1 then guarantees that the identified regression parameters are estimated without bias, irrespective of whether the regression model is correctly specified or not. This guarantee, however, does not extend to the model based variance matrix of the estimates. For this to be appropriate, one also needs to assume that the regression model is correct. The asymptotic variance matrix (4) on the other hand is generally applicable. When the targeted treatment effects are identified as regression parameters, it may even be obtained by standard software [24].

The asymptotic theory developed is applicable, as is when the sample size is large. However, TQT trials and many cross-over trials have a rather small sample size, in which case the appropriateness of the asymptotic theory is questionable. For such cases, there is a substantial literature on how to improve asymptotic standard errors and confidence intervals

[2, 3, 16, 25]. One simple improvement of the asymptotic standard errors from [3] is to use the influence function in the same way as we would in the context of a linear normal model. I.e. to estimate the variance by instead of

, and use the 97.5% quantile of the t-distribution with

degrees of freedom instead of a standard normal distribution to construct confidence intervals. These modifications vanish as the sample size increases, and will consequently lead to correct asymptotic inference. These simple small sample modifications are used in the analyses and simulations throughout this paper.

4 Data example

To illustrate the developments in this paper, we reanalyse a standard TQT trial also analysed in chapter 9 of [21]. The data set is freely available on the book website, and consists of 41 subjects, two of which we have excluded due to missingness. There are three single-dose treatments (C, D, E) and a placebo (F). Treatment E is included as a positive control, i.e. treatment E is known to mildly prolong the QT interval. The subjects’ QT intervals are measured in triplicates at baseline, 0.5 hours, 1 hour, 1.5 hours, 2.5 hours, and 4 hours post treatment. The triplicates are averaged at each time point.

In accordance with the developments presented in Section 3, we analyse the data with a range of models of differing complexity, that, in theory, facilitate unbiased estimates of the average causal effects under Assumption 2. We informally compare these models in terms of obtained estimates and standard errors.

Our benchmark model corresponds to the recommendation made in [15]. This paper advocates a regression model including average baseline measurements as a covariate. It is shown in [15] that this approach is consistent with the joint baseline and post baseline measurement model advocated in [13, 17]. It is further argued in [15] that the resulting estimates of the treatment effects will be superior in terms of precision.

Specifically, the working model proposed in [15] is given by:

where is the average baseline measurement. The effect of the baseline measurements and average baseline measurements are different at different time points.

Moreover, we fit a simpler model:

i.e. a model without average baseline measurements, but still with interaction between the effect of baseline and time point.

Furthermore, we fit an even simpler model:

i.e. without average baseline measurement, no interaction between time point and period, and the effect of the baseline measurement is the same at all time points. All the models above have the treatment effects as specific parameters. In order to illustrate the modelling flexibility facilitated by Theorem 1, we fit a model with interaction between baseline and treatment:

Note that with the complexity of this model, it is no longer possible to identify the average causal effect as a parameter in the regression model. Therefore, we can no longer rely on standard inference of regression models, but need to rely on the general inference procedures developed in Section 3.

On top of specifying a working regression structure for the models, we also need to specify the working covariance structure. The working covariance structure has to be on the form (2), and in the following we will use a random subject effect corresponding to being a matrix of constants unless otherwise specified. We consider three different specifications for the matrix:

  1. Unspecified: all the variances and covariances have to be estimated.

  2. AR(1): variance matrix is on the form:

    where and are parameters to be estimated.

  3. Independence: , where

    is the identity matrix. In this case,

    will be a matrix of zeros corresponding to a standard linear regression model.

Last, we also fit the non-parametric estimator . The estimates of the effect of treatment E compared to placebo at post baseline time 4 from the models are displayed in Table 1. Note that the standard errors and confidence intervals are based on the small sample size adjustment discussed in the last section. The remaining estimates of treatment effects are presented in Appendix D. The different models result in similar treatment effects as expected. In practice, the main reason for choosing one model over another is therefore in order to have as much efficiency as possible, and not because we expect to actually know the true data-generating mechanism.

Mean structure Covariance structure Estimate Standard Error 95% CI
Unspecified 8.32 1.53 (5.22, 11.42)
AR(1) 8.11 1.49 (5.09, 11.14)
Independence 8.19 1.54 (5.07, 11.32)
Unspecified 8.22 1.52 (5.14, 11.31)
AR(1) 8.09 1.48 (5.11, 11.08)
Independence 8.19 1.52 (5.12, 11.26)
Unspecified 8.49 1.43 (5.59, 11.40)
AR(1) 8.30 1.44 (5.38, 11.22)
Independence 8.43 1.44 (5.51, 11.34)
Unspecified 8.43 1.40 (5.60, 11.26)
AR(1) 8.29 1.43 (5.49, 11.10)
Independence 8.42 1.42 (5.55, 11.30)
8.18 2.05 (4.03, 12.33)
Table 1: Estimates and standard errors for data example.

From Table 1 we note that estimates of the targeted treatment effect across all models seem comparable. The standard errors are substantially larger with the non-parametric approach, whereas for all other model based approaches, standard errors are comparable. We investigate these observations further in a simulation study mimicking the data example in the next section.

5 Simulation study

Simulation studies have already demonstrated that it is theoretically possible to gain precision by including the average baseline measurement as a covariate [15, 17]. However, it is unclear how much the addition of the average baseline covariates matters for the precision in realistic setups. Therefore, we have based our simulation on the data set from the previous section.

Specifically, we have simulated the data as follows: a joint normal distribution of the baseline measurements is fitted to the baseline measurements in the data set, and baseline measurements are simulated according to this fit. The average baseline model from the previous section is fitted to the data set, and the post baseline measurements are simulated from that model.

There are at least two advantages to this approach: first, the simulation must be considered realistic since it is based on parameters estimated on a data set from a real TQT trial. Second, we know that models ignoring the average baseline measurements are too simple to capture the true data-generating mechanism. Thereby, the simulations can show us how much precision we can expect to lose by not using average baseline measurements as covariates in real TQT studies.

We compare all the models from the last section. The models are misspecified both in terms of mean structure and in terms of correlation structure, but are all unbiased by Corollary 1. We ran 10000 simulations in the statistical program R [26], and the code is available at The results of the simulations are displayed in Table 2. Standard errors and confidence interval coverages are again based on an adjustment for small sample size. As expected, all estimators have negligible bias.

Mean structure Covariance structure Bias SD Avg. SE Coverage
Unspecified 0.02 1.48 1.43 0.947
AR(1) 0.02 1.48 1.43 0.947
Independence 0.02 1.48 1.48 0.954
Unspecified 0.01 1.49 1.45 0.945
AR(1) 0.01 1.49 1.45 0.945
Independence 0.01 1.52 1.52 0.952
Unspecified 0.02 1.48 1.46 0.951
AR(1) 0.02 1.48 1.46 0.951
Independence 0.02 1.51 1.51 0.954
Unspecified 0.02 1.48 1.38 0.939
AR(1) 0.02 1.47 1.38 0.939
Independence 0.02 1.50 1.49 0.951
0.02 1.94 1.94 0.952
Table 2:

Bias, standard deviation of estimates, and coverage of confidence intervals in simulations.

The standard error estimates seem approximately correct compared to the sample standard deviation of the estimates, and the coverages of the confidence intervals are consequently close to 95 %. The standard deviations in the table show that by far the majority of the gain in precision from using a model comes from the inclusion of the baseline measurements. Using the average baseline measurement as a covariate adds little precision, even over the linear regression model. However, the gain in precision from including the average baseline as a covariate and using a more flexible covariance structure than independence is for free, since Corollary 1 implies unbiased estimation in any case.

6 Discussion

We have introduced causal reasoning to the field of TQT studies. We have shown how typical choices of estimators can be given very clear causal interpretations in terms of the data, and not just in terms of specific models. Furthermore, we have shown that popular choices of working models in many circumstances yield unbiased estimates of causal parameters under arbitrary model misspecification. We have illustrated these results in a data example and a simulation study.

However, the unbiasedness of the proposed estimators follows from the balancing induced by randomization. In practice, we may have missing data, which can invalidate the balancing initially ensured by the randomization. That said, the amount of missing data is typically negligible in TQT studies owing to the fact that they are most often conducted on healthy subjects who are paid to participate [6]

. In cases with a non-negligible amount of missing data the working regression models can be fitted to the observed data with MLE assuming missingness at random (MAR), and distinct parameters for the missing data mechanism and for the outcome. As a consequence, it is straightforward to estimate the causal effects when using a linear mixed model, where the causal effects are identified as the main effects of treatment. However, this becomes challenging if we consider more complex models. In that case, a viable strategy could be to use imputation or weighting methods in order to go from model to estimates of the causal effects

[14, 32].

We have focused on how to estimate causal effects under assumption (2), that is, assuming the effects are the same in all periods. To complement these developments, it is interesting to consider what we are estimating if that assumption does not hold. In general, the mean of equals

i.e. the average of the causal effects for each period. has the same mean due to the randomization, and will equal when estimation is done according to Theorem 1. Thus, in general, the above strategy will lead to unbiased estimation of the average of the period specific causal effects. Alternatively, one may fit a working model to all data under Assumption 4, and subsequently apply G-computation for a single period:


The estimation of gains precision by using data from all periods, which in turn makes (5) more precise. The estimator (5) emulates a standard one-period trial, while it is unclear what we are emulating by ignoring the period specific treatment effects [8]. This is a topic for further research.

Two other theoretical issues also deserve more attention. First, we have not theoretically shown that or for that matter are in fact more efficient than the non-parametric estimator . We, however, suspect this to be the case in line with the results obtained for one-period trials in [1, 33]. Second, the impact of a violation of restrictions on the working model dictated by Theorem 1 deserves further investigation. Possibly, a more flexible weight matrix than what is warranted by Theorem 1 may lead to further efficiency gains.

Finally, we would like to point out the difference between the estimation procedure proposed in this paper and the traditional approach of reporting treatment effects based on differences in least squares means (see for example chapter 8 of [21]). It is duly noted that differences in least squares means are equivalent to when the treatment effects are modelled as main effects in a linear mixed model. However, they are not equivalent to when the model is more complex. In those cases, least squares means lack a proper causal interpretation in a meaningful population and on those grounds or should be preferred for assessing causal treatment effects.


  • [1] Jonathan W Bartlett. Covariate adjustment and estimation of mean response in randomised trials. Pharmaceutical statistics, 17(5):648–666, 2018.
  • [2] Robert M Bell and Daniel F McCaffrey. Bias reduction in standard errors for linear regression with multi-stage samples. Survey Methodology, 28(2):169–182, 2002.
  • [3] A. Colin Cameron and Douglas L. Miller. A Practitioner’s Guide to Cluster-Robust Inference. Journal of Human Resources, 50(2):317–372, 2015.
  • [4] EMA. European Medicines Agency Guideline on Adjustment for Baseline Covariates in Clinical Trials. European Medicines Agency: CPMP/295050/2013, 2015.
  • [5] FDA. Adjusting for Covariates in Randomized Clinical Trials for Drugs and Biologics with Continuous Outcomes. Draft Guidance for Industry., 2019.
  • [6] Food and Drug Administration, HHS. International Conference on Harmonisation; guidance on E14 Clinical Evaluation of QT/QTc Interval Prolongation and Proarrhythmic Potential for Non-Antiarrhythmic Drugs; availability. Notice. Federal Register, 70(202):61134–61135, 2005.
  • [7] David A. Harville. Matrix algebra from a statistician’s perspective. Springer, 2008.
  • [8] Miguel A. Hernán, Alvaro Alonso, Roger Logan, Francine Grodstein, Karin B. Michels, Walter C. Willett, JoAnn E. Manson, and James M. Robins. Observational Studies Analyzed Like Randomized Experiments: An Application to Postmenopausal Hormone Therapy and Coronary Heart Disease. Epidemiology, 19(6):766–779, 2008.
  • [9] Adrián V. Hernández, Ewout W. Steyerberg, Isabella Butcher, Nino Mushkudiani, Gillian S. Taylor, Gordon D. Murray, Anthony Marmarou, Sung C. Choi, Juan Lu, J. Dik F. Habbema, and Andrew I.R. Maas. Adjustment for Strong Predictors of Outcome in Traumatic Brain Injury Trials: 25% Reduction in Sample Size Requirements in the IMPACT Study. Journal of Neurotrauma, 23(9):1295–1303, 2006.
  • [10] Torsten Hothorn, Frank Bretz, and Peter Westfall.

    Simultaneous Inference in General Parametric Models.

    Biometrical Journal, 50(3):346–363, 2008.
  • [11] ICH. Addendum on estimands and sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials E9 (R1), 2019.
  • [12] Jiming Jiang. Linear and generalized linear mixed models and their applications. Springer, 2007.
  • [13] Michael G Kenward and James H Roger. The use of baseline covariates in crossover studies. Biostatistics, 11(1):1–17, 2010.
  • [14] Roderick J. A. Little and Donald B. Rubin. Statistical analysis with missing data. Wiley, third edition, 2020.
  • [15] Kaifeng Lu. An efficient analysis of covariance model for crossover thorough QT studies with period-specific pre-dose baselines. Pharmaceutical Statistics, 13(6):388–396, 2014.
  • [16] James G MacKinnon and Halbert White. Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of Econometrics, 29(3):305–325, September 1985.
  • [17] Zhaoling Meng, Hui Quan, Li Fan, Robert Kringle, and Gordon Sun. Use of the Average Baseline Versus the Time-Matched Baseline in Parallel Group Thorough QT/QTc Studies. Journal of Biopharmaceutical Statistics, 20(3):665–682, 2010.
  • [18] Whitney K. Newey and Daniel McFadden. Chapter 36 Large sample estimation and hypothesis testing. In Handbook of Econometrics, volume 4, pages 2111–2245. Elsevier, 1994.
  • [19] Yasushi Orihashi and Yuji Kumagai. Concentration-QTc analysis with two or more correlated baselines. Journal of Pharmacokinetics and Pharmacodynamics, 48(5):615–622, 2021.
  • [20] Yasushi Orihashi, Yuji Kumagai, and Kazuhito Shiosakai. Novel concentration-qtc models for early clinical studies with parallel placebo controls: A simulation study. Pharmaceutical Statistics, 20(2):375–389, 2021.
  • [21] Scott D. Patterson and Byron Jones. Bioequivalence and statistics in clinical pharmacology. Chapman & Hall/CRC, 2006.
  • [22] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
  • [23] Christian Bressen Pipper, Christian Ritz, and Hans Bisgaard. A versatile method for confirmatory evaluation of the effects of a covariate in multiple models: Evaluation of Effects of a Covariate in Multiple Models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61(2):315–326, 2012.
  • [24] James Pustejovsky. clubSandwich: Cluster-Robust (Sandwich) Variance Estimators with Small-Sample Corrections, 2022. R package version 0.5.6.
  • [25] James E. Pustejovsky and Elizabeth Tipton. Small-Sample Methods for Cluster-Robust Variance Estimation and Hypothesis Testing in Fixed Effects Models. Journal of Business & Economic Statistics, 36(4):672–683, 2018.
  • [26] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2021.
  • [27] James Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9-12):1393–1512, 1986.
  • [28] Laurence D. Robinson and Nicholas P. Jewell.

    Some Surprising Results about Covariate Adjustment in Logistic Regression Models.

    International Statistical Review / Revue Internationale de Statistique, 59(2):227–240, 1991.
  • [29] Michael Rosenblum and Jon Arni Steingrimsson. Matching the efficiency gains of the logistic regression estimator while avoiding its interpretability problems, in randomized trials. Johns Hopkins University, Dept. of Biostatistics Working Papers, 2016. Working Paper 281.
  • [30] Robert Schall and Arne Ring. Mixed models for data from thorough QT studies: part 1. assessment of marginal QT prolongation. Pharmaceutical Statistics, 10(3):265–276, 2011.
  • [31] Stephen Senn. Cross-over trials in clinical research. John Wiley & Sons, 2nd edition, 2002.
  • [32] Anastasios A. Tsiatis. Semiparametric theory and missing data. Springer, 2006.
  • [33] Mark J. van der Laan and Sherri Rose. Targeted Learning. Springer, 2011.
  • [34] Bingkai Wang, Ryoko Susukida, Ramin Mojtabai, Masoumeh Amin-Esmaeili, and Michael Rosenblum. Model-Robust Inference for Clinical Trials that Improve Precision by Stratified Randomization and Covariate Adjustment. Journal of the American Statistical Association, pages 1–12, 2021.

7 Appendix A: Proof of Theorem 1

We base the proof on rewriting the estimator as

The last rewriting shows that is equal to plus an adjustment term, i.e., if the last term is zero.

We assume the -parameters are estimated from the following weighted least squares estimating equation.


where is on the form (2). The inverse of these matrices is also on the form


where C and D are matrices of the same dimensions as A and B, namely . This can be realized by using the Woodbury identity (Theorem 18.2.8 in [7]). The Woodbury identity states that assuming is nonsingular, then is nonsingular if and only if is nonsingular, and in that case:

For our purposes we will choose:


such that . To realize that (7) follows from the Woodbury identity, one needs to realize that is block diagonal with in the diagonal. Moreover,

The inverse of is then , where

This can be checked by multiplying the matrices and see that they give you the identity (it might be easier if you substitute with ). Then is a block matrix where all blocks are equal. Then the Woodbury identity gives us a block diagonal matrix minus a block matrix with all blocks equal, which is on the form (7).

For convenience, we will introduce

There is an equation in (6) for each column in our design matrix, i.e. one equation for each -parameter. However, as it turns out, we only need some equations to get . We re-parametrize our model so that instead of having a -parameter per combination of treatment and time point, we have main effects from time points, and main effects of non-placebo treatment per time point, i.e. placebo treatment becomes a reference level. First, we need the equations coming from having an effect of time point in our model. If we specifically look at the estimating equation coming from the effect of time point , then it looks like this:


The inverse variance matrix leads to all residuals from all time points, periods, and subjects, being included in the estimating equation (8) for time point . However, (8) is T linear equations, one for each , with T unknowns, all equalling zero. Hence, we can conclude


The second set of estimating equations we need comes from having main effects of treatment per time point in our model. The estimating equation corresponding to the effect of treatment at time point is:

(*) comes from the fact that all subjects receive each treatment exactly once, so that . (**) comes from (9). The rest is just interchanging the order of summation. Again, we have T equations with T unknowns and can conclude


as wanted. Note that the above argument only works for all other treatments than the placebo, since placebo is the reference treatment. However, we can rewrite (9) to

where the last equality comes from (10). Thus, the extra term in is zero, and . Adding other covariates or terms to the regression will not change this fact as long as we have main effects of treatment specific to each post baseline time point.

8 Appendix B: Influence function for

Let’s say that the estimation in the first stage solves the following equation:


where is all the information we have about subject . This would for example be the case if we used MLE in which case would be the score function. Denote the solution to (11) by

, and the limit in probability of

by . Then the influence function of is found using Theorem 6.1 from [18]:



The first term in (12) represents the uncertainty coming from the covariate distribution, and is zero if the model simply is a main term linear mixed model, where is a parameter in the model, and thereby independent of covariates. The second term represents the uncertainty arising from the estimation of the -parameters in the first stage. The variance of the estimator is then

9 Appendix C: Derivation of

The estimator is derived in the same way as the semi-parametric estimator for the pretest-posttest study in [32], but with as the non-parametric starting point. Note that this won’t result in the efficient estimator in our setup, since we also have the assumption of the same treatment effect in all periods, which is not used in the derivation.

The influence function of is given by

where is the true causal effect, which we remind the reader is independent of period under Assumption 2, and consists of all the data from the subject. We derive the estimator by the same calculations as [32], namely


where , and likewise for . We calculate the expectations:


The expectations and require a model, .

When plugging the above into (13) we get an estimator of the influence function, , and the estimator is obtained by simple isolation from the equation

which defines influence functions.

9.1 Influence function for

It might seem like the influence function was derived above. However, one could imagine that the estimation of the models in the first step would change the influence function, so we get a term like in the case of .

We can simply take the estimator , subtract and multiply by in order to get


This looks like an equation that gives us the influence function, but note that is estimated on all data, so the terms above are strictly speaking not independent. To proceed, we need to assume something about what happens to the model as gets bigger. Assume that there exists