Crossover trials, in which participants are randomly allocated to receive a sequence of treatments across a series of time periods, are an extremely useful tool in clinical research. Their nature permits each patient to act as their own control, exploiting the fact that in most instances the variability of measurements on different subjects in a study will be far greater than that on the same subject. In this way, crossover trials are often more efficient than parallel group trials. Like most experimental designs, the determination of the sample size required by a crossover trial, to achieve a certain power for a particular treatment effect, depends on the significance level, and at least one factor which accounts for the participant’s variance in response to treatment. Whilst the former are designated quantities, the variance factors will usually be subject to substantial uncertainty at the design stage. Their value will often be greatly affected by components of the current trial, such as inclusion/exclusion criteria for example, that renders estimates obtained from previous trials biased. This is troubling since sample size calculation is of paramount importance in study design. Planning a trial that is too large results in an unnecessary number of patients being made susceptible to interventions that may be harmful. It also needlessly wastes valuable resources in terms of time, money, and available trial participants. In contrast, too small a sample size confers little chance of success for a trial. The consequences of this could be far reaching: a wrong decision may lead to the halting of the development of a therapy, which could deprive future patients of a valuable treatment option.
To address this problem in a parallel group setting with normally distributed outcome variables,Wittes and Brittain , building upon previous work by Stein , proposed the internal pilot study design. In their approach, at an interim time period the accrued data is unblinded, the within-group variance computed, and the trial’s required sample size adjusted if necessary. However, unblinding an ongoing trial can reduce its integrity and introduce bias . Consequently, Gould and Shih 
explored several approaches for re-estimating the required sample size in a blinded manner. Since then, a number of papers have advocated for re-estimation in a parallel group setting to be based upon a crude one-sample estimate of the variance, and methodology has also been proposed which allows the type-I error-rate to be more accurately controlled. More recently, much work has been conducted on similar methods for an array of possible trial designs and types of outcome variable (see, e.g., Jensen and Kieser  and Togo and Iwasaki ), with these methods also gaining regulatory acceptance [2, 5].
Thus, today, sample size re-estimation procedures have established themselves for parallel group trials as an advantageous method to employ when there is pre-trial uncertainty over the appropriate sample size. In contrast, there has been little exploration of such methodology within the context of multi-treatment crossover trials. Golkowski et al.  recently explored a blinded sample size re-estimation procedure for establishing bioequivalence in a trial utilising an AB/BA crossover design. Jones and Kenward  discussed how the results of Kieser and Friede  could be rephrased for an AB/BA crossover trial testing for superiority. In addition, several unblinded re-estimation procedures for AB/BA bioequivalence trials have been proposed [22, 21, 32], the performance of which has recently been extensively compared . The work of Lake et al.  and van Schie and Moerbeek  on sample size re-estimation in cluster randomised trials has some parallels with the methodology required for crossover trials, because of the necessitated mixed model for data analysis. Likewise, this is true of the methodology presented by Zucker and Denne  on re-estimation procedures for longitudinal trials. However, we are unaware of any article that explicitly discusses re-estimation in crossover trials with more than two-treatments. There are many examples of such trials in the literature, whilst they also remain the focus of much research (see, e.g., Bailey and Druilhet  and, Lui and Chang ).
In this article, we consider several possible approaches to the interim re-assessment of the sample size required by a multi-treatment crossover trial. We assume a normally distributed outcome variable, and that a commonly utilised linear mixed model will be employed for data analysis. We focus primarily on a setting in which the final analysis is based on many-to-one comparisons for one-sided null hypotheses, but provide additional guidance for other possibilities in the Appendix. Blinded procedures for estimating the between and within person variance in response are proposed, following either simple or block randomisation to sequences that are balanced for period. The performance of these estimators is contrasted to that of an unblinded procedure via a simulation study motivated by a real four-treatment four-period crossover trial. Additionally, in the Appendix we provide results for two additional examples. We now proceed by specifying the notation used in the re-estimation procedures. Our findings are then summarised in Section 3, before we conclude in Section 4 with a discussion.
2.1 Hypotheses, notation, and analysis
We consider a crossover trial with treatments, indexed . Treatments are considered experimental, and are to be compared to the common control . We suppose that sequences, indexed , are utilised for treatment allocation, and denote by the number of patients allocated to sequence . The number of periods in the trial, which is equal to the length of each of the sequences, is denoted by .
We restrict our focus to trials with normally distributed outcome data, to be analysed using the following linear mixed model
is the response for individual , in period , on sequence ;
is an intercept term; the mean response on treatment 0 in period 1;
is a fixed effect for period , with the identifiability constraint ;
is a fixed direct treatment effect for the treatment administered to an individual in period , on sequence , with the identifiability constraint . Thus ;
is a random effect for individual on sequence ;
is the residual for the response from individual , in period , on sequence .
This model, and its implied covariance structure, is the standard for a crossover trial that ignores the possible effects of carryover. Thus we are implicitly heeding the advice of Senn , and others, that a crossover trial should not be conducted when carryover is likely to be an issue. Furthermore, note that by the above, two observations and are independent unless and .
We assume that the following hypotheses are to be tested, to attempt to establish the superiority of each experimental intervention versus the control
Note though that for Examples 1 and 3, slightly different hypotheses are assessed, as negative effects imply efficacy. Additionally, in the Appendix we detail how one can handle alternate hypotheses of interest.
We suppose that it is desired to strongly control the FWER, the maximal probability of one or more incorrect rejections amongst the family of null hypotheses for all possible treatment effects, to some specified level. There are several possible ways to define power in a multi-treatment setting. Throughout, we assume that pairwise power of at least to reject, without loss of generality, is required when
for designated type-II error-rateand clinically relevant difference . Thus, from here, when referring to power we mean the probability that is rejected. However, in the Appendix we describe how a desired familywise power could be achieved.
To test the hypotheses, we assume that patients in total will be recruited to the trial, with each randomised to one of the sequences, and that the the linear mixed model (2.1) will be fitted to the accumulated data. Note that in fitting this model, a choice must be made over whether to utilise maximum likelihood, or restricted error maximum likelihood (REML), estimation. Given the bias of the maximum likelihood estimator of the variance components of a linear mixed model in finite samples, and that crossover trials are often conducted with relatively small sample sizes, here we always take the latter approach. Note though that this would have little effect for larger sample sizes. For further details on these considerations, we refer the reader to, for example, Fitzmaurice et al. . In brief, the REML estimation procedure, for a linear mixed model of the form with and , iteratively optimizes the parameter estimates for the effects in the model. The following modified log-likelihood is maximised to provide an estimate, , for , using an estimate, , for
Then, is updated to
and the process repeated. Given the final solutions and , we take .
In our case, , and the following
Wald test statistics are formed
where and are extracted from and respectively.
Next, we reject if , with chosen to control the FWER. Explicitly, using a Dunnett test , we take as the solution to
where is the
-dimensional cumulative distribution function of a central multivariatet-distribution with covariance matrix and degrees of freedom. We take the degrees of freedom here, for sample size , to be , which arises from that associated with an analogous multi-level ANOVA design. Moreover, is the covariance matrix of , which can be calculated using .
Now, in this case, if and were known, and we assumed that , we could derive a simple formula for the total number of patients, , required to achieve the desired power for the trial. Here, we denote this formula using the function , explicitly stating it’s dependence upon the within and between person variances. In the Appendix, we elaborate on how this formula can be derived.
Our problem, as discussed, is that in practice and are rarely known accurately at the design stage. Therefore, we propose to re-estimate the required sample size at an interim analysis timed after patients. That is, we consider several methods to construct estimates, and , for and respectively, based on the data accrued up to the interim analysis. Then, the final sample size for the trial is taken as
where denotes the nearest integer greater than or equal to and is a specified maximal allowed sample size. It could be based, for example, on the cost restrictions or feasible recruitment rate of a trial. Of course, if then the trial will be expected to be under-powered. Thus, if necessary, additional patients are recruited and a final analysis conducted as above based on the calculated values of the test statistics , and the critical value as defined in Equation (2.2).
Throughout, to give our function a simple form, we consider values of that imply an equal number of patients could be allocated to each of the sequences, and assume randomisation schemes that ensures this is the case. Moreover, for reasons to be elucidated shortly, we consider from here only settings where the sequences are balanced for period. That is, across the chosen sequences, each treatment appears an equal number of times in each period. We now proceed by detailing each of our explored methods for estimating and based on the internal pilot data.
2.2 Unblinded estimator
The first of the methods we consider is an unblinded procedure. As noted, such an approach is typically less well favoured by regulatory agencies. However, though this may not always actually prove to be the case (see, e.g., Friede and Kieser ), one may anticipate its performance in terms of estimating the key variance components and provided desired operating characteristics to be preferable to that of the blinded procedures. This method therefore serves as a standard against which to assess the blinded estimators. Explicitly, this approach breaks the randomisation code and fits the linear mixed model (2.1) to the accrued data using REML estimation. With the REML estimates of and obtained, they are utilised in the re-estimation procedure as described above.
2.3 Adjusted blinded estimator
Zucker et al.  considered a blinded estimator for two-arm parallel trial designs based on an adjustment to the one-sample variance. Golkowski et al.  considered a similar unadjusted procedure for two-arm bioequivalence trials. Here, we consider a similar approach for multi-treatment crossover trials. Specifically, the following blinded estimators of the within and between person variances are used
for specified , , with , where
In the Appendix, we show that if for then and , and thus and
are unbiased estimators forand respectively. This is the reason for our restrictions on the employed randomisation scheme (which assumes at the interim reassessment), and the employed sequences (which are assumed to be balanced for period). The above estimator could be used when there is imbalance in the number of patients allocated to each sequence, or without making this restriction on the sequences, but results on the expected values of the variance components would have a more complex form. It is therefore advantageous to ensure an equal number of patients are allocated to each sequence, and also logical to utilise period balanced sequences. We also view is as sensible therefore to explore the performance of the estimators in this case.
It is also important to assess the sensitivity of the performance of these estimators to the choice of the , hoping for it to have negligible impact as in analogous procedures for other trial settings . Adapting previous works (see, e.g., Kieser and Friede , Zucker et al. , Gould and Shih ), we assess this procedure for , and , , and refer to these henceforth as the null adjusted and alternative adjusted re-estimation procedures respectively.
Note that one limitation of this approach in practice is that there is no guarantee that the above value for will be positive. Therefore, we actually re-evaluate the required sample size as . For the examples provided in the Appendix, we demonstrate that the above procedure still performs well despite this inconvenience. Moreover, in certain routinely faced scenarios, as will be discussed shortly, the value of is inconsequential and this issue therefore no longer exists. However, in general this must be kept in mind when considering using this procedure for sample size re-estimation.
2.4 Blinded estimator following block randomisation
The above re-estimation procedures are explored within the context of a simple randomisation scheme that only ensures an equal number of patients are allocated to each sequence prior to the interim re-assessment. In contrast, the final blinded estimator we consider exploits the advantages block randomisation can bring, extending the methodology presented in Xing and Ganju  for parallel arm trials to crossover studies.
We suppose that patients are allocated to sequences in blocks, each of length (with these values chosen such that ). We recategorise our data as , the response from patient , in period , in block . Then, the following blinded estimators are used to recalculate the required sample size
In the Appendix, provided that an equal number of patients are allocated to each of a set of period balanced sequences, these are also shown to be unbiased estimators for and . Note though that as above, we must actually re-estimate using . Additionally, when using block randomisation, the actual sample size used by a trial may differ from , if it is not divisible by the block length .
3 Simulation study
3.1 Motivating examples
We present results for three motivating examples based on real crossover trials. Example 1 is described in Section 3.2, with Examples 2 and 3 discussed in the Appendix, where their associated results are also presented. Among the three examples we consider settings with a range of required sample sizes, utilising complete block, incomplete block, and extra-period designs. This allows us to provide a thorough depiction of the performance of the various estimators in a wide range of realistic trial design settings.
3.2 Example 1: TOMADO
First, we assess the performance of the various re-estimation procedures using the TOMADO trial as motivation. TOMADO compared the clinical effectiveness of a range of mandibular devices for the treatment of obstructive sleep-apnea hypopnea. Precise details can be found in Quinnell et al. . Briefly, TOMADO was a four-treatment four-period crossover trial, with patients allocated treatment sequences using two Williams squares. The data for the outcome Epworth Sleepiness Scale was to be analysed using linear mixed model (2.1), with the following hypotheses tested
since a reduction in the Epworth Sleepiness Scale score is indicative of an efficacious treatment. Consequently, the null hypotheses were to be rejected if , using the value of determined as above.
Following the methodology described in the Appendix, we can demonstrate that when complete-block period-balanced sequences are used for treatment allocation, that the required sample size has no dependence upon the between person variance . Explicitly, we have
where is defined in the Appendix. See Jones and Kenward , for an alternative derivation of this formula. This substantially simplifies the re-estimation procedure, as we only need to provide a value for , and do not require use of the estimators for .
TOMADOs complete case analysis estimated the following values for the various components of the linear mixed model (1)
Therefore, for , the trials planned recruitment of 72 patients would have conferred power of 0.8 at a significance level of 0.05 for . Consequently, we set and throughout. In the main manuscript, we additionally take and always. The effect of other underlying values for and is considered in the Appendix. In contrast, whilst we focus here on the case with , we also consider the influence of alternative values for this parameter. When simulating data we take , , , and . However, the effect of other period effects is discussed in Section 4 and in the Appendix.
We explore the performance of the procedures under the global null hypothesis (), when only treatment one is effective (), when treatments one and two are effective (), under the global alternative hypothesis (), and under what we refer to henceforth as the observed treatment effects (). For simplicity, we assume a single Latin square was used for treatment allocation, and set so that there is no practical upper limit on the allowed sample size. In all cases, the average result for a particular design and analysis scenario was determined using 100,000 trial simulations.
3.3 Distributions of and
First, the performance of the re-estimation procedures was explored for the parameters listed in Section 3.2, with , and . The resulting distributions of , the interim estimate of
, are shown in Figure 1 via the median, lower and upper quartiles in each instance. Additionally, Figure 2 depicts the equivalent results for the distribution of, the interim re-estimated value for . The results are grouped according to the timing of the re-estimation and by the true value of the treatment effects. Note that is only considered for values of which allows an equal number of patients to be allocated to each sequence by the interim analysis.
The median value of for the unblinded procedure is always close to, but typically slightly less than, the true value . The same statement holds for the block randomisation procedure with or 4. However, whilst this is true for the adjusted procedures under the global null hypothesis, it is not otherwise always the case. In particular, both perform poorly for the observed treatment effects.
As would be anticipated, the alternative adjusted procedure has lower median values for than the null adjusted procedure. Moreover, using the block randomised re-estimation procedure with seems to improve performance over , both in terms of the median value of , and by imparting a smaller interquartile range for .
The results for mirror those for . Thus is larger for the adjusted estimators under the observed treatment effects, but otherwise the distributions are comparable.
Increasing the value of reduces the interquartile range for and for each procedure, and results in median values closer to the truth, as would be expected. Finally, we observe that the interquartile range for the unblinded procedure is often smaller than that of its adjusted or block randomisation counterparts.
3.4 Familywise error-rate and power
For the scenarios from Section 3.3 that were not conducted under the observed treatment effects, the estimated FWER and power were also recorded. The results are displayed in Table 1.
The FWER for each of the procedures is usually close to the nominal level, with a maximal value of 0.052 for the unblinded procedure with . The adjusted procedures arguably have the smallest inflation across the considered values of .
In most cases the re-estimation procedures attain a power close to the desired level. Of the adjusted procedures, the null adjusted has a larger power, as would be anticipated given our observations on and above. In fact, the null adjusted method conveys the highest power for each value of . The power of the block randomised procedures is typically similar to that of the alternative adjusted method. In addition, whether only treatment one, treatments one and two, or all three treatments are effective has little effect on the power.
There is no clear to trend as to the effect of increasing on the FWER, however it leads in almost all instances to an improvement in power. Finally, increasing the value for in the block randomisation procedure increases power as would be predicted.
3.5 Influence of
In this section, we consider the influence of the value of on the performance of our re-estimation procedures. Specifically, whilst we know that increasing will increase the required sample size, we would like to assess the effect this has upon the ability of the methods to control the FWER and attain the desired power.
Figures 3 and 4 respectively present our results on the FWER and power of the various re-estimation procedures when for several values of , under the global null and alternative hypotheses respectively.
Arguably, we observe that the FWER is more variable for smaller values of , with it changing little for several of the procedures when . There is additionally some evidence to suggest that increasing the value of reduces the overall effect has on the FWER.
For the power, as would be anticipated, the re-estimation procedures are over-powered when and is small. Moreover, increasing the value of universally increases the power. Finally, as increases beyond approximately , for both considered values of , there is little change in power.
3.6 Sample size inflation factor
Whilst the above suggests the overall performance of the re-estimation is good, there are several simple refinements that can be implemented to improve the observed results.
One such refinement, to help ensure the power provided by the re-estimation procedures is at least the desired , is to utilise a sample size inflation factor as originally proposed by Zucker et al. . With it, the value of as determined using the arguments above, is enlarged by the following factor
Of course, one must be careful that the new implied sample size does not exceed any specified value of . However, this factor has then been shown to improve the performance of re-estimation procedures in both superiority , non-inferiority , and two-treatment bioequivalence trials .
Figure 5 displays its effect in the context of our multi-treatment crossover trials. Explicitly, the power of the various re-estimation procedures under the global alternative hypothesis, for and , is shown with and without the use of the inflation factor. For the unblinded, null adjusted, and block randomised method with , the inflation factor increases power to above the desired level in every instance. Consequently, this simple inflation factor appears once more to be an effective adjustment to the basic procedures.
In this article, we have developed and explored several methods for the interim re-assessment of the sample size required by a multi-treatment crossover trial. Our methodology is applicable to any trial analysed using the linear mixed model (2.1), when there is equal participant allocation to a set of period balanced sequences. Thus whilst adapting the work of Golkowski et al.  would be advisable in the case of an AB/BA superiority trial, given that it does not require the use of simulation, our methods are pertinent to a broader set of crossover designs. Indeed, they are as readily applicable to multi-treatment superiority trials as they are ones for establishing bioequivalence.
We explored performance via three motivating examples, allowing consideration of settings with different types of sequences and a range of required sample sizes. Overall, the results presented here for the TOMADO trial are similar to those provided in the Appendix for Examples 2 and 3. However, larger inflation to the FWER was observed in Example 2, most likely as a consequence of its associated smaller sample sizes. Nonetheless, the methods were found to provide desirable power characteristics with negligible inflation to the FWER in many settings. In particular, the blinded procedures provided comparable operating characteristics to the unblinded procedure, and thus can be considered viable alternatives. Following results for parallel arm trials , the null adjusted blinded estimator arguably performed better than the other estimators in that its typical over-estimation of the variance at interim led to the desired power being achieved more often. We may therefore tentatively suggest the null adjusted blinded estimator to be the preferred approach in this setting.
Our findings indicate that for each of the re-estimation procedures, the choice of and the underlying values of and often have little effect upon the FWER and power. We may be reassured therefore that the performance of the procedures should often be relatively insensitive to the design parameters. On a similar note, it is important to recognise that one cannot be certain when utilising these methods that the value of the period effects will not influence the performance of the re-estimation procedures. Whilst the final analysis should be asymptotically invariant to period effects, in finite samples it may influence the results of the hypothesis tests. Intuitively though one would not anticipate this effect to be large, nor would one routinely expect large period effects in many settings. In the Appendix, simulations to explore this are presented for the TOMADO example. The results indicate that there is little evidence to suggest the value of the period effects influences the performance of the re-estimation procedures. Trialists must be mindful however that this cannot be guaranteed, and should therefore be investigated.
We also considered the utility of a simple sample size inflation factor in ensuring the power reaches the desired level. Ultimately, we demonstrated that this was an effective extension to the basic re-estimation procedures. Though the observed inflation to the FWER of our procedures was often small, if more strict control is desired, a crude -level adjustment procedure can also be utilised. For a particular re-estimation scenario, the values of and , and say, which maximise the inflation to the FWER under the global null hypothesis can be determined via a two dimensional search. Then, the significance level used in the analysis of the trial can be adjusted to the that confers a FWER of for this , pair, according to further simulations. This may be useful in practice if the inflation is large for a particular trial design scenario of interest.
It is important to note the seemingly inherent advantages and disadvantages of the various re-estimation procedures. The adjusted estimator is perhaps the most constrained of those considered; requiring an equal number of patients to be allocated to each sequence for any non-zero adjustment to be reasonable. This is particularly troubling because of the possibility of patient drop-out.
The estimator following block randomisation does not necessitate equal allocation to sequences (though its performance was considered here only when this was the case), but could also fall foul of patient drop-out that would prevent the estimation of the within person variance for each block. It also requires block randomisation, and could not be used with a more simple randomisation scheme if this was desired. The unblinded estimator of course suffers from none of these problems, but as discussed may be looked upon less favourably by regulators.
Finally, note that in conducting our work we also considered the performance of two re-estimation procedures based on methodology for the clustering of longitudinal data [7, 10]. The motivation for this came from the Expectation-Maximisation algorithm approaches of Gould and Shih  for parallel two-arm, and Kieser and Friede  for parallel multi-arm, studies. These methods may seem appealing, as they are blinded, under certain assumptions can produce unbiased estimates of the variance parameters, do not require specification of any adjustment, and in theory should be able to more readily handle small amounts of missing data. However, we found that they routinely vastly under-estimated the size of within person variance, resulting in substantially lower power than that attained by the other re-estimation procedures. Accordingly, especially given the associated concerns about the appropriateness of an Expectation-Maximisation algorithm for blinded sample size re-estimation , we would not recommend re-estimation be performed based on a clustering based approach.
In conclusion, following findings for other trial design settings, blinded estimators can be used for sample size re-estimation in multi-treatment crossover trials. The operating characteristics of any chosen procedure should of course be assessed pre-trial through a comprehensive simulation study. But, often, investigators can hope to find that the likelihood of correctly powering their study when there is pre-trial uncertainty over the within and between person variances can be enhanced.
The authors would like to thank the two anonymous reviewers whose comments helped to substantially improve the quality of this article. This work was supported by the Medical Research Council [grant number MC_UU_00002/3 to M.J.G. and A.P.M., and grant number MC_UU_00002/6 to J.M.S.W.].
Appendix A Deriving
In this section, we elaborate on how a formulae for the sample size required by a trial, , can be specified when and are known, a set of sequences have been chosen, and . First, we focus on the set up from Section 2.1, before briefly describing adjustments for other testing scenarios.
We begin by denoting the linear mixed model for our vector of observationsby . Here, it is important to note that the particular form of the matrices and is dependent on the sample size and the chosen sequences. Then, the generalised least squares estimate for , , is given by
where is known. Precisely, by our choice of covariance structure implied by linear mixed model (1), will be an block diagonal matrix, consisting of blocks given by . In addition, , will also be known. Finally, is an unbiased estimate of . Thus the vector of test statistics , where
has the following multivariate normal distribution
Here, , and with . Furthermore, is the matrix formed by placing the elements of the vector along the leading diagonal, and we take the convention that .
Using the distribution of , we can control the FWER to a desired level for our hypotheses of interest by rejecting if , for the which is the solution of
where is the -dimensional cumulative distribution function of a central multivariate normal distribution with covariance matrix . The power to reject when is then
for . Consequently, to have power of we must have
By deriving the explicit form of we can determine its dependence upon the sample size , which allows the above to be arranged and our final formula specified.
For example, as discussed in Section 3.2, in the case where complete block sequences that are balanced for period are utilised, it is well known that (Jones and Kenward, 2014). Thus, we can rearrange the above to acquire our previously specified formula
In the case where alternative hypotheses are to be assessed (e.g., a global test that foregoes many-to-one comparisons), provided testing is performed using effects from , the above can easily be adapted to designate an appropriate sample size formula. In each instance, one specifies a vector of test statistics, the distribution of which can be derived using that of . This allows an appropriate formulae for controlling the FWER to be provided. For example, in the case where
we still utilise as defined above, but now reject if , where is the solution of
Then, a formulae for the power of interest can similarly be designated, which allows to either be specified explicitly, or permits an iterative search to be performed to determine its required value.
For example, for the design scenario of Section 2.1, if we instead desire a familywise power of at least when (that is, power to reject at least one of ), our formula for the power becomes
Having derived and for any , a one-dimensional search can then be performed to acquire the minimal such that the above is at least .
Appendix B Adjusted blinded estimator
In this section, we find the expected value of the adjusted blinded estimators of the within and between person variances discussed in the main paper.
First, note that for any sequences which are balanced for period
for , and
Now note that for the linear mixed model (1) that
In addition, observe that unless and . Furthermore
where we have used Equation (3). Additionally
Next, consider the following, which we call
Taking expectations we have