Bayesian design and analysis of external pilot trials for complex interventions

08/16/2019 ∙ by Duncan T. Wilson, et al. ∙ 0

External pilot trials of complex interventions are used to help determine if and how a confirmatory trial should be undertaken, providing estimates of parameters such as recruitment, retention and adherence rates. The decision to progress to the confirmatory trial is typically made by comparing these estimates to pre-specified thresholds known as progression criteria, although the statistical properties of such decision rules are rarely assessed. Such assessment is complicated by several methodological challenges, including the simultaneous evaluation of multiple endpoints, complex multi-level models, small sample sizes, and uncertainty in nuisance parameters. In response to these challenges, we describe a Bayesian approach to the design and analysis of external pilot trials. We show how progression decisions can be made by minimising the expected value of a loss function, defined over the whole parameter space to allow for preferences and trade-offs between multiple parameters to be articulated and used in the decision making process. The assessment of preferences is kept feasible by using a piecewise constant parameterisation of the loss function, the parameters of which are chosen at the design stage to lead to desirable operating characteristics. We describe a flexible, yet computationally intensive, nested Monte Carlo algorithm for estimating operating characteristics. The method is used to revisit the design of an external pilot trial of a complex intervention designed to increase the physical activity of care home residents.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Randomised clinical trials (RCTs) of complex interventions can be compromised by factors such as slow patient recruitment, poor levels of adherence to the intervention, and low completeness of follow-up data. To identify these problems prior to the main RCT we often conduct small trials [1] known as pilots. These typically take the same form as the planned RCT but with a considerably lower sample size [2]. If there is a seamless transition between the pilot and the main RCT, with all data being pooled and used in the final analysis, they are known as internal pilots. External pilots, in contrast, are carried out separately to the main RCT with a clear gap between the two trials.

The data generated by an external pilot trial is used to help decide if the main RCT should go ahead, and if so, whether the intervention or the trial design should be adjusted to ensure success. In the UK, the National Institute for Health Research ask that these progression criteria are pre-specified and included in the research plan [3], and the recent CONSORT extension to randomised pilot trials requires their reporting [4]. A single pilot trial can collect data on several progression criteria, often focused on the aforementioned areas of recruitment, protocol adherence, and data collection [5]. Although they may take the form of single threshold values leading to binary stop/go decision rules, investigators are increasingly using two thresholds to accommodate an intermediate decision between stopping altogether and progressing straight to the main trial, which would allow progression but only after some adjustments have been made [4]. The need for appropriate progression criteria is clear when we consider the consequences of poor post-pilot progression decisions. If the criteria are too lax, there is a greater risk that the main trial will go ahead but found to be infeasible and thus a waste of resources; if the criteria are too strict, a promising intervention may be discarded under the mistaken belief that the main trial would be infeasible. Despite this, there is little published guidance about how they should be determined[5, 6].

In addition to pre-specifying progression criteria, another key design decision is the choice of pilot sample size. Conventional methods of sample size determination, which focus on ensuring the trial will have sufficient power to detect a target difference in the primary outcome, are rarely used since they would lead to a pilot sample size comparable with the main trial sample size. Several methods for pilot sample size determination instead aim to provide a sufficiently precise estimate of the variance in the primary outcome measure to inform the sample size of the main trial 

[7, 8, 9, 10, 11, 12]. Others have suggested a simple rule of thumb for when the goal is to identify unforeseen problems [13]

. While some have noted that the low sample size in pilots may lead to a considerable probability that a certain progression criterion will be met (or missed) due to random sampling variation

[11, 14], and despite the consequences of making the wrong progression decision, the statistical properties of pilot decision rules are rarely used to inform the choice of sample size. This may be due to the methodological challenges commonly found in pilot trials of complex interventions, including the simultaneous evaluation of multiple endpoints, complex multi-level models, small sample sizes, and prior uncertainty in nuisance parameters [15].

In this paper we will describe a method for designing and analysing external pilot trials which addresses these challenges. We take a Bayesian view, where progression decisions are made to minimise the expected value of a loss function. We propose a loss function with three parameters whose values can be determined either through direct elicitation of preferences or by considering the pilot trial operating characteristics they lead to. The operating characteristics we propose are all unconditional probabilities (with respect to a prior distribution) of making incorrect decisions, also known as assurances [16]. Using assurances rather than the analogous frequentist error rates brings several benefits, including the ability to make use of existing knowledge whilst allowing for any uncertainty, and a more natural interpretation [17]. As we will show, assurances are also useful when our preferences for different end-of-trial decisions are based on several attributes in a complex way that involves trading off some against others.

The remainder of this paper is organised as follows. In Section 2 we describe the general framework for pilot design and analysis, some operating characteristics used for evaluation, and a routine for optimising the design. Two illustrative examples are then described in Sections 3 and 4. Finally, we discuss implications and limitations in Section 5.

2 Methods

2.1 Analysis and progression decisions

Consider a pilot trial which will produce data according to model . We decompose the parameters into , where denotes the parameters of substantive interest and the nuisance parameters. We follow [18] and assume that two joint prior distributions of have been specified, one each for the design and analysis stages of the trial. The first, denoted , is a completely subjective prior which fully expresses our knowledge and uncertainty in the parameters at the design stage. The second prior, , will be used when analysing the pilot data. Although it may be that the same subjective prior is intended to be used at both the design and analysis stages, it has been argued that regulators are unlikely to accept the prior beliefs of the trial sponsor for analysis of the data and as such a weakly or non-informative prior should be used for  [16, 19].

After observing the pilot data , we must decide whether or not to progress to the main RCT. We consider three possible actions following the aforementioned ‘traffic light’ system commonly used in pilot trials:

  • red - discard the intervention and stop all future development or evaluation;

  • amber - proceed to the main RCT, but only after some modifications to the intervention, the planned trial design, or both; or

  • green - proceed immediately to the main RCT.

In what follows we will denote these decisions by and respectively. We assume that our preferences between the three possible decisions are influenced by but independent of , formalising the separation of into substantive and nuisance components. We can then partition the substantive parameter space into three subsets where each decision is optimal. We denote these subsets by , for . We will henceforth refer to these three subsets as hypotheses. Throughout, we will distinguish hypothesis from the corresponding optimal decision by using upper and lower case letters respectively.

When and we choose a decision , there will be negative consequences. In particular, we may make three kinds of mistakes: proceed to an infeasible main RCT; discard a promising intervention; or make unnecessary adjustments to the intervention or trial design. We denote these errors as , , respectively. The occurrence of error will be denoted by , otherwise . An error’s occurrence will be a function of the decision made and the true parameter value , i.e. for . We then use a loss function to express the preferences of the decision maker(s) on the space of possible events under uncertainty, defined as

Note that the additive form of the loss function implies that the our preferences for any one of the attributes are independent of the values taken by the others [20].

To determine appropriate values of the parameters we first scale the loss function by setting . Thus, a loss of 0 is obtained if no errors occur, and a loss of 1 is obtained if all errors occur (although note that this is not possible in this setting). We then elicit some judgements from the decision maker(s) and use these to determine the values of . One approach is to ask them to first consider their preferences between the event with probability 1, and a simple gamble which will lead to no errors (i.e. ) with probability and the event with probability . They are asked to choose a value for such that they are indifferent between the two options. Similarly, we ask for a value of such that they are indifferent between the event with probability 1, and a simple gamble which will lead to no errors with probability and the event with probability . These two judgements leave us with the three equations

which can then be solved to obtain

The loss function will then take values as given in Table 1. For example, suppose we make a ‘green’ decision under the ‘amber’ hypothesis. The subsequent trial will be infeasible because the necessary adjustments will not have been made; but we have also discarded a promising intervention, since it would have been redeemed had the adjustments been made. The overall loss is therefore .

Decision 0
Table 1: Losses associated with each decision under each hypothesis.

Given a loss function with parameters we follow the principle of maximising expected utility (or in our case, minimising the expected loss) when making a progression decision. We first use the pilot data in conjugation with the analysis prior to obtain a posterior , and then choose the decision such that


We can simplify this expression by noting that, given the piecewise constant nature of the loss function, the expected loss of each decision depends only on the posterior probabilities

for . We then have


For some simple models which admit a conjugate analysis, the posterior probabilities can be obtained exactly. Otherwise, Monte Carlo estimates can be computed based on the samples from the joint posterior distribution generated by an MCMC analysis of the pilot data. Specifically, given samples ,


where is the indicator function.

2.2 Operating characteristics

Defining a loss function and following the steps of the preceding section effectively prescribes a decision rule mapping the pilot data sample space to the decision space . To gain some insight at the design stage into the properties of this rule, we propose to calculate some trial operating characteristics. These take the form of unconditional probabilities of making an error when following the rule, calculated with respect to the design prior . We consider the following:

  • - probability of proceeding to an infeasible main RCT;

  • - probability of discarding a promising intervention;

  • - probability of making unnecessary adjustments to the intervention or the trial design.

These operating characteristics can be estimated using simulation. First, we draw samples

from the joint distribution

. For each data set we then apply the analysis and decision making procedure described in Section 2.1

, using some vector

to parametrise the loss function. This results in decisions which can be contrasted with the corresponding true parameter value and in which hypothesis it resides, noting if any of the three types of errors have been made. MC estimates of the operating characteristics can then be calculated as the proportion of occurrences of each type of error in the simulated cases. Assuming that is large, the unbiased MC estimate of an operating characteristic with true probability

will be approximately normally distributed with variance

.111Note that in the case of complex models which do no admit a conjugate analysis, the posterior probabilities obtained using an MCMC analysis will themselves be approximate and as such the optimal decision will be subject to error, which may increase the variance of the operating characteristic estimates. However, this issue can be sidestepped by assuming that, for each data set, the analysis that is simulated corresponds exactly to the analysis that would be carried out in practice. In particular, we assume that exactly posterior samples will be generated by the same MCMC algorithm, using the same seed in the random number generator.

2.3 Optimisation

Elicitation of the loss function parameters in the manner described in Section 2.1 may be challenging, particularly when multiple decision-makers are involved [21]. An alternative way to determine is through examining the operating characteristics it leads to (for some fixed pilot design). As is adjusted, the balance between the conflicting objectives of minimising each OC will change, and the task is then to find the which returns the best balance from the perspective of the decision-maker. Formally, and thinking of operating characteristics as functions of , we wish to solve the multi-objective optimisation problem


where .

Since the three objectives are in conflict there will be no single solution which simultaneously minimises each one. We would instead like to find a set such that each member provides a different balance between minimising the three operating characteristics. If there exist such that for all and for some , we say that dominates . In this case, because leads to worse (or at least no better) values of all three operating characteristics when compared to , we have no reason to include it in our set . Because the search space has only two dimensions, problem (7) can be approximately solved by generating a uniform random sample of ’s and estimating the operating characteristics for each. Any parameters which are dominated in this set can then be discarded, and the operating characteristics of those which remain can be illustrated graphically. The decision maker(s) can then view the range of available options, all providing different trade-offs amongst the three operating characteristics, and choose from amongst them.

To solve the problem in a timely manner we must be able to estimate operating characteristics quickly. Noting from equation (3) that the expected loss of each decision depends only on and the posterior probabilities and , we first generate samples of these posterior probabilities and then use this same set of samples for every evaluation. This approach not only ensures that optimisation is computationally feasible, but also means that differences in operating characteristics are entirely due to differences in costs, as opposed to differences in the random posterior probability samples.

3 Illustrative example - Child psychotherapy (TIGA-CUB)

TIGA-CUB (Trial on Improving Inter-Generational Attachment for Children Undergoing Behaviour problems) was a two-arm, individually-randomised, controlled pilot trial informing the feasibility and design of a confirmatory RCT comparing Child Psychotherapy (CP) to Treatment as Usual (TaU), for children with treatment resistant conduct disorders. The trial aimed to recruit

primary carer-child dyads, to be randomised equally to each arm. This sample size was chosen to give desired levels of precision in the estimates of the common standard deviation of the primary outcome, the follow-up rate, and the adherence rate. Here, we focus on the latter two parameters and consider how our proposed method could have informed the design of TIGA-CUB.

We model the number of participants successfully followed-up using a binomial distribution with parameter

, and similarly the number successfully adhering to the intervention with a binomial distribution with parameter . At the design stage, the follow-up rate was thought to be somewhere in the range 62% to 92%, while the adherence rate

was thought to lie between 40% and 95%. We reflect these ranges of uncertainty in our design priors by using beta distributions

(thus giving a prior mean of 0.8), and (giving a prior mean of 0.7). We assume that a uniform ‘non-informative’ prior will be used for each parameter in the analysis.

TIGA-CUB’s progression criteria included only simple stop/go thresholds, with no intermediate ‘amber’ decisions. As such, in this example we partition the parameter space into two hypotheses, and . For the purposes of illustration we define the hypothesis as the subset of the parameter space where and , hypothesis being its complement. Thus, in this example we do not consider there to be a trade-off between the two parameters of interest. For the main trial to be feasible, both must be above their respective thresholds. The prior distributions on parameters and imply an a priori probability of 0.28 that , i.e. that both follow-up and adherence are sufficiently high.

In this special case, the loss function is

and the expected losses of decisions and will be and , where and . Decision is therefore optimal whenever . The posterior probability can be easily calculated given the pilot data due to the beta prior distributions being conjugate. Specifically, given a total sample size and observing participants with follow-up and participants with adherence, the posterior probability is given by


where denotes the cumulative probability function of the beta distribution with parameters .

At the design stage we can calculate the probability of an infeasible trial (),


and similarly for the probability of discarding a promising intervention. As these calculations can be computationally expensive for moderate due to the nested summation term, we use Monte Carlo approximations as described in Section 2.

Keeping the sample size fixed at per arm, we estimated the operating characteristics using a range of cost parameters values using Monte Carlo samples. The results are plotted in Figure 1, with some specific values of highlighted. The decision-maker can decide which point on the operating characteristic curve best reflects their own priorities in terms of the two types of error. For example, if the consequences of running an infeasible main RCT are considered less important than those of needlessly discarding a potentially effective intervention, the decision-maker may choose to set and would obtain .

Figure 1: Probabilities of an infeasible main trial () and of discarding a promising intervention () for a range of loss parameters when sample size is fixed at .

To examine the effect of adjusting the sample size, we evaluated the operating characteristics obtained for per arm whilst setting . The results are shown in Figure 2. Each line includes a shaded area denoting the 95% Monte Carlo error intervals, although these are so small as to be illegible given the high number () of MC samples used for each calculation. Although operating characteristics generally improve as the sample size is increased, we see that for and 0.5 the probability of an infeasible main trial, , remains flat whilst has a downward trend. As we would expect, the the expected loss reduces smoothly as increases in all cases. In contrast, there is some variability beyond that explained by MC error in the OCs. This can be explained by the discrete nature of simulated adherence and follow-up data. Our results show that, for the design priors and hypotheses used in this example, the chosen sample size in TIGA-CUB of can provide error rates broadly in line with conventional type I and II error rates under the usual hypothesis testing framework.

Figure 2: Probabilities of an infeasible main trial () and of discarding a promising intervention () for a range of per-arm sample sizes and different values of the loss parameter .

4 Illustrative example - Physical activity in care homes (REACH)

The REACH (Research Exploring Physical Activity in Care Homes) trial aimed to inform the feasibility and design of a future definitive RCT assessing a complex intervention designed to increase the physical activity of care home residents [22]. The trial was cluster randomised at the care home level, with twelve care homes in total randomised equally between treatment as usual (TaU) and the intervention plus TaU.

Data on several feasibility outcomes were collected. Here, we focus on four: recruitment (measured in terms of the average number of residents in each care home who participate in the trial); adherence (a binary indicator at the care home level indicating if the intervention was fully implemented); data completion (a binary indicator for each resident of successful follow-up at the planned primary outcome time of 12 months); and potential efficacy (a continuous measure of physical activity at the resident level). Progression criteria using the traffic light system were pre-specified for all of these outcomes except potential efficacy, as detailed in Table 2.

Outcome Red Amber Green
Recruitment (avg. per care home) Less than 8 Between 8 and 10 At least 10
Adherence Less than 50% Between 50 and 75% At least 75%
Follow-up Less than 65% Between 65 and 75% At least 75%
Table 2: Pre-specified progression criteria used in the original REACH design.

4.1 Model specification

To begin specifying a model for the REACH trial, we first note that the four substantive parameters can be divided into two pairs. Firstly, mean cluster size and follow-up rate relate to the amount of information which a confirmatory trial will gather. Secondly, potential efficacy and adherence relate to the effectiveness of the intervention, where effectiveness is thought of as the effect which will be obtained in practice when the effect of non-adherence is accounted for. We expect that a degree of trade-off between adherence and potential efficacy will be acceptable, with a decrease in one being compensated by an increase in the other. Likewise, low mean cluster size could be compensated to some extent by higher follow-up rate, and vice versa.

While there may be trade-offs within these pairs of parameters, we do not expect trade-offs between them. A trial with no effectiveness will be futile regardless of the amount of information collected, and so should not be conducted. Similarly, a confirmatory trial should not be conducted if it is highly unlikely to produce enough information for the research question to be adequately answered. We therefore consider the sub-spaces of formed by these parameter pairs, partition these into hypotheses, and combine these together. Constructing hypotheses in these two-dimensional spaces is cognitively simpler than working in the original four dimensional space, not least because they can be easily illustrated graphically.

Formally, let be the sub-space of mean cluster size and follow-up rate, and be that of adherence and potential efficacy. Having specified hypotheses for , we then have


4.1.1 Follow-up and cluster size

We assume cluster sizes are normally distributed, . A normal-inverse-gamma prior


is placed on the mean and variance to allow for prior uncertainty in both parameters. It was anticipated that an average of 8 - 12 residents would be recruited in each care home. To reflect this prior belief we set the hyper-parameters to , giving a prior cluster size of 10 with mean variance 2.05.

We assume that follow up rates are constant across clusters. The number of participants followed-up is assumed to follow a binomial distribution . We take a Beta distribution with hyper-parameters as the prior for . This gives a prior with a mean of 0.7 and a standard deviation of 0.08.

To partition the parameter space into hypotheses, we first consider the case where follow-up is perfect, i.e. . Conditional on this, we reason that a mean cluster size of below 5 should lead to a red decision (stop development), whereas a size of above 7 should lead to a green decision (proceed to the main trial). As the probability of successful follow-up decreases, we suppose that this can be compensated by an increase in mean cluster size. We assume the nature of this trade-off is linear and decide that if were reduced to 0.8, we would want to have a mean cluster size of at least 8 to consider decisions or . We further decide that a follow-up rate of less than would be critically low, regardless of the mean cluster size, and should always lead to decision . Similarly, a follow-up rate of should lead to modification of the intervention or trial design. Together, these conditions lead to the following partitioning of the parameter space:


The hypotheses are illustrated in Figure 3

(a). Having specified both the hypotheses and the prior distribution for these two parameters, we can obtain prior probabilities of each hypothesis by sampling from the prior and calculating the proportion of these samples falling into the regions

and . We have plotted 1000 samples from the prior in Figure 3 (a), falling into hypotheses and in proportions 0.354, 0.517, 0.129 respectively. This demonstrates that there is significant prior uncertainty regarding the optimal decision, indicating the potential value of the pilot trial.

Figure 3: Marginal hypotheses over parameters for (a) follow-up rate and mean cluster size ; and (b) adherence rate and potential efficacy . Each point is a sample from the joint prior distribution.

4.1.2 Adherence and potential efficacy

Having defined priors and hypotheses with respect to cluster size and follow-up, we now consider adherence and potential efficacy. The number of care homes which successfully adhere to the intervention delivery plan is assumed to be binomially distributed with probability . We assume that adherence is absolute in the sense that all residents in a care home which does not successfully deliver the intervention will not receive any of the treatment effect. We place a Beta prior on , with hyper-parameters and giving a prior mean of 0.9 and a standard deviation of 0.05.

The continuous measure of physical activity is expected to be correlated within care homes. We model this using a random intercept, where the outcome of resident in care home is


Here, is a binary indicator of care home being randomised to the intervention arm, is a binary indicator of care home successfully adhering to the intervention, is the random effect for care home and is the residual for resident . We parametrise the model using the intracluster correlation coefficient , and place priors on , and in the manner suggested in [23]. Specifically, we choose


To reflect prior expectation of an ICC around 0.05 but possibly as large as 0.1, the hyperparameters give a prior mean of 0.05 for the ICC with a prior probability of 0.104 that it will exceed 0.1.

While there is potential for adherence to be improved after the pilot, we assume there will be little opportunity to improve the potential efficacy of the intervention. Moreover, we suppose an absolute improvement in adherence of up to around 0.1 is feasible. To define the hypotheses in this subspace we first set a minimal level of potential efficacy to be 0.1, and decide that we would be happy to make decision at this point if and only if adherence is perfect. As reduces from 1, a corresponding linear increase in potential efficacy is considered to maintain the overall effectiveness of the intervention. The rate of substitution for this trade-off is determined to be approximately 0.57 units of potential efficacy per unit of adherence probability. We consider an absolute lower limit in adherence of , below which we will always consider decision to be optimal. Taking these considerations together, the marginal hypotheses are defined as


The hypotheses are illustrated in Figure 3 (b). Again, a sample of size 1000 from the joint marginal prior distribution is also plotted, falling into hypotheses and in proportions 0.234, 0.470, 0.296 respectively. As before, this indicates substantial prior uncertainty regarding the optimal decision and thus supports the use of a pilot study.

The marginal hypotheses are combined together using equation (11). Considering the same 1000 samples from the design prior plotted in Figure 3, these now fall into the regions and in proportions 0.507, 0.458, and 0.035 respectively. Note that the prior probabilities of these overall hypotheses are quite different to those of the marginal hypotheses. In particular, there is a considerable increase in the probability that decision will be optimal, and a considerable decrease that decision will be.

4.2 Evaluation

4.2.1 Weakly informative analysis

We applied the proposed method assuming that a weakly informative joint prior distribution will be used at the analysis stage222Full details of the weakly informative prior are given in the appendix.. We took the sample size of the trial to be clusters per arm. For calculating operating characteristics we generated samples from the joint distribution . We analysed each simulated data set using Stan via the R package rstan [24], in each case generating 5000 samples in four chains and discarding the first 2500 samples in each to allow for burn-in, leading to posterior samples in total. This gave a maximum Monte Carlo error of approximately 0.005 when estimating a posterior probability , which we considered sufficient. These posterior samples were then used to find the posterior probabilities of each hypothesis, for each simulated data set.

We evaluated the operating characteristics for a sample of parameters as described in Section 2.3. A total of 254 parameter vectors were evaluated, of which 62 led to operating characteristics which were worse in every respect than some other vector (i.e. dominated) and were discarded. The operating characteristics of the non-dominated parameters are shown in Figure 4. The three operating characteristics are found to be highly correlated. In particular, changing the parameters to give a lower probability of discarding a promising intervention () tends to lead to a reduction in the probability of making an unnecessary adjustment (). When selecting , the key decision appears to be trading off the probability of an infeasible trial, (), against . There is a very limited opportunity to minimise at the expense of these. For example, compare points and in Figure 4, details of which are given in Table 3. We see that point reduces by 0.078 in comparison to point , but only at the expense of increase in and of 0.13 and 0.145 respectively.

Figure 4: Operating characteristics of the example pilot trial for a range of loss parameter vectors, when a weakly informative analysis prior is used.
a (0.07, 0.9, 0.03) 0.107 (0.003) 0.108 (0.003) 0.232 (0.004)
b (0.18, 0.58, 0.24) 0.021 (0.001) 0.394 (0.005) 0.08 (0.003)
c (0.01, 0.29, 0.7) 0.151 (0.004) 0.539 (0.005) 0.002 (0)
Table 3:

Estimated operating characteristics (with standard errors) of the REACH trial for the three loss parameter vectors highlighted in Figure 

4, when a weakly informative analysis prior is used. Costs have been rounded to 2 decimal places; operating characteristics and their errors to 3.

We would expect to see a clear relationship between the value of parameters and the operating characteristics they relate to. We explore this in Figure 5 with scatter plots of each parameter against each operating characteristic. The results show that there is indeed a strong relationship between the loss assigned to discarding a promising intervention, , and the probability that this event will occur, (see centre plot). Moreover, also seems to be the main determinant of operating characteristics and . The implication is that once the has been chosen, the operating characteristics of the trial depend only weakly on the way in which the remaining is allocated to and . This appears to be due to the fact that, regardless of how errors are weighted, the way we have defined our prior distributions and hypotheses means we are much more likely to make the error of discarding a promising intervention than the other types of error. The cost we assign to this error is therefore more influential on the overall operating characteristics than the other costs.

Figure 5: Relationships between the three loss parameters ( axes) and resulting operating characteristics ( axes).

To illustrate the effect of varying sample size in the REACH trial, we set the loss function parameters to that of point in Figure 4 and Table 3, . We then estimated the operating characteristics obtained for clusters per arm. Note that we considered only three choices of sample size due to the significant computational burden of each evaluation. The results are plotted in Figure 6. Increasing the sample size appears to have little effect on and , while leading to a decrease in , the probability of discarding a promising intervention. This behaviour reflects the priorities encoded by the costs parameter, where .

Figure 6: Operating characteristics of the REACH trial for per-arm sample sizes and setting

. Error bars denote 95% confidence intervals. All points have been adjusted horizontally to avoid overlap.

4.2.2 Incorporating subjective priors

Rather than use weakly or non-informative priors when analysing the pilot data, we may instead want to make use of the (subjective) elicited knowledge of parameter values described in the design prior . Anticipating criticisms of a fully subjective analysis, we can envisage two particular cases where this might be appropriate. Firstly, using the components of the design prior which describe the nuisance parameters while maintaining weakly informative priors on substantive parameters . Secondly, when very little data on a specific substantive parameter is going to be collected in the pilot, using the informative design prior for that parameter could substantially improve operating characteristics.

We replicated the above analysis for these two scenarios. For the second, we used informative priors for all nuisance parameters and for the probability of adherence, . Recall that this is informed by a binary indicator at the care home level and only in the intervention arm, and will therefore have very little pilot data bearing on it. For each case we used the same samples of parameters and pilot data which were used in the weakly informative case, repeating the Bayesian analysis using the appropriate analysis prior and obtaining estimated posterior probabilities and as before. These were used in conjunction with the same set of loss parameter vectors to obtain corresponding operating characteristics.

For brevity we will refer to the three cases as Weakly Informative (WI), Informative Nuisance (IN), and Informative Nuisance and Adherence (INA). Comparing the operating characteristics of cases WI and IN, we found very little difference (further details are provided in the appendix). When we contrast cases WI and INA, however, there is a clear distinction. Using the INA analysis prior will lead to larger probabilities of an infeasible trial () and of unnecessary adjustment (), while reducing the probability of discarding a promising intervention (), for almost all loss parameters. The expected loss is always lower for the INA analysis than for WI, as we would expect.

Figure 7: Operating characteristics and expected utilities for weakly (WI) and partially informative (INA)

5 Discussion

When deciding if and how a definitive RCT of a complex intervention should be conducted, and basing this decision on an analysis of data from a small pilot trial, there is a risk we will inadvertently make the wrong choice. A Bayesian analysis of pilot data followed by decision making based on a loss function can help ensure this risk is minimised. The expected results of such a pilot can be evaluated through simulation at the design stage, producing operating characteristics which help us understand the potential for the pilot to lead to better decision making. These evaluations can in turn be used to find the loss function which leads to the most desirable operating characteristics, and to inform the choice of sample size.

Our proposal has been motivated by some salient characteristics of complex intervention pilot trials, and offers several potential benefits over standard pilot trial design and analysis techniques. The Bayesian approach to analysis means that complex multi-level models can be used to describe the data, even when the sample size is small. In contrast to the usual application of independent progression criteria for several parameters of interest, we provide a way for preferential relationships between parameters to be articulated and used when making decisions. Using a subjective prior distribution on unknown parameters at the design stage allows both our knowledge and our uncertainty to be fully expressed, meaning we can leverage external information whilst also avoiding decisions which are highly sensitive to imprecise point estimates.

Our proposed design is related to the literature on assurance calculations for clinical trials [16], applying the idea of using unconditional event probabilities as operating characteristics to the pilot trial setting. In doing so we have shown how assurances can be defined for multiple substantive parameters with trade-offs between them, and with respect to the ‘traffic light’ red/amber/green decision structure commonly found in pilot trials. The multi-objective optimisation framework we have used to inform trial design allows the decision-maker to explicitly consider the different trade-offs between operating characteristics which are available, and select that which best reflects their own preferences. A similar approach has been taken in the context of phase II trials using the statistical concept of admissible designs [25, 26]. This can be contrasted with the conventional and much criticised approach common in the frequentist context, where arbitrary constraints are placed on type I and II error rates in order to define a single optimal design [27].

The benefits brought by the Bayesian approach must be set against the challenges it brings, particularly in terms of computation time and implementation. In terms of the latter, we are required to specify a joint prior distribution over the parameters and a partitioning of the parameter space into the three hypotheses. The specification of the prior distribution may be a challenging and time-consuming task. Although some relevant data relating to similar contexts may be available, for example in systematic reviews or observational studies, expert opinion may still be required to articulate the relevance of such data to the problem at hand. When no data are available, which is not unlikely given the early phase nature of pilot studies, expert opinion will be the only source of information. Although potentially challenging, many examples describing successful practical applications of elicitation for clinical trial design are available [19, 17, 28], as are tools for its conduct such as the Sheffield Elicitation Framework (SHELF) [29]. Dividing the parameter space into three hypotheses may also prove challenging in practice, particularly when trade-offs between more than two parameters are to be elicited. There is a need for methodological research investigating how methods for multi-attribute preference elicitation, such as those set out in [21], can be applied in this context.

The computational burden of the proposed method is significant, particularly when the model is too complex to allow a conjugate analysis to be used when sampling from the posterior distribution. We have used a nested Monte Carlo sampling scheme to estimate operating characteristics, as seen elsewhere [18, 16, 30]. One potential approach to improve efficiency is to use non-parametric regression to predict the expected losses of Equation (3) based on some simulated data, thus bypassing the need to undertake a full MCMC analysis for each of the samples in the outer loop. This approach has been shown to be successful in the context of expected value of information calculations [31, 32]. The computational difficulties will be particularly pertinent when using our approach to determine sample size, as several evaluations of different sample size choices will be required. If the choice of sample size can be framed as an optimisation problem, methods for efficient global optimisation of computationally expensive functions such as those described in [33, 34] may be useful [15]. Alternatively, one of several rules-of-thumb for choosing pilot sample size [35, 8, 10, 12] could be used, with the resulting operating characteristics evaluated using the proposed method.

We have defined our procedure in terms of a loss function, where the decision making following the pilot will minimise the expected loss. However, the piecewise constant loss function we have proposed may not adequately represent the preferences of the decision maker. For example, we may object to the loss associated with discarding a promising intervention being independent of exactly how effective the intervention is. An alternative is to try to define a richer representation of the loss function through direct elicitation of the decision makers preferences under uncertainty [20], leading to a fully decision-theoretic approach to design and analysis [36]. However, as previously noted by others [37, 38, 39], implementation of these approaches has been limited in practice and this may be indicative of their feasibility.

The proposed method could be extended in several ways. More operating characteristics could be defined and used in design optimisation, more complicated trade-off relationships between multiple parameters could be addressed, or the hypotheses could be expanded to include nuisance parameters which would be used as part of the sample size calculation in the main RCT. A particularly interesting avenue for future research is to consider how to model post-pilot trial actions in more detail. For example, while we allow for the possibility of making an ‘amber’ decision, indicating that modifications to the intervention or trial design should be made, we do not model what that decision will actually look like and how it should relate to the observed pilot data. Methodology for jointly modelling a pilot and subsequent main RCT in this manner could be informed by developments for designing phase II/III programs in the drug setting [40, 41, 42, 43].

Data availability statement

All simulated data used in this manuscript, together with the code used to generate it, is available at


This work was supported by the Medical Research Council under Grant MR/N015444/1 to D.T.W. and Grant MC_UU_00002/6 to J.M.S.W.


  • [1] Peter Craig, Paul Dieppe, Sally Macintyre, Susan Michie, Irwin Nazareth, and Mark Petticrew. Developing and evaluating complex interventions: the new medical research council guidance. BMJ: British Medical Journal, 337, 9 2008.
  • [2] Sandra M. Eldridge, Gillian A. Lancaster, Michael J. Campbell, Lehana Thabane, Sally Hopewell, Claire L. Coleman, and Christine M. Bond. Defining feasibility and pilot studies in preparation for randomised controlled trials: Development of a conceptual framework. PLOS ONE, 11(3):e0150205, mar 2016.
  • [3] National Institute for Health Research. Research for patient benefit (rfpb) programme guidance on applying for feasibility studies, 2017.
  • [4] Sandra M Eldridge, Claire L Chan, Michael J Campbell, Christine M Bond, Sally Hopewell, Lehana Thabane, and Gillian A Lancaster. CONSORT 2010 statement: extension to randomised pilot and feasibility trials. BMJ, page i5239, oct 2016.
  • [5] Kerry N L Avery, Paula R Williamson, Carrol Gamble, Elaine O’Connell Francischetto, Chris Metcalfe, Peter Davidson, Hywel Williams, and Jane M Blazeby. Informing efficient randomised controlled trials: exploration of challenges in developing progression criteria for internal pilot studies. BMJ Open, 7(2):e013537, feb 2017.
  • [6] Lisa V Hampson, Paula R Williamson, Martin J Wilby, and Thomas Jaki. A framework for prospectively defining progression rules for internal pilot studies monitoring recruitment. Statistical Methods in Medical Research, 0(0):0962280217708906, 2017. PMID: 28589752.
  • [7] Richard H. Browne. On the use of a pilot sample for sample size determination. Statistics in Medicine, 14(17):1933–1940, 1995.
  • [8] Steven A. Julious. Sample size of 12 per group rule of thumb for a pilot study. Pharmaceutical Statistics, 4(4):287–291, 2005.
  • [9] Julius Sim and Martyn Lewis. The size of a pilot study for a clinical trial should be calculated in relation to considerations of precision and efficiency. Journal of Clinical Epidemiology, 65(3):301–308, mar 2012.
  • [10] M Teare, Munyaradzi Dimairo, Neil Shephard, Alex Hayman, Amy Whitehead, and Stephen Walters. Sample size requirements to estimate key design parameters from external pilot randomised controlled trials: a simulation study. Trials, 15(1):264, 2014.
  • [11] Sandra M Eldridge, Ceire E Costelloe, Brennan C Kahan, Gillian A Lancaster, and Sally M Kerry. How big should the pilot study for my cluster randomised trial be? Statistical Methods in Medical Research, 2015.
  • [12] Amy L Whitehead, Steven A Julious, Cindy L Cooper, and Michael J Campbell. Estimating the sample size for a pilot randomised trial to minimise the overall trial sample size for the external pilot and main trial for a continuous outcome variable. Statistical Methods in Medical Research, 2015.
  • [13] Wolfgang Viechtbauer, Luc Smits, Daniel Kotz, Luc Budé, Mark Spigt, Jan Serroyen, and Rik Crutzen. A simple formula for the calculation of sample size in pilot studies. Journal of Clinical Epidemiology, 68(11):1375–1379, nov 2015.
  • [14] Cindy L Cooper, Amy Whitehead, Edward Pottrill, Steven A Julious, and Stephen J Walters. Are pilot trials useful for predicting randomisation and attrition rates in definitive studies: A review of publicly funded trials. Clinical Trials, 0(0):1740774517752113, 2018. PMID: 29361833.
  • [15] D. T. Wilson, R. E. Walwyn, J. Brown, A. J. Farrin, and S. R. Brown. Statistical challenges in assessing potential efficacy of complex interventions in pilot or feasibility studies. Statistical Methods in Medical Research, 25(3):997–1009, jun 2015.
  • [16] Anthony O’Hagan, John W. Stevens, and Michael J. Campbell. Assurance in clinical trial design. Pharmaceutical Statistics, 4(3):187–201, 2005.
  • [17] Adam Crisp, Sam Miller, Douglas Thompson, and Nicky Best. Practical experiences of adopting assurance as a quantitative framework to support decision making in drug development. Pharmaceutical Statistics, 0(0), 2018.
  • [18] Fei Wang and Alan E. Gelfand. A simulation-based approach to bayesian sample size determination for performance under a given model and for separating models. Statistical Science, 17(2):pp. 193–208, 2002.
  • [19] Rosalind J. Walley, Claire L. Smith, Jeremy D. Gale, and Phil Woodward. Advantages of a wholly bayesian approach to assessing efficacy in early drug development: a case study. Pharmaceutical Statistics, 14(3):205–215, apr 2015.
  • [20] Simon French and David Rios Insua. Statistical Decision Theory. Number 9 in Kendall’s Library of Statistics. Oxford University Press, 2000.
  • [21] Ralph L. Keeney and Howard Raiffa. Decisions with multiple objectives: preferences and value tradeoffs. John Wiley & Sons, 1976.
  • [22] Anne Forster, , Jennifer Airlie, Karen Birch, Robert Cicero, Bonnie Cundill, Alison Ellwood, Mary Godfrey, Liz Graham, John Green, Claire Hulme, Rebecca Lawton, Vicki McLellan, Nicola McMaster, and Amanda Farrin. Research exploring physical activity in care homes (REACH): study protocol for a randomised controlled trial. Trials, 18(1), apr 2017.
  • [23] David J. Spiegelhalter. Bayesian methods for cluster randomized trials with continuous responses. Statistics in Medicine, 20(3):435–452, 2001.
  • [24] Stan Development Team. RStan: the R interface to Stan, 2016. R package version 2.14.1.
  • [25] Sin-Ho Jung, Taiyeong Lee, Kyung Mann Kim, and Stephen L. George. Admissible two-stage designs for phase II cancer clinical trials. Statistics in Medicine, 23(4):561–569, 2004.
  • [26] Adrian P. Mander, James M.S. Wason, Michael J. Sweeting, and Simon G. Thompson. Admissible two-stage designs for phase II cancer clinical trials that incorporate the expected sample size under the alternative hypothesis. Pharmaceutical Statistics, 11(2):91–96, 2012.
  • [27] Peter Bacchetti. Current sample size conventions: Flaws, harms, and alternatives. BMC Medicine, 8(1):17, Mar 2010.
  • [28] Nigel Dallow, Nicky Best, and Timothy H Montague. Better decision making in drug development through adoption of formal prior elicitation. Pharmaceutical Statistics, 0(0), 2018.
  • [29] Anthony O’Hagan, Caitlin E. Buck, Alireza Daneshkhah, J. Richard Eiser, Paul H. Garthwaite, David J. Jenkinson, Jeremy E. Oakley, and Tim Rakow. Uncertain Judgements: Eliciting Experts’ Probabilities. John Wiley and Sons, 2006.
  • [30] Alexander J. Sutton, Nicola J. Cooper, David R. Jones, Paul C. Lambert, John R. Thompson, and Keith R. Abrams. Evidence-based sample size calculations based upon updated meta-analysis. Statistics in Medicine, 26(12):2479–2500, 2007.
  • [31] Mark Strong, Jeremy E. Oakley, and Alan Brennan. Estimating multiparameter partial expected value of perfect information from a probabilistic sensitivity analysis sample. Medical Decision Making, 34(3):311–326, apr 2014.
  • [32] Mark Strong, Jeremy E Oakley, Alan Brennan, and Penny Breeze. Estimating the expected value of sample information using the probabilistic sensitivity analysis sample: a fast, nonparametric regression-based method. Medical Decision Making, 35(5):570–583, 2015.
  • [33] Donald R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21(4):345–383, 2001.
  • [34] Olivier Roustant, David Ginsbourger, and Yves Deville. DiceKriging, DiceOptim: Two R packages for the analysis of computer experiments by kriging-based metamodeling and optimization. Journal of Statistical Software, 51(1):1–55, 2012.
  • [35] Gillian A. Lancaster, Susanna Dodd, and Paula R. Williamson. Design and analysis of pilot studies: recommendations for good practice. Journal of Evaluation in Clinical Practice, 10(2):307–312, 2004.
  • [36] Dennis V. Lindley. The choice of sample size. Journal of the Royal Statistical Society: Series D (The Statistician), 46(2):129–138, 1997.
  • [37] Lawrence Joseph and David B. Wolfson. Interval-based versus decision theoretic criteria for the choice of sample size. Journal of the Royal Statistical Society: Series D (The Statistician), 46(2):145–149, 1997.
  • [38] Peter Bacchetti, Charles E. McCulloch, and Mark R. Segal. Simple, defensible sample sizes based on cost efficiency. Biometrics, 64(2):577–585, jun 2008.
  • [39] John Whitehead, Elsa Valdés-Márquez, Patrick Johnson, and Gordon Graham. Bayesian sample size for exploratory clinical trials incorporating historical data. Statistics in Medicine, 27(13):2307–2327, 2008.
  • [40] Nigel Stallard. Optimal sample sizes for phase II clinical trials and pilot studies. Statistics in Medicine, 31(11-12):1031–1042, 2012.
  • [41] James M. S. Wason, Thomas Jaki, and Nigel Stallard. Planning multi-arm screening studies within the context of a drug development program. Statistics in Medicine, 32(20):3424–3435, 2013.
  • [42] Heiko Götte, Armin Schüler, Marietta Kirchner, and Meinhard Kieser. Sample size planning for phase II trials based on success probabilities for phase III. Pharmaceutical Statistics, 14(6):515–524, sep 2015.
  • [43] Marietta Kirchner, Meinhard Kieser, Heiko Götte, and Armin Schüler. Utility-based optimization of phase II/III programs. Statist. Med., 35(2):305–316, aug 2015.