 # Efficient and flexible simulation-based sample size determination for clinical trials with multiple design parameters

Simulation offers a simple and flexible way to estimate the power of a clinical trial when analytic formulae are not available. The computational burden of using simulation has, however, restricted its application to only the simplest of sample size determination problems, minimising a single parameter (the overall sample size) subject to power being above a target level. We describe a general framework for solving simulation-based sample size determination problems with several design parameters over which to optimise and several conflicting criteria to be minimised. The method is based on an established global optimisation algorithm widely used in the design and analysis of computer experiments, using a non-parametric regression model as an approximation of the true underlying power function. The method is flexible, can be used for almost any problem for which power can be estimated using simulation, and can be implemented using existing statistical software packages. We illustrate its application to three increasingly complicated sample size determination problems involving complex clustering structures, co-primary endpoints, and small sample considerations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The sample size of a clinical trial is typically minimised subject to the power of the trial being above a nominal level, often 80 or 90%. For many sample size determination (SSD) problems, power can be calculated using a simple mathematical formula and the optimisation problem can be solved in a timely manner. When complexity in the trial design or the method of analysis mean such formulae are not readily available, we can estimate power using a Monte Carlo (MC) approximation [1, 2]

. To do so, we simply simulate several hypothetical sets of trial data under the alternative hypothesis, analyse each of these, and calculate the proportion of analyses which reject the null hypothesis. The simplicity and flexibility of the simulation method has seen it used for a variety of statistical models and study designs, including problems involving hierarchical models

[3, 4], proportional hazards models 

, logistic regression models

, individual patient data meta-analyses [7, 8], patient enrolment models , stepped wedge designs [10, 11], and cluster randomised crossover designs . Although calculating MC estimates of power can be computationally demanding, these SSD problems remain feasible because, as optimisation problems, they are quite simple. In particular, optimisation takes place over a single parameter (the sample size), subject to a single constraint (power), and with respect to a single objective to be minimised (the sample size again).

SSD problems, particularly those found in trials of complex interventions, are not always this simple . There may be several parameters, each influencing the power of the trial, which need to be specified at the design stage. Several design parameters are common in, for example, trials with multilevel structures such as cluster randomised trials, where both the number of clusters and the number of participants in each cluster must be specified. Increasing the number of design parameters complicates the SSD problem by increasing the number of possible solutions to search. A second complication arises when there is more than one criterion we are interested in minimising, subject to the nominal power constraint. A cluster randomised trial will often have this property, as we would like to minimise both the total number of participants and the number of clusters. Given multiple conflicting objectives, there is no single ‘optimum’ solution but rather a range of solutions which offer different degrees of trade-off between the objectives. Seeking a set of good solutions, rather than a single optimum, further adds to the difficulty of the SSD problem.

Complex SSD problems with several design parameters and several objectives could in theory be solved using benchmark multi-objective optimisation algorithms such as NSGA-II , robust implementations of which are freely available in statistical software such as R . However, these so-called ‘greedy’ algorithms typically assume that evaluating any proposed solution to the problem is a very fast process, and consequently evaluate many thousands of solutions during the search. If these algorithms were applied to problems where evaluating solutions required computing an MC estimate of power, they would take an infeasibly long time to converge. Thus, if we are to extend simulation-based trial design to complex SSD problems, we require a more general framework employing more efficient optimisation algorithms.

Outwith the context of clinical trial design, a great deal of research has addressed optimisation problems where the evaluation of a solution is a computationally demanding, or expensive, operation [16, 17]. One approach addresses the problem by substituting the expensive function with a mathematical approximation known as a surrogate model. The surrogate model is then used to make predictions about the true function for different values of design parameters, with these predictions informing which point should be evaluated next. The information obtained from this evaluation is used to update the surrogate model, thus improving the predictions available at the next iteration. One class of surrogate model is Gaussian process (GP) regression. Also known as Kriging and having its roots in geostatistics 

, GP models are spatial interpolators which are computationally tractable

 and can be fitted using robust and freely available software . A GP surrogate model provides not only a prediction of the true function value at any point, but also a measure of uncertainty in this prediction. This property is exploited by the benchmark Efficient Global Optimisation (EGO) algorithm , allowing the next point in the search to be chosen in a way that formally balances the potential benefits of exploitation (searching around areas already known to be promising) and exploration (searching in areas of high uncertainty). Although EGO was originally proposed for unconstrained optimisation of expensive objective functions with deterministic output, various proposals have extended it to incorporate the expensive constraints , multiple objectives , and stochastic outputs  that feature in complex SSD problems.

In this paper we will explore how GP regression models and a variant of the EGO algorithm can be used to solve complex SSD problems. In contrast with many of the available methods and software for simulation-based SSD, which focus on specific application areas such as multilevel designs (MLPowSim ), IPD meta-analyses (ipdpower ) or stepped wedge design (SWSamp ), we take the same approach as that used in the SimSam package  and propose a more general framework which can be applied to a broad class of SSD problems following the Neyman-Pearson hypothesis testing formulation. Specifically, we require that the user can write a program which simulates the data generating process and analysis of the trial, returning a binary indicator denoting rejection or otherwise of the null hypothesis. This flexibility will not only mean simulation-based SSD can be used for a wide range of existing trial designs, but will also facilitate SSD for novel designs developed in the future and which cannot be anticipated now.

The remainder of the paper is structured as follows. Three motivating problems are described in Section 2. In Section 3 we provide the necessary background and notation regarding Monte Carlo estimation and multi-objective optimisation. In Section 4 we describe Gaussian process regression, the efficient global optimisation algorithm, and a framework for its application to sample size determination. We return to the examples in Section 5, illustrating how the method can be applied in practice. We conclude with a discussion of the implications and limitations of the proposed approach in Section 6.

## 2 Motivating examples

‘Pacing, graded Activity, and Cognitive behaviour therapy; a randomised Evaluation’ (PACE) [26, 27]

was a randomised controlled trial comparing adaptive pacing therapy (APT), cognitive behavioural therapy (CBT), graded exercise therapy (GET) and specialist medical care (SMC) as secondary care treatments for patients with chronic fatigue syndrome (CFS/ME). The data had a complex multilevel structure, with three of the four arms including a therapy provided by different therapists (partially nested structure) and all arms including medical care from the same doctors (crossed structure), leading to the potential for treatment-related clustering. Since some participants receive treatment from both a therapist and a doctor, the relationship between participants and care providers is cross-classified. Moreover, two potentially correlated co-primary endpoints (fatigue and disability) were used. The original sample size calculation used a simple analytic formula for comparing proportions in two groups of equal size. As was typical at the time, it did not account for the impact of clustering or of simultaneously analysing two correlated endpoints. In this section we will describe three theoretical example SSD problems based around the PACE trial, increasing in complexity at each step.

### 2.1 Complex clustering

For simplicity, we consider redesigning the PACE trial to detect a difference in the probability of participants responding with respect to fatigue between the APT and SMC arms. A participant is considered to have responded if they have a score of 3 or less (indicating normal fatigue) on the likert Chalder Fatigue Scale (CFS)

. As in the original design, we assume an equal number of participants will be recruited to each arm. In the intervention arm,

therapists will deliver APT to participants, with each participants receiving treatment from a single therapist. We assume that the number of participants allocated to each therapist will vary with therapist. Specifically, we model the proportion of all participants in the APT arm allocated to a therapist using a Gamma distribution with shape parameter 1. Participants in both APT and SMC arms will receive specialist medical care from one of

doctors, with the proportion allocated to each doctor also following a gamma distribution with shape parameter 1. This leads to a multilevel data structure where therapists are partially nested within interventions, doctors are crossed with interventions, and patients are cross-classified with therapists and doctors in the intervention arm and nested within doctors in the control arm . This structure is illustrated in Figure 1. Figure 1: Multilevel structure of the SMC and APT arms of our example, where therapists (T) are partially nested within interventions, doctors (D) are crossed with interventions, and patients (P) are cross-classified with therapists and doctors in the intervention arm and nested within doctors in the control arm.

In this example, the primary analysis will be a logistic mixed effect model with a likelihood ratio test of the null hypothesis that the probability of response in the APT arm, , is equal to that in the SMC arm,

. For simplicity, the model includes random intercepts in the linear predictor for doctor effects and a random coefficient for therapist effects. Further work is needed on the model that is recommended in this scenario but as the model we assume is for illustration purposes only, it could simply be updated once a recommended model becomes available. We assume that the test statistic follows a chi-squared distribution with 1 degree of freedom, and therefore that the type I error rate can be controlled at a nominal level of

(two-sided). As model convergence may be an issue, we include this in our definition of power by not rejecting the null hypothesis when the model fails to converge. We require that the power at the alternative hypothesis be no less than 90%. Subject to these constraints, we aim to find minimal values of and , recognising that these two objectives will conflict with one another.

Although sample size formulae for partially nested designs with binary outcomes are available , they do not extend to the structure in this example where participants are cross-classified and doctors are crossed with interventions. Including the complex process of model non-convergence into the definition of power further necessitates the use of simulation.

### 2.2 Co-primary endpoints

We extend the previous example to include a co-primary endpoint relating to disability, a binary response defined as 75 (out of 100) or more on the short form-36 physical function subscale , where the mean score for the UK adult population is around 85. An in PACE, the endpoints will be analysed separately, each time fitting a logisitc mixed effect model and conducting a likelihood ratio test as described in Section 2.1. The results of the trial will be considered positive only if both of the analyses show a statistically significant difference, leading to reduced power under the alternative hypothesis of an effect on each endpoint in comparison to the univariate case of Section 2.1. Correlation between the endpoints is expected at the participant level, and correlations between the random effects of therapists and doctors for different endpoints are also expected. This correlation will be modelled in the data generating process, but not in the analysis due to concerns about the feasibility of fitting a multivariate model. Such a mismatch between the data generating and analysis models has been noted as a clear motivation for simulation-based power calculations .

### 2.3 Small sample pilot

Finally, we consider how we might have designed a pilot trial prior to the definitive PACE trial to provide a preliminary test of potential efficacy. At this early stage, we would like 90% power to detect a meaningful effect in either the fatigue or disability endpoints. To enable a small trial to have such high power, we relax the type I error rate to 0.2 (one-sided) and change the primary endpoints from binary responses to the continuous scores on the CFS and SF-36. In the pilot setting we assume we have greater control over the numbers of participants allocated to therapists and to doctors, and so can maximise efficiency by balancing cluster sizes. We also now consider varying the number of doctors. Our objectives are to minimise the total number of participants, the number of therapists, and the number of doctors. The small sample setting of a pilot trial implies the distributional assumptions underpinning type I error control may be violated, and so we simulate power under the null hypothesis and model this constraint in addition to power under the alternative. We include the nominal type I error rate used when testing the null hypothesis as a design variable, allowing an appropriate adjustment to be made as part of the larger optimisation process.

In terms of design parameters, we must choose the total sample size in the APT arm, denoted ; the number of APT therapists, ; the allocation ratio relating the total number of participants in each arm, ; the number of doctors, ; and the nominal type I error rate, . Thus, in comparison with the preceding examples, the number of designs to be searched over is significantly larger. By requiring the simulation of power under the null and alternative hypotheses, the computational burden of simulation is doubled. By minimising three objectives simultaneously, a larger set of solutions will be required to enable the available trade-offs between them to be fully appreciated.

## 3 Background

### 3.1 Monte Carlo estimation

Monte Carlo methods can be used to numerically approximate expectations of real valued functions

with respect to the probability distribution of

. Given samples of , denoted , , the MC estimate is

 E[f(Z)]≈1NN∑i=1f(zi). (1)

The estimate is unbiased for all

and has variance equal to

 ω2=Var[1NN∑i=1f(zi)]=1NVar[f(zi)]. (2)

The standard error of the MC estimate will therefore reduce at a rate of

as we increase . When

is large we can consider an MC estimate to be the true expectation plus a normally distributed error term with 0 mean and variance

, i.e.

 1NN∑i=1f(zi)=E[f(Z)]+e, where e∼N(0,ω2). (3)

In the context of simulation-based trial design, if is the test statistic to be compared with an acceptance region then the probability of acceptance under hypothesis is , where is the indicator function. An MC estimate of the power of a trial design under can therefore be obtained given test statistics sampled under . The steps required to simulate these statistics are described in , and we briefly summarise them here:

1. Define the population model. This describes the underlying target population and should specify all population parameters and distributions under the hypothesis of interest.

2. Define the sampling strategy. This should specify the numbers of patients, clusters, or any other sampling units in the trial and how they will be drawn from the population.

3. Define the method of analysis. For hypothesis testing, this will include defining the form of the test statistic and the acceptance region .

Given each of the above elements, pseudo-random number generators can be used to simulate the recruitment, randomisation and primary outcome measure of patients under the hypothesis of interest, from which a test statistic can be calculated.

### 3.2 Multi-objective optimisation

A solution to the SSD problem consists of a vector of design parameters

, and the solution space is the set of all solutions. A simple SSD problem may have a 1-dimensional solution space, while more complex problems may have several dimensions. Elements of may include parameters defining the sample size of the trial, the acceptance region to be used in the analysis, or any other design aspect over which we have control and which may influence the trial operating characteristics. For instance, in example 2.1 a solution is defined by the number of participants in each arm () and the number of therapists in the intervention arm ().

An objective function is a function which we wish to minimise. In a multi-objective problem with objectives, we denote the vector of objective values as . We will describe as the objective space. In our example 2.1, we have two objectives: minimising the number of clusters ; and minimising the total number of participants, .

A constraint function is a function which must be less than or equal to 0 for the solution

to be considered feasible. For example, if type II error rate is denoted by

and the nominal type II error rate is set at , a constraint function would be . We denote constraint functions as . The general SSD problem can now be stated as

 minx∈X fi(x), i=1,…,B (4) subject to gj(x)≤0, j=1,…,C. (5)

We denote by the relation of Pareto dominance, where a solution dominates another if it is at least as good in all respects, and better in at least one. Formally, if for , and for some . For instance, in example 2.1, but . The Pareto set is the set of non-dominated solutions . An example Pareto set for example 2.1 is plotted in Figure 2, illustrating the available trade-offs between the two objectives.

Multi-objective optimisation seeks to find a set of solutions that are close to the true Pareto set, with every member of the set non-dominated with respect to all other members. We will refer to these as approximation sets, denoted . That is, any set such that with is an approximation set . A set is feasible if all constraints are satisfied by every member of . An example feasible approximation set for example 2.1, plotted in Figure 2, is given by the four points

 A={(589,24),(705,20),(810,12),(982,10)}. (6)

To understand the similarity between any approximation set and the ideal Pareto set, we measure its dominated hypervolume. This is the volume of the subspace dominated by solutions in and bounded by a reference point :

 H(A)=Vol({y∈RB∣y is% dominated by some y∗∈A and y≺r}). (7)

The largest possible hypervolume of any feasible approximation set is achieved by the true Pareto set . We can therefore frame the multi-objective optimisation problem as finding the feasible approximation set with largest hypervolume. Taking a reference point of (marked by the cross in Figure 2), our example approximation set has a dominated hypervolume of 9202. This can be compared with that of the Pareto set, at 14501. We would expect the approximation set to converge to the Pareto set as the number of optimisation iterations increases. Figure 2: Example Pareto front Xp and approximation set A for a cluster randomised trial design problem. The dominated hypervolume of the approximation set with respect to a reference point (cross) is the shaded area.

## 4 Simulation-based sample size determination

### 4.1 Overview

The proposed method is based on the Efficient Global Optimisation algorithm . For clarity we will describe the algorithm in the context of an SSD problem with a single constraint function, denoted , which must be estimated using simulation. The more general case of several constraints will follow. The initial step is to select a number of potential solutions to the SSD problem and evaluate the constraint function at each of these points, giving . A Gaussian process regression model is then fitted to the data, relating the solutions to the estimates and providing an approximation of the constraint function . The solution which has the largest expected improvement according to the predictions of the GP model, is then found. This solution is evaluated to obtain . This new data is then used to update the GP model, which is then used again to find the next solution to evaluate. The algorithm can be repeated until either the computational resources available have been exceeded, or until no further improvements are being obtained. The Algorithm is summarised in 1 below.

The process of computing MC estimates used in steps (1) and (5) has already been described in Section 3.1. In what follows we will first consider step (3), describing Gaussian process regression models and outlining how they can be fitted and used to make predictions. The notion of expected improvement in step (4) will then be defined for the constrained multi-objective problems we are concerned with. Finally, we cover the remaining aspects of implementation.

### 4.2 Gaussian process regression

Consider a set of points at which an expensive function will be estimated using the Monte Carlo method. Consider also some other point where we are interested in making a prediction of . The value of at each point in is initially unknown, but can be modelled by a Gaussian process (GP).

In using a GP we assume that our belief regarding the the values of can be represented by a multivariate normal distribution. Prior to computing any estimates of , we assume that the mean function of this multivariate normal is equal to zero111This is not a restrictive assumption. After observing estimates of the function and updating the GP model to account for these, the mean function can take on non-zero values.. We write the covariance matrix of the distribution as

 (K(XE,XE)K(XE,x∗)K(x∗,XE)K(x∗,x∗)), (8)

where is the covariance matrix for the points , is the -length vector of covariances between and , and is the variance at .

Given this prior distribution, we compute the MC estimates at each point in . From equation (3), where

is a zero-mean normally distributed error term with standard deviation

. We denote by the diagonal matrix where the th entry is . The distribution of conditional on the observed can be shown to be normal with mean and variance  . Thus, given a prior covariance matrix of the form (8) and some MC estimates of at the points , a conditional predictive distribution of can be found. It is this distribution which will be used in the optimisation algorithm when deciding which solution should next be evaluated.

The predictive distributions are influenced by the prior covariance matrix (8). The matrix is populated using a covariance function (or kernel), . This function must be symmetric and positive definite for the covariance matrix to have the same properties. One such covariance function is the squared exponential, which has the form

 k(x,x′)=σexp(−D∑j=1(xj−x′j)2λ2j). (9)

By using covariance functions of this form we will obtain a Gaussian process which is infinitely differentiable over and thus very smooth. This would appear to be a reasonable restriction to place upon the power functions we are interested in. In order to populate the covariance matrix we must choose values of the hyper-parameters . We do this by numerically optimising the log marginal likelihood

 (10)

considered as a function of  . Fitting a GP model by maximum likelihood in this manner can be done using the function km in the R package DiceKriging, as illustrated in the appendix.

An illustration of a Gaussian process regression model of a power function in one dimension is given in Figure 3. The power of three different choices of sample size have been calculated and a GP model fitted to the results. The figure illustrates how the uncertainty in the model predictions (shaded area) increases the further we are from a point which has been evaluated. The GP prediction of power at a sample size of , shown as a dashed line, is normally distributed with mean 0.84 and standard deviation 0.035. Figure 3: A Gaussian process model of a power function over a one-dimensional sample size (solid line) based on three evaluations. Uncertainty is shown as the shaded area. Expected improvement (dotted line) is maximised at a sample size of 190, where the predicted power is normally distributed around a mean estimate of 0.84 (dashed line).

### 4.3 Expected improvement

At any given point during the optimisation process we can obtain an approximation set based on the set of solutions which have been evaluated up to that point. If a new point is considered feasible, a new approximation set will be identified. The improvement resulting from the evaluation of is the difference in the dominated hypervolumes:

 I=H(A∗)−H(A). (11)

Prior to evaluation, we do not know if the point will be considered feasible. We therefore modify to account for the probability that will be considered feasible after the MC estimates have been obtained. This probability can be estimated using the GP regression methodology described in Section 4.2. A GP model of unknown constraint function will describe our current belief about the value of at using a normal distribution with mean and variance , and we will consider the point feasible if the upper

% quantile of this distribution is below 0. We denote this quantile as

 q(x∗)=m+Φ−1(p)s, (12)

where is the standard normal cumulative distribution. Following an evaluation of the GP model will be updated and the quantile revised to . Before the evaluation the value of is unknown, but it is shown in  that its predictive distribution is where

 m+ =m+Φ−1(p)√ω2s2ω2+s2 (13) s2+ =[s2]2ω2+s2, (14)

and is the MC error of the planned evaluation, estimated as where is the number of MC samples to be used. The predictive distribution can then be used to calculate the probability that the point will be considered feasible following its evaluation. Following , we multiply the theoretical improvement by this probability, thus penalising candidate solutions with a low chance of satisfying the constraint. This then gives us our expected improvement measure , where

 Expected Improvement EI(x∗)=[H(A∗)−H(A)]C∏j=1Φ(−mj,+sj,+). (15)

Note that we include a penalty term for all constraint functions. This maximisation problem is in itself complex, with a potentially large number of local maxima. We therefore use the particle swarm optimisation algorithm as implemented in the R package pso , designed to avoid becoming trapped in local maxima, to solve this sub-problem.

An illustration of expected improvement for a single-objective problem is given in Figure 3. When choosing which sample size to evaluate next and aiming to find the lowest per-arm sample size with at least 80% power, we balance the potential improvement over the best current solution (a sample size of 260) with the probability of constraint satisfaction. In this case, we would choose to evaluate the sample size of 190, estimated by the GP model to have a power of 84%.

### 4.4 Implementation

To apply Algorithm 1 in practice we must first choose the initial set of points to be evaluated, . One recommendation is to include 10 points for each dimension of the solution space, and to allocate between 30 and 50% of the total computation budget to their evaluation. To select the location of the points in we use the space-filling Sobol sequence generated using the R package randtoolbox . The number of iterations and the number of MC samples used at each iteration must also be chosen. Given a total computational budget in terms of MC samples, the choice of these values should account for the fact that fitting GP regression models in R to more than around 800 points is currently infeasible .

As the algorithm depends on GP regression models, it can be helpful to assess the fit of these models. One approach is to regularly plot the predicted mean and standard deviation in one or two dimensions, centred at the last evaluated point. Poor model fit could be identified if the mean function is not, for example, strictly increasing as expected. We can also contrast the predicted function values with the obtained function values at each iteration, halting the algorithm if a large and unexpected discrepancy in these values is observed.

We have used R to implement the proposed framework, partly due to the availability of robust and efficient R packages for fitting Gaussian process models (DiceKriging ) and for global optimisation (pso ). Using R also provides flexibility in terms of the user-writen simulation routines by facilitating various complicated analysis procedures, e.g. multilevel modelling through lme4 . Our implementation works to a simple interface. The user must provide instances of two data frames. The first, design_space, contains a row for each design parameter describing its name and its lower and upper bounds. The second, constraints, contains a row for each constraint function . Each row should include a label for the constraint, the hypothesis it pertains to, a nominal power which should not be exceeded, and the confidence we require in the constraint being satisfied (i.e. the in Equation 12). Further, two functions are required. The first, objectives, takes as its argument a vector of design parameter values and returns a vector of objective values . The second, sim_trial, takes as its arguments a vector of design parameter values and a vector of parameter values defining the conditions under which we wish to simulate. The function should simulate the necessary data generation and analysis and return a boolean indicator of the rejection of the null hypothesis, or, more generally, of declaring ‘success’. Given these components, the example R code in the appendix can be modified to solve the problem at hand.

## 5 Application to the examples

In this section we describe the data generating models used to simulate trial data for each of our examples and the methods used in their analyses. Full details, including all the programs used to generate the results presented, are given in the appendix.

### 5.1 Complex clustering

We model the binary response of the th participant using a latent variable representation. Specifically, we suppose that underlying the binary response there is a continuous latent variable such that

 yi={1 if y∗i≥00 if y∗i<0.

As before, we define our model in terms of the . Using the multilevel model notation of :

 y∗i =β0+β1ti+u(2)therapist(i)ti+v(2)doctor(i)+ei (16) u(2)therapist(i) ∼N(0,σ2T) (17) v(2)doctor(i) ∼N(0,σ2D), (18)

where is a level 1 residual with mean zero and variance . Assuming follows a logistic distribution with leads to a random intercept logistic model.

For the purposes of power calculations we must make some assumptions about the various nuisance parameter values. We set the 2nd level variance components to in order to give a variance partition coefficient of in the control arm, a typical value in this setting. Similarly, the variance partition coefficient for between-therapist variance is then , and for between-doctor variation, . Recall that we want to simulate the power of the trial under the alternative hypothesis . We can translate these probabilities into corresponding values for the coefficients in our model, giving .

The design parameters are the number of participants in each arm , the number of therapists in the APT arm, and the number of doctors delivering specialist medical care across both arms. For simplicity and ease of illustration we will fix . When searching over the remaining design parameters and we will initially consider and , noting that these can be easily revised if the initial evaluations indicate larger values are required to achieve nominal power. We wish to minimise both the total number of patients and the total number of care providers . The only constraint we must satisfy is that the type II error rate under the alternative hypothesis is no more than . This gives the constraint function .

The original PACE sample size calculation did not account for clustering and, using simple analytic formulae for power of test of proportions, arrived at per arm (before inflating for attrition) to achieve 90% power. Simulating the actual power obtained from under our proposed model, with therapists and

doctors, gave an MC estimate of 0.69 power (95% confidence interval 0.66 to 0.72). Fitting two multilevel models for each sample led to a significant computational burden, needing over 5.3 minutes to generate the

samples required for this estimate. Thus, there is a need to search for an appropriate sample size using simulation, but a practical limit on the number of designs which we can evaluate in a timely manner.

We initialised the optimisation algorithm by generating a Sobol sequence of size 20 and computing MC estimates of power for each point using samples. Following this, 30 iterations of the algorithm were applied, with samples used at each iteration. We chose these optimisation parameters to ensure a solution could be found quickly, noting that further iterations can easily be added if solutions of a higher quality are sought. In Figure 4 we plot the 50 evaluated solutions, distinguishing between those in the initial design , those which were subsequently evaluated during the iterative phase of the algorithm, and those which together form the final approximation set. The contours of the mean function of the final GP model are also shown. Figure 4: Objective values of solutions in the initial set XE (open circles), subsequent iterations of the algorithm (crosses), and those in the final approximation of the Pareto set (filled squares). The approximation set obtained using a fixed design is also shown (filled circles). Contours represent the mean function of the final GP model.

For comparison, we also plot the approximation set obtained using a similar procedure as that implemented in MLPowSim, software designed for simulation-based SSD for problems with multilevel data. Specifically, we take a Sobol sequence of size 50 and estimated the type II error at each of these points using MC samples. For each point a 95% confidence interval based on the MC error was calculated, and any points where the interval was not entirely below the nominal value of 0.1 were discarded. Of those that remained, any dominated solutions were discarded. The remaining two solutions are plotted in Figure 4. The proposed method has led to solutions of higher quality which collectively dominate those produced by the simpler method, with lower numbers of participants, providers, or both.

At the th iteration of the algorithm we calculated the dominated volume , plotted in Figure 5. In this instance the algorithm appears to successfully improve the quality of the approximation set as the number of iterations increases, with the rate of improvement decreasing over time. The total running time was 47 minutes. Note that is not strictly increasing. This is because the evaluation of a new solution can lead to revised estimates of other solutions which were in the approximation set, such that they are then considered infeasible and removed from the set. Figure 5: Quality of the approximation set obtained as the algorithm proceeds, where higher dominated hypervolume reflects higher quality.

The solutions which form the final approximation set are detailed in Table 1. As few as 6 therapists and 12 doctors can lead to a sufficiently powered trial, although 742 participants in total are required in this configuration. On the other hand, if minimising the total sample size is deemed more important than minimising the number of therapists and doctors, we see that as few as 546 participants are necessary providing 12 therapists and 24 doctors are included. The approximation set contains 6 solutions in total, providing a reasonable set of options from which the solution best representing the priorities of the decision makers can be selected.

As can be seen from Table 1, approximate upper 95% confidence limits based on the initial MC estimates of power often exceed the corresponding nominal bound of 0.1. For example, the solution described in the first row would have an approximate upper 95% two-sided confidence interval of . To verify the actual type II error we computed a more precise MC estimates using samples, which gave an estimate of 0.093 and a two-sided interval of . Similar results are seen for the remaining solutions in the approximation set, as shown in Table 1. This demonstrates the GP’s ability to share information of MC estimates computed at several points to increase the precision at each of them.

### 5.2 Co-primary endpoints

For our second example we consider a second co-primary binary responder endpoint. We use the same latent variable representation as described in the preceding section to model the fatigue response and disability response of the th participant. Correlation between these two endpoints is modelled by simulating bivariate residuals from a joint logistic distribution with correlation and marginal variances as before. We also allow for correlation between the random effects associated with each therapist and doctor. These are now simulated according to the bivariate normal distributions

We set all correlations equal at , reflecting a situation where both a patient’s responses and the individual therapist and doctor effects and very similar for both the fatigue and disability outcomes.

The algorithm was applied using the same settings as before, with an initial design of 20 points followed by 30 iterations, and each evaluation using MC samples. The run time in this case was 96 minutes, roughly double that of the previous example due to each simulation fitting twice as many models. The resulting approximation set is plotted in Figure 6. Again, the contours represent the mean function of the final GP model. Figure 6: Objective values of solutions in the initial set XE (open circles), subsequent iterations of the algorithm (crosses), and those in the final approximation of the Pareto set (filled squares).

Table 2 provides full details of the obtained approximation set together with their initial MC estimates of type II error rate using MC samples, and further estimates using MC samples. Approximate 95% confidence intervals around the more precise estimates all either include the nominal constraint of 0.1, or lie entirely below it.

### 5.3 Small sample pilot

For our final example, recall that we have two continuous co-primary endpoints. For notational simplicity we use model (16) but now consider the to be the actual observed continuous response, as opposed to a latent variable. We now assume the individual-level residual term is normally distributed but with the same variance as before, thus maintaining the variance partition coefficients at the same levels. The alternative hypothesis remains . Note that this corresponds to a treatment effect standardised with respect to the total standard deviation in the APT arm of .

Our design parameters (together with the ranges considered) are the total sample size in the APT arm, denoted (50 to 100); the number of APT therapists, (2 to 10); the allocation ratio relating the total number of participants in each arm, (0.5 to 1.5); the number of doctors, (3 to 20); and the nominal type I error rate to be used in the hypothesis tests, (0.05 to 0.2). The three objective functions to be minimised are , and . The two constraints to be satisfied are and .

Given the increase in dimensions of the solution space, we use an initial Sobol sequence design of 50 solutions. As before, we use 100 MC samples for each evaluation. After 50 iterations of the algorithm, an approximation set containing 15 solutions was obtained. The algorithm took 2 hours and 36 minutes to run. The objective values of these solutions are illustrated in Figure 7, with full details provided in Table 3. The total number of participants ranged from 140 to 214; of therapists, from 5 to 10; and of doctors, from 5 to 23. Type I error rates ranged from 0.09 to 0.14, all some way below the actual constraint value of 0.2. We calculated precise MC estimates (using samples) of both type I and II error rates for each solution in the approximation set. As shown in Table 3, type II error rates all appear to be around or slightly below the constraint of 0.1. Type I error rates, in contrast, are in some cases significantly below the constraint of 0.2. This suggests there is some potential for improvement in the approximation set by applying further iterations of the algorithm. Figure 7: Objective values of the approximation set obtained following 50 iterations of the algorithm for Example 3.

## 6 Discussion

Although simulation is often required for clinical trial sample size determination, related methodology has typically assumed that there is only one parameter which we are able to adjust (the sample size); that there is only one operating characteristic which must be estimated using simulation (the power of the trial); and that our goal is to minimise only one criterion (the sample size again) [2, 4]. In this paper we have described a flexible approach to simulation-based SSD which can be used for more general multi-parameter problems. The method draws on established global optimisation algorithms which use statistical ‘surrogate‘ models to solve design problems where there are several parameters to be chosen, several objectives to minimise, and several constraints to satisfy. We have illustrated how such problems arise in clinical trials of complex interventions.

The general optimisation framework we have suggested recognises that in many complex trials we are interested in minimising more than one quantity subject to constraints on operating characteristics. Problems of this sort are common in multilevel trial design , but are typically approached by first reducing the multiple objectives down to a single objective. For example, in the design of a cluster randomised trial it is common to fix the number of participants per cluster and minimise the number of clusters , or vice-versa [42, 43]. Alternatively, a function which specifies the cost of sampling at the cluster and the patient level could be specified , and the overall cost minimised . The latter approach has been suggested for both two-level  and three-level hierarchical trial designs [47, 48]. However, the a priori specification of such a cost function may not always be feasible, particularly when several stakeholders are involved . The Pareto optimisation framework we have described leads to a more computationally challenging optimisation problem, but produces a set of good solutions enabling the available trade-offs between objectives to be seen and selected from.

As noted in Section 1, related work in simulation-based design methodology has often focussed on a specific area of application. One advantage that brings is the relative ease with which the software can be used to solve a new problem within the same area. In contrast, our approach requires that the user provides a program which simulates the data generation and analysis of their proposed trial design. Although some have argued that this requirement may be prohibitive in practice , it allows the user to solve their specific problem rather than some related version of it. Moreover, prior to addressing the sample size issue, modelling and simulation can help inform many other aspects of trial design, such as the patient population or the choice of endpoint . One way to assist users in writing their own simulations is to share example programs for a range of problems, providing a starting point for the development of a new program. We have provided some examples in the appendix.

When submitting a proposed design for approval by a funding body it is important that the sample size calculation is transparent and replicable. This may be achieved in the context of simulation-based SSD by supplying the simulation program as part of the application . Given this, any reviewer should be able to re-calculate the operating characteristics of the proposed design. However, a greater challenge for the reviewer is understanding the program and ensuring it is an accurate representation of the model in question. This requirement has partly motivated our use of R. Although significantly slower than a compiled language such as C++, it has been argued that software written in R is more transparent . Validation will be further facilitated if a simulation protocol of the sort described in  is provided alongside the code. Future work could develop an interface for alternative statistical software such as Stata or SAS, allowing a simulation program to be written in them and connect with the R implementation of the optimisation algorithm.

We have followed the conventional approach to clinical trial design whereby constraints on operating characteristics are set and then a constrained optimisation problem is solved. In practice the constraints are not fixed in advance, but adjusted iteratively in response to the design requirements they produce. For example, an initial nominal power of 90% may require an infeasibly large sample size, leading to a revision down to 80%. Such an iterative procedure will increase an already substantial computational burden for simulation-based design. However, note that a change to a constraint does not mean starting the process again, since any previous MC estimates can still be used when fitting the GP model(s). The sequential nature of the optimisation algorithm suggests that an interactive routine could be developed, where the user pauses the algorithm in response to the sample size requirements which are being observed, adjusts the constraints, and then continues with the optimisation.

The examples have demonstrated that the time required to solve a sample size determination problem can be significant, of the order of hours. Given that the majority of computational effort is expended generating MC samples when evaluating solutions, it is important that these simulation programs are as efficient as possible. We recommend making use of code profilers such as R’s ‘Rprof’ to identify the parts of the program that are consuming the most resources. Further efficiencies could potentially be gained by using more sophisticated methods for surrogate modelling and efficient optimisation. For example, when the modelled function can be assumed monotonic, this information can be incorporated into the surrogate modelling process .

Numerous extensions to the proposed approach can be considered. One argument for simulation-based design is the ease with which sensitivity to model assumptions, such as the value of nuisance parameters, can be assessed . Future work could consider how a systematic assessment of sensitivity to nuisance parameters could be conducted, given a proposed trial design. Such investigations fall under the heading of uncertainty quantification and can be carried out using GP regression and associated techniques . A further extension could consider Bayesian approaches to trial design, including hybrid Bayesian-frequentist assurances , fully Bayesian measures such as average coverage criterion , and decision-theoretic methods . Aside from very simple cases involving only conjugate analyses, evaluating these Bayesian criteria will generally require simulation  and so optimal design may benefit from the efficient methods discussed here. Complex SSD problems are also common in the area of adaptive designs, which can aim to minimise the expected sample size under several different hypotheses and over a number of stopping rule parameters . Extending the proposed methods to such problems would require using surrogate models to approximate the objective functions, as opposed to only the constraints.

In conclusion, efficient optimisation algorithms based on surrogate models of expensive operating characteristic functions can be used to solve complex clinical trial sample size determination problems. By using these methods we can avoid making unrealistic simplifying assumptions at the trial design stage, both in terms of the statistical model underlying the trial and of the nature of the design problem.

### Funding

This work was supported by the Medical Research Council [grant number MR/N015444/1].

### Data availability statement

All simulated data used in this manuscript, together with the code used to generate it, is available at https://github.com/DTWilson/Bayes_opt_SSD.

## References

•  Benjamin F Arnold, Daniel R Hogan, John M Colford, and Alan E Hubbard. Simulation methods to estimate design power: an overview for applied research. BMC Med Res Methodol, 11(1), 6 2011.
•  Sabine Landau and Daniel Stahl. Sample size and power calculations for medical studies by simulation when closed form expressions are not available. Statistical Methods in Medical Research, 22(3):324–345, 2013.
•  Ziding Feng and James E. Grizzle. Correlated binomial variates: Properties of estimator of intraclass correlation and its effect on sample size calculation. Statistics in Medicine, 11(12):1607–1614, 1992.
•  Richard Hooper. Versatile sample-size calculation using simulation. The STATA Journal, 13(1):21–38, 2013.
•  David A. Schoenfeld and Michael Borenstein. Calculating the power or sample size for the logistic and proportional hazards models. Journal of Statistical Computation and Simulation, 75(10):771–785, oct 2005.
•  Andrew P. Grieve and Shah-Jalal Sarker. Simulation-based sample-sizing and power calculations in logistic regression with partial prior information. Pharmaceutical Statistics, 15(6):507–516, sep 2016.
•  Alexander J. Sutton, Nicola J. Cooper, David R. Jones, Paul C. Lambert, John R. Thompson, and Keith R. Abrams. Evidence-based sample size calculations based upon updated meta-analysis. Statistics in Medicine, 26(12):2479–2500, 2007.
•  Evangelos Kontopantelis, David A Springate, Rosa Parisi, and David Reeves. Simulation-based power calculations for mixed effects modeling: ipdpower in stata. Journal of Statistical Software, 74(12), 2016.
•  Valerii Fedorov and Byron Jones. The design of multicentre trials. Statistical Methods in Medical Research, 14(3):205–248, 2005. PMID: 15969302.
•  Gianluca Baio, Andrew Copas, Gareth Ambler, James Hargreaves, Emma Beard, and Rumana Z Omar. Sample size calculation for a stepped wedge trial. Trials, 16(1), aug 2015.
•  Richard Hooper, Steven Teerenstra, Esther de Hoop, and Sandra Eldridge. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Statistics in Medicine, 35(26):4718–4728, jun 2016.
•  Nicholas G. Reich, Jessica A. Myers, Daniel Obeng, Aaron M. Milstone, and Trish M. Perl. Empirical power and sample size calculations for cluster-randomized and cluster-randomized crossover studies. PLOS ONE, 7(4):1–7, 04 2012.
•  D. T. Wilson, R. E. Walwyn, J. Brown, A. J. Farrin, and S. R. Brown. Statistical challenges in assessing potential efficacy of complex interventions in pilot or feasibility studies. Statistical Methods in Medical Research, 25(3):997–1009, jun 2015.
•  K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan.

A fast and elitist multiobjective genetic algorithm: NSGA-II.

IEEE Transactions on Evolutionary Computation

, 6(2):182–197, apr 2002.
•  Olaf Mersmann. mco: Multiple Criteria Optimization Algorithms and Related Functions, 2014. R package version 1.0-15.1.
•  Jerome Sacks, William J. Welch, Toby J. Mitchell, and Henry P. Wynn. Design and analysis of computer experiments. Statistical Science, 4(4):409–423, 1989.
•  Thomas J. Santner, Brian J. Williams, and William I. Notz. The Design and Analysis of Computer Experiments. Springer-Verlag New York, Inc., 2003.
•  D.G. Krige. A statistical approach to some basic mine valuation problems on the witwatersrand. Journal of the Southern African Institute of Mining and Metallurgy, 52(6):119–139, 1951.
•  Carl Edward Rasmussen and Christopher K. I. Williams.

Gaussian Processes for Machine Learning

.
MIT Press, 2006.
•  Olivier Roustant, David Ginsbourger, and Yves Deville. DiceKriging, DiceOptim: Two R packages for the analysis of computer experiments by kriging-based metamodeling and optimization. Journal of Statistical Software, 51(1):1–55, 2012.
•  Donald R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21(4):345–383, 2001.
•  Michael J. Sasena, Panos Papalambros, and Pierre Goovaerts. Exploration of metamodeling sampling criteria for constrained global optimization. Engineering Optimization, 34(3):263–278, jan 2002.
•  Michael TM Emmerich, André H Deutz, and Jan Willem Klinkenberg. Hypervolume-based expected improvement: Monotonicity properties and exact computation. In Evolutionary Computation (CEC), 2011 IEEE Congress on, pages 2147–2154. IEEE, 2011.
•  Victor Picheny and David Ginsbourger. Noisy kriging-based optimization methods: A unified implementation within the DiceOptim package. Computational Statistics & Data Analysis, 71:1035–1053, mar 2014.
•  William J Browne, Mousa Golalizadeh Lahi, and Richard MA Parker. A Guide to Sample Size Calculations for Random Effect Models via Simulation and the MLPowSim Software Package, 2009.
•  Peter White, Michael Sharpe, Trudie Chalder, Julia DeCesare, Rebecca Walwyn, and the PACE trial group. Protocol for the pace trial: A randomised controlled trial of adaptive pacing, cognitive behaviour therapy, and graded exercise as supplements to standardised specialist medical care versus standardised specialist medical care alone for patients with the chronic fatigue syndrome/myalgic encephalomyelitis or encephalopathy. BMC Neurology, 7(1):6, 2007.
•  PD White, KA Goldsmith, AL Johnson, L Potts, R Walwyn, JC DeCesare, HL Baber, M Burgess, LV Clark, DL Cox, J Bavinton, BJ Angus, G Murphy, M Murphy, H O’Dowd, D Wilks, P McCrone, T Chalder, and M Sharpe. Comparison of adaptive pacing therapy, cognitive behaviour therapy, graded exercise therapy, and specialist medical care for chronic fatigue syndrome (pace): a randomised trial. The Lancet, 377(9768):823–836, 2011.
•  T Chalder, G Berelowitz, S Hirsch, T Pawlikowska, P Wallace, and S Wessely. Development of a fatigue scale. Journal of Psychometric Research, 37(2):147–153, 1993.
•  Rebecca E. A. Walwyn and Chris Roberts. Therapist variation within randomised trials of psychotherapy: implications for precision, internal and external validity. Statistical Methods in Medical Research, 19(3):291–315, 2010.
•  Chris Roberts, Evridiki Batistatou, and Stephen A. Roberts. Design and analysis of trials with a partially nested design and a binary outcome measure. Statistics in Medicine, 35(10):1616–1636, 2015.
•  Colleen A. McHorney, John E. Ware, and Anastasia E. Raczek. The mos 36-item short-form health survey (sf-36): Ii. psychometric and clinical tests of validity in measuring physical and mental health constructs. Medical Care, 31(3):247–263, 1993.
•  Stephen Senn and Frank Bretz. Power and sample size when multiple endpoints are considered. Pharmaceut. Statist., 6(3):161–170, 2007.
•  Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, 1998.
•  Claus Bendtsen. , 2012. R package version 1.0.3.
•  V. Picheny, D. Ginsbourger, and Y. Richet. Noisy expected improvement and on-line computation time allocation for the optimization of simulators with tunable fidelity. In 2nd International Conference on Engineering Optimization, 2010.
•  Christophe Dutang and Petr Savicky. randtoolbox: Generating and Testing Random Numbers, 2015. R package version 1.17.
•  Clément Chevalier, Victor Picheny, and David Ginsbourger. KrigInv: An efficient and user-friendly implementation of batch-sequential inversion strategies based on kriging. Computational Statistics & Data Analysis, 71:1021–1034, mar 2014.
•  Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.
•  Harvey Goldstein. Multilevel Statistical Models. Arnold, 3rd edition, 2003.
•  K Hemming, S Eldridge, G Forbes, C Weijer, and M Taljaard. How to design efficient cluster randomised trials. BMJ, 358, 2017.
•  Allan Donner and Neil Klar. Design and Analysis of Cluster Randomization Trials in Health Research. London Arnold Publishers, 2000.
•  Karla Hemming, Alan Girling, Alice Sitch, Jennifer Marsh, and Richard Lilford. Sample size calculations for cluster randomised controlled trials with a fixed number of clusters. BMC Medical Research Methodology, 11(1):102, 2011.
•  Sandra M Eldridge, Ceire E Costelloe, Brennan C Kahan, Gillian A Lancaster, and Sally M Kerry. How big should the pilot study for my cluster randomised trial be? Statistical Methods in Medical Research, 2015.
•  Joop Hox. Multilevel Analysis: Techniques and Applications. Lawrence Erlbaum Associates, Inc., 2002.
•  Tom A. B. Snijders and Roel J. Bosker. Standard errors and sample sizes for two-level research. Journal of Educational and Behavioral Statistics, 18(3):237–259, 1993.
•  Stephen W. Raudenbush and Xiaofeng Liu. Statistical power and optimal design for multisite randomized trials. Psychological Methods, 5(2):199–213, 2000.
•  Gerard J.P. van Breukelen and Math J.J.M. Candel. Calculating sample sizes for cluster randomized trials: We can keep it simple and efficient! Journal of Clinical Epidemiology, 65(11):1212 – 1218, 2012.
•  S. Teerenstra, M. Moerbeek, T. van Achterberg, B. J. Pelzer, and G. F. Borm. Sample size calculations for 3-level cluster randomized trials. Clinical Trials, 5(5):486–495, sep 2008.
•  Lawrence Joseph and David B. Wolfson. Interval-based versus decision theoretic criteria for the choice of sample size. Journal of the Royal Statistical Society: Series D (The Statistician), 46(2):145–149, 1997.
•  Mike K. Smith and Andrea Marshall. Importance of protocols for simulation studies in clinical drug development. Statistical Methods in Medical Research, 2010.
•  Andrea Burton, Douglas G. Altman, Patrick Royston, and Roger L. Holder. The design of simulation studies in medical statistics. Statistics in Medicine, 25(24):4279–4292, 2006.
•  Marc C. Kennedy and Anthony O’Hagan. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3):425–464, 2001.
•  Anthony O’Hagan, John W. Stevens, and Michael J. Campbell. Assurance in clinical trial design. Pharmaceutical Statistics, 4(3):187–201, 2005.
•  Jing Cao, J. Jack Lee, and Susan Alber. Comparison of bayesian sample size criteria: ACC, ALC, and WOC. Journal of Statistical Planning and Inference, 139(12):4111 – 4122, 2009.
•  Jeremy E. Oakley, Alan Brennan, Paul Tappenden, and Jim Chilcott. Simulation sample sizes for monte carlo partial evpi calculations. Journal of Health Economics, 29(3):468 – 477, 2010.
•  James M. S. Wason and Thomas Jaki. Optimal design of multi-arm multi-stage trials. Statistics in Medicine, 31(30):4269–4279, jul 2012.