Group Sequential Clinical Trial Designs for Normally Distributed Outcome Variables

10/09/2017 ∙ by Michael Grayling, et al. ∙ University of Cambridge 0

In a group sequential clinical trial, accumulated data are analysed at numerous time-points in order to allow early decisions about a hypothesis of interest. These designs have historically been recommended for their ethical, administrative and economic benefits. In this work, we discuss a collection of new Stata commands for computing the stopping boundaries and required group size of various classical group sequential designs, assuming a normally distributed outcome variable. Following this, we demonstrate how the performance of several designs can be compared graphically.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

0.1 Introduction

Parallel group randomised controlled trials are typically conducted by recruiting a fixed number of individuals and allocating each to receive one of two treatments, ultimately testing a pre-specified hypothesis. Since Wald published his work on the sequential probability ratio test

(wald1947), there has been substantial interest in trial designs that allow hypotheses to be tested multiple times during the trial. With this approach, the trial may be stopped early if the data so suggests. This leads to patient exposure to inferior treatments being limited, and, by helping to lower the expected required sample size, the cost of a trial will often be reduced.

Armitage1975

was responsible for much of the early use of such methods in medicine. However, his and the other initial approaches were fully sequential, with data analysed after every patient. Whilst this may seem desirable, it is impractical and thus this methodology did not gain general acceptance. The pivotal moment in this field came with the work of

pocock1977

, who provided a clear way of determining group sequential designs with desired type-I and type-II error rates. In a group sequential design, a hypothesis is analysed multiple times during an on going trial, but as the name suggests, only after groups of certain sizes have been assessed. This allows the majority of the benefits of a fully sequential approach to be retained, whilst also making the design feasible in practice.

Since this paper, group sequential designs have been researched extensively, and utilised regularly in clinical trials. Today, methodology is well established for designing group sequential trials with normal, binary, and survival endpoints. Approaches are available to design trials with unknown variance, with multiple arms, or to optimise a designs features. For a detailed discussion of available methods see

whitehead1997 or jennison2000.

In this paper, we focus on the design of two-treatment group sequential trials with a normally distributed outcome variable, but note that asymptotically other endpoint types can be treated with the same normal test statistics. We proceed by summarising the statistical theory behind group sequential methodology. Following this we detail our new commands, and provide several examples of their use.

0.2 Statistical Theory

We consider a randomised two-arm group sequential trial design with up to planned analyses. We index one arm by 0, and the other by 1. Often, it will be the case that arm 0 is a control and arm 1 an experimental treatment, but this may not always be true. We assume that the th analyses takes place after and patients have been randomised to arms 0 and 1, respectively. Possible extensions to this framework are discussion in Section 8. Thus is the ratio of patients allocated to arm 1 relative to arm 0, and we refer to as the group size. The outcome from patient in arm in stage , , is assumed to be distributed as follows

Thus, we are assuming that the variance in response of both treatments is known.

Our ultimate goal is to make inference about the difference in the average treatment effect of arms 0 and 1. To this end, we define , and at each interim analysis compute the following test-statistic

with

(1)

the information for this analysis. It can be shown that have for the parameter of interest , with information levels

, what has been referred to as the canonical joint distribution

(jennison2000). That is

  • is multivariate normal;

  • , ;

  • , .

Using this, the operating characteristics of a group sequential design with any choice of stopping boundaries can be determined using multivariate normal integration as described in jennison2000 and wason2015. This allows the use of numerical optimisation routines to determine suitable sample sizes and stopping boundaries. The particular type of boundaries to utilise depends on the chosen hypothesis testing framework. Therefore, in the following sections we discuss several established methods for two-sided, and then one-sided, tests.

0.3 Two-Sided Tests

0.3.1 Stopping Rules and Operating Characteristics

In a two-sided test, we assess whether there is significant evidence of a difference in the mean responses of the two treatment arms. That is, we test

Here, a group sequential trial design is characterised by stopping boundaries and , with for , and , and the following stopping rules at analyses

  • If stop and reject ;

  • If stop and do not reject ;

  • otherwise continue to stage .

The choice ensures termination after analysis , whilst also guaranteeing a conclusion is made about .

Then, the probability of rejecting for any , given , is

Similarly, the probability of not rejecting for any is

Using the above, the expected sample size for any can be calculated as

As discussed earlier, each of these probabilities can be computed using multivariate normal integration. Explicitly, defining

then, for example

Here,

, the square root of the vector

is taken in an element wise manner, and

is the probability density function of a multivariate normal distribution with mean

and covariance matrix , evaluated at vector . In all of the commands presented here, these integrals are evaluated using the mata function pmvnormal_mata() (grayling2016).

With the above specifications, all that remains is a method for determining stopping boundaries, and an associated required sample size, such that and , for clinically relevant difference , and desired type-I and type-II error rates and . It is this problem that much of the group sequential clinical trial design literature has focused upon. In the following sections we discuss several options available via our commands.

0.3.2 Early Stopping to Reject

Much of the early work on group sequential trial design focused on two-sided tests with early stopping only to reject . That is, with for . In particular, haybittle1971 and peto1976 suggested a simple set of boundaries with for . The final critical boundary

is then determined to ensure an overall type-I error rate of

. Following the determination of , a one-dimensional numerical search is utilised to ascertain the exact required group size for power of when , treating as a continuous quantity.

Haybittle and Peto’s procedure is advantageous in that it is a simple one, whilst its wide stopping boundaries mean that early stopping is unlikely: a desirable property in some instances to help increase data accumulation, with termination only in the case of extreme disparities in treatment performance. However, trialists will often desire stopping boundaries that help to substantially reduce the expected sample size when is not true. For this, wang1987 suggested the following family of stopping boundaries, indexed by a parameter

Their procedure encompasses the popular pocock1977 and obrien1979 boundaries, by taking or respectively. In this approach, a numerical search is utilised for any chosen to determine the value of that implies the correct type-I error rate . Following this, as with Haybittle and Peto’s design, a further search is then used to ascertain the required sample size for the power constraint. In general, it has been shown that as increases, the maximum sample size increases, but the expected sample size for larger values of decreases.

Later, we present commands haybittlePeto and wangTsiatis for determining the stopping boundaries and required sample size of these designs for any choice of , , , , , and .

0.3.3 Early Stopping to Reject and Not Reject

The above designs deal well with the issue of ethics in two-sided clinical trials; namely the desire to stop early when the difference between treatments is substantial. However, there are also often sound reasons to desire early stopping when it is clear there is no detectable treatment difference; usually based around reducing the cost of a trial. These are trial designs with not all , . pampallona1994 described a one-parameter family of such trial designs, again indexed by a shape parameter , that has been referred to as the power family of inner wedge designs. Explicitly

The final information level is then

to ensure as desired. A two-dimensional numerical search is utilised to determine the values of and that provide the desired type-I and type-II error rates given choices for , , and . With these values identified, the final required information level is used to determine the exact required group size through Equation (1). As in the procedure of wang1987 above, the inclusion of the parameter allows a large range of designs to be determined, with varying performance in terms of their expected sample sizes. In Section 0.6 we will see how these performances can be examined graphically.

Alternatively, whitehead1983 and whitehead1997 proposed an approach for the determination of a group sequential clinical trial design for a two-sided test with early stopping to not reject , termed the double triangular test. Specifically, they demonstrated that a design with

where

and

would approximately attain a type-I error rate of when , and a type-II error rate of when .

Later, we discuss our commands innerWedge and doubleTriangular for determining these designs.

0.4 One-Sided Tests

0.4.1 Stopping Rules and Operating Characteristics

In a one-sided test, we assess whether, without loss of generality, the mean response on treatment 1 is significantly larger than that on treatment 0. That is, we test

A group sequential trial design of this type is characterised by stopping boundaries and , with for and , and the following stopping rules at analyses

  • If stop and reject ,

  • If stop and do not reject ,

  • otherwise continue to stage ,

Again, the choice is to ensure termination after analysis , and to guarantee a conclusion is drawn about .

Now, the probability of rejecting for any , given , becomes

Similarly, the probability of not rejecting for any is

As before, the expected sample size for any is given by

Moreover, these probabilities can again be computed using multivariate normal integration. Using our notation from earlier, we have for example

In some situations, a one-sided test will be more appropriate because departures from in one direction are implausible. Alternatively, it may be the case that we are interested in directly testing the superiority of one treatment over another. Consequently, much research has gone in to determining designs that will have desired operating characteristics (now, a type-I error rate of when , and a type-II error rate of when ) and favourable performance in terms of the expected sample size. Below, we discuss two popular methods, available for implementation via our commands.

0.4.2 Power Family of One-Sided Designs

In addition to their power family of inner wedge designs, pampallona1994 also detailed a one-parameter family of designs for one-sided tests, with boundaries given by

As before, taking a final information level of

ensures that as desired, and a two-dimensional grid search can be used to determine the appropriate values of and . Our command powerFamily is available to perform these computations.

0.4.3 Triangular Test

whitehead1983 and whitehead1997 also proposed a triangular test for one-sided group sequential clinical trial designs. Specifically, they proposed

with

and

demonstrating this design would approximately attain the desired operating characteristics.

This design has proven popular with trialists because of the speed with which it can be calculated, and also because of its strong performance in terms of its expected sample sizes (wason2012). Our command triangular determines this design.

0.5 Syntax

In this section, we detail the syntax of our six discussed commands, which are all declared as rclass

doubleTriangular, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) performance *

haybittlePeto, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) performance *

innerWedge, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) omega(real 0.5) performance *

powerFamily, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) omega(real 0.5) performance *

triangular, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) performance *

wangTsiatis, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) omega(real 0.5) performance *

Here, the prescribed options denote the following

alpha is the desired overall type-I error rate, . That is, it is the two-sided or one-sided type-I error rate according to the chosen command.

beta is the desired type-II error rate, .

delta is the clinically relevant difference at which we power, .

l is the maximum number of allowed stages in the design, .

omega is the shape parameter of the boundaries of the power family and Wang-Tsiatis designs.

performance specifies that the performance of the identified design, i.e. its expected sample size and power curves, should be determined and plotted.

ratio is the desired ratio of the sample sizes between arms 0 and 1.

sigma

is the standard deviation of the responses in arms 0 and 1;

and . This can either be of length two, containing the assumed values of these two parameters, or of length one, implying .

Attainable via return list for all six commands, are the determined exact required group size (r(n)), and the stopping boundaries , , and as appropriate (e.g., r(a)). In addition, the vector of information levels (r(I)), the covariance matrix (r(Lambda)), and a vector summarising the performance of the design (r(performance))

are available.

Note that in all of these commands, required one dimensional numerical searches are performed using a purpose built implementation of Brent’s algorithm (Brent1973). In contrast, all two dimensional numerical searches are carried out with the Nelder-Mead option in optimize().

0.6 Example 1: Two-Sided Tests

As our first example, we consider the case , , , , , and , in two-sided testing.

We begin by demonstrating how doubleTriangular can be used to determine the boundaries and sample size required by the Double Triangular test of whitehead1983. Explicitly, the following code is used to determine the design

. doubleTriangular, l(2) alpha(0.05) beta(0.2) delta(0.2) sigma(2) r(1) 2-stage Group Sequential Trial Design 37 The hypotheses to be tested are as follows: H0: tau = 0 H1: tau != 0, with the following error constraints: P(Reject H0 — tau = 0) = .05, P(Reject H0 — tau = delta = .2) = 1 - .2. Double-triangular boundaries selected……………….. …now determining design…………………………….. …design determined. Returning the results…………….. …Exact required group sizes for each arm determined to be: 875.5 and 875.5. …Rejection boundaries r determined to be: (2.2,2.07). …Acceptance boundaries a determined to be: (.73,2.07). …Operating characteristics of the design are: P(Reject H0 — tau = 0) = .0531, P(Reject H0 — tau = .2) = .8003, E(N — tau = 0) = 2514.6, E(N — tau = .2) = 2550.5, max_tau E(N — tau) = 2716.4, max N = 3501.9.

As can be seen, by default the commands return an informative summary of the chosen testing framework, their progress, and the characteristics of the final design. Specifically, the first few lines describe the hypotheses that will be tested based on the chosen command. The input values of alpha and beta are then used in printing a summary of the desired operating characteristics. Several lines then follow which describe the progress of the command in completing its required computations. Next, the exact required number of patients in each arm, in each stage, are printed. The rejection and acceptance boundaries then follow, along with a summary of the operating characteristics of the identified design. In this case we see the design has a type-I error-rate of 0.053, and power of 0.800. This is a well-known limitation of the double triangular design: the type-I and type-II error requirements are only approximately achieved. The final four printed results summarise various important sample size characteristics of the design: the expected sample size when , that when , the maximum expected sample size over all possible values of , and the maximum possible required sample size. We can see that in this case, whilst the maximum possible value of is 3501.9, we would expected to not require more than 2716.4 patients.

Being able to easily determine this design is useful, however in most situations it is unlikely that a trialist will have a single design in mind. Consequently, it is important to be able to determine the performance of several designs and compare them graphically. Here, we demonstrate this for the power family of inner wedge designs. Using the following code, we find the designs for , , and , saving their performance. Then, we combine the saved graphs to produce Figure 1

. qui innerWedge, l(2) alpha(0.05) beta(0.2) delta(0.2) sigma(2) omega(-0.5) r(1) ¿ perf saving(firstDesign) nodraw title(&Omega = -0.5) scale(0.75) scheme(sj) . qui innerWedge, l(2) alpha(0.05) beta(0.2) delta(0.2) sigma(2) omega(-0.25) r(1) ¿ perf saving(secondDesign) no draw title(&Omega = -0.25) scale(0.75) scheme(sj) . qui innerWedge, l(2) alpha(0.05) beta(0.2) delta(0.2) sigma(2) omega(0) r(1) perf ¿ saving(thirdDesign) nodraw title(&Omega = 0) scale(0.75) scheme(sj) . qui innerWedge, l(2) alpha(0.05) beta(0.2) delta(0.2) sigma(2) omega(0.25) r(1) ¿ perf saving(fourthDesign) nodraw title(&Omega = 0.25) scale(0.75) scheme(sj) . graph combine firstDesign.gph secondDesign.gph thirdDesign.gph fourthDesign.gph, ¿ ycommon scheme(sj)

Figure 1: Comparison of the performance of several two-sided power family of inner wedge designs.

We observe that increasing the value of appears to reduce the expected sample required when is small. However, this comes at a cost to that required when is large.

0.7 Example 2: One-Sided Tests

As our next example, we consider one-sided testing. We take , , , , , and . Similarly to the above, we demonstrate how powerFamily can be used to determine several designs (, , and ), and in addition compute the boundaries and sample size of the triangular test. Saving the performance of each, we then compare their performance graphically, creating Figure 2 with the following code

. qui powerFamily, l(3) alpha(0.1) beta(0.1) delta(0.25) sigma(1, 2) omega(-0.25) ¿ r(2) perf saving(firstDesign) nodraw title(Power family with &Omega = -0.25) ¿ scale(0.75) scheme(sj) . qui powerFamily, l(3) alpha(0.1) beta(0.1) delta(0.25) sigma(1, 2) omega(0) r(2) ¿ perf saving(secondDesign) nodraw title(Power family with &Omega = 0) ¿ scale(0.75) scheme(sj) . qui powerFamily, l(3) alpha(0.1) beta(0.1) delta(0.25) sigma(1, 2) omega(0.25) ¿ r(2) perf saving(thirdDesign) nodraw title(Power family with &Omega = 0.25) ¿ scale(0.75) scheme(sj) . qui triangular, l(3) alpha(0.1) beta(0.1) delta(0.25) sigma(1, 2) r(2) perf ¿ saving(fourthDesign) nodraw title(Triangular test) scale(0.75) scheme(sj) . graph combine firstDesign.gph secondDesign.gph thirdDesign.gph fourthDesign.gph, ¿ ycommon scheme(sj)

Figure 2: Comparison of the performance of several one-sided power family designs and the triangular test.

As has been reported previously, the triangular test does indeed fare well in comparison to the two identified power family designs. Explicitly, it has the lowest maximum expected sample size of the four designs. However, this does come at the cost of an increased maximum possible sample size, as evidence by its performance for large .

0.8 Conclusion

It is important that any clinical trial control both its type-I and type-II error rates accurately. For this task, Stata introduced in Version 13 the command power, which can be used for an extremely wide array of trial scenarios. However, as we have discussed, group sequential clinical trial designs are extremely popular with researchers, and to date few available commands are available in Stata for determining such designs. Notable exceptions include nstage (Barthel2009; Bratton2015) and nstagebin (Bratton2014) for multi-arm multi-stage trial designs with time-to-event and binary endpoints respectively. In addition, the command simsam can determine the required sample size of certain group sequential clinical trial designs given stopping boundaries (Hooper2013). There are no established commands however for determining the boundaries and group size required by the wide array of group sequential trial designs for normally distributed outcomes discussed here.

Several extensions to our commands are now possible. We have assumed that the variance of the responses on both treatment arms is known prior to trial commencement. Whilst this is a common assumption in the group sequential design literature, often this will be a strong one to make. However, whitehead2009

proposed a simple quantile substitution method for dealing with this problem, which has been shown to generally control the type-I error rate to the correct level

(wason2012a). This would no doubt be a useful addition to our commands. Moreover, we have assumed that the interim analyses are equally spaced in-terms of the number of patient responses accrued in each arm. lan1983 proposed an error spending approach to the design of group sequential trials that allows this assumption to be relaxed. Consequently, a command to employ such methodology could prove useful to those seeking more complex designs.

Additionally, our focus has been on two-arm trials. Today, multi-arm multi-stage trials are becoming increasingly popular. Therefore, extending these designs to allow for multiple experimental arms would be advantageous. Finally, there have now been several proposals for the determination of optimal or near-optimal group sequential designs (see, for example, wason2012, wason2012a, and Wason2015a). To allow trialists to maximise the efficiency gains made by utilising a group sequential design, the establishment of commands for determining such designs would be highly advantageous.

Regardless of these possible expansions, our commands can be used to determine stopping boundaries, exact required group sizes, and also to compare the performance of a selection of designs. Consequently, they should prove useful to those seeking to exploit the efficiencies of a group sequential design whilst working in Stata.

0.9 Acknowledgements

Michael J. Grayling is supported by the Wellcome Trust (Grant Number 099770/Z/12/Z). James M. S. Wason is supported by the National Institute for Health Research Cambridge Biomedical Research Centre (Grant Number MC_UP_1302/6). Adrian P. Mander is supported by the Medical Research Council (Grant Number MC_UP_1302/2).

References