A two-stage Fisher exact test for multi-arm studies with binary outcome variables

11/28/2017 ∙ by Michael Grayling, et al. ∙ University of Cambridge 0

In small sample studies with binary outcome data, use of a normal approximation for hypothesis testing can lead to substantial inflation of the type-I error-rate. Consequently, exact statistical methods are necessitated, and accordingly, much research has been conducted to facilitate this. Recently, this has included methodology for the design of two-stage multi-arm studies utilising exact binomial tests. These designs were demonstrated to carry substantial efficiency advantages over a fixed sample design, but generally suffered from strong conservatism. An alternative classical means of small sample inference with dichotomous data is Fisher's exact test. However, this method is limited to single-stage designs when there are multiple arms. Therefore, here, we propose a two-stage version of Fisher's exact test, with the potential to stop early to accept or reject null hypotheses, which is applicable to multi-arm studies. In particular, we provide precise formulae describing the requirements for achieving weak or strong control of the familywise error-rate with this design. Following this, we describe how the design parameters may be optimised to confer desirable operating characteristics. For a motivating example based on a phase II clinical trial, we demonstrate that on average our approach is less conservative than corresponding optimal designs based on exact binomial tests.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When designing a study, it is not uncommon for the outcome variable of interest to be dichotomous. Unsurprisingly therefore, there is a long history of publications pertaining to the design of studies that compare two binomial proportions. Whilst a normal approximation could be utilised for comparing proportions, in the case of small samples this can lead to substantial inflation of the familywise error-rate (FWER). Therefore, exact statistical methods are required, and one classical approach to this is Fisher’s exact test (Fisher, 1935), sometimes also credited to Yates (1934) or Irwin (1935). Unfortunately, prior to the advent of modern computing power this method was prohibitively computationally intensive, and therefore the early literature contains several proposals for approximating the problem of sample size determination (see, for example, Casagrande et al. (1978), Fleiss et al. (1980), and Ury and Fleiss (1980)). Today, computational speed is no longer an issue, and exact methods are readily available, not only for two-arm, but also multi-arm studies (Mehta and Patel, 1983; Mehta and Patel, 1986). Thus, along with alternative exact methods, such as Barnard’s test (Barnard, 1945), Fisher’s exact test allows for the effectual design and analysis of small fixed-sample studies with binary outcome variables.

Since Wald published his work on the sequential probability ratio test (Wald, 1945), there has been substantial interest in study designs that allow for the interim assessment of hypotheses. Recently, for logistical reasons, the focus of such research has typically been on group, rather than fully, sequential designs. Indeed, much methodology now exists on this (see, for example, Jennison and Turnbull (2000)), including recent extensions to multi-arm studies with normally distributed outcomes (Magirr et al., 2012). This approach has also been demonstrated to be applicable asymptotically to binary outcome variables (Jaki and Magirr, 2013).

Unfortunately, the drawbacks of utilising a normal approximation in small sample studies with binary data persist in this sequential setting. Accordingly, recent proposals have included methods for the design of two-stage multi-arm studiess, allowing early stopping to accept null hypotheses, that employ exact binomial tests (Jung, 2008). Later, this methodology was extended in the case of two-arm designs to allow early stopping to reject null hypotheses (Jung, 2013). However, this approach was demonstrated to be highly conservative for certain values of the shared success probability across the design arms. To attempt to resolve this in a two-arm setting, a recent paper presented a two-stage version of Fisher’s exact test (Jung and Sargent, 2014). It was established that this design could on average more exhaustively utilise the designated FWER.

Nonetheless, as stated, the design presented in Jung and Sargent (2014) was limited to two-arm studies only. Moreover, a functional form for the stopping boundaries at the interim analysis was assumed, which may substantially reduce the chance the design will be more efficient than an approach based on exact binomial tests. In this paper, we seek to address these limitations by presenting an optimised two-stage version of Fisher’s exact test applicable to multi-arm studies. Extension to a multi-arm domain is of particular importance because of the noted advantages in terms of efficiency of comparing multiple arms against a single control (Parmar et al., 2014). Furthermore, this study design scenario is becoming increasingly common. For example, in clinical research, in many disease settings numerous novel compounds are available for testing in phase II, and the primary outcome variable is typically a binary response indicator. Whilst randomised trials are now being far more typically utilised in this setting (Ivanova et al., 2016).

This paper will consist of the following sections. First we introduce the notation used in the paper in Section 2.1. Following this, in Section 2.2, we summarise the previously proposed design based on exact binomial tests. Our approach is then presented in Section 2.3. Throughout these sections, particular focus is given to the requirements for achieving weak or strong control of the FWER. Then, in Section 2.4 we introduce a phase II clinical trial utilised as a motivating example. Our results are then detailed in Section 3, before we conclude with a discussion of the advantages and disadvantages of the two considered designs in Section 4.

2 Methods

2.1 Notation, hypotheses and analysis

Define the sets and . Then, we suppose that our study will have a single (control) arm, indexed by , which will be compared to (experimental) arms, indexed by , in a pairwise manner within a two-stage design. Denoting the success probability for each arm by , we test the following composite hypotheses

where for . That is, if

is the random variable describing the outcome from experiment

in arm , then for .

We allow early stopping to both reject and accept the null hypotheses, if so desired, assuming that the rejection of any null hypothesis at the first analysis leads to the termination of the whole study. Note that methodology for a design which terminates the study only when a decision has been made for every null hypothesis could be specified similarly.

We assume that control of the FWER, the probability of one or more incorrect rejections of null hypotheses, is desired to some maximal level . We discuss criteria for both weak and strong control of the FWER. Moreover, we suppose the experiment must have a familywise power (FWP) of at least when for . That is, we design the experiment to have power at reject at least one of the , when is true for . Note however that power to reject a particular null hypothesis could be achieved similarly.

In testing our hypotheses, we will make repeated reference to the odds ratios

given by

Furthermore, we will make use of the vectors

, , and . To indicate the implied by a particular using the relationship given above, we use the notation .

In stage one, we suppose that experiments are to be conducted in the control arm, for some . Then, we specify parameters such that the total number of experiments conducted in the control arm by the completion of stage is . Similarly, we designate values such that the total number of experiments conducted in each arm still present in the experiment is . Here, we retain the notion of a stage 0 to simplify the expressions that follow. To facilitate the dropping of one or more arms after stage one, we denote by the actual number of trials conducted in arm after stage . Note therefore that and , the other and will be assumed to be pre-specified. In contrast, will be determined based on the studies’ power requirements. One could however choose to search numerically for advantageous values of the and , according to some designated optimality criteria, if desired.

Next, we denote by the unknown total number of successes in arm in stage , and by its corresponding observed value. Thus .

Then, in both the Fisher exact and exact binomial test frameworks, at each interim analysis

, we employ the following test statistics


Finally, in what follows it will be convenient to formalise the eventual (unknown) outcome of the study. We achieve this via the pairs , where and with

  • , with if is rejected, and otherwise,

  • , with if following stage one is rejected or accepted, or the whole study is stopped and no decision is made on , and otherwise.

Using this notation, recalling our previous prescription that the rejection of one or more null hypotheses at the interim analysis causes the termination of the entire study, we can define the sample space of the random pairs by

Here, is the indicator function on event . The following sets will now also be useful

These sets represent the subset of outcomes in which a single particular null hypothesis, at least one null hypothesis, or at least one true null hypothesis, is rejected respectively.

Then, it is the ability to evaluate the probability of observing on trial completion, for any vector of success probabilities , that is key to determining and optimising the considered designs. Explicitly, referring to this probability as , we have


where , and are functions that evaluate the FWP, FWER and expected sample size (ESS) for a given .

2.2 Exact binomial testing design

In this section, we discuss design based upon exact binomial tests. Jung (2008) and Jung (2013) together provide detailed discussions on such designs in the case , and some guidance for . We expand on these considerations, providing formulae for computing the FWER, FWP, and ESS of any design, with any value for , and early stopping to reject and accept null hypotheses as desired.

In this approach, the goal is to identify suitable values for the parameter discussed earlier, along with acceptance and rejection boundaries and respectively. Informally, if we accept , whilst results in the rejection of . The space of possible designs is given by

Here, , place logical limits on the allowed value of . These could be chosen, for example, based on the sample size required by a corresponding single-stage design. Moreover, the restrictions ensure for any that it is possible to stop to accept or reject null hypotheses at both analyses, and ensure the study terminates after at most two stages. Note that if it desired to prevent the possibility to reject or accept null hypotheses at the end of stage one, further restrictions can simply be placed on the set , with design optimisation then proceeding as below.

Having specified , and , the studies’ formal conduct can be defined, along with the formulae for . We provide both in the Appendix. Here, we proceed directly to discussing how values for , and can be chosen. Specifically, in this instance, it was proposed by Jung (2008) that an optimal design be determined by exhaustively searching over the set . Since the evaluation of each design is independent, parallel execution can be used to enhance the speed of this search. Extending Jung (2013), we search for the solution to

where are weights that indicate which of the three factors we desire to minimise most. Heeding the advice of Mander et al. (2012), we always ensure that since many designs will likely share the same minimal maximal sample size. Later, we will make use of the notation . Additionally, is a specified vector of success probabilities to utilise in the optimisation procedure. Typically, this will be those expected under a global null hypotheses for . The two given constraints here are our requirements on the studies’ FWER and FWP. The exact form of the set is dependent on whether weak or strong control of the FWER is desired. In the case where we designate that weak control (i.e., control when all null hypotheses are true) must be achieved, we set . Alternatively, for strong control we must take .

Note that this contradicts the advice given in Jung (2008), which stated that strong control could be achieved for the case of two-arm studies by controlling the FWER for . Although the true maximal FWER appears in general to be close to that attained when , a simple counter example to it being universally true can be constructed by considering a design with . With such a design, attains a larger FWER than . Consequently, a search over one of the specified must be included when determining a design. In practice, the multi-dimensional search required for strong control must be broached by utilising the criteria for weak control during the design determination stage. Then after choosing an optimal design, a retrospective search over should be performed to argue that strong control has been achieved. Intuitively, it is logical that the maxima will occur when . This type of problem is common to experiments with binary outcome variables (see, for example, Kunzmann and Kieser (2016)).

With this, the methodology required for determining an optimal design based on exact binomial tests has been specified. We will later compare this approach to our two-stage Fisher exact test.

2.3 Fisher exact test design

As was discussed, the exact binomial test method summarised above was confirmed to be highly conservative for many combinations of success probabilities, and to address this in a two-arm setting Jung and Sargent (2014) proposed a two-stage version of Fisher’s exact test. In this section, we detail our extension to their proposal; allowing for multiple arms and the optimisation of the stopping boundaries.

Our test here is based on the conditional distribution of the , specified earlier in Equation (2.1), given the observed total number of successes in each completed stage , (with the unknown total number of stage-wise successes denoted by ). To achieve this, we let , with if arm is present in the study in stage , and otherwise. Therefore, note that . Then, extending Jung and Sargent (2014), the probability mass function of conditional on , and is


and we set . Note that this immediately implies if

Now, the studies’ conduct depends upon having chosen stopping boundaries for all possible total number of successes that could be observed in stage one and stage two, for every possible number of experimental arms that could be present in stage two. Formally, the following are required

  • , (with ) for and;

  • for , , .

Thus, whilst the rejection boundary in stage one depends on the number of observed successes, we have chosen to simplify matters by making independent of , as in Jung and Sargent (2014).

The studies’ formal conduct is then as follows

  1. Set .

  2. Conduct stage one of the study, allocating patients to the control arm, and patients to each arm . Following data accrual, compute and the .

  3. For each

    • If reject , setting and .

    • If accept , setting .

  4. If and , continue to 5. Otherwise stop the study, and for each with , set .

  5. Set .

  6. Conduct stage two of the study, allocating patients to the control arm, and patients to each arm with . Following data accrual, compute and the .

  7. For each with

    • If reject , setting and .

    • If accept , setting .

The above specifies the conduct of our study given values for , the allocation ratios and , and the required stopping boundaries. At the design stage of a study though, we require the ability to choose suitable values for and the stopping boundaries. The large number of required stopping boundaries precludes the possibility of optimising every chosen value, as in the method of the previous section. The aim of the Fisher exact approach though is not to optimise every boundary, but to instead ensure conditional control of the FWER for all possible values of , and , such that marginal control is then certain.

However, we can identify designs with more desirable operating characteristics. In Jung (2013) and Jung and Sargent (2014), the stopping boundaries for stage one were pre-specified, such that only those for the second stage needed to be chosen. Designing a study in this manner reduces the computational complexity of the ensuant optimisation problem. However, it reduces the chance that the operating characteristics of the resultant design will compare favourably with a design determined using the exact binomial test approach. Consequently, we here propose a more flexible design framework, reliant on the specification of two parameters, and , that can then be optimised. We first describe how the stopping boundaries are chosen for any , given and . Following this, we detail how can be chosen such that the study attains the desired FWP.

First, for any , define the function as follows. Precisely, this describes the probability of committing a familywise error at the end of stage one if success are observed, a rejection boundary of is utilised, and the true vector of odds ratios is . We have

Motivated by the error spending approach to group sequential trial design, when desiring weak control of the FWER, we select for as the minimal integer for which . Alternatively, for strong control is instead the smallest integer such that

That is, we either weakly or strongly control the possibility of committing a familywise error at the first analysis to .

Next, define

We then choose

to be the largest integer such that the marginal type-II error rate for

at the first stage, when , is at most . That is, is the largest integer such that

Here is the probability mass function of given and

with .

Now, define the function . This evaluates the probability of committing a familywise error at the end of the second stage, if and successes are observed in stages one and two respectively, with experimental arms present in the second stage, conditional on the use of the stopping boundaries , , and , and a nominated value of . Precisely

where can be computed from the and the stopping rules.

Then, to ensure weak control of the FWER we choose for , , to be the smallest integer such that

Alternatively, for strong control we must ensure that

Here, division by in the right-hand side of the above formulae is to allocate the unspent familywise error equally across the scenarios defined by the number of arms remaining in the experiment in stage two.

As in the previous section, for strong control, in practice it is necessary to assume the maximal FWER occurs on the boundary, and then search for the maximal marginal FWER after design determination. Note however that weak control is always assured by this design, unlike for the exact binomial test approach.

We have now fully specified a method for boundary determination given . With the boundaries specified, it is then possible to define the formulae for in this design. We provide this in the Appendix. Then, with this formulae ascertained, for any and , we can ensure our studies’ FWP requirement is met by searching for the minimal such that

As alluded to previously, we are then free to optimise and . Following the previous section, we minimise

Here, we do not need to list constraints on the studies’ operating characteristics as they assured as part of design determination for any and .

Now, since and are continuous variables, this minima could in theory be ascertained by utilising a conventional greedy optimisation routine. However, there is no guarantee that this would converge to the global optimum in a parameter space that may contain several local optima, given the discrete nature of the stopping boundaries and required value for . One possible solution therefore is to employ a global continuous optimisation algorithm such as simulated annealing, which has previously been used with success in clinical trial design (see, for example, Wason and Jaki (2012) and Chan and Lee (2013)). Such a search is typically a highly time consuming process however. A compromise can be achieved by instead proposing a set of candidate values for and , and respectively. The efficiency of the designs in can then be evaluated in parallel using a grid-search and what is essentially, depending on and , a near globally optimal design obtained. This method is also advantageous in that the solution for different combinations of , and can be attained simultaneously. In practice, can be specified by choosing a range of equally spaced values with and , and similarly for . This will be our approach in what follows.

2.4 Example

In this coming section, we determine several example designs using our two-stage Fisher exact procedure, and compare their performance to those based on the exact binomial method. We reconsider an example motivated by CALGB 50502; a randomised phase II clinical trial that compared the anti-tumour activity of two regimens in patients with relapsed/refractory classical Hodgkin’s lymphoma. With this, we set , , , and (Jung, 2008). For simplicity, we limit our focus to designs with .

In determining the designs based on the exact binomial method, we limit the maximal value of in our search to , where is the group size required by a corresponding single stage design. For the designs utilising the Fisher’s exact method, we determine the near optimal design by taking , .

Code to replicate our findings is available upon request.

3 Results

3.1 Comparison of optimised and non-optimised two-stage Fisher exact designs

First, we consider the case with . Jung and Sargent (2014) designated and for in this instance. Here, we compare the efficiency of their design to several determined using our optimisation procedure. Table 1 presents the ESS for and , along with the maximal possible sample size, of several designs. Note that in this instance, the criteria for weak and strong control coincide. Thus strong control is guaranteed by the Fisher exact approach.

It can be seen that the optimised designs allow the ESS under and to be reduced by up to 13.1% () and 17.5% () respectively, with only small increases to the maximal possible sample size. Moreover, one of the optimised designs reduces the maximal sample size by 8.3% (), though this comes at a cost to both and . A compromise can be achieved between reducing the ESSs and the maximal sample size by taking . In fact, the design with performs better than Jung and Sargent’s design in terms of , , and .

Jung and Sargent N/A N/A N/A 48 145.49 146.89 192
Optimised 0.11 0.16 52 126.49 125.42 208
0.08 0.17 51 126.54 124.93 204
0.01 0.01 44 162.54 164.60 176
0.04 0.10 46 131.54 131.27 184
Table 1: A comparison of the design from Jung and Sargent (2014) and several optimised designs for different values of , and is shown. Here, , , and .

3.2 Comparison of two-stage Fisher exact and exact binomial test designs

Next, we compare the performance of our proposed two-stage Fisher exact design to that based on the exact binomial test in the case . For each approach, we enforce their respective criteria for weak control of the FWER during design determination.

Searching over for the Fisher exact approach, in this case the optimal design for each of is given by the case . A summary of this design’s operating characteristics, along with that for the corresponding optimal designs using the exact binomial test method, are given in Table 2.

Fisher exact test All below 38 N/A N/A N/A 154.2 151.7 228
Exact binomial test 37 2 11 7 144.2 190.3 222
47 4 8 9 158.0 170.5 282
44 3 8 9 156.3 171.0 264
38 1 9 8 156.9 181.4 228
Table 2: A summary of several optimal designs for the Fisher and exact binomial test approaches. Here, , , and .

We can see that the Fisher exact design attains the smallest value of , being at least 11% smaller than that for the exact binomial test designs. However, an exact binomial test design does exist with a smaller value for . The maximal possible required sample size of the Fisher exact design is comparable to that of several of the exact binomial test designs.

The operating characteristics of these designs are further elucidated in Figure 1, which depicts the FWER and ESS when for , as well as the FWP and ESS when for .

For more extreme values of , the Fisher exact design has a FWER closer to the nominal level. There is however a region around in which the exact binomial test designs more exhaustively utilise the allowed FWER. This mirrors the results previously presented for the case with .

It is clear that the Fisher exact design almost universally has a larger FWP than the exact binomial test designs. The exact binomial test designs display a characteristic in which the FWP can drop as tends towards 0.85. This is a consequence of the fact that the probability of observing a large number of successes in arm 0 grows more quickly in this region than that for arms 1 and 2.

With the exception of the optimal exact binomial test design for , the designs require similar ESSs when . When however, the ESS for the Fisher exact design is far smaller than that for all of the exact binomial test designs. The fundamental shape of the ESS curves for the Fisher exact design and the exact binomial test designs in this instance are also different. This reflects the ability of the Fisher exact test design to alter the rejection boundary when a large number of successes are observed.

Figure 1: The operating characteristics of the Fisher exact and exact binomial test designs given in Table 2 are presented. Specifically, in (a) and (c) the familywise error-rate (FWER) and expected sample size (ESS) when for are shown respectively. In (b) and (d) the familywise power (FWP) and ESS when for are shown. The Fisher exact design is shown in red. The optimal exact binomial test designs with , , , and , are shown in blue, green, orange, and purple respectively.

4 Discussion

In this paper, we proposed methodology for the design of a two-stage multi-arm randomised study with binary outcome variables, based upon Fisher’s exact test. Through several examples, we were able to demonstrate that pre-specifying the stopping boundaries after stage one reduced potential trial efficiency substantially. Moreover, we observed that the Fisher exact design typically controlled the FWER more closely to the nominal level across possible values of the common success probability. In addition, this design typically retained a power advantage over its binomial exact counterparts, and would be expected to require a smaller sample size under the global alternative hypothesis. However, in some instances this did come at a small cost to the examined expected sample size under the global null hypothesis.

In general, the Fisher exact approach is advantageous over the exact binomial test since the common success probability under the global null hypothesis does not need to be searched over to weakly control the FWER. However, both methods considered here require a numerical search to control the type-II error rate to the desired level.

The issues with controlling trial operating characteristics could of course be alleviated by the specification of point null and alternative hypotheses. However, in many design situations such an approach may not be wise. For example, randomised designs have been proposed for phase II clinical trials primarily to address issues caused by a lack of knowledge about the success probability in the control arm. It is worth noting that for each of the designs presented in Table 2, we utilised a stochastic search procedure to search for the maximal FWER across all . In each case the maximum appeared to occur, as anticipated, when for some . This suggests strong control can be attained by controlling the maximal FWER over this . A formal proof of this fact remains to be presented however.

For both of the design procedures considered, it is extremely computationally intensive to determine the optimal design. Consequently, neither design can be declared preferential based on this factor. However, as discussed in Jung and Sargent (2014), the Fisher exact approach retains an advantage for scenarios in which a studies’ sample size differs from its pre-specified value. This can easily be accounted for, by additionally conditioning on the realised sample sizes.

Throughout, we restricted ourselves to designs in which an equal number of patients were allocated to each of the experimental arms present in the study. This was necessary for the exact methods considered here, as unequal allocation would imply that different boundaries would be needed for each of the experimental arms. It is worth noting that the use of a normal approximation for the determination of optimal two-stage designs could more easily handle differing allocation to the experimental arms. Furthermore, this approach would readily extend to three-stage designs. Whereas, the exact methods discussed in this article may be computationally intractable with the addition of a third stage. However, a normal approximation approach would of course not be appropriate in the case of small sample sizes.

In conclusion, we have provided new methodology for the design of multi-arm studies with binary outcome variables. For the considered example, the operating characteristics of our design were found to often be preferable to an approach based on exact binomial tests. Thus, whilst there is no golden rule as to which technique should be preferred, our two-stage Fisher exact test may routinely be expected to provide more desirable efficiency gains.


The formal conduct of a trial utilising the exact binomial test approach is as follows

  • Set , and .

  • Conduct stage of the trial, allocating patients to the control arm, and patients to each arm with . Following data accrual, compute the .

  • For each with

    • If reject , setting and .

    • If accept , setting .

  • If and , set and return to 2. Otherwise stop the experiment, and for each with , set .

With this, on trial termination we have that . Then