A review of Bayesian perspectives on sample size derivation for confirmatory trials

06/28/2020 ∙ by Kevin Kunzmann, et al. ∙ University of Cambridge Roche Queen Mary University of London Newcastle University 0

Sample size derivation is a crucial element of the planning phase of any confirmatory trial. A sample size is typically derived based on constraints on the maximal acceptable type I error rate and a minimal desired power. Here, power depends on the unknown true effect size. In practice, power is typically calculated either for the smallest relevant effect size or a likely point alternative. The former might be problematic if the minimal relevant effect is close to the null, thus requiring an excessively large sample size. The latter is dubious since it does not account for the a priori uncertainty about the likely alternative effect size. A Bayesian perspective on the sample size derivation for a frequentist trial naturally emerges as a way of reconciling arguments about the relative a priori plausibility of alternative effect sizes with ideas based on the relevance of effect sizes. Many suggestions as to how such `hybrid' approaches could be implemented in practice have been put forward in the literature. However, key quantities such as assurance, probability of success, or expected power are often defined in subtly different ways in the literature. Starting from the traditional and entirely frequentist approach to sample size derivation, we derive consistent definitions for the most commonly used `hybrid' quantities and highlight connections, before discussing and demonstrating their use in the context of sample size derivation for clinical trials.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Randomised Controlled Trials (RCTs) are the gold-standard study design for evaluating the effectiveness and safety of new interventions. Despite their successful history of demonstrating the benefits and uncovering the risks of new treatments, they face substantial challenges. Recent evidence has shown that the success rate of RCTs is low [wong2019estimation]. This low success rate and the increasing cost of conducting RCTs is resulting in large real-term increases in the cost of drug development [dimasi2016innovation]. Since the sample size of a trial is a key determinant of the chances of detecting a treatment effect (if it is present) and an important cost factor, the choice of an adequate sample size enables more economic drug development. Purely economic arguments would suggest a utility based approach as discussed in lindley-1997

, for example. However, specifying utility functions, particularly in investigator-sponsored studies concerned with non-drug interventions, can be hard in practice. These practical problems, the desire to maintain minimal ethical standards, as well as recommendations of health authority guidelines result in the vast majority of confirmatory clinical trials deriving their sample size based on desired conditional type I and type II error rates. For instance, a randomised clinical trial with an unnecessarily large sample size (‘overpowered’) is unethical if the treatment shows a substantial effect and the consequences of being randomised to the control arm are severe. Thus the issue of selecting an appropriate sample size is a vitally important part of conducting a clinical trial.

The traditional approach to determining the required sample size for a trial is to choose a point alternative and derive a sample size such that the probability to reject the null hypothesis exceeds a certain threshold (typically 80% or 90%) while maintaining a certain maximal type I error rate (typically 2.5% one-sided). The maximal type I error rate is usually realised at the boundary of the null hypothesis and thus available immediately. The type II error rate, however, critically depends on the choice of the (point) alternative. There are at least two ways of justifying the choice of point alternative. The first is based on a relevance argument, which requires the specification of a minimal clinically relevant difference (MCID). Since the probability to reject the null hypothesis is typically monotonic in the effect size, determining the sample size such that the desired probability to reject the null is exceeded at the MCID implies that the power for all other relevant differences will be even larger. Guidance has recently been published on choosing the MCID 

[cook2018delta2], but making this choice may still be difficult in practice.

The second perspective is based on a priori considerations about the likelihood of the treatment effect. Here, an a priori likely effect size is used as the point alternative. This implies that the resulting sample size might be too small to detect smaller but still relevant differences reliably. On the other hand, the potential savings in terms of sample size might outweigh the risk of ending up with an underpowered study. The core difference between these approaches is that basing the required sample size on the MCID might be ineffective if prior evidence for a larger effect is available, but the MCID is chosen based on relevance arguments and is thus not subject to uncertainty. In contrast, choosing the point alternative based on considerations about the relative a priori likelihood of effect sizes implies that there is an inherent uncertainty about the effect size that can be expected – otherwise no trial would be needed in the first place.

Depending on the study objective, the sample size can also be derived based on entirely different considerations. For example, studies that aim to establish new diagnostic tools or biomarkers would rather target a certain width of the confidence interval for the AUC 


. Similarly, studies aimed at estimating population parameters with sufficient precision should target the standard error of the estimate, rather than deriving a sample size based on power arguments 

[Thompson2012, grouin-2007]. These approaches are beyond the scope of this manuscript and we will only discuss sample size derivation based on error rate considerations.

To make things more tangible, consider the case of a simple one-stage, one-arm -test. Let , , be observations with mean

and known standard deviation

. Under suitable regularity conditions, their mean is asymptotically normal and where . Assume that interest lies in testing the null hypothesis at a one-sided significance level of . The critical value for rejecting is given by the

-quantile of the standard normal distribution,

, and is independent of . The probability of rejecting the null hypothesis for given and is



is the cumulative distribution function (CDF) of the standard normal distribution. Often,

is seen as a function of and termed the ‘power function’. This terminology may lead to confusion when considering parameter values and , since the probability to reject the null hypothesis corresponds to the type I error rate in the former case and classical ‘power’ in the latter. For the sake of clarity we will thus use the neutral term ‘probability to reject’.

Assume that a point alternative is given. A sample size can then be chosen as the smallest sample size that results in a probability to reject of at least at


Since is monotone in , the probability to reject the null hypothesis for is also greater than . Thus, if , the null hypothesis can be rejected for all clinically relevant effect sizes with a probability of at least . This approach requires no assumptions about the a priori likelihood of the value of but only about and the desired minimal power level (see also [chuang2011, Section 3] and chuang-2006). However, the required sample size increases quickly as approaches . In almost all practically relevant situations, a maximal feasible sample size is given (e.g., due to budget constraints), which might not be sufficient to achieve the desired power if is close to . This problem might arise when considering overall survival in oncology trials, for example. However, it may then be hard to justify a value for much larger than since almost any improvement in overall survival can be considered relevant 111It is important to distinguish between the (unknown) true effect size and the observed effect size. It might still be reasonable to additionally require a certain deviation of the observed effect from the null, e.g. a hazard ratio of less than 0.85.. The problem becomes even more pressing if the null hypothesis is defined as , i.e., if the primary study objective is to demonstrate a clinically important effect. In either case it is clearly impossible to derive a feasible sample size based on the minimal clinically important difference alone [chuang2011]. Formulating a principled approach to eliciting a sample size in situations like these is difficult, and in practice trialists may resort to back-calculating an effect size in order to achieve the desired power given the maximum feasible sample size [lenth2001, lan-2012, grouin-2007]. One way of justifying a smaller sample size is to consider a point alternative based on a priori likelihood arguments: if there is prior evidence for effect sizes larger than , determining the sample size based on might well be inefficient and lead to an unnecessarily large trial. Therefore, planning of the required sample size is often based on a single point alternative , . Yet, this pragmatic approach is unsatisfactory in that it ignores any uncertainty about the effect size [lenth2001].

In the following, we review approaches to sample size derivation that do account for a priori uncertainty via a prior density for the effect size. We propose a framework encompassing the most relevant quantities discussed in this context, give precise definitions of the terms, and highlight connections between individual items. Where necessary, we refine existing definitions to improve overall consistency. Note that we exclusively focus on what is usually termed a ‘hybrid’ Bayesian-frequentist approach to sample size derivation [spiegelhalter-2004]. This means that, although, Bayesian arguments are used to derive a sample size under uncertainty about the true effect size, the final analysis is strictly frequentist. After introducing the individual concepts in detail, a structured overview of all quantities considered is provided in Figure 1. We then present a review of the literature on the subject, showcasing the confusing diversity of terminology used in the field and relating our definitions back to the existing literature. Finally, we present some numerical examples and conclude with a discussion.

2 A Bayesian argument for the choice of the point alternative

One way of incorporating planning uncertainty is to make assumptions about the relative a priori likelihood of the unknown effect size. This approach can be formalised within a Bayesian framework by seeing the true effect size

as the realisation of a random variable

with prior density . This means that the CDF of is given by . At the planning stage, the probability to reject the null hypothesis is then given by the random variable , the ‘random probability to reject’:


We explicitly denote this quantity as ‘random’ to emphasise the distinction between the (conditional on ) probability to reject given in equation (1) and the unconditional ‘random’ probability to reject. The variation of the random variable reflects the a priori uncertainty about the unknown underlying effect size that is encoded in the prior density of the random variable . We prefer the term ‘random probability to reject’ over ‘random power’ since is unconditional on the effect size, and consequently does not distinguish between rejections under the null hypothesis and under relevant effect sizes. Instead, we define the conditional random variable ‘random power’ as


This definition more closely resembles the concept of frequentist power since it conditions on a relevant effect size.

Determining the required sample size based on a point alternative as outlined in the introduction evaluates the probability to reject the null hypothesis solely on . This can be understood as conditioning the random probability to reject, or the random power, on , i.e., to consider . Due to conditioning on a single parameter value , under any prior density with , both random variables (almost surely) reduce to the deterministic expression which is often termed ‘power’ in a frequentist context. Basing the sample size derivation on this quantity means that the probability to reject the null hypothesis for relevant values of is completely ignored and the a priori evidence encoded in is not used.

spiegelhalter1986 have pointed out that a power constraint for sample size derivation could be computed based on “[…] a somewhat arbitrarily chosen location parameter of the [prior] distribution (for example the mean, the median or the 70th percentile).” Using a location parameter of the unconditional prior distribution of , however, might lead to situations where no sample size can be determined if the location parameter lies within the null hypothesis. Here, we follow a similar idea but motivate the choice of location parameter in terms of the a priori distribution of random power. To this end, let


be the -quantile of the random power (). Furthermore, let


be the conditional prior density of . We choose to make the dependency of


on the prior density explicit by using the index ‘’ since the random parameter does not appear directly in the description of the event ‘’. Whenever appears explicitly, we omit the index since the dependency on  is then clear from the context. The expression is a real number in and thus different from which is a random variable.

If a sample size was then chosen such that , the a priori probability of exceeding a probability to reject of given a relevant effect would be, by definition, at least . The required sample size for this approach is the solution of


Since is monotonic in , this problem is equivalent to solving


where is the -quantile of the prior distribution of the random variable conditional on a relevant effect. Consequently, this ‘prior quantile approach’ can be used with any existing frequentist sample size derivation formula. It is merely a formal Bayesian justification for determining the sample size of a trial based on a point alternative . The prior quantile approach reduces to powering on whenever the target power needs to be met with absolute certainty for all relevant effect sizes (). One may thus see the prior quantile approach as a principled relaxation of powering on .

The approach differs from Spiegelhalter and Freedman’s suggestions in two key aspects. Firstly, the point alternative naturally emerges as a quantile of the prior conditional on a relevant effect by imposing a lower boundary on the a priori probability to undershoot the target power. This intuitively makes sense since a large probability to reject is only desirable when the underlying is relevant. This also ensures that for and thus guarantees a finite sample size irrespective of the choice of prior. Secondly, to guarantee a more than 50% chance of exceeding the target power, the conditional prior quantile will typically be chosen smaller (i.e., ) than the conditional median () which was discussed by Spiegelhalter and Freedman.

3 Probability of success and expected power

Spiegelhalter and Freedman also proposed the use of the “probability of concluding that the new treatment is superior and of this being correct” ( in their notation) to derive a required sample size [spiegelhalter1986]. The quantity has also been referred to as ‘prior adjusted power’ [spiegelhalter-2004, shao-2008]. This definition of probability of success is also discussed in liu-2010 and ciarleglio-2015. In the situation at hand, it reduces to


where is the PDF of the standard normal distribution. Here, we are slightly more general than previous authors in that we allow and use a tighter definition of ‘success’: a trial is only successful if the null hypothesis is rejected and the effect is relevant. Whenever this coincides with the definitions used previously in the literature.

The definition of critically relies on what is being considered a ‘success’. The original proposal of Spiegelhalter and Freedman only considers a significant study result a success if the underlying effect is also non-null (i.e., the joint probability of non-null and detection). In more recent publications, a majority of authors tend to follow O’Hagan et al. who consider a slightly different definition of the probability of success by integrating the probability to reject over the entire parameter range [o2001bayesian, ohagan-2005] and term this ‘assurance’. For a more comprehensive overview of the terms used in the literature, see Section 5. The alternative definition for probability of success introduced by O’Hagan et al. corresponds to the marginal probability of rejecting the null hypothesis irrespective of the corresponding parameter value


The decomposition in equation (15) makes it clear that the implicit definition of ‘success’ underlying is at least questionable [liu-2010]. The marginal probability of rejecting the null hypothesis includes rejections under irrelevant or even null values of , and is thus inflated by type I errors and rejections under irrelevant values of . This issue was first raised by spiegelhalter-2004 for simple (point) null and alternative hypotheses. The degree to which and differ numerically is discussed in more detail in Section 7.2. Which definition of ‘success’ is preferred mainly depends on perspective: a short-term oriented pharmaceutical company might just be interested in rejecting the null hypothesis to monetise a new drug - irrespective of it actually showing a relevant effect. This view would then correspond to . Regulators and companies worried about the longer-term consequences of potentially having to retract ineffective drugs, might tend towards the joint probability of correctly rejecting the null, i.e., . We take the latter perspective and focus on .

is an unconditional quantity and must therefore implicitly depend on the a priori probability of a relevant effect. To see this, consider the following decomposition


This means that the probability of success can be expressed as the product of the ‘expected power’, , and the a priori probability of a relevant effect (see again spiegelhalter-2004 for the situation with point hypotheses). Expected power was implicitly mentioned in spiegelhalter1986 ( in their notation) as a way to characterise the properties of a design. The use of expected power as a means to derive the required sample size of a design under uncertainty was then proposed in brown1987projection by solving


Since the power function is monotonically increasing in , expected power is strictly larger than power at the minimal relevant value whenever . This implies that a constraint on expected power instead of a constraint on the probability to reject the null hypothesis at is less restrictive, and consequently .

The terms ‘expected power’ and ‘probability of success’ are sometimes used interchangeably in the literature (see Section 5). In the following, we take a closer look at their connection to clarify their characteristic differences. Expected power is merely a weighted average of the probability to reject in the relevance region , where the weight function is given by the conditional prior density defined in equation (7)


, on the other hand, integrates the probability to reject over the same region using the unconditional prior density (see equations (17) and (22)). Thus, in contrast to , expected power does not depend on the a priori probability of a relevant effect size but only on the relative magnitude of the prior density (‘a priori likelihood’) of relevant parameter values. Since the conditional prior density differs from the unconditional one only by normalisation via the a priori probability of a relevant effect, it follows from equation (19) that and differ only by the constant factor . Consequentially, any constraint on probability of success can be reformulated as a constraint on expected power and vice versa


Furthermore, since and , can never exceed the a priori probability of a relevant effect, . This implies that the usual conventions on the choice of as the maximal type II error rate for a point alternative cannot be meaningful in terms of the unconditional , since the maximum attainable probability of success is the situation-specific a priori probability of a relevant effect. The need to recalibrate typical benchmark thresholds when considering probability of success was previously discussed in the literature. For instance, o2001bayesian

states that “[t]he assurance figure is often much lower [than the power], because there is an appreciable prior probability that the treatment difference is less than

”, where in their notation, corresponds to in our notation. A similar argument is put forward in rufibach_15 for . The key issue is thus whether one is interested in the joint probability of rejecting the null hypothesis and the effect being relevant, , or the conditional probability of the rejecting the null hypothesis given a relevant effect, . While the interpretation of both quantities is different, in any particular situation, they only differ by a constant factor.

3.1 Expected power versus quantile-based approach for sample size derivation

Since expected power and probability of success are proportional, it suffices to compare expected power and the quantile-based approach outlined in Section 2 with respect to sample size derivation. Consider and an arbitrary but fixed parameter value . Clearly, under the quantile-based approach, the rejection probability at any does not contribute towards the fulfilment of the power constraint since the probability to reject is only evaluated at . For expected power, however, the total functional derivative with respect to changes in the probability to reject at and is


Keeping expected power constant, i.e., setting and solving for yields


A reduction in the probability to reject at by one percentage point can thus be compensated by an increase in the probability to reject at by  percentage points. This demonstrates that the core difference between the prior quantile-based approach and the expected power approach is whether or not a trade-off between power at different parameter values is deemed permissible (expected power) or not (quantile-based approach). A structured overview of the terms introduced so far and the respective connections between them is given in Figure 1.

4 Connection to utility maximisation

In a regulatory environment, and most scientific fields, the choice of the significance level, , is a pre-determined quality criterion. In the life sciences a one-sided of 2.5% is common. Yet, the exact choice of the threshold is much more arbitrary. In clinical trials, or are common choices when a classical sample size derivation is conducted. From the previous section it is already clear that a generic threshold for that is independent of the specific context of a trial only makes sense with conditional approaches like the (conditional prior) quantile approach or when using to derive a required sample size. In principle, the unconditional should be easier to interpret by non-statisticians. Equation (23) allows the transformation of an -based sample size derivation, which can readily use any of the established values for , into a -based sample size derivation by re-calibrating the threshold with the proportionality factor linking and . This only transforms the conditional criteria (minimum ) of the classical sample size derivation to the unconditional domain (minimum ) without affecting the derived sample size in any way. For instance, if and , the transformed threshold for would be . In making the assumptions underlying the sample size derivation more transparent by formulating them in terms of unconditional probabilities, there is a need to explain to practitioners why the threshold now differs from study to study.

The Bayesian view and the prior density give a natural answer to this issue of trial-specific thresholds via the concept of utility maximisation or maximal expected utility (MEU). An in-depth discussion of the MEU concept is beyond the scope of this manuscript and we refer the reader to, for example,  lindley-1997. We merely want to highlight the fact that the choice of the constraint threshold can be justified by making the link to MEU principles. To this end we consider a particularly simple utility function.

Assume that the maximal type I error rate is still to be controlled at level . For sake of simplicity, further assume that a correct rejection of the null hypothesis yields an expected return of . Here the return is given in terms of the average per-patient costs within the trial. Ignoring fixed costs, the expected trial utility (in units of average per-patient costs) is then given by


and the utility-maximising sample size is . Obviously, the same sample size would be obtained by solving problem (20), given


The right hand side, , is the utility-maximising expected power threshold given the utility parameter . Similarly, one could start with for a given and derive the corresponding such that . This value of would then correspond to the implied expected reward upon successful rejection of the null for given . Under the assumption of a utility function of the form (29), and can thus be matched such that the corresponding utility maximisation problem and the constraint minimisation of the sample size under a power constraint both lead to the same required sample size. Consequently, practitioners are free to either define an expected return upon successful rejection, , or a threshold on the minimal expected power, .

While it is theoretically attractive to derive the sample size directly based on a utility function, an informed choice of is often hard to justify in practice. In these situations one may instead reverse the perspective and determine the value of under which the utility-maximising design would coincide with the design obtained under a standard (expected) power threshold of, say, 80% or 90%. The implied reward parameter might then be used to communicate the consequences of different choices of power thresholds to decision makers and to inform the final choice of . This approach can, of course, be generalised to more detailed utility functions. Note, however, that for utility functions with more than one free parameter there is no longer a one-to-one correspondence between power level and utility parameters. Rather, for a given power level, there will be a level-set of values for the utility parameters that match the specified power. We give a practical example of this process in Section 7.5.

Probability to reject Sym.: Def.: Int.: real number; probability to reject the null hypothesis given a fixed value of 

Random probability to reject Sym.: Def.: Int.: random variable; realisations correspond to the probability to reject the null hypothesis for

Random power Sym.: Def.: Int.: random variable; realisations correspond to the probability to reject the null hypothesis for given a relevant effect

Expected power Sym.: Def.:
Int.: real number; average probability to reject the null weighted with prior density conditional on

Probability of success Sym.: Def.:
Int.: real number; joint probability to reject the null and have a relevant effect; average probability to reject on weighted with unconditional prior density

Marginal probability to reject Sym.: Def.:
Int.: real number; marginal probability to reject the null irrespective of underlying effect

replace fixed parameter with random variable

condition on

integrate with respect to prior conditional on relevant effect,

integrate over with unconditional prior

integrate over entire parameter range with unconditional prior

form expected value

add expected type one error rate and probability of rejection under irrelevant values

form expected value

form expected value

Figure 1: Structured overview of all quantities related to ‘power’ that are introduced in Sections 1 to 3. The symbols used in the text (Sym.), their exact definitions (Def.), and verbal interpretation (Int.) are summarised in the respective boxes. The relationships between the individual quantities are given as labelled arrows. For an overview of previous mentions and synonyms used in the literature, see Table 1.

References Notes
Marginal probability to reject crook-1982

Termed ‘strength’; application in multinomial contingency tables.

spiegelhalter1986 Only implicitly mentioned; discussing close relation to , termed ‘expected/average power’ in spiegelhalter-2004.
gillett-1994 Termed ‘average power‘; focus on replication.
o2001bayesian Termed ‘assurance’ or ‘expected power’; different from our notion of expected power which is conditional on a relevant effect, see also [ohagan-2005].
chuang-2006 Termed ‘average probability of success’; discusses other definitions of ‘success’ based on additional criteria for the observed point estimates; discusses how basing the sample size on relevance arguments alone is theoretically correct but ineffective if evidence for larger effect sizes is available, see also chuang2011.
grouin-2007 Termed ‘predictive power’ and ‘predictive probability to reject ’; review of regulatory aspects, discussion of interval-based sample size calculation, and utility considerations.
daimon-2008 Termed ‘hybrid Neyman–Pearson–Bayesian (hNPB) probability‘; application in non-inferiority setting.
shao-2008 Termed ‘adjusted power’; review of regulatory aspects, discussion of interval-based sample size calculation, and utility considerations.

Termed ‘extended Bayesian expected power 1’; extended by treating variance as unknown, also consider

and .
lan-2012 Termed ‘average power’; discusses upper limit of ‘average power‘ depending on prior choice and suggest truncated priors which would be very close to conditioning on a relevant effect.
carroll-2013 Termed ‘assurance’ and ‘probability of success’ (); discusses other definitions of success but all definitions are also exclusively based on observed quantities (minimum threshold on point estimate), see also chuang-2006.
brutti-2014 Termed ‘predictive frequentist power’; also discusses sample size derivation based on Bayesian decision criteria.
ren-2014 Termed ‘assurance’; discusses ideas of ohagan-2005 in time-to-event setting.
hu-2014 Termed ‘probability of success’; considers priors on mean and standard deviation; discuss upper limit on probability of success in the more complex two-parameter situation.
ibrahim-2015 Termed ‘average probability of success’; discussed in context of historical data integration.
walley-2015 Termed ‘assurance’ or ‘probability of success’; extension to multi-parameter situations.
ciarleglio-2015 Termed ‘expected power’; also consider and , very similar settings considered in ciarleglio-2016, ciarleglio-2017.
rufibach_15 Termed ‘assurance’ or ‘probability of success’; in-depth discussion of the distribution of the probability to reject the null hypothesis.
saint-hilary-2018 Termed ‘predictive probability of success’; consider both ‘statistical success’ (-value ) and ‘clinical relevance’ (observed effect above relevance threshold), see also saint-hilary-2019.
chen-2017 Termed ‘assurance’ and ‘expected power’; discusses conditional nature of the (frequentist) probability to reject the null hypothesis from a Bayesian perspective.
jiang-2011, kirby-2012, zhang-2013, wang-2015, gotte-2017 Termed ‘probability of statistical success’, ‘probability of success’, ‘assurance’, ‘predictive power’; discusses extensions to multiple studies or entire drug development programs.
ambrosius-2012, wang-2013, wang-2015b, crisp-2018, chen-2018 Termed ‘assurance’, ‘probability of success’, ‘probability of study success’; practical applications in various settings.
Probability of success spiegelhalter1986 Only implicitly mentioned, termed ‘prior adjusted power’ in spiegelhalter-2004; discusses close relation to marginal probability to reject (suggesting the latter as practical approximation).
brown1987projection Termed ‘expected power’; also discusses ‘conditional expected power’ which corresponds to our definition of .
shao-2008 Termed ‘adjusted power‘; application of the ideas of spiegelhalter-2004 to binary setting, define probability of success but approximate it with the marginal probability to reject .
liu-2010 Termed ‘extended Bayesian expected power 2’; extended by treating variance as unknown, also considers and .
ciarleglio-2015 Termed ‘prior-adjusted power’; also considers and , very similar settings considered in ciarleglio-2016, ciarleglio-2017.
Expected power brown1987projection Termed ‘conditional expected power’; also discusses unconditional expected power which corresponds to our definition of .
spiegelhalter-2004 Not named; referencing brown1987projection.
liu-2010 Termed ‘extended Bayesian expected power 3’; extended by treating variance as unknown, also consider and .
ciarleglio-2015 Termed ‘conditional expected power’; also considers and , very similar settings considered in ciarleglio-2016, ciarleglio-2017.
Table 1: Selected publications on ‘hybrid’ sample size derivation based on error rates. Structured by concepts as defined in Figure 1.

5 Literature review of terminology

A structured overview of the literature on ‘hybrid’ Bayesian sample size derivation in the context of clinical trials is given in Table 1. The table relates publications in the field to the terms defined in Figure 1. Publications with a similar take on the matter are grouped. In the following, we highlight a few particularly interesting contributions and how they relate to the definitions used in this manuscript.

The majority of the manuscripts only consider the marginal probability to reject (). Many publications refer to o2001bayesian or ohagan-2005, where this quantity was introduced as ‘assurance’. The range of names for what we call the ‘marginal probability to reject

’ is, however, quite diverse: ‘assurance’, ‘probability of success’, ‘predictive probability of success’, ‘average probability of success’, ‘probability of statistical success’, ‘probability of study success’, ‘predictive power’, ‘predictive frequentist power’, ‘expected power’, ‘average power’, ‘strength’, ‘extended Bayesian expected power 1’, and ‘hybrid Neyman-Pearson-Bayesian probability’.

However, only a handful of authors elaborate on the intricacies of defining what exactly constitutes a ‘success’ and whether to consider an unconditional measure of success or to condition on the presence of a relevant effect for sample size derivation [spiegelhalter1986, brown1987projection, shao-2008, liu-2010, ciarleglio-2015]. Most publications fail to define explicitly what exactly constitutes a ‘success’. Yet, the use of implies that rejection of the null hypothesis, irrespective of its truth, must be considered a success. Our analysis in Section 7.2 confirms the statement in spiegelhalter-2004 that can be used as a practical approximation to in many situations. The exact definition of ‘probability of success’ becomes more interesting when allowing for , a potential extension rarely considered in the literature (see, e.g., brown1987projection for the binary case). We revisit the distinction between and in a concrete example in Section 7.2.

The exact choice of wording should not be given too much weight. However, we feel that any notion of power in the ‘hybrid’ Bayesian/frequentist setting should be conditional on a relevant effect (or at least a non-null effect) to preserve the conditional nature of the purely frequentist power. Using the term ‘power’ to refer to a joint probability like the ‘expected power’ of brown1987projection and ciarleglio-2015 (our ) or the ‘average/expected power’ of spiegelhalter-2004 (our ) is potentially misleading. Others suggest ‘conditional expected power’ for to distinguish it from ‘expected power’ (our [brown1987projection, ciarleglio-2015]. This wording, however, may lead to confusion when also considering interim analyses where ‘conditional power’ is a well-established term for the probability of rejecting the null hypothesis given and partially observed data [bauer-2016].

A particularly interesting publication is liu-2010. They extend hybrid sample size derivation in the normal case to also incorporate uncertainty about the variance and clearly distinguish between = ‘extended Bayesian expected power 1’, = ‘extended Bayesian expected power 2’, and = ‘extended Bayesian expected power 3’. Apart from nomenclature, our definitions of these three quantities only differ in that they assume the standard deviation to be fixed and the fact that we accommodate the optional notion of a relevant effect via . The former makes explicit formulas more manageable, the latter is important to keep sample sizes small in situations with vague or conservative prior information but substantial relevance thresholds. liu-2010 and rufibach_15 are also the only publications we found that study the distribution of the quantities that are averaged over ( and in our notation, see Figures 4 and 5). In ciarleglio-2015, the distinction between all three quantities is also made explicit (‘expected power’ is our , ‘prior-adjusted power’ is our , and ‘conditional expected power’ is our ).

6 Prior elicitation

A major issue in modelling uncertainty and computing sample size via prior densities is the elicitation of an adequate prior. At first glance, non-informative or ‘objective’ priors seem to be a viable choice. As illustrated in rufibach_15, the prior crucially impacts the properties and interpretability of and likewise any other quantity depending on in Figure 1, so careful selection is paramount. Often in clinical research, there is no direct prior knowledge on the effect size of interest, e.g., overall survival in a phase III trial, as no randomised trials comparing these same treatments using the same endpoint have been run previously. Researchers are then often tempted to use a vague prior, typically a normal prior with large variance, as, e.g., advocated in saint-hilary-2019.

Assuming a non-informative, improper prior for would imply that arbitrarily large effect sizes are just as likely as small ones. Yet, in clinical trials, the standardised effect size rarely exceeds 0.5 [lamberink2018]. We thus illustrate the characteristics of the different approaches to defining power constraints under uncertainty using a convenient truncated Gaussian prior. The truncated Gaussian is conjugate to a Gaussian likelihood and allows us to restrict the plausible range of effect sizes to, e.g., liberally

. Also, the truncated Gaussian is the maximum entropy distribution on the truncation interval, given mean and variance which can be interpreted as a ‘least-informative’ property under constraints on the first two moments.

Alternatively, a trial designer can formally elicit a prior on the effect size of interest. Kinnersley and Day describe a structured approach to the elicitation of expert beliefs in a clinical trial context based on the SHELF framework [kinnersley_13, shelf]. dallow_18 discusses how SHELF is routinely used within a pharmaceutical company to define prior distributions that are used in conjunction with calculation of probability of success and to inform internal decision making at key project milestones. Formal and informal prior elicitation is also discussed in spiegelhalter-2004.

7 Results

7.1 Comparison of required sample sizes for various prior choices

Let and let the maximal feasible sample size be 1000. Figure 2 shows the required sample sizes under the expected power, the probability of success, and the quantile approach (for ). We use and .

Figure 2: Required sample size plotted against prior parameters (Normal truncated to [-0.3, 0.7], with varying mean and standard deviation); ; EP = Expected Power, PoS = Probability of Success, quantile = quantile approach with and , respectively.

The patterns of required sample sizes are qualitatively different between the three approaches. For probability of success, large prior uncertainty implies low a priori probability of a relevant effect and thus the required sample sizes explode for large prior standard deviations (in relation to the prior mean). For very large standard deviations, the constraint on probability of success becomes infeasible.

The expected power criterion leads to a completely different sample size pattern. Since expected power is defined conditional on a relevant effect, large prior uncertainty increases the weight in the upper tails of the power curve where power quickly approaches one. Consequently, for small prior means, larger uncertainty decreases the required sample size. For large prior means, however, smaller prior uncertainty leads to smaller sample sizes since again more weight is concentrated in the tails of the power curve.

The characteristics of the prior-quantile approach very much depend on the choice of . When using the conditional prior median () the approach is qualitatively similar to the expected power approach. This is due to the fact that computing power on the conditional median of the prior is close to computing power on the conditional prior mean. Since the power function is locally linear around the centre of mass of the conditional prior, this approximates computing expected power by interchanging forming the expected value and computing power (i.e., first average the prior and then compute power or average over power with weights given by the conditional prior). For a stricter criterion () the required sample sizes are much larger. This is due to the fact that the quantile approach does not allow a trade-off between power in the upper tails of the power curve and regions with low power. Higher uncertainty then decreases the -quantile towards the minimal relevant effect and thus increases the required sample size.

7.2 Probability of success as the basis for sample size derivation

In theory, one might be inclined to derive a sample size based on the probability of success instead of using expected power. Consider a situation in which the a priori probability of is . The probability of success is then only (for 80% expected power) or (for 90% expected power). A sponsor might want to increase these relatively low unconditional success probabilities by deriving a sample size based on a minimal of instead. The choice of is limited by the a priori probability of a relevant effect (0.51 in this case). Using equation (23) a minimal probability of success of 0.5 is equivalent to requiring an expected power of more than 98%. In essence, the attempt to increase via a more stringent threshold on implies that low a priori chances of success are to be offset with an almost certain detection () in the unlikely event of an effect actually being present. The ethical implication of this approach to sample size derivation is that an extremely large number of patients would be exposed to treatment although the sponsor more or less expects it to be ineffective.

This example demonstrates that the (situation agnostic) conventions on power levels cannot be transferred to thresholds for probability of success without adjustment for the situation-specific a priori probability of a relevant effect. It thus seems much easier to directly impose a constraint on expected power, which implicitly adjusts for the a prior probability of a relevant effect via equation (23).

To investigate the difference between the probability of success, , and the marginal probability of rejecting the null hypothesis, , Figure 3 visualises the proportion of the individual components of for varying prior standard deviation and prior means. The sample size is fixed at , , the maximal type I error rate is , and the minimal clinically important difference is .

Figure 3: Components of for , , , and varying prior mean and standard deviation; numbers correspond to overall ; proportions in individual pie charts correspond to: A = probability to reject and null effect (type I error), B = probability to reject and irrelevant but non-null effect, C = probability to reject and relevant effect ().

Evidently, the contribution of type I errors (component ‘A’ in Figure 3) to is mostly negligible unless the prior is sharply peaked at an effect size slightly smaller than the null. The a priori probability of a relevant effect size is close to zero in these cases and so is . For the more practically relevant scenarios with prior mean greater than , the contribution of the average type I error rate to is almost negligible. Still, if , might be inflated substantially by rejections under parameter values that are non-null but also not clinically relevant. This phenomenon evidently depends on the magnitude of ; the more of the prior mass concentrated in , the larger the contribution towards . If , the numeric difference between and is negligible since the maximal type I error rate is controlled at level and the power curve quickly approaches zero on the interior of the null hypothesis. This was already pointed out by spiegelhalter-2004, who argue that can be used as approximation to in many practically relevant situations.

7.3 Distribution of random power under a constraint on expected power

To further investigate the properties of the random variable , we consider three example prior configurations with means and standard deviations respectively. The corresponding sample sizes to reach an expected power of at least 80% are 854, 126, and 32. Figure 4 shows the unconditional and conditional (on a relevant effect) priors, the corresponding probability of rejecting the null as a function of , and histograms of the distributions of random power (), and the unconditional probability to reject the null hypothesis ().

Figure 4: Top: conditional and unconditional prior PDF and power curves corresponding to for . Bottom: histogram of the random power (conditional on relevant effect) and the random probability to reject (unconditional); vertical lines indicate 80% power and the numbers are the respective (conditional) probabilities to exceed a probability to reject of 80%.

Clearly, the distributions of the conditional and unconditional rejection probabilities (random power and random probability to reject, respectively) are qualitatively very different in the three situations. In the first case (mean , standard deviation 0.4), the prior mass is mostly concentrated on the null hypothesis and the normalising factor that links the unconditional and the conditional prior is clearly noticeable. The conditional prior then assigns most weight to values of close to the relevance threshold leading to a large required sample size. The large sample size then implies a steep power curve and a distribution of

that is highly right-skewed towards 1 since

is conditional on . If the unconditional distribution of the rejection probability is considered instead (), the characteristic u-shape discussed in rufibach_15 is recovered.

For the intermediate setting (mean 0.3, standard deviation 0.125), most prior mass is already concentrated on relevant values. The difference between conditional and unconditional prior is less pronounced (the normalising factor is closer to 1) and even the unconditional distribution of the rejection probability is no longer u-shaped.

Finally, in the last setting (mean 0.5, standard deviation 0.05), is almost certainly highly relevant. Since the normalising factor is thus close to 1, there is no discernible difference between the conditional and the unconditional prior densities. Not surprisingly, the assumption of a highly relevant effect with high certainty only requires a small sample size (32). This leads to a relatively flat curve of the rejection probability and to a peaked distribution of and . Flat power curves and high certainty about the effect size tend to result in peaked distributions of and because the power curve is almost linear in the region of the parameter where the prior mass is concentrated. The distribution of and

is thus well approximated by a linear transformation of the (conditional) prior, which is a peaked truncated normal distribution. Since conditioning has almost no effect, the unconditional distribution of the probability to reject is the same in this case.

Interestingly, both settings with higher a priori uncertainty lead to a high chance of exceeding a power of 80%. This is due to the fact that the rare occurrence of very low rejection probabilities needs to be compensated to achieve an overall expected power of 80%.

7.4 Distribution of random power under quantile approach

To compare the results in the previous section with the prior quantile-based approach, we consider the intermediate example with prior mean 0.3 and prior standard deviation 0.2 again. For this situation, the required sample sizes under and target power or , the corresponding curves of the rejection probability, and histograms of the distribution of the rejection probability are given in Figure 5.

Figure 5: Top: conditional and unconditional prior PDF for a truncated normal prior on with mean 0.3, standard deviation 0.2 and power curves corresponding to for or and ; Bottom: histogram of the random power (conditional on relevant effect) and the random probability to reject (unconditional); vertical lines mark 80% power and the number are the respective (conditional) probabilities to exceed a probability to reject of either 70% or 80%.

The required sample sizes depend heavily on the choice of . The crucial difference between the quantile-based and the expected power approach is that for the prior quantile approach the exact distribution of power below the target value of 80% is irrelevant; only the total mass of the distribution below this critical point matters. This means that the sample size for the cases are substantially lower than the corresponding sample size derived from an expected power constraint. The flip side of this ignorance about the exact amount by which the target power is undershot is that there is a relatively high chance of ending up with a severely underpowered study in these cases. Increasing the certainty to exceed a power of 80% or 70% by setting , however, leads to substantially larger required sample sizes than under the expected power approach.

The example demonstrates the problems arising from having to specify both and . While this allows more fine-grained control over the distribution of the (conditional) rejection probability, there seems to be no canonical choice for , which is critical in determining the required sample size.

7.5 A clinical trial example

To make things more tangible, consider the case of a clinical trial designed to demonstrate superiority of an intervention over a historical control group with respect to the endpoint of overall survival. To stay within the framework of (approximately) normally distributed test statistics, we assume that effect sizes are given on a standardised log hazard scale, i.e.,

corresponds to no difference in overall survival and to superiority of the intervention group. Assume that the prior for the treatment effect of the intervention is given by a truncated Normal distribution on with mean and standard deviation (pre-truncation). The minimal clinically relevant difference is set to . This setting corresponds to an a priori probability of a relevant effect of approximately .

Figure 6 shows the (conditional) prior density, the curves of the rejection probability corresponding to the required sample sizes derived from constraints on a minimal probability to reject of at (MCID), at (quantile, 0.5), at (quantile, 0.9), or a minimal expected power of (EP).

Figure 6: Left panel: prior PDF and conditional prior PDF (on ); middle panel: probability to reject the null hypothesis as function of for the expected power design (EP), the design powered for (MCID), the design based on power at the conditional prior median (quantile 0.5), and the design using the conditional 0.1-quantile, i.e. (quantile 0.9); right panel: CDF of random power (probability to reject given ) for the four different design choices.

In this case the MCID criterion requires . The quantile approach (with ) already reduces this to while still maintaining an a priori chance of 90% to exceed the target power of 80%. The quantile approach with results in the lowest sample size of at the cost of only having a 50% chance to exceed the target power of 80%. The EP approach is more liberal than the quantile approach () with but still guarantees a chance of exceeding the target power of roughly 75%. A sample size based on cannot be derived in this example since the a priori probability of a relevant value is lower than . The large spread between the derived sample sizes shows how sensitive the the required sample size is to the changes in the power constraint. Clearly, the MCID approach is highly ineffective, as accepting a small chance to undershoot the target power with the quantile approach () reduces the required sample size from to roughly a quarter (). At the other extreme, constraining power only on the conditional prior median (quantile approach, ) leads to a rather unattractive a priori distribution of the random power: by definition, the probability to exceed a rejection probability of is still but the a priori chance of ending up with a severely underpowered study is non-negligible.

These considerations leave the trial-sponsor with essentially two options. Either a range of scenarios for the quantile approach with values of between and could be discussed in more detail and a decision on the exact value of could be reached by considering the corresponding distributions of , or the intermediate EP approach could be used. We assume that the trial-sponsor accepts the implicit trade-off inherent to expected power and decides to base the sample size derivation on the EP approach. The required sample size for an (expected) power of 80% is then . Note that this still means that there is a roughly one-in-five a priori probability to end up in a situation with less than 50% power (see Figure 6, CDF panel).

In a situation where is not set in stone, further insights might be gained by making the link to utility maximisation explicit. In a first step, we will assume that the sponsor has no way of quantising the reward parameter directly. One may then guide decision making by computing the values of that lead to the same required sample size for a range of values of . Figure 7 shows this ‘implied reward’ as function of the minimal expected power constraint.

Figure 7: Utility-maximising implied reward for varying expected power levels in the situation discussed in Section 7.5.

An expected power of is thus ideal in this situation if the expected reward upon successful (i.e., the effect is indeed relevant) rejection of the null hypothesis is approximately times the average per-patient costs within the planned trial. Using the curve depicted in Figure 7, a discussion with the trial sponsor about the plausibility of certain reward levels can be started. Usually the average per-patient costs are well-known in advance, so that the scale can even be transformed to monetary units, e.g., $US. Assume to this end, that for the particular study at hand, the expected average per-patient costs are $US. Then, the sample size corresponding to an expected power of is maximising utility if the expected reward is $US. The utility-maximising reward for an expected power of would be approximately , i.e., $US. Even without committing to a fixed value of , these considerations can be used to guide the decision as to which of the ‘standard’ power levels (0.8 or 0.9) might be more appropriate in the situation at hand.

Of course, one might also directly optimise utility if the reward upon successful rejection of the null hypothesis can be specified. To that end, assume that a reward of $US is expected. Under the same assumption about average per-patient costs, this translates to . The utility-maximising sample size is then and the corresponding utility-maximising expected power is .

8 Discussion

The concept of ‘hybrid’ sample size derivations based on Bayesian priors for planning and the design’s frequentist error rate properties is well-established in the literature on clinical trial design. Nevertheless, the substantial variation in the terminology used and small differences in the exact definition of the terms used can be confusing. We have tried our best to formulate a consistent naming scheme, to be explicit about the exact definitions, highlight connections between the different quantities (see Figure 1), and to relate back to previous authors (see Table 1). Any naming scheme necessarily has a subjective element to it and ours is by no means exempt from this problem (see also https://xkcd.com/927/). We do hope, however, that our review encourages a clearer separation between terminology for joint probabilities (avoiding the use of the word ‘power’) and for probabilities that condition on the presence of an effect (‘power’ seems more appropriate here), as well as a more transparent distinction between arguments based on a priori likelihood of effects and their relevance. We also strongly believe that an explicit definition (in formulae) of any quantities used should be given when discussing the subject. Merely referring to terms like ‘expected power’ or ‘probability of success’ are too ambiguous given their inconsistent use in the literature.

Often, the main argument for a ‘hybrid’ approach to sample size derivation is the fact that the uncertainty about the true underlying effect can be incorporated in the planning of a design. This is certainly a major advantage but it is equally important that the ‘hybrid’ approach allows a very natural distinction between arguments relating to the (relative) a priori likelihood of different parameter values (encoded in the prior density) and relevance arguments (encoded in the choice of ). The fact that these two components can be represented naturally within the ‘hybrid’ approach has the potential to make sample size derivation much more transparent.

The ‘hybrid’ quantity considered most commonly in the literature is the marginal probability to reject . Often, it is not clear whether the authors are aware of the fact that this quantity includes the error of rejecting the null hypothesis incorrectly, i.e. when . In many practical situations this problem is numerically negligible and , i.e., the marginal probability to reject the null hypothesis is approximately the same as the joint probability of a non-null effect and the rejection of the null hypothesis. If, however, the definition of ‘success’ also takes into account a non-trivial relevance threshold , the distinction becomes more important in practice. Given the great emphasis on strict type I error rate control in the clinical trials community it seems at least strange to implicitly consider type I errors as ‘successful’ trial outcomes. Beyond these principled considerations, a practical advantage of over is the direct and simple connection to . While is independent of the a priori probability of a relevant effect and only depends on the relative a priori likelihood of different effects through the conditional prior, does directly depend on . Although spiegelhalter-2004 see this as a disadvantage of , it is actually a necessary property to use it for sample size derivation without re-calibrating the conventional values for (see also brown1987projection). If one tried to derive a sample size such that this would be impossible for situations with . In a situation where exceeds 0.8 only slightly, the expected power () would have to be close to 1 to compensate for the a priori probability of a relevant effect. In essence, one would thus increase the sample size in situations where the efficacy of the new treatment is still uncertain. This would put more study participants at risk just to make sure that the treatment effect is detected almost certainly if it is indeed present. The use of for sample size derivation thus only makes sense in a setting where the threshold is adapted to the a priori probability of a relevant effect. The simplest way to do so is by using which is, however, entirely equivalent to . Another option to derive situation-specific thresholds is via utility maximisation, and is a key term in the simple expected utility function proposed in Section 4. Ultimately, and can be used interchangeably once the prior distribution is fixed as long as the respective multiplicative factor is taken into account. The main advantage of is that it is an unconditional probability which might be easier to interpret by practitioners, while can be readily used in conjunction with an already established power threshold in a research field.

A slightly different concept to sample size derivation via expected power is what we call the ‘quantile approach’. This approach uses a different functional of the probability to reject the null hypothesis given a relevant effect. Instead of the mean, we propose to use a quantile of this distribution. Compared to expected power, this allows direct control of the left-tail of the a priori distribution of the probability to reject the null hypothesis given a relevant effect. This can be desirable since a sample size derived via a threshold for expected power might still lead to a substantial chance of ending up with an underpowered study. This can be avoided with the quantile approach and a higher value for (see Figure 5). The quantile approach is also relatively easy to implement in practice, since it is just a Bayesian justification for powering on a point alternative. This flexibility comes at the price of having to specify an additional parameter, (the acceptable risk of ending up with an underpowered study). Theoretically, both expected power and the prior quantile approach are perfectly viable to determine a sample size. Whichever approach is preferred, it is certainly advisable to not only plot the corresponding power curves but also the resulting distribution of (see Figure 4). In essence, the problem of defining a ‘hybrid’ power constraint boils down to finding a summary functional of the power curve that reflects the planning objectives. Ideally, one would like to control the a priori distribution of such that it is sharply peaked around a certain target value avoiding both over- and underpowered studies. Yet, controlling both location (e.g., mean) and spread (e.g., standard deviation) of the distribution of is impossible. A second constraint on the standard deviation of in addition to the mean constraint (expected power) would led to an over-determined problem since there is only one free parameter, . To increase expected power, the sample size must be increased. The standard deviation of , however, decreases as the sample size is lowered since this flattens the power curve of the resulting test (the standard deviation would be 0 if the power curve was constant). Both conflicting objectives (high expected power, low standard deviation of power) are thus not fulfillable at the same time.

Finally, it should be stressed again that the key frequentist property of strict type I error rate control of the designs are not affected by the fact that the arguments for calculating a required sample size are Bayesian. In fact, at no point, the Bayes theorem is invoked (i.e. the posterior distribution of the effect size is not required). The Bayesian perspective is merely a principled and insightful way of specifying a weight function (prior density) that can then be used to guide the choice of the power level of the design, or as

brown1987projection put it: “This proposed use of Bayesian methods should not be criticised by frequentists in that these methods do not replace any current statistical techniques, but instead offer additional guidance where current practice is mute”.

Supplemental Materials

The code required to reproduce the figures is available at https://github.com/kkmann/sample-size-calculation-under-uncertainty. A permanent backup of the exact version of the repository used for this manuscript is available under the digital object identifier 10.5281/zenodo.3899943 (release 0.2.1). An interactive version of the repository at the time of publication is hosted at https://mybinder.org/v2/gh/kkmann/sample-size-calculation-under-uncertainty/0.2.1?urlpath=lab/tree/notebooks/figures-for-manuscript.ipynb using Binder [jupyter-2018]. A simple shiny app implementing the sample size calculation procedures is available at https://mybinder.org/v2/gh/kkmann/sample-size-calculation-under-uncertainty/0.2.1?urlpath=shiny/apps/sample-size-calculation-under-uncertainty/.


DSR was funded by the Biometrika Trust and the Medical Research Council (MC_UU_00002/6).

Conflicts of interest

None to declare.