1 Introduction
Clinical development of a new treatment is a lengthy, rigorous, and very costly process. Each treatment generally goes through a stagewise procedure before regulatory approval into the market. Generally, research begins with smaller exploratory studies and gradually moves to larger confirmatory studies. It has been increasingly recognized that early phase trials can not only provide safety assessments, but also have the potential to evaluate the efficacy signal. Sponsors tend to only select compounds that show promising efficacy (e.g. meeting a certain threshold value ) and a reasonable safety profile in smaller early studies to move into the next stage of development. It is not uncommon to see compounds selected in a smaller study show poor efficacy results in a subsequent larger study.
In a phase 2 study, anifrolumab showed an impressive improvement in systemic lupus erythematosus (SLE) responder index 4 (SRI4) at 52 weeks compared to placebo (62.6% for anifrolumab 300 mg vs. 40.2% for placebo; p<0.001)[furie2017anifrolumab]. However, recently 2 phase 3 studies showed less impressive results. In TULIP1 Study, the response rates were 36.2% for anifrolumab 300 mg vs. 40.4% for placebo [furie2019type]. In TULIP2 Study, the response rates were 55.5% and 37.3% for anifrolumab 300 mg and placebo treatments, respectively (p < 0.001) [morand2020trial]. The first study did not reach statistical significance and neither study achieved the magnitude of the treatment effect as observed in the phase 2 study. No clear differences in the study design and study population could explain the sharp contrast of results between the phase 2 study and the first 3 study.
The average cost of developing a new drug that gains marketing approval is estimated to be $1 to $2.6 billion [dimasi2016innovation, wouters2020estimated]
. To improve efficiency, the most critical factor is to improve the probability of success for phase 2 and 3 development
[paul2010improve]. Therefore, characterizing a statistical framework that can explain the above mentioned “sharp contrast" helps enhance quantitative decisionmaking during the stagewise drug development process.The terminology of the proofofconcept (PoC) study we use in the article is relative: depending on the endpoints, for some therapeutic areas, a doseresponse phase 2 study may be considered a PoC study; while for some other areas, a multiple ascending dose (MAD) study may serve as a PoC study. To avoid confusion, we refer to the smaller earlier study as the Small Study, and the subsequent larger study as the Large Study throughout the article. There are many reasons the positive result observed in a Small Study may not be replicated in the later Large Study. For example, the populations, duration, and endpoints between the two studies may be different. In oncology, the primary endpoint in a PoC study is often progressionfree survival, while the primary endpoint for a confirmatory study becomes overall survival. For diabetes, glucose may be investigated in a PoC study, while Hemoglobin A1c (HbA1c) is generally the primary endpoint for a subsequent study. While these factors are important for consideration in evaluating the difference in the results between the two studies or predicting the outcome in the Large Study based on the results from the Small Study, we will explore the fundamental reason for the difference in the estimation of the treatment effects between the two studies, assuming they are similar except for the sample size.
It has been proposed that the probability of study success (PrSS) can account for the variability in the assumed true treatment difference compared to the statistical power [o2005assurance, wang2013evaluating]
; However, statisticians often use a normal prior distribution with the estimated mean and variance from the Small Study (Frequentist approach), or use the posterior distribution given the data in the Small Study with noninformative prior (Bayesian approach). This approach only accounts for the variability from the Small Study, not the selection bias (only moving compounds with promising results from the Small Study to the next development stage).
The problem of selection bias or regression to the mean was described as Tweedie’s formula [robbins1956empirical, efron2011tweedie]. Tweedie’s formula to estimate the prior density assumes the variance of measurement error was constant across observations. Its application in clinical trials has not been widely realized. ChuangStein and Kirby described the phenomenon of selection bias and regression to the mean, and provided an overview of the research in discounting the early phase study results [chuang2017quantitative]. One approach is to apply an empirical discount factor for the treatment effect or to raise the bar for the criterion for moving the drug to the next development stage [wang2006adapting, kirby2012discounting]. Zhang evaluates the selection bias phenomenon using a Bayesian approach assuming an informative prior distribution through simulation [zhang2013evaluating]. Again, the prior was used as an empirical way of “discounting" the results from early phase studies with no clinical meaning and how to determine the prior is not stated also in their paper.
In this article, we will describe the aforementioned problem through a theoretical framework in the drug development context, and build the connection between Frequentist and Bayesian methods. Furthermore, we will propose a Bayesian hierarchical model to estimate the distribution for a portfolio of historical compounds, which can be used as the prior for future drug development for treating the same disease with a similar mechanism. We envision that a very important benefit of this research will be the improvement of probability of late phase study success and the reduction of the overall cost of drug development.
This article is organized as follows. In Section 2, we provide the theoretical framework to explain the systematic bias, introduced by the nature of the size of the Small Study, and the fact that we only select the promising compounds moving forward. We also describe the methods to form a prior distribution used in the framework that characterizes the treatment effect for a portfolio of similar compounds. In Section 3, we provide 2 data examples from drug development in diabetes and rheumatoid arthritis (RA) disease respectively, using the methods described in Section 2. Finally, Section 4 provides a summary and discussion of the topic.
2 Methods
Assume we are interested in estimating the treatment effect comparing a new treatment (at a certain dose) to a comparator (typically placebo). The treatment effect can be the treatment difference in means or proportions, logarithm of odds ratio, or logarithm of hazard ratio, depending on the type of variables for the outcome and study objectives. Given a compound, let
denote the true treatment effect. Without loss of generality, we assume a smaller value of means better efficacy. Consider a typical sequential clinical development program for a compound in which an exploratory Small Study is conducted first, and the subsequent Large Study is conducted upon promising results from the Small Study. Assume the estimators of the treatment difference for the Small Study and Large Study are distributed from(1) 
and
(2) 
respectively. We further assume the compound being studied is from a portfolio of candidate treatments with the true treatment effect distributed from
(3) 
We assume

, i.e., the conditional independence between and given the treatment effect for the compound under investigation.

. Although sometimes the estimators are only asymptotically unbiased given , we assume the conditional unbiasedness for the convenience of calculation and argumentation.
2.1 The probability of observing a large treatment effect in a small PoC study
We now illustrate the impact of the variability in on the probability of meeting a desired threshold value for the treatment difference through a normal prior distribution:
For example, suppose we intend to develop an antidiabetes treatment and the primary endpoint in the Small Study is the change in HbA1c from baseline at week 12. The desired treatment effect for the new treatment versus placebo (difference in group means) is at least as good as . Figure 1 shows the relationship between the probability of meeting the threshold value [
)] and the standard deviation,
, for . We can see that ) increases as goes up. This means the smaller the sample size, the more likely we would be to observe a promising treatment effect for this particular compound in the portfolio.2.2 Define and adjust for the selection bias from a frequentist approach
Although , an underlying distribution for imposes an unconditional correlation between and
. A joint distribution between
and is shown below:(4) 
The conditional distribution can subsequently be derived accordingly:
(5) 
The drug development is a stagewise process: we only move the compound to the Large Study if the Small Study shows a reasonably promising result (e.g., ). Therefore, even though itself is an unbiased estimator for , given is no longer an unbiased estimator for .
We now evaluate two quantities of interest from the Small Study based on the conditional distribution:

The mean treatment difference for the Large Study conditional on the result from the Small Study, i.e., . Since is a function of , it has no selection bias. Therefore, is an unbiased estimator for , which can be seen by .

The conditional probability of achieving the desired treatment effect for the Large Study conditional on the result from the Small Study, i.e., . It can also been seen that .
The bias (sometimes also called discount factor) for using to estimate the expected treatment effect for the Large Study given the Small study is defined by
(6) 
The bias is computed based on the information from the Small Study and the information from the portfolio, but independent of the estimate from the Large Study. It represents the amount of adjustment or discount factor we shall apply to the treatment effect estimate from the Small Study in order to have a more realistic view of the expected treatment effect from the Large Study.
To have a more concrete perspective on the definition of “bias", let us assume a special case where , and
are normal distributions such that
(7) 
(8) 
and
(9) 
where and are the mean and variance for the prior distribution of . The unconditional distribution of is easily derived as
(10) 
The conditional distribution of given is
(11) 
According to the definition by (6), the bias is given by
(12) 
Figure 2 shows the bias related to the observed treatment effect in the Small Study for various sample sizes of the Small Study. In this plot, we mimic the variable of the change in HbA1c in antidiabetes drug development, assuming the mean and standard deviation for the prior distribution are 1% and 1%, respectively. In antidiabetes treatment PoC or phase 2 studies, the sample size is generally between 20 and 100 per treatment arm. Since is the mean of a portfolio of candidate drugs of the same class and the sponsors tend to pick up the compounds with promising results from the Small Study to move forward, is generally smaller than (keep in mind smaller values mean better efficacy). Then, the bias is always positive, and the treatment effect observed in the Large Study is very likely worse than that in the Small Study. The more promising and more variable the Small Study result is, the larger the difference between the expected treatment effects from the 2 studies will be. Therefore, more discount shall be applied towards the observed treatment effect in the Small Study in the planning of the next study, to offset the magnitude and variability in estimating .
The conditional distribution of given also enables us to estimate the probability of observing the desired efficacy conditional on the result from the Small Study. Figure 3 depicts the relationship between the observed treatment effect from the Small Study (
) and the conditional probability of the success of the Large Study for different standard error of the observed treatment difference in the Small Study (with a similar setting as Figure
2). Again, the more promising the observed treatment effect is in the Small Study, the more likely the next study is to succeed. However, given any fixed value of , as the standard error of the estimate increases from 0.18 to 0.45, the conditional probability of the success of the Large Study falls by %.From (11), the adjusted estimator of the expected treatment effect for the Large Study is readily available:
Although it is called a special case when , and
are normal distributions, this is, in fact, quite general and can apply to a wide variety of scenarios in clinical development. By central limit theorem, with moderate or large sample size, both
and approximately follow normal distributions regardless of the distribution of their associated outcomes. This is not very different from the estimation and inferences in the Frequentist approach, where the normal approximation is commonly used. The only distribution that could be very nonGaussian is the prior distribution , which depends on the nature of the portfolio of drugs under investigation.2.3 Understand and adjust for the bias from the perspective of Bayesian statistics
As Efron [efron2011tweedie] pointed out, Bayesian method inherently prevents the bias if the appropriate prior distribution is used. Let be the posterior distribution given the estimator with a prior distribution of (3). The conditional expectation is equivalent to the Bayesian posterior mean for the Small Study with the prior distribution of (3). This can be seen by
(13) 
Similarly,
(14) 
is the posterior expectation of given the estimator from the Small Study.
Often the sponsor who conducts the Small Study has the individual data . Then, a fully Bayesian method based on the observed data (instead of the estimator ) can be used to estimate the posterior distribution . The adjusted point estimator is
(15) 
and the adjusted probability of meeting a threshold for the treatment effect is
(16) 
When the individual patient data are available, the estimation based on the posterior distribution is more natural, and may be more accurate compared to when the distribution of is not exactly normally distributed; while the estimation based on the posterior distribution given provides an advantage when only the estimate for the Small Study is available.
2.4 Modeling the distribution of
The estimation of the distribution of is the key to the understanding of the portfolio performance in both Frequentist and Bayesian methods. In this section, we will use a Bayesian hierarchical model to form the prior distribution.
Following up the special case in Section 2.2, the estimates of and are required to carry out the bias adjustment as defined by (12). We propose to construct a hierarchical model for the estimation of parameters and . At any given point in time, we use the study data on the compound portfolio accumulated so far to infer the parameter of the distribution of . As the portfolio is expanding with time, we should update the inference when more relevant study information becomes available.
Suppose we have information for compounds. For the compound, there are studies. Note that it is important to include data from both positive and negative studies, and both early and late phase studies. For some studies, the population, endpoint or study duration may be different from what we desire, either these studies are excluded or the treatment effect for the desired population, endpoint and study duration is estimated with additional modeling and extrapolations. The estimates from the studies can be modeled as
(17) 
and
(18) 
By assigning prior distributions for and , the inference on and can be easily obtained from the maximum likelihood estimation or using Bayesian hierarchical modeling framework. Note Tweedie’s formula cannot be used here as the variances are not constant [robbins1956empirical, efron2011tweedie]. We suggest a noninformative or weakly informative prior for the distribution of and , where for example, . Standard softwares are readily available to draw the posterior inference based on the defined hierarchical modeling, such as JAGS, WinBUGS and STAN. Once the posterior estimates for and are obtained, they can be fed into (12) to gauge the bias for any compound in the portfolio finishing the Small Study with a promising result, and construct an adjusted estimator for the assessment of efficacy of this compound and planning for the next study.
The method of estimating
can be easily generalized to other models or assuming a nonGaussian distribution. For example, a more complex approach to model the prior is to include the withincompound betweenstudy variability. Specifically, one can replace equation (
17) with(19) 
where the mean for study and compound , is the variance of given , is the mean for compound , and is the betweenstudy variance for compound . In some cases it may be difficult to estimate the withincompound betweenstudy variability when there is only 1 Small Study for a compound.
3 Real Data Examples
In this section, we illustrate the application of bias adjustment through real data examples in two therapeutic areas: diabetes and immunology. In the diabetes therapeutic area, we considered the endpoint of body weight for the class of incretins as antidiabetic drugs for patients with Type2 diabetes while in the immunology therapeutic area, we focused on the treatment against the disease of rheumatoid arthritis (RA).
3.1 Diabetes Therapeutic Area
Due to confidentiality requirements, the compounds have been deidentified. The proposed method in Section 2.4 is implemented to model the distribution of the treatment effect in the portfolio of candidate compounds. To that purpose, we gathered and used the available study data (from publication or conference presentations) with regards to the effect of incretins on weight loss across pharmaceutical companies. In this real data application, the Small Study refers to the phase 1b MAD study while the Large Study corresponds to the phase 2 study. Again, we assume normal distributions for , and . In this illustrative example, the outcome is the change in body weight from baseline to 4 weeks. There are 5 compounds for which we know the estimates and their standard errors for both phase 1b and phase 2 studies, so that we can compare the biasadjusted estimate and the observed treatment effect in phase 2 study. In all studies for all 5 compounds, the population was similar: patients with Type2 diabetes with inadequate glycemic control by diet and exercise or the treatment of metformin. The application follows the below procedure:
The actual availability of the Large Study (phase 2 study) results allow for a comparison of the unadjusted and adjusted estimates ( and , respectively) with the observed estimates of treatment effect in phase 2 study. Figure 4 shows a graphical display of the unadjusted treatment effect based on the Small Study () versus the observed treatment effect in the Large Study (solid circles), and the adjusted treatment effect based on the Small Study () versus the observed treatment effect in the Large Study (hollow circle) for all 5 incretin compounds. The circle size (area) is proportionate to the information of the estimates (i.e., one over the square of the standard error). Most data used in this figure are confidential and have not been published, so we removed the scales for the xaxis and yaxis. For the illustration of effectiveness of the bias adjustment, the relative, but not the absolute scale, is sufficient. Ideally, we expect the solid circles (unadjusted estimates) to be distributed symmetrically across the 45degree line; however, most of the time the sponsors are only willing to move the compounds to the next developmental stage when they exhibit promising results in the Small Study. This results in an uneven distribution of the solid circles with more circles in the lower right quadrant and fewer circles in the upper left area, a phenomenon of selection bias.
The adjusted estimates are closer to the observed treatment effect in phase 2 studies, as indicated by all the hollow circles with the exception of the purple compound moving away from the 45degree line. This is not surprising because both the adjusted estimates from phase 1b studies (also called Bayesian shrinkage estimates) and the unadjusted estimate from phase 2b studies were subject to variability and, more importantly the Bayesian shrinkage estimator provides better overall estimation (e.g., in terms of mean squared errors) but may be biased if conditional on individual compounds. The estimate for the orange compound had relatively large variance for the treatment effect estimate (as indicated by a small circle) and a large treatment effect (much more than ). Therefore, the circle for the orange compound had the most shift to the left. On the other hand, there was also one compound (the leftmost) with almost no weight loss based on the phase 1b study, while some weight loss was observed in a subsequent Large Study. This is rare in practice but it is possible for several reasons, for example, the primary outcome upon which the decision is made may not be weight loss. In addition, Figure 4 shows the adjustment may not always make the treatment effect smaller. For the 2 leftmost compounds, the adjusted treatment effect was larger compared to the unadjusted one. This phenomenon is consistent with the property of the Bayesian shrinkage estimator, which shrinks the estimates to the center of the prior distribution.
3.2 Immunology Therapeutic Area
In this section, we further illustrate how the biasadjusting method could be implemented in an RA example. While there are a handful of compounds approved for RA, it is not uncommon that a compound which is promising in a phase 2 study fails in phase 3. In this example, the Small Study refers to the phase 2 while the Large Study corresponds to the phase 3 study. The outcome is set to be ACR20 (whether a patient has improvement in ACR [American College of Rheumatology] assessment) at week 12, and the treatment effect is defined as the difference in ACR20 response rates between an experimental treatment and placebo arms. As treatment effect varies significantly across different subpopulations, we only focus on studies with populations that have had an inadequate response to methotrexate (MTX). As previously mentioned, normal distributions for , and are assumed throughout this section.
To estimate the portfolio distribution of , we perform a systematic review of literature and select RA clinical trials based on two criteria: (i) doubleblind, placebocontrolled RA trials with ACR20 results reported at week 12; (ii) of enrolled patients have inadequateresponse to MTX. Ultimately, 48 phase 2 trials are selected. We apply the method in Section 2.4 using published estimates and standard errors in the selected phase 2 trials. The prior distribution is therefore estimated such that , which is then used to calculate the adjusted treatment effect . Fourteen out of 48 phase 2 trials have been corresponded to 15 phase 3 trials, all of which share similar population and background therapies as those in selected phase 2 trials. Note that a phase 2 trial may be matched with more than one phase 3 trials. In this case, we used the metaanalysis to pool the results from multiple phase 3 trials and treated as one large phase 3 study. Overall, there were 9 compounds and 24 compounddose levels (“treatments") included in the analysis.
Figure 5 represents the unadjusted treatment effect based on the phase 2 study versus the observed treatment effect in the phase 3 study (solid circles), and the adjusted treatment effect based on the phase 2 study versus the observed treatment effect in the phase 3 study (hollow circle). Each circle represents 1 compounddose level. The circle size (area) is proportionate to the information of the estimates (i.e., one over the square of the standard error). The observed treatment effects in phase 2 range from to , while phase 3 results are more stable and treatment effects are within the range to . Again, most of the solid circles fall under the 45degree line, which means that in general phase 3 results appear worse than originally reported phase 2 results. This is another example of selection bias. Compared to the observed phase 2 treatment difference estimator , the biasadjusted estimator is closer to its observed phase 3 results . The plot also shows the magnitude of biasadjustment through the length of the arrow between each pair of solid and hollow circles. The longer the arrow, the more biasadjustment is present. Such biasadjustment is associated with two terms: (i) difference between and the observed phase 2 result : the closer the observed phase 2 results are to , the smaller adjustment shall be applied; (ii) phase 2 sample size: larger phase 2 sample size leads to a smaller adjustment, as indicated in (13).
4 Summary and Discussion
Selection bias or regression to the mean phenomenon has been observed in the past, and some research has been done in this area. In this article, we made new contributions in four aspects: the clinical meaning of the prior, the connection between the estimation bias due to selection bias and the Bayesian posterior mean, the use of hierarchical modeling to estimate the distribution of the underlying treatment effect for a portfolio of compounds, which can be used as the prior distribution to adjust the treatment effect for current and future studies, and the role of estimation variability in the Small Study in the estimation bias.
Although prior distribution has been used in the estimation of posterior mean to account for selection bias, the clinical meaning of prior distribution was not clearly defined [zhang2013evaluating]. In this article, we defined that the prior represents the distribution of the underlying treatment effect of a portfolio of similar compounds. The clarification of the clinical meaning of prior distribution allows us to develop a Bayesian hierarchical model to estimate the prior distribution. We start from the commonly understood bias as and showed this is equivalent to , the “regression to the mean" (discount) effect from the Bayeisan framework, if is the treatment effect. While ChuangStein and Kirty consider the discount based on the Bayesian framework as a different definition of “regression to the mean" [chuang2017quantitative], we established the equivalence of discount between Frequentist and Bayesian methods. We also further quantified the variability in the estimator for the Small Study on the probability of achieving a large promising treatment effect, as well as on the bias. The larger the estimation variability and the larger the effect size in the Small Study, the more discount should be applied to adjust for the bias in the estimator for the Large Study. Therefore, special caution should be taken when a very promising signal is observed from a very small study.
This phenomenon may be perplexing. For a single asset, , the estimator from the Small Study, is unbiased for treatment effect since . However, from a portfolio perspective, the distribution of the observed treatment effect is more variable than the distribution of true , but is still an unbiased estimator for the mean of true treatment effect for the portfolio. This can be seen from the theoretical perspective:
and
A Bayesian shrinkage estimator will provide a more realistic estimation of the distribution of . When only promising compounds (e.g., ) based on the Small Study are moved forward for the next stage of development, a selection bias is introduced. In the presence of such selection bias, is also biased for . This can be seen by
In this article, we provided two examples with the treatment effects being the difference in means and difference in proportions and a normal prior. The theoretical framework in this article can be applied to other contrasts of treatment effect (e.g., logarithm of the odds ratio or logarithm of the hazard ratio), since all these estimators approximately follow normal distribution. The prior distribution for does not have to be a normal distribution. For example, the prior distribution may be a mixture of two normal distributions representing two sets of compounds with little and reasonable treatment effect respectively, or contain a mass on the treatment effect of zero to account for drugs with no treatment effect. Selection of studies/compounds to form the prior distribution is important in application of the proposed method to adjust for the selection bias. For a new compound that is not novel (i.e., data are available for similar compounds), the data from compounds in the same or similar class should be used to form the prior. For a compound with a novel target (i.e., no data are available for similar compounds), we may assume this compound comes from a distribution of existing classes of treatments. For such compounds, a prior distribution to characterize different classes of treatments for the same disease can be used: either a prior constructed by treating each class as one data point (mean treatment effect for each class based on metaanalysis) as described in Equation (17) or a hierarchical model considering the betweenstudy, compound and class variabilities [e.g., Equation (19)].
In addition to sample size, there may be many differences between Small and Large Studies, including but not limited to population, endpoint, duration, and dose. In this article, we proposed a method to predict mean treatment effect of Large Study based on Small Study in consideration of adjusting for the selection bias only, assuming Small and Large Studies are otherwise similar except for sample size. Although we generally try to make phase 2 studies more translatable to phase 3 studies, considerable difference may still exist between phase 2 and 3 studies, possibly due to the safety consideration, limited knowledge on the candidate treatment, and financial consideration. In this case, one should take a 2step approach for predicting the treatment effect of the candidate treatment in phase 3 studies. First, a model to account for the differences between phase 2 and 3 studies should be built to predict phase 3 results without adjusting for the selection bias. Then, the method proposed in this article can be applied to the predicted outcome in Step 1 to adjust for the selection bias. The prediction model as well as the prior for adjusting for selection bias should be prespecified and developed before the completion of Small Study so that (1) the prediction model and the prior can be agreed upon before knowing the Small study results to avoid bias due to subjectivity, and (2) the adjusted treatment effect can be estimated expeditiously right after the Small Study results become available.
acknowledgements
We thank Dr. Ilya Lipkovich and Dr. Michael Sonksen for his scientific review of this manuscript and useful comments.
conflict of interest
All authors are employees and stock holders of Eli Lilly and Company.