Abstract
We propose an information borrowing strategy for the design and monitoring of phase II basket trials based on the local multisource exchangeability assumption between baskets (disease types). We construct a flexible statistical design using the proposed strategy. Our approach partitions potentially heterogeneous baskets into nonexchangeable blocks. Information borrowing is only allowed to occur locally, i.e., among similar baskets within the same block. The amount of borrowing is determined by betweenbasket similarities. The number of blocks and block memberships are inferred from data based on the posterior probability of each partition. The proposed method is compared to the multisource exchangeability model and Simon’s twostage design, respectively. In a variety of simulation scenarios, we demonstrate the proposed method is able to maintain the type I error rate and have desirable basketwise power. In addition, our method is computationally efficient compared to existing Bayesian methods in that the posterior profiles of interest can be derived explicitly without the need for sampling algorithms. R scripts that perform the simulation and figures are available at
https://github.com/smartbenben/baskettrialdesign.Keywords: betweenbasket similarity, clustering, local multisource exchangeability, type I error rate
1 Introduction
Breakthroughs in molecular biology have led to the development of more personalized treatment strategies targeting specific molecular aberrations involved in tumor growth. As a molecular aberration can occur in tumors of different histological or anatomical types, the traditional “one indication at a time” strategy evaluating a new treatment in a certain cancer (sub)type is no longer sustainable. To accelerate the oncology drug development process, there has been a growing interest in conducting basket trials, which provide a framework for simultaneously testing the antitumor activity of a novel agent in a variety of cancer (sub)types harboring the same therapeutic target.[woodcock2017master, redig2015basket, renfro2017statistical]. In a basket trial, the term basket or indication refers to a cohort of patients with the same cancer (sub)type. Patients in different baskets are commonly treated by the same targeted agent. The majority of phase II basket trials were conducted as independent parallel studies without a concurrent control arm[mcneil2015nci, middleton2020national].
Consider a prototypical basket trial based on the Vemurafenib study. The Vemurafenib trial was designed to test the preliminary efficacy of Vemurafenib, a BRAF inhibitor, in BRAFV600 mutation positive nonmelanoma cancers in six prespecified baskets comprising NSCLC, cholangiocarcinoma (bile duct or BD), ErdheimChester disease or Langerhans’ cell histiocytosis (ED.LH), anaplastic thyroid cancer (ATC), and colorectal cancer (CRC). The statistical challenges in implementing a basket trial can be described based on this prototype. A pooled analysis combining data across baskets may be performed if we assume the treatment effect is homogeneous across different baskets. This assumption is often invalid and the effect of a treatment can be significantly different across baskets. For example, Vemurafenib was found to be effective in treating BRAF V600E mutant melanoma and hairy cell leukemia, but not for BRAF mutant colon cancer. [flaherty2010inhibition, tiacci2011braf, prahallad2012unresponsiveness]
When the homogeneity assumption is invalid, then pooled analysis will lead to biased estimate of the treatment effect. Alternatively, we may perform basketwise analysis. But such analysis often suffer from a lack of statistical power owing to the difficulty of accruing patients from rare disease (sub)types. In the Vemurafenib study, the ATC and BD basket only enrolled 7 and 8 patients, respectively, as compared to the 26 patients in the CRC basket.
A variety of statistical strategies have been proposed to improve the accuracy and efficiency of basket trials. [thall2003hierarchical, liu2017increasing, berry2013bayesian, chu2018bayesian, chu2018blast, hobbs2018bayesian, neuenschwander2016robust, simon2016bayesian, zhou2020robot, kang2021hierarchical] Bayesian hierarchical models (BHM) formulated by Thall et al.[thall2003hierarchical] and Berry et al.[berry2013bayesian] improve the efficiency of basket trials by enabling information borrowing across baskets. The validity of the BHM relies on the assumption of single source exchangeability (SSE). The SSE assumes the response rates from different baskets arise from a common parent distribution defined by triallevel parameters. The SSE assumption is often violated in practice because a treatment might be only effective in a subset of the baskets. It has been shown BHM based on the SSE assumption can lead to inappropriate borrowing in the presence of nonexchangeable baskets, resulting in inflated type I error rate and reduced power[hobbs2018bayesian].
To address the limitations of the traditional BHM approach, Neuenschwander [neuenschwander2016robust] and Berry [simon2016bayesian] proposed different methods to measure exchangeability between baskets. Hobbs et al. [hobbs2018bayesian] developed a basket trial design based on the multisourceexchangeability model (MEM), which allows the identification of exchangeable baskets as well as singleton baskets. Zhou et al. [zhou2020robot] infers the number of subgroups and subgroup memberships using a Dirichlet process mixture model. Kang et al. propose a hierarchical Bayesian clustering design that clusters arms into either active or inactive subgroup [kang2021hierarchical]. These aforementioned approaches are computationally demanding and are difficult to implement due to the complexities involved in model and prior specifications.
We propose an information borrowing strategy under the local multisource exchangeability assumption and construct a twostage design for phase II basket trials using the proposed strategy. The proposed work differs from previous approaches in the following aspects. To address the pooling vs. not pooling issue, we formally evaluate the evidence for and against the heterogeneity of treatment effects using Bayes factor and information sharing is only allowable if there is sufficient evidence for pooling some baskets together. To determine the extent as well as the amount of borrowing, the proposed approach partitions baskets into mutually nonexchangeable blocks and conducts local information borrowing within each block according to their similarity in response profiles. We implement the proposed information borrowing strategy in both interim and final analysis such that nonpromising baskets can be removed earlier from the trial. Compared to all the existing approaches, our method is easy to implement because all the posterior quantities of interest can be derived without the need for sampling algorithms.
In Section 2.1, we consider whether or not to pool data across baskets by formulating a set of hypotheses representing different levels of betweenbasket heterogeneity. Section 2.2 then describes our method for assessing betweenbasket similarity based on these hypotheses. In Section 2.3, we describe our strategies for conducting posterior inferences under the local multisource exchangeability assumption. We construct stopping boundaries for making go/no go decisions in Section 2.4. We show results from simulation studies under varying response scenarios in Section 3.1 based on comparisons with MEM and Simon twostage design. We discuss limitations and future directions in Section 4.
2 Methods
2.1 To pool or not to pool
Consider our prototypical phase II basket trial consisting of baskets, each with a binary endpoint indicating treatment success or failure. We model each basket as a sequence of independent Bernoulli samples with success probability for . Denote as the number of responses out of the patients in basket
. The number of successful responses in a basket is assumed to have a binomial distribution,
A crucial decision in the design and analysis of basket trials is whether data from different baskets should be combined or not. We will evaluate whether data should be pooled and to what extent it should be pooled by setting up hypotheses representing all the possible partitions of the baskets into blocks. We assume baskets within the same block have the same response rate, whereas baskets in different blocks have different response rates.
Consider baskets in our prototype. The hypothesis represents the scenario where each basket has its unique response rate. That is, the six baskets are partitioned into six blocks
The other hypotheses correspond to cases in which some baskets are similar in terms of patient responses and these baskets can be considered as having the same response rate. Specifically, we consider
and the hypothesis of complete homogeneity across baskets
Under , we hypothesize there are blocks and each block has its unique response rate , where . For example, we have , , unique response rates under , and , respectively. Denote the block membership of the basket as under . Baskets belong to the th block () if . Under , baskets should be analyzed individually without any pooling. Under other hypotheses, we should group baskets by their response rates and conduct pooled analysis within each block.
The plausibility of each hypothesis can be estimated based on the observed response rates of the baskets. Let and denote the sum of responses and sample sizes in the th block, th partition respectively, then
Let
denote the probability density function of a Beta distribution. We assume
has a noninformative prior and follows the Beta distribution with probability density function , whereCommon choices of include beta(1,1) and beta(0.5, 0.5).
The posterior distribution of under the hypothesis is
(1) 
Let denote the Beta function. Under , then the marginal density of is
With the use of Bayes’s rule, the posterior probability (PP) of is
(2) 
where and .
We rank all the possible partitions of the six baskets in the prototype according to their posterior probabilities and show the top ten partitions in Table 1. In this case, the top partition has a posterior probability of 0.283, suggesting some evidence for basketwise analysis.
We evaluate the relative evidence for and against pooling using Bayes factor. The prior probabilities for these
hypotheses are given byand
for
That is, we consider pooling and not pooling () as equally likely in priori. The hypothesis represents complete heterogeneity at basketlevel and requires the use of basketwise analysis. The other hypotheses imply there is some homogeneity at basketlevel and some baskets should be pooled together for analysis. The Bayes factor in favor of pooling is
A Bayes factor of is generally considered as not worth more than a bare mention, whereas a BF of offers substantial evidence[kass1995bayes]. Therefore, we will consider pooling if . We will analyze each basket individually without any pooling if
. Pooling without appropriate statistical considerations can lead to biased estimates and inflated error rates. The test of pooling vs. not pooling reduces the chance of inappropriate information borrowing and helps to maintain the type I and type II error rates of our proposed method.
In Table 1, the BF of pooling vs. not pooling is so we decide not to pool the baskets. We would reach the same conclusion if a popular PP cut off of 0.05 is used. That is, we fail to reject the hypothesis because the PP of is greater than 0.05.
2.2 Assessment of similarity between baskets
Specifying the hypotheses not only helps us to carefully weigh the decision of pooling vs basketwise analysis, it also provides a framework for the evaluation of betweenbasket similarity. We define similarity between the th and th basket as the posterior probability of these two baskets in the same block, i.e., having the same response rate. Specifically, define
(3) 
for . We consider if .
The betweenbasket similarity measures for the six baskets considered in our prototype are displayed in Figure 2.
2.3 Basketlevel inference
Let denote the partition of our final choice based on BF and PP. When BF supports pooling (), is the partition with the largest posterior probability among all pooling partitions (). If , in which all baskets are assigned to different blocks corresponding to different response rates. The posterior distribution for the response probability of basket is
(4) 
where and . We denote as the similarity matrix with elements defined by equation (3).
A basket is only allowed to borrow from a different basket if two baskets are estimated to have the same group membership (). When borrowing is allowed, and the amount of information borrowing is proportional to the betweenbasket similarity .
In the Bayesian context, the amount of information sharing across baskets is measured by the effective sample size (ESS) of the resultant posterior distribution.[hobbs2013adaptive, kaizer2018bayesian] The ESS quantifies information content in relation to the number of observables that would be required to obtain the level of posterior precision achieved by the candidate posterior distribution when analyzed using a vague prior. [morita2008determining] When pooling is allowed, i.e., , the ESS of basket based on equation (4) is
(5) 
When information borrowing is not permitted, the effective sample size of basket is reduced to
which is the effective sample size if basketwise analysis is performed.
2.4 Go/no go decisions for monitoring a basket trial
In exploratory basket trials, the primary objective is to evaluate whether the experimental treatment is worthy of further investigation for each basket. Denote as the fixed, prespecified response rate seen in historical controls for basket and let
be the target response rate of clinical interest. This primary objective is accomplished by testing the null hypothesis
versus the alternative
evaluating the statistical power at , where .
Consider a twostage design allowing early termination for futility. Denote as the prespecified maximum sample size for basket . Let denote the prespecified sample size for basket at stage I. To account for potential differences in the maximum sample size () due to differences in patient availability, we set such that a smaller interim sample size is chosen for smaller baskets.
Let denote the posterior probability of having a higher response rate than in basket , given data from all the baskets and the similarity matrix . We define the following rules for making go/no go decisions:

if , then terminate accrual for futility for basket ,

otherwise, continue basket to the next stage.

At the second stage, calculate and using data from the baskets remaining in the trial. Due to possible early termination of baskets, we have .

If , then conclude the new treatment is promising for this basket.

We conclude the new treatment is not promising in basket if
Figure 1 shows the workflow of our proposed design.
The frequentist operating characteristics of the proposed approach can be evaluated based on the following criteria via simulation studies:

Familywise type I error rate (FWER): This the probability that the treatment is wrongly claimed to be efficacious for at least one nonpromising basket.

Basketwise type I error rate: Consider a single basket . This is the probability of rejecting the null hypothesis when the treatment is actually nonpromising.

Basketwise power: Consider a single basket . This is the probability of rejecting the null hypothesis when the treatment is actually efficacious.

Trialwise power: this is defined as the weighted average of basketwise power for all promising baskets, where the weight is the sample size for each promising basket.
In practice, the maximum sample size of each basket is often determined by budget or accrue rate. Give the maximum basket size , our choice of the decision boundary is the pair () which maximizes the trialwise power of the design under the global alternative, i.e., all baskets have promising response rates (), while controlling the FWER at a target level (e.g.,0.1) under the global null, i,e, the treatment is not effective in any baskets(), for . The values of () can be found via grid search.
3 Simulation
3.1 Simulation setup and evaluation metrics
We conducted simulation studies to evaluate the operating characteristics of the proposed design under different response scenarios in the context of the Vemurafenib trial [hyman2015vemurafenib]. Consider a total of six baskets (). The response scenarios we considered include the global null (“6 failures”), the global alternative (“6 successes”), and the mixed scenarios. Under the global null scenario, the response rate in each basket is the same as the historical control rate () and we should not declare success in any of these baskets. Under the global alternative, the response rate is promising () in all the baskets. The mixed scenarios consist of baskets with both promising () and nonpromising () response rates, representing varying levels of betweenbasket heterogeneity. We also considered a scenario consisting of baskets with promising (H), nonpromising (L) and intermediate (M) response rates denoted as “HHLMMH”. We summarize these scenarios in Table 2.
We evaluated these scenarios here on two specific cases:

Case 1 A fixed design () is used for each basket.

Case 2 A two stage design is applied to each basket, assuming the same maximum sample size () for all the baskets.
We simulated 5,000 trials for each response scenario under each case. In case 1, we evaluate the performance of the proposed method, the exact binomial test and the MEM approach developed by Hobbs et al.[hobbs2018bayesian] based on basketwise power and basketwise type I error rate. A sample size of 19 is chosen for each basket such that the FWER using the exact binomial method under the global null is as close to 0.10, but not exceeding it. In case 2, we compared the proposed method with Simon’s twostage design on the basis of basketwise power, basketwise type I error rate and expected sample size. Likewise, the FWER under the global null scenario is maintained at comparable levels () for the proposed method and the Simon twostage design. The proposed approach and the MEM method are both calibrated to maximize the trialwise power under global alternative while maintaining the FWER under the global null scenario at 0.10. The R scripts for simulation can be found at https://github.com/smartbenben/baskettrialdesign. The comparison results will be discussed in the following sections.
3.2 Comparison with MEM and Exact Binomial Method
This section compares the performance of our approach with the MEM model developed by Hobbs et al.[hobbs2018bayesian] and exact binomial method in a fixed design without early termination rules. Here we refer to the model proposed by Hobbs et al. as “global MEM” since it allows information borrowing between any possible pairs of baskets. By contrast, we refer to our proposed model as “local MEM” because it only allows borrowing to occur between neighbors, i.e., within the same block. We implement the global MEM model using the ‘basket’ package in R. For the local and global MEM methods, we consider the futility cutoff to disable early futility stopping and set as the posterior probability cutoff maximizing the trialwise power under the global alternative while controlling FWER under the global null. Figure 3 shows comparisons of these three methods. Basketwise power (or basketwise type I error rate when a basket is assumed to be a failure) is displayed for each basket.
In case 1, the maximum sample size is set to be for all baskets. As shown in Figure 3, both local and global MEM achieve greater basketwise power than the exactbinomial method in a fixed design. Among all truly promising baskets (), the minimum basketwise power using the proposed local MEM model is , which is achieved when only a single basket is promising (“1 success”). Under the global alternative when all baskets are truly promising (“six success”), the local MEM method achieves a basketwise power of at least . On the extreme case, when the treatment is truly promising in only one basket(Table 2: basket F in the “1 success” case), our local MEM model is more effective at identifying the singleton effective basket. In basket F, the local MEM approach achieves power, whereas global MEM only has power. Similarly, when there are only two truly promising baskets (“2 success”), the local MEM has greater power () than global MEM () in detecting the efficacious baskets. As more effective baskets are added to the alternative scenarios, both local and global MEM models show an increasing trend in power. Although the global MEM model seems to achieve greater power when there are at least “3 success” out of six baskets, it also leads to inflated type I error rate in the nonpromising baskets. In particular, when the treatment is truly promising in all but one basket, the type I error rate for the nonpromising basket exceeds . By contrast, the basketwise type I error rate using the local MEM method is under the scenario “0 success”, “1 success”, “2 success”, “3 success” and “4 success”, and under “5 success” and “HHLMMH” scenarios.
Under the “HHLMMH” scenario, the local MEM successfully controls basketwise type I error rate for the nonpromising arm (C), and achieves basketwise power for promising arms (A,B,F). By contrast, although the global MEM has basketwise power for promising arms and for intermediate arms, its basketwise type I error rate is inflated to . Detailed results can be found in Table 3.
3.3 Comparison with Simon’s twostage design
For case 2, we conducted one interim futility analysis when half of the maximum sample size () has been accrued for each basket. Our proposed method has a FWER of under the global null, which is then used to determine the significance level for the Simon’s twostage design. In comparison, Simon’s ‘minmax’ design has a basketwise type I error rate of when the treatment is not efficacious, closely matching the FWER of local MEM.
Based on Figure 5, Simon two stage design is shown to provide a modest benefit in expected sample size (EN), leading to patients reduction in EN, whereas our method demonstrate significant gains in power. Figure 4 shows the results of our local MEM model for all response scenarios. Our local MEM model is able to maintain basketwise type I error rates for nonpromising arms at an acceptable level. Compared with Simon’s twostage design which has power in recognizing an intermediate efficacious arm and power in detecting an arm with a response rate of 45%, our approach achieved much greater power in promising baskets among all involved scenarios.
4 Discussion
In this article, we propose an information borrowing strategy under the local multisource exchangeability assumption. By contrast to the global MEM framework, which allows information sharing among all possible pairs of baskets, we restrict information sharing to a similarity based local scope using a clustering method. By conducting 6arm simulation studies for fixed and twostage designs, we demonstrate that the proposed local MEM approach is effective in preventing inappropriate information borrowing. Compared with the exact binomial method, our approach achieves greater basketwise power while maintaining basketwise type I error rates at an acceptable level. In comparison with the MEM model, our approach shows better control of basketwise type I error rate under varying clinical scenarios ranging from the global null to all success, and avoids the undesirable reduction of basketwise power led by inappropriate information borrowing under the “Nugget” scenario. Moreover, our proposed approach is more computationally efficient compared with MEM, as all posterior quantities of interest could be derived explicitly without the need for sampling algorithms. For twostage analysis, we compare our proposed model with Simon’s two stage design. Our method outperforms Simon’s twostage design with higher basketwise power and comparative type I error rates among all response scenarios.
Furthermore, the choice of prior in our approach can be easily adapted for different circumstances in practice. Without patientlevel information, we consider pooling or not pooling as equally likely in priori. The presence of patientlevel information, such as data from high throughout sequencing analysis might help to predict treatment benefit. When predictive biomarkers are available, we can assign a more informative prior distribution to the different partitions of baskets by assuming some partitions are more likely than others. Likewise, we can consider some partitions as unlikely based on biology of the disease.
Some limitations of our approach should also be noted. Our local MEM framework currently enumerates all possible partitions to find the best one, which is computationally intensive when the number of baskets is large. Alternative methods including clustering, mixture models can be applied instead in the presence of greater number of baskets.
Comments
There are no comments yet.